# Physics-Informed Neural Networks for Reservoir Modeling
## Tutorial 2: Data Processing and LAS File Handling

In this tutorial, we'll learn how to process real Kansas Geological Survey (KGS) LAS well log files and prepare quality datasets for PINN training. This is a crucial step that determines the success of our physics-informed models.

### Learning Objectives
By the end of this tutorial, you will be able to:
- Read and parse LAS files with different formats and versions
- Extract and validate well log curves (gamma ray, density, neutron porosity, resistivity)
- Implement robust data preprocessing and quality filtering
- Create training/validation datasets for PINN models
- Visualize data distributions and identify quality issues

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import our custom modules
import sys
sys.path.append('../src')

from data.las_reader import LASFileReader
from data.preprocessor import DataPreprocessor
from data.dataset_builder import DatasetBuilder
from visualization.scientific_plotter import ScientificPlotter

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("‚úÖ Environment setup complete!")
print(f"üìÅ Current working directory: {Path.cwd()}")

## 1. Understanding LAS File Structure

Log ASCII Standard (LAS) files are the industry standard for storing well log data. Let's examine the structure and learn how to parse them effectively.

### LAS File Sections
1. **~V (Version)**: LAS version information
2. **~W (Well)**: Well identification and location data
3. **~C (Curve)**: Curve definitions, units, and descriptions
4. **~P (Parameter)**: Additional parameters and constants
5. **~A (ASCII)**: The actual log data in columnar format

### Key Challenges
- Different LAS versions (1.2, 2.0, 3.0)
- Varying curve names and units
- Missing or corrupted data
- Inconsistent formatting

In [None]:
# Initialize our data processing components
las_reader = LASFileReader()
preprocessor = DataPreprocessor()
dataset_builder = DatasetBuilder()
plotter = ScientificPlotter()

# Set up data directory
data_dir = Path('../data')
print(f"üìÇ Looking for LAS files in: {data_dir}")

# Find all LAS files
las_files = list(data_dir.glob('*.las'))
print(f"üîç Found {len(las_files)} LAS files")

if len(las_files) > 0:
    print(f"üìÑ First few files: {[f.name for f in las_files[:5]]}")
else:
    print("‚ö†Ô∏è  No LAS files found. Please check the data directory.")

## 2. Reading and Parsing LAS Files

Let's start by examining a single LAS file to understand its structure and content.

In [None]:
# Read a sample LAS file
if len(las_files) > 0:
    sample_file = las_files[0]
    print(f"üìñ Reading sample file: {sample_file.name}")
    
    try:
        # Read the LAS file
        well_data = las_reader.read_las_file(str(sample_file))
        
        print(f"\nüìä Well Information:")
        print(f"  Well ID: {well_data.well_id}")
        print(f"  Location: {well_data.metadata.location if well_data.metadata else 'Unknown'}")
        print(f"  Depth range: {well_data.depth.min():.1f} - {well_data.depth.max():.1f} ft")
        print(f"  Number of depth points: {len(well_data.depth)}")
        
        print(f"\nüìà Available Curves:")
        for curve_name, curve_data in well_data.curves.items():
            valid_points = np.sum(~np.isnan(curve_data))
            print(f"  {curve_name:12s}: {valid_points:4d} valid points ({valid_points/len(curve_data)*100:.1f}%)")
        
        # Extract curves for analysis
        curves = las_reader.extract_curves(well_data)
        print(f"\n‚úÖ Successfully extracted {len(curves)} curves")
        
    except Exception as e:
        print(f"‚ùå Error reading LAS file: {e}")
        # Create synthetic data for demonstration
        print("üîß Creating synthetic data for demonstration...")
        
        depth = np.linspace(2000, 2500, 500)
        np.random.seed(42)
        
        curves = {
            'GR': 30 + 50 * np.sin(0.01 * depth) + 10 * np.random.randn(len(depth)),
            'RHOB': 2.3 + 0.3 * np.random.randn(len(depth)),
            'NPHI': 0.15 + 0.1 * np.random.randn(len(depth)),
            'RT': 10 * np.exp(np.random.randn(len(depth))),
            'PHIE': 0.2 + 0.05 * np.random.randn(len(depth))
        }
        
        # Add some missing values to simulate real data
        for curve_name in curves:
            missing_idx = np.random.choice(len(depth), size=int(0.05 * len(depth)), replace=False)
            curves[curve_name][missing_idx] = np.nan
        
        print("‚úÖ Synthetic data created successfully")

else:
    print("üîß No LAS files available. Creating synthetic data for demonstration...")
    
    depth = np.linspace(2000, 2500, 500)
    np.random.seed(42)
    
    curves = {
        'GR': 30 + 50 * np.sin(0.01 * depth) + 10 * np.random.randn(len(depth)),
        'RHOB': 2.3 + 0.3 * np.random.randn(len(depth)),
        'NPHI': 0.15 + 0.1 * np.random.randn(len(depth)),
        'RT': 10 * np.exp(np.random.randn(len(depth))),
        'PHIE': 0.2 + 0.05 * np.random.randn(len(depth))
    }
    
    # Add some missing values
    for curve_name in curves:
        missing_idx = np.random.choice(len(depth), size=int(0.05 * len(depth)), replace=False)
        curves[curve_name][missing_idx] = np.nan

## 3. Data Quality Assessment

Before using well log data for PINN training, we need to assess its quality and identify potential issues.

In [None]:
def assess_data_quality(curves, depth):
    """Comprehensive data quality assessment"""
    
    print("üîç Data Quality Assessment")
    print("=" * 50)
    
    quality_report = {}
    
    for curve_name, curve_data in curves.items():
        # Basic statistics
        valid_mask = ~np.isnan(curve_data)
        valid_data = curve_data[valid_mask]
        
        if len(valid_data) == 0:
            print(f"‚ùå {curve_name}: No valid data")
            continue
        
        # Calculate quality metrics
        completeness = len(valid_data) / len(curve_data) * 100
        mean_val = np.mean(valid_data)
        std_val = np.std(valid_data)
        min_val = np.min(valid_data)
        max_val = np.max(valid_data)
        
        # Outlier detection (3-sigma rule)
        outliers = np.abs(valid_data - mean_val) > 3 * std_val
        outlier_pct = np.sum(outliers) / len(valid_data) * 100
        
        # Continuity assessment (gaps)
        gaps = np.diff(np.where(valid_mask)[0]) > 1
        num_gaps = np.sum(gaps)
        
        quality_report[curve_name] = {
            'completeness': completeness,
            'mean': mean_val,
            'std': std_val,
            'range': (min_val, max_val),
            'outliers': outlier_pct,
            'gaps': num_gaps
        }
        
        # Quality assessment
        quality_score = completeness
        if outlier_pct > 5:
            quality_score -= 10
        if num_gaps > 5:
            quality_score -= 10
        
        status = "‚úÖ" if quality_score > 80 else "‚ö†Ô∏è" if quality_score > 60 else "‚ùå"
        
        print(f"{status} {curve_name:8s}: {completeness:5.1f}% complete, "
              f"{outlier_pct:4.1f}% outliers, {num_gaps:2d} gaps, "
              f"range: [{min_val:6.2f}, {max_val:6.2f}]")
    
    return quality_report

# Assess quality of our sample data
quality_report = assess_data_quality(curves, depth)

## 4. Data Preprocessing Pipeline

Now let's implement a comprehensive preprocessing pipeline to clean and standardize our well log data.

In [None]:
# Apply preprocessing pipeline
print("üîß Applying Data Preprocessing Pipeline")
print("=" * 40)

# Step 1: Clean data (remove outliers, handle missing values)
print("1Ô∏è‚É£ Cleaning data...")
cleaned_curves = preprocessor.clean_data(curves)

# Step 2: Normalize curves
print("2Ô∏è‚É£ Normalizing curves...")
normalized_curves = preprocessor.normalize_curves(cleaned_curves)

# Step 3: Handle missing values
print("3Ô∏è‚É£ Handling missing values...")
final_curves = preprocessor.handle_missing_values(normalized_curves)

print("‚úÖ Preprocessing complete!")

# Compare before and after
print("\nüìä Preprocessing Results:")
for curve_name in curves.keys():
    original_valid = np.sum(~np.isnan(curves[curve_name]))
    final_valid = np.sum(~np.isnan(final_curves[curve_name]))
    
    print(f"{curve_name:8s}: {original_valid:3d} ‚Üí {final_valid:3d} valid points "
          f"({final_valid/len(depth)*100:.1f}% complete)")

## 5. Data Visualization and Distribution Analysis

Visualizing our data helps us understand its characteristics and identify potential issues.

In [None]:
# Create comprehensive data visualization
def create_data_visualization(original_curves, processed_curves, depth):
    """Create comprehensive visualization of well log data"""
    
    fig = plt.figure(figsize=(20, 12))
    
    # Create subplot layout
    gs = fig.add_gridspec(3, 6, hspace=0.3, wspace=0.4)
    
    # 1. Well log tracks (original data)
    ax_logs = fig.add_subplot(gs[:, :2])
    
    curve_names = list(original_curves.keys())
    colors = ['green', 'blue', 'red', 'purple', 'orange']
    
    for i, (curve_name, color) in enumerate(zip(curve_names, colors)):
        # Normalize for display
        curve_data = original_curves[curve_name]
        valid_mask = ~np.isnan(curve_data)
        
        if np.sum(valid_mask) > 0:
            norm_data = (curve_data - np.nanmin(curve_data)) / (np.nanmax(curve_data) - np.nanmin(curve_data))
            ax_logs.plot(norm_data + i, depth, color=color, linewidth=1, label=curve_name)
    
    ax_logs.set_ylabel('Depth (ft)')
    ax_logs.set_xlabel('Normalized Curve Values')
    ax_logs.set_title('Well Log Display (Original Data)')
    ax_logs.invert_yaxis()
    ax_logs.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax_logs.grid(True, alpha=0.3)
    
    # 2. Distribution plots (top row)
    for i, curve_name in enumerate(curve_names[:3]):
        ax = fig.add_subplot(gs[0, i+2])
        
        # Original data
        orig_data = original_curves[curve_name]
        orig_valid = orig_data[~np.isnan(orig_data)]
        
        # Processed data
        proc_data = processed_curves[curve_name]
        proc_valid = proc_data[~np.isnan(proc_data)]
        
        if len(orig_valid) > 0 and len(proc_valid) > 0:
            ax.hist(orig_valid, bins=30, alpha=0.5, color='red', label='Original', density=True)
            ax.hist(proc_valid, bins=30, alpha=0.7, color='blue', label='Processed', density=True)
            ax.set_title(f'{curve_name} Distribution')
            ax.set_xlabel('Value')
            ax.set_ylabel('Density')
            ax.legend()
            ax.grid(True, alpha=0.3)
    
    # 3. Correlation matrix (middle)
    ax_corr = fig.add_subplot(gs[1, 2:4])
    
    # Create correlation matrix from processed data
    corr_data = {}
    for name, data in processed_curves.items():
        valid_mask = ~np.isnan(data)
        if np.sum(valid_mask) > 10:  # Need enough points
            corr_data[name] = data[valid_mask]
    
    if len(corr_data) > 1:
        # Find common valid indices
        min_length = min(len(data) for data in corr_data.values())
        corr_matrix_data = {name: data[:min_length] for name, data in corr_data.items()}
        
        df_corr = pd.DataFrame(corr_matrix_data)
        corr_matrix = df_corr.corr()
        
        sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, 
                   square=True, ax=ax_corr, cbar_kws={'shrink': 0.8})
        ax_corr.set_title('Curve Correlation Matrix')
    
    # 4. Data completeness (middle right)
    ax_complete = fig.add_subplot(gs[1, 4:])
    
    completeness_orig = []
    completeness_proc = []
    curve_labels = []
    
    for name in curve_names:
        orig_complete = np.sum(~np.isnan(original_curves[name])) / len(original_curves[name]) * 100
        proc_complete = np.sum(~np.isnan(processed_curves[name])) / len(processed_curves[name]) * 100
        
        completeness_orig.append(orig_complete)
        completeness_proc.append(proc_complete)
        curve_labels.append(name)
    
    x = np.arange(len(curve_labels))
    width = 0.35
    
    ax_complete.bar(x - width/2, completeness_orig, width, label='Original', color='red', alpha=0.7)
    ax_complete.bar(x + width/2, completeness_proc, width, label='Processed', color='blue', alpha=0.7)
    
    ax_complete.set_xlabel('Curves')
    ax_complete.set_ylabel('Completeness (%)')
    ax_complete.set_title('Data Completeness Comparison')
    ax_complete.set_xticks(x)
    ax_complete.set_xticklabels(curve_labels, rotation=45)
    ax_complete.legend()
    ax_complete.grid(True, alpha=0.3)
    ax_complete.set_ylim(0, 105)
    
    # 5. Quality metrics (bottom)
    ax_quality = fig.add_subplot(gs[2, 2:])
    
    # Calculate quality scores
    quality_scores = []
    for name in curve_names:
        data = processed_curves[name]
        valid_data = data[~np.isnan(data)]
        
        if len(valid_data) > 0:
            completeness = len(valid_data) / len(data) * 100
            # Simple quality score based on completeness and variance
            cv = np.std(valid_data) / np.abs(np.mean(valid_data)) if np.mean(valid_data) != 0 else 0
            quality = completeness * (1 - min(cv, 1))  # Penalize high coefficient of variation
            quality_scores.append(max(0, quality))
        else:
            quality_scores.append(0)
    
    bars = ax_quality.bar(curve_labels, quality_scores, color=['green' if q > 70 else 'orange' if q > 40 else 'red' for q in quality_scores])
    ax_quality.set_xlabel('Curves')
    ax_quality.set_ylabel('Quality Score')
    ax_quality.set_title('Data Quality Assessment')
    ax_quality.set_ylim(0, 100)
    
    # Add quality score labels
    for bar, score in zip(bars, quality_scores):
        height = bar.get_height()
        ax_quality.text(bar.get_x() + bar.get_width()/2., height + 1,
                       f'{score:.1f}', ha='center', va='bottom')
    
    ax_quality.grid(True, alpha=0.3)
    
    plt.suptitle('Comprehensive Well Log Data Analysis', fontsize=16, fontweight='bold')
    plt.show()

# Create the visualization
create_data_visualization(curves, final_curves, depth)

## 6. Multi-Well Data Processing

For PINN training, we typically need data from multiple wells. Let's process multiple LAS files and create a comprehensive dataset.

In [None]:
def process_multiple_wells(las_files, max_wells=10):
    """Process multiple LAS files and create a combined dataset"""
    
    print(f"üîÑ Processing up to {max_wells} wells...")
    
    processed_wells = []
    processing_stats = {
        'total_files': len(las_files),
        'successful': 0,
        'failed': 0,
        'filtered_out': 0
    }
    
    for i, las_file in enumerate(las_files[:max_wells]):
        try:
            print(f"  üìÑ Processing {las_file.name}... ", end="")
            
            # Read LAS file
            well_data = las_reader.read_las_file(str(las_file))
            
            # Extract curves
            curves = las_reader.extract_curves(well_data)
            
            # Check if well has required curves
            required_curves = ['GR', 'PHIE', 'PERM']  # Minimum required
            has_required = all(curve in curves for curve in required_curves)
            
            if not has_required:
                print("‚ùå Missing required curves")
                processing_stats['filtered_out'] += 1
                continue
            
            # Apply preprocessing
            cleaned_curves = preprocessor.clean_data(curves)
            normalized_curves = preprocessor.normalize_curves(cleaned_curves)
            final_curves = preprocessor.handle_missing_values(normalized_curves)
            
            # Check data quality
            total_points = len(well_data.depth)
            valid_points = sum(np.sum(~np.isnan(curve)) for curve in final_curves.values())
            quality_ratio = valid_points / (total_points * len(final_curves))
            
            if quality_ratio < 0.7:  # Require 70% data completeness
                print(f"‚ùå Low quality ({quality_ratio:.1%})")
                processing_stats['filtered_out'] += 1
                continue
            
            # Store processed well data
            processed_well = {
                'well_id': well_data.well_id,
                'depth': well_data.depth,
                'curves': final_curves,
                'metadata': well_data.metadata,
                'quality_ratio': quality_ratio
            }
            
            processed_wells.append(processed_well)
            processing_stats['successful'] += 1
            
            print(f"‚úÖ Success ({quality_ratio:.1%} quality)")
            
        except Exception as e:
            print(f"‚ùå Error: {str(e)[:50]}...")
            processing_stats['failed'] += 1
    
    return processed_wells, processing_stats

# Process multiple wells (or create synthetic data if no files available)
if len(las_files) > 0:
    processed_wells, stats = process_multiple_wells(las_files, max_wells=5)
else:
    print("üîß Creating synthetic multi-well dataset...")
    
    # Create synthetic wells with different characteristics
    processed_wells = []
    
    for well_id in range(5):
        np.random.seed(42 + well_id)
        
        # Vary depth ranges and characteristics
        depth_start = 2000 + well_id * 100
        depth_end = depth_start + 400 + well_id * 50
        depth = np.linspace(depth_start, depth_end, 400)
        
        # Create curves with well-specific characteristics
        base_porosity = 0.15 + well_id * 0.02
        base_permeability = 10 * (well_id + 1)
        
        curves = {
            'GR': 40 + 30 * np.sin(0.01 * depth) + 15 * np.random.randn(len(depth)),
            'PHIE': base_porosity + 0.05 * np.random.randn(len(depth)),
            'PERM': base_permeability * np.exp(0.5 * np.random.randn(len(depth))),
            'RT': 5 + 10 * np.exp(np.random.randn(len(depth))),
            'RHOB': 2.2 + 0.4 * np.random.randn(len(depth))
        }
        
        # Apply preprocessing
        cleaned_curves = preprocessor.clean_data(curves)
        normalized_curves = preprocessor.normalize_curves(cleaned_curves)
        final_curves = preprocessor.handle_missing_values(normalized_curves)
        
        processed_well = {
            'well_id': f'SYNTHETIC_WELL_{well_id+1:02d}',
            'depth': depth,
            'curves': final_curves,
            'metadata': None,
            'quality_ratio': 0.95
        }
        
        processed_wells.append(processed_well)
    
    stats = {'successful': 5, 'failed': 0, 'filtered_out': 0, 'total_files': 5}

print(f"\nüìä Processing Summary:")
print(f"  ‚úÖ Successful: {stats['successful']}")
print(f"  ‚ùå Failed: {stats['failed']}")
print(f"  üö´ Filtered out: {stats['filtered_out']}")
print(f"  üìà Success rate: {stats['successful']/stats['total_files']*100:.1f}%")

## 7. Dataset Creation for PINN Training

Now let's create training and validation datasets suitable for PINN models.

In [None]:
# Create datasets for PINN training
print("üèóÔ∏è Building PINN Training Datasets")
print("=" * 35)

# Build combined dataset
training_config = {
    'validation_split': 0.2,
    'test_split': 0.1,
    'random_seed': 42,
    'min_points_per_well': 100
}

# Create datasets
datasets = dataset_builder.create_datasets(processed_wells, training_config)

print(f"üì¶ Dataset Creation Results:")
print(f"  üèãÔ∏è Training samples: {len(datasets['train']['depth'])}")
print(f"  üîç Validation samples: {len(datasets['validation']['depth'])}")
print(f"  üß™ Test samples: {len(datasets['test']['depth'])}")
print(f"  üìä Total wells used: {len(processed_wells)}")

# Analyze dataset characteristics
def analyze_dataset_characteristics(datasets):
    """Analyze the characteristics of our PINN datasets"""
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    dataset_names = ['train', 'validation', 'test']
    colors = ['blue', 'orange', 'green']
    
    # 1. Depth distribution
    for i, (name, color) in enumerate(zip(dataset_names, colors)):
        depths = datasets[name]['depth']
        axes[0, 0].hist(depths, bins=30, alpha=0.6, label=name.title(), color=color, density=True)
    
    axes[0, 0].set_xlabel('Depth (ft)')
    axes[0, 0].set_ylabel('Density')
    axes[0, 0].set_title('Depth Distribution Across Datasets')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Porosity distribution
    for i, (name, color) in enumerate(zip(dataset_names, colors)):
        porosity = datasets[name]['curves']['PHIE']
        valid_porosity = porosity[~np.isnan(porosity)]
        if len(valid_porosity) > 0:
            axes[0, 1].hist(valid_porosity, bins=30, alpha=0.6, label=name.title(), color=color, density=True)
    
    axes[0, 1].set_xlabel('Porosity (fraction)')
    axes[0, 1].set_ylabel('Density')
    axes[0, 1].set_title('Porosity Distribution')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Permeability distribution (log scale)
    for i, (name, color) in enumerate(zip(dataset_names, colors)):
        perm = datasets[name]['curves']['PERM']
        valid_perm = perm[~np.isnan(perm)]
        if len(valid_perm) > 0 and np.all(valid_perm > 0):
            axes[0, 2].hist(np.log10(valid_perm), bins=30, alpha=0.6, label=name.title(), color=color, density=True)
    
    axes[0, 2].set_xlabel('Log10(Permeability)')
    axes[0, 2].set_ylabel('Density')
    axes[0, 2].set_title('Permeability Distribution (Log Scale)')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # 4. Data completeness by curve
    curve_names = list(datasets['train']['curves'].keys())
    completeness_data = {name: [] for name in dataset_names}
    
    for curve in curve_names:
        for name in dataset_names:
            curve_data = datasets[name]['curves'][curve]
            completeness = np.sum(~np.isnan(curve_data)) / len(curve_data) * 100
            completeness_data[name].append(completeness)
    
    x = np.arange(len(curve_names))
    width = 0.25
    
    for i, (name, color) in enumerate(zip(dataset_names, colors)):
        axes[1, 0].bar(x + i*width, completeness_data[name], width, label=name.title(), color=color, alpha=0.7)
    
    axes[1, 0].set_xlabel('Curves')
    axes[1, 0].set_ylabel('Completeness (%)')
    axes[1, 0].set_title('Data Completeness by Curve')
    axes[1, 0].set_xticks(x + width)
    axes[1, 0].set_xticklabels(curve_names, rotation=45)
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # 5. Sample distribution by well
    well_counts = {name: {} for name in dataset_names}
    
    for name in dataset_names:
        well_ids = datasets[name]['well_ids']
        unique_wells, counts = np.unique(well_ids, return_counts=True)
        well_counts[name] = dict(zip(unique_wells, counts))
    
    # Get all unique wells
    all_wells = set()
    for name in dataset_names:
        all_wells.update(well_counts[name].keys())
    all_wells = sorted(list(all_wells))
    
    x = np.arange(len(all_wells))
    
    for i, (name, color) in enumerate(zip(dataset_names, colors)):
        counts = [well_counts[name].get(well, 0) for well in all_wells]
        axes[1, 1].bar(x + i*width, counts, width, label=name.title(), color=color, alpha=0.7)
    
    axes[1, 1].set_xlabel('Wells')
    axes[1, 1].set_ylabel('Number of Samples')
    axes[1, 1].set_title('Sample Distribution by Well')
    axes[1, 1].set_xticks(x + width)
    axes[1, 1].set_xticklabels([w[:10] + '...' if len(w) > 10 else w for w in all_wells], rotation=45)
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # 6. Feature correlation in training data
    train_curves = datasets['train']['curves']
    corr_data = {}
    
    for name, data in train_curves.items():
        valid_mask = ~np.isnan(data)
        if np.sum(valid_mask) > 100:
            corr_data[name] = data[valid_mask][:1000]  # Limit for correlation
    
    if len(corr_data) > 1:
        min_length = min(len(data) for data in corr_data.values())
        corr_matrix_data = {name: data[:min_length] for name, data in corr_data.items()}
        
        df_corr = pd.DataFrame(corr_matrix_data)
        corr_matrix = df_corr.corr()
        
        im = axes[1, 2].imshow(corr_matrix.values, cmap='RdBu_r', vmin=-1, vmax=1)
        axes[1, 2].set_xticks(range(len(corr_matrix.columns)))
        axes[1, 2].set_yticks(range(len(corr_matrix.columns)))
        axes[1, 2].set_xticklabels(corr_matrix.columns, rotation=45)
        axes[1, 2].set_yticklabels(corr_matrix.columns)
        axes[1, 2].set_title('Feature Correlation Matrix\n(Training Data)')
        
        # Add correlation values
        for i in range(len(corr_matrix)):
            for j in range(len(corr_matrix)):
                axes[1, 2].text(j, i, f'{corr_matrix.iloc[i, j]:.2f}', 
                               ha='center', va='center', fontsize=8)
        
        plt.colorbar(im, ax=axes[1, 2], shrink=0.8)
    
    plt.tight_layout()
    plt.suptitle('PINN Dataset Analysis', fontsize=16, fontweight='bold', y=1.02)
    plt.show()

# Analyze our datasets
analyze_dataset_characteristics(datasets)

## 8. Data Export and Preparation for PINN Training

Finally, let's prepare our processed data for use in PINN training.

In [None]:
# Prepare data for PINN training
def prepare_pinn_training_data(datasets):
    """Prepare final datasets for PINN training"""
    
    print("üéØ Preparing Data for PINN Training")
    print("=" * 35)
    
    pinn_data = {}
    
    for split_name in ['train', 'validation', 'test']:
        split_data = datasets[split_name]
        
        # Extract features (inputs to PINN)
        features = []
        feature_names = ['depth']
        
        # Add depth as first feature
        features.append(split_data['depth'].reshape(-1, 1))
        
        # Add curve data as features
        for curve_name in ['GR', 'PHIE', 'PERM']:
            if curve_name in split_data['curves']:
                curve_data = split_data['curves'][curve_name]
                features.append(curve_data.reshape(-1, 1))
                feature_names.append(curve_name)
        
        # Combine features
        X = np.hstack(features)
        
        # Create targets (what PINN should predict)
        # For demonstration, we'll use porosity and a synthetic pressure field
        porosity = split_data['curves']['PHIE']
        
        # Create synthetic pressure field based on depth and porosity
        depth_norm = (split_data['depth'] - np.min(split_data['depth'])) / (np.max(split_data['depth']) - np.min(split_data['depth']))
        pressure = 100 + 50 * depth_norm + 20 * (1 - porosity) + 5 * np.random.randn(len(porosity))
        
        # Create synthetic saturation field
        saturation = 0.3 + 0.4 * porosity + 0.1 * np.random.randn(len(porosity))
        saturation = np.clip(saturation, 0.2, 0.8)
        
        y = np.column_stack([pressure, saturation])
        target_names = ['pressure', 'saturation']
        
        # Remove NaN values
        valid_mask = ~np.any(np.isnan(X), axis=1) & ~np.any(np.isnan(y), axis=1)
        
        X_clean = X[valid_mask]
        y_clean = y[valid_mask]
        well_ids_clean = np.array(split_data['well_ids'])[valid_mask]
        
        pinn_data[split_name] = {
            'X': X_clean,
            'y': y_clean,
            'well_ids': well_ids_clean,
            'feature_names': feature_names,
            'target_names': target_names,
            'n_samples': len(X_clean)
        }
        
        print(f"  {split_name:10s}: {len(X_clean):5d} samples, {X_clean.shape[1]} features, {y_clean.shape[1]} targets")
    
    return pinn_data

# Prepare PINN training data
pinn_data = prepare_pinn_training_data(datasets)

# Display data summary
print(f"\nüìã PINN Data Summary:")
print(f"  Features: {pinn_data['train']['feature_names']}")
print(f"  Targets: {pinn_data['train']['target_names']}")
print(f"  Feature ranges:")

X_train = pinn_data['train']['X']
for i, name in enumerate(pinn_data['train']['feature_names']):
    min_val, max_val = np.min(X_train[:, i]), np.max(X_train[:, i])
    print(f"    {name:8s}: [{min_val:8.3f}, {max_val:8.3f}]")

print(f"  Target ranges:")
y_train = pinn_data['train']['y']
for i, name in enumerate(pinn_data['train']['target_names']):
    min_val, max_val = np.min(y_train[:, i]), np.max(y_train[:, i])
    print(f"    {name:10s}: [{min_val:8.3f}, {max_val:8.3f}]")

## 9. Summary and Best Practices

In this tutorial, we've covered the complete data processing pipeline for PINN training with well log data.

### Key Accomplishments
1. **LAS File Processing**: Successfully read and parsed LAS files with robust error handling
2. **Data Quality Assessment**: Implemented comprehensive quality metrics and filtering
3. **Preprocessing Pipeline**: Applied cleaning, normalization, and missing value handling
4. **Multi-Well Integration**: Combined data from multiple wells into coherent datasets
5. **PINN-Ready Datasets**: Created training/validation/test splits suitable for physics-informed learning

### Best Practices for Well Log Data Processing

#### Data Quality
- **Completeness**: Require at least 70% data completeness for reliable training
- **Consistency**: Standardize curve names and units across wells
- **Outlier Detection**: Use statistical methods (3-sigma rule) to identify anomalous values
- **Gap Analysis**: Monitor data continuity and interpolate small gaps carefully

#### Preprocessing Strategy
- **Normalization**: Essential for neural network training stability
- **Missing Values**: Use domain-appropriate interpolation methods
- **Feature Engineering**: Consider derived properties (porosity-permeability relationships)
- **Validation**: Always validate preprocessing results visually and statistically

#### Dataset Design
- **Stratified Splitting**: Ensure representative samples in train/validation/test sets
- **Well-Based Splitting**: Avoid data leakage by splitting at well level
- **Balanced Representation**: Include diverse geological conditions
- **Physics Constraints**: Ensure data respects known physical relationships

### Common Pitfalls to Avoid
1. **Ignoring Data Quality**: Poor quality data leads to poor PINN performance
2. **Over-Preprocessing**: Excessive smoothing can remove important physics
3. **Data Leakage**: Using future information or mixing wells inappropriately
4. **Scale Issues**: Forgetting to normalize features with very different ranges
5. **Physics Violations**: Creating datasets that violate known physical constraints

### Next Steps
With our processed datasets ready, we can now move to:
1. **Tutorial 3**: Implement PINN architecture with physics constraints
2. **Tutorial 4**: Train PINNs with optimization best practices
3. **Tutorial 5**: Validate results and analyze performance

In [None]:
# Save processed data for next tutorials
import pickle

# Create output directory
output_dir = Path('../output')
output_dir.mkdir(exist_ok=True)

# Save PINN-ready datasets
with open(output_dir / 'pinn_datasets.pkl', 'wb') as f:
    pickle.dump(pinn_data, f)

print("üíæ Datasets saved successfully!")
print(f"üìÅ Location: {output_dir / 'pinn_datasets.pkl'}")
print("\nüéâ Tutorial 2 Complete!")
print("\n‚û°Ô∏è  Next: Tutorial 3 - PINN Model Implementation")

# Quick verification
print(f"\nüîç Final Verification:")
print(f"  Training samples: {pinn_data['train']['n_samples']:,}")
print(f"  Validation samples: {pinn_data['validation']['n_samples']:,}")
print(f"  Test samples: {pinn_data['test']['n_samples']:,}")
print(f"  Features: {len(pinn_data['train']['feature_names'])}")
print(f"  Targets: {len(pinn_data['train']['target_names'])}")
print(f"  Wells processed: {len(processed_wells)}")
print("\n‚úÖ Ready for PINN training!")