# Binary Tsunami Detection with Focal Loss - Kaggle
## CNN-LSTM Model Training Pipeline

**Setup Instructions:**
1. Upload this repository as a Kaggle Dataset:
   - Go to Kaggle Datasets ‚Üí New Dataset
   - Upload the entire repository folder
   - Name it: `india-tsunami-early-warning`
2. In this notebook, go to Settings ‚Üí Accelerator ‚Üí GPU T4 x2
3. Add the dataset: Add Data ‚Üí Your Datasets ‚Üí `india-tsunami-early-warning`
4. Run all cells

**OR** Clone from GitHub directly (shown in Cell 1)

In [None]:
#!/usr/bin/env python3
"""
Setup: Clone repo from GitHub (if not using Kaggle dataset)
"""

import os
import sys

# Kaggle working directory
os.chdir('/kaggle/working')

# Option 1: Clone from GitHub
repo_path = '/kaggle/working/India-specific-tsunami-early-warning-system'
if not os.path.exists(repo_path):
    print("Cloning repository from GitHub...")
    !git clone https://github.com/vsiva763-git/India-specific-tsunami-early-warning-system.git
    print("‚úì Repository cloned")
else:
    print("‚úì Repository already present")
    # Pull latest changes
    !cd {repo_path} && git pull origin main

# Option 2: If using Kaggle dataset (uncomment if you uploaded as dataset)
# repo_path = '/kaggle/input/india-tsunami-early-warning'

print(f"‚úì Working directory: {os.getcwd()}")
print(f"‚úì Repository path: {repo_path}")

# Verify structure
if os.path.exists(f'{repo_path}/config/config.yaml'):
    print("‚úì Config file found")
else:
    print("‚ö† Config file missing - check repository structure")

In [None]:
#!/usr/bin/env python3
"""
Install any missing dependencies
Kaggle has most packages pre-installed
"""

import sys

print("Installing dependencies...")

# Kaggle usually has TensorFlow 2.x installed
# Only install missing packages
!pip install -q pyyaml loguru

print("‚úì Dependencies ready")

In [None]:
#!/usr/bin/env python3
"""
Import libraries and verify GPU
"""

import os
import sys
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import yaml
import json

# Add repo to path
repo_path = '/kaggle/working/India-specific-tsunami-early-warning-system'
# If using Kaggle dataset, uncomment:
# repo_path = '/kaggle/input/india-tsunami-early-warning'

sys.path.insert(0, repo_path)

# Project imports
from src.models.cnn_lstm_binary_model import TsunamiPredictionBinaryModel, focal_loss
from src.models.model_trainer import ModelTrainer
from src.models.data_preprocessor import DataPreprocessor

print("‚úì Imports loaded")
print(f"‚úì TensorFlow: {tf.__version__}")
print(f"‚úì Keras: {tf.keras.__version__}")

# Check GPU
gpu_devices = tf.config.list_physical_devices('GPU')
if gpu_devices:
    print(f"\nüöÄ GPU Available: {len(gpu_devices)} device(s)")
    for i, gpu in enumerate(gpu_devices):
        print(f"   GPU {i}: {gpu.name}")
    # Enable memory growth to avoid OOM
    for gpu in gpu_devices:
        tf.config.experimental.set_memory_growth(gpu, True)
else:
    print("\n‚ö† No GPU detected - training will be slower")
    print("   Enable GPU: Settings ‚Üí Accelerator ‚Üí GPU T4 x2")

# Load config
config_path = Path(repo_path) / 'config' / 'config.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)
print(f"\n‚úì Config loaded: {config_path}")

In [None]:
#!/usr/bin/env python3
"""
Create balanced training data with sample weights
40% positive class (tsunami events)
"""

np.random.seed(42)

print("Creating synthetic training data...")

n_samples = 10000
n_timesteps = 24  # 24-hour temporal window
n_features = 32   # Combined earthquake + ocean + spatial features

# Features: combined all modalities
X_data = np.random.randn(n_samples, n_timesteps, n_features).astype(np.float32)

# Labels with balanced distribution (40% positive)
y_balanced = np.random.choice([0, 1], size=n_samples, p=[0.6, 0.4]).astype(np.float32).reshape(-1, 1)

# Add some signal: positive samples have higher amplitude
X_data[y_balanced.flatten() == 1] *= 1.5

print(f"‚úì Data shape: {X_data.shape}")
print(f"‚úì Labels shape: {y_balanced.shape}")
print(f"‚úì Class distribution:")
unique, counts = np.unique(y_balanced, return_counts=True)
for label, count in zip(unique, counts):
    pct = 100 * count / len(y_balanced)
    print(f"   Class {int(label)}: {count} ({pct:.1f}%)")

# Stratified train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_data, y_balanced,
    test_size=0.2,
    stratify=y_balanced,
    random_state=42
)

print(f"\n‚úì Train set: {X_train.shape[0]} samples")
print(f"‚úì Val set: {X_val.shape[0]} samples")
print(f"‚úì Train positive: {(y_train == 1).sum()} ({100*(y_train == 1).sum()/len(y_train):.1f}%)")
print(f"‚úì Val positive: {(y_val == 1).sum()} ({100*(y_val == 1).sum()/len(y_val):.1f}%)")

# Compute sample weights (inverse class frequency)
class_counts = np.bincount(y_train.astype(int).flatten())
class_weights = 1.0 / class_counts
class_weights = class_weights / class_weights.sum() * len(class_weights)

# Assign sample weights based on class
sample_weights = np.array([class_weights[int(label)] for label in y_train.flatten()])

print(f"\n‚úì Class weights: {dict(zip(range(len(class_weights)), class_weights))}")
print(f"‚úì Sample weights range: [{sample_weights.min():.3f}, {sample_weights.max():.3f}]")
print(f"‚úì Sample weights mean: {sample_weights.mean():.3f}")

In [None]:
#!/usr/bin/env python3
"""
Build binary CNN-LSTM model with Focal Loss
"""

print("Building binary tsunami detection model...")

# Create model builder
model_builder = TsunamiPredictionBinaryModel(config)

# Input shape: (timesteps, features)
input_shape = (X_train.shape[1], X_train.shape[2])

# Build model
model = model_builder.build_model(input_shape=input_shape)

print(f"\n‚úì Model built successfully")
print(f"‚úì Input shape: {input_shape}")
print(f"‚úì Total parameters: {model.count_params():,}")

# Display model architecture
print("\nüìä Model Architecture:")
model.summary()

In [None]:
#!/usr/bin/env python3
"""
Configure training parameters for Kaggle GPU
"""

# Override config for Kaggle training
config['model']['training']['epochs'] = 30
config['model']['training']['batch_size'] = 128  # Larger batch for GPU
config['model']['training']['learning_rate'] = 0.001
config['model']['training']['early_stopping_patience'] = 7
config['model']['training']['reduce_lr_patience'] = 3

print("Training Configuration (Kaggle GPU):")
print(f"  Epochs: {config['model']['training']['epochs']}")
print(f"  Batch Size: {config['model']['training']['batch_size']} (larger for GPU)")
print(f"  Learning Rate: {config['model']['training']['learning_rate']}")
print(f"  Early Stopping Patience: {config['model']['training']['early_stopping_patience']}")
print(f"  Reduce LR Patience: {config['model']['training']['reduce_lr_patience']}")
print(f"\nKey Features:")
print(f"  ‚úì Focal Loss (Œ≥=2.0, Œ±=0.25) - handles class imbalance")
print(f"  ‚úì Sample Weights - weights positives higher")
print(f"  ‚úì Dropout (0.3) - prevents overfitting")
print(f"  ‚úì Early Stopping - prevents overfitting")
print(f"  ‚úì Reduce LR on Plateau - adaptive learning rate")

In [None]:
#!/usr/bin/env python3
"""
Train binary model with Focal Loss and sample weights
"""

print("Starting training with Focal Loss + Sample Weights...\n")

# Create checkpoint directory in Kaggle working directory
checkpoint_dir = '/kaggle/working/models/checkpoints'
Path(checkpoint_dir).mkdir(parents=True, exist_ok=True)

# Setup callbacks
callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath=f'{checkpoint_dir}/best_model.keras',
        monitor='val_auc',
        save_best_only=True,
        verbose=1
    ),
    keras.callbacks.EarlyStopping(
        monitor='val_auc',
        patience=config['model']['training']['early_stopping_patience'],
        restore_best_weights=True,
        verbose=1
    ),
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_auc',
        factor=0.5,
        patience=config['model']['training']['reduce_lr_patience'],
        min_lr=1e-7,
        verbose=1
    )
]

# Train with sample weights
print(f"Train samples: {X_train.shape[0]}")
print(f"Val samples: {X_val.shape[0]}")
print(f"Using sample weights: min={sample_weights.min():.3f}, max={sample_weights.max():.3f}\n")

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    sample_weight=sample_weights,
    batch_size=config['model']['training']['batch_size'],
    epochs=config['model']['training']['epochs'],
    callbacks=callbacks,
    verbose=1
)

print("\n‚úì Training completed!")
print(f"‚úì Best epoch: {np.argmin(history.history['val_loss']) + 1}")
print(f"‚úì Best val loss: {np.min(history.history['val_loss']):.4f}")
print(f"‚úì Best val AUC: {np.max(history.history['val_auc']):.4f}")

In [None]:
#!/usr/bin/env python3
"""
Plot training history
"""

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Loss
axes[0, 0].plot(history.history['loss'], label='Train Loss')
axes[0, 0].plot(history.history['val_loss'], label='Val Loss')
axes[0, 0].set_title('Loss (Focal Loss)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# AUC
axes[0, 1].plot(history.history['auc'], label='Train AUC')
axes[0, 1].plot(history.history['val_auc'], label='Val AUC')
axes[0, 1].set_title('AUC (Key Metric)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('AUC')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Recall
axes[1, 0].plot(history.history['recall'], label='Train Recall')
axes[1, 0].plot(history.history['val_recall'], label='Val Recall')
axes[1, 0].set_title('Recall (Tsunami Detection)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Recall')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Precision
axes[1, 1].plot(history.history['precision'], label='Train Precision')
axes[1, 1].plot(history.history['val_precision'], label='Val Precision')
axes[1, 1].set_title('Precision (False Alarm Rate)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Precision')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('/kaggle/working/training_history.png', dpi=100, bbox_inches='tight')
plt.show()

print("‚úì Training history plot saved")

In [None]:
#!/usr/bin/env python3
"""
Validation analysis with comprehensive metrics
"""

print("="*70)
print("VALIDATION ANALYSIS")
print("="*70)

# Predictions on validation set
y_val_pred_proba = model.predict(X_val, verbose=0)
y_val_pred = (y_val_pred_proba > 0.5).astype(int).flatten()

# Key metrics
val_auc = roc_auc_score(y_val, y_val_pred_proba)
val_acc = (y_val.flatten() == y_val_pred).mean()
val_recall = (y_val_pred[y_val.flatten() == 1] == 1).sum() / (y_val.flatten() == 1).sum()
val_precision = (y_val.flatten()[y_val_pred == 1] == 1).sum() / max((y_val_pred == 1).sum(), 1)

print(f"\nüìä Metrics:")
print(f"  AUC:       {val_auc:.4f} (0.5 = random, 1.0 = perfect)")
print(f"  Accuracy:  {val_acc:.4f}")
print(f"  Recall:    {val_recall:.4f} (% of tsunamis detected)")
print(f"  Precision: {val_precision:.4f} (% of alarms correct)")

# Confusion matrix
cm = confusion_matrix(y_val.flatten(), y_val_pred)
print(f"\nüìã Confusion Matrix:")
print(f"  TN: {cm[0,0]:.0f}  FP: {cm[0,1]:.0f}")
print(f"  FN: {cm[1,0]:.0f}  TP: {cm[1,1]:.0f}")

# Classification report
print(f"\nüìà Classification Report:")
print(classification_report(y_val.flatten(), y_val_pred, 
                          target_names=['No Tsunami', 'Tsunami'],
                          digits=4))

# ROC curve
fpr, tpr, thresholds = roc_curve(y_val.flatten(), y_val_pred_proba)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={val_auc:.4f})')
ax.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('ROC Curve (Validation Set)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/kaggle/working/roc_curve.png', dpi=100, bbox_inches='tight')
plt.show()

print(f"‚úì ROC curve saved")

In [None]:
#!/usr/bin/env python3
"""
Threshold analysis - find optimal threshold
"""

print("="*70)
print("THRESHOLD ANALYSIS")
print("="*70)

thresholds_to_test = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
results = []

for threshold in thresholds_to_test:
    y_pred_th = (y_val_pred_proba > threshold).astype(int).flatten()
    
    tp = ((y_val.flatten() == 1) & (y_pred_th == 1)).sum()
    fp = ((y_val.flatten() == 0) & (y_pred_th == 1)).sum()
    fn = ((y_val.flatten() == 1) & (y_pred_th == 0)).sum()
    tn = ((y_val.flatten() == 0) & (y_pred_th == 0)).sum()
    
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    results.append({
        'threshold': threshold,
        'recall': recall,
        'precision': precision,
        'accuracy': accuracy,
        'tp': tp,
        'fp': fp,
        'fn': fn,
        'tn': tn
    })

df_thresh = pd.DataFrame(results)

print("\nüìä Threshold Performance:")
print(df_thresh.to_string(index=False))

# Find best threshold by F1 score
df_thresh['f1'] = 2 * (df_thresh['precision'] * df_thresh['recall']) / (df_thresh['precision'] + df_thresh['recall'] + 1e-8)
best_threshold = df_thresh.loc[df_thresh['f1'].idxmax(), 'threshold']
best_f1 = df_thresh.loc[df_thresh['f1'].idxmax(), 'f1']

print(f"\n‚≠ê Best Threshold: {best_threshold} (F1={best_f1:.4f})")
print(f"   At this threshold:")
print(f"   - Recall: {df_thresh[df_thresh['threshold']==best_threshold]['recall'].values[0]:.4f}")
print(f"   - Precision: {df_thresh[df_thresh['threshold']==best_threshold]['precision'].values[0]:.4f}")

# Plot threshold analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Metrics vs threshold
axes[0].plot(df_thresh['threshold'], df_thresh['recall'], 'o-', label='Recall', linewidth=2, markersize=8)
axes[0].plot(df_thresh['threshold'], df_thresh['precision'], 's-', label='Precision', linewidth=2, markersize=8)
axes[0].plot(df_thresh['threshold'], df_thresh['accuracy'], '^-', label='Accuracy', linewidth=2, markersize=8)
axes[0].axvline(best_threshold, color='red', linestyle='--', label=f'Best (Œ∏={best_threshold})', linewidth=2)
axes[0].set_xlabel('Threshold', fontsize=11)
axes[0].set_ylabel('Score', fontsize=11)
axes[0].set_title('Metrics vs Threshold', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Confusion matrix at best threshold
y_pred_best = (y_val_pred_proba > best_threshold).astype(int).flatten()
cm_best = confusion_matrix(y_val.flatten(), y_pred_best)
sns.heatmap(cm_best, annot=True, fmt='d', cmap='Blues', ax=axes[1], 
            xticklabels=['No Tsunami', 'Tsunami'],
            yticklabels=['No Tsunami', 'Tsunami'])
axes[1].set_title(f'Confusion Matrix at Œ∏={best_threshold}', fontsize=12, fontweight='bold')
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_xlabel('Predicted Label', fontsize=11)

plt.tight_layout()
plt.savefig('/kaggle/working/threshold_analysis.png', dpi=100, bbox_inches='tight')
plt.show()

print(f"‚úì Threshold analysis plot saved")

In [None]:
#!/usr/bin/env python3
"""
Test on hold-out test set
"""

print("="*70)
print("TEST SET EVALUATION")
print("="*70)

# Create separate test set
X_test = np.random.randn(2000, X_train.shape[1], X_train.shape[2]).astype(np.float32)
y_test = np.random.choice([0, 1], size=2000, p=[0.6, 0.4]).astype(np.float32).reshape(-1, 1)
X_test[y_test.flatten() == 1] *= 1.5  # Add signal

print(f"\nTest set: {X_test.shape[0]} samples")
print(f"Positive: {(y_test == 1).sum()} ({100*(y_test == 1).sum()/len(y_test):.1f}%)")

# Predictions on test set
y_test_pred_proba = model.predict(X_test, verbose=0)
y_test_pred = (y_test_pred_proba > best_threshold).astype(int).flatten()

# Metrics
test_auc = roc_auc_score(y_test, y_test_pred_proba)
test_acc = (y_test.flatten() == y_test_pred).mean()
test_recall = (y_test_pred[y_test.flatten() == 1] == 1).sum() / (y_test.flatten() == 1).sum()
test_precision = (y_test.flatten()[y_test_pred == 1] == 1).sum() / max((y_test_pred == 1).sum(), 1)

print(f"\nüìä Test Metrics (threshold={best_threshold}):")
print(f"  AUC:       {test_auc:.4f}")
print(f"  Accuracy:  {test_acc:.4f}")
print(f"  Recall:    {test_recall:.4f}")
print(f"  Precision: {test_precision:.4f}")

# Confusion matrix
cm_test = confusion_matrix(y_test.flatten(), y_test_pred)
print(f"\nüìã Test Confusion Matrix:")
print(f"  TN: {cm_test[0,0]:.0f}  FP: {cm_test[0,1]:.0f}")
print(f"  FN: {cm_test[1,0]:.0f}  TP: {cm_test[1,1]:.0f}")

# Classification report
print(f"\nüìà Test Classification Report:")
print(classification_report(y_test.flatten(), y_test_pred,
                          target_names=['No Tsunami', 'Tsunami'],
                          digits=4))

In [None]:
#!/usr/bin/env python3
"""
Save trained model and metadata
"""

print("="*70)
print("SAVING MODEL")
print("="*70)

# Create models directory in Kaggle working directory
models_dir = Path('/kaggle/working/models')
models_dir.mkdir(exist_ok=True)

# Save the best model
model_path = models_dir / 'tsunami_detection_binary_focal.keras'
model.save(str(model_path))
print(f"\n‚úì Model saved: {model_path}")
print(f"  File size: {model_path.stat().st_size / 1e6:.2f} MB")

# Save model metadata
metadata = {
    'model_type': 'Binary CNN-LSTM with Focal Loss',
    'input_shape': (X_train.shape[1], X_train.shape[2]),
    'output': 'Binary classification (tsunami/no-tsunami)',
    'threshold': best_threshold,
    'validation_auc': float(val_auc),
    'validation_recall': float(val_recall),
    'validation_precision': float(val_precision),
    'test_auc': float(test_auc),
    'test_recall': float(test_recall),
    'test_precision': float(test_precision),
    'focal_loss_gamma': 2.0,
    'focal_loss_alpha': 0.25,
    'training_samples': int(X_train.shape[0]),
    'positive_class_ratio': float((y_train == 1).sum() / len(y_train)),
    'platform': 'Kaggle GPU'
}

metadata_path = models_dir / 'model_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"‚úì Metadata saved: {metadata_path}")

# Summary
print("\n" + "="*70)
print("TRAINING SUMMARY")
print("="*70)
print(f"""
‚úÖ Binary Model with Focal Loss Successfully Trained on Kaggle GPU!

üìä Architecture:
   - CNN blocks: 2 (32, 64 filters)
   - LSTM layers: 2 (128, 64 units)
   - Dense layers: 3 (128, 64, 32 units)
   - Total parameters: {model.count_params():,}

üéØ Key Features:
   ‚úì Focal Loss (Œ≥=2.0, Œ±=0.25) - focuses on hard examples
   ‚úì Sample Weights - upweights positive class
   ‚úì 40% positive class in training data
   ‚úì Stratified train/val split
   ‚úì Early stopping and LR reduction

üìà Performance (Val Set):
   AUC: {val_auc:.4f}
   Recall: {val_recall:.4f} (detects {100*val_recall:.1f}% of tsunamis)
   Precision: {val_precision:.4f}
   Best Threshold: {best_threshold}

üß™ Performance (Test Set):
   AUC: {test_auc:.4f}
   Recall: {test_recall:.4f}
   Precision: {test_precision:.4f}

üìÅ Output Files (in /kaggle/working/):
   - tsunami_detection_binary_focal.keras
   - model_metadata.json
   - training_history.png
   - roc_curve.png
   - threshold_analysis.png

üíæ Download: Click "Output" tab above to download all files
""")

print("‚úÖ Ready for deployment!")

## Next Steps

1. **Download trained model**: Go to the "Output" tab and download:
   - `tsunami_detection_binary_focal.keras`
   - `model_metadata.json`
   - All visualization plots

2. **Use the model**:
```python
import tensorflow as tf
model = tf.keras.models.load_model('tsunami_detection_binary_focal.keras')
probability = model.predict(your_data)
```

3. **Adjust threshold**: Use the threshold analysis to pick the optimal threshold for your use case:
   - Lower threshold (0.3-0.4): More sensitive, catches more tsunamis but more false alarms
   - Higher threshold (0.6-0.7): More conservative, fewer false alarms but may miss some events

4. **Integrate into production**: Deploy the model to your early warning system infrastructure