# EpiWatch: AI for Early Epidemic Detection

## Project Overview

**Mission:** Create a scalable and intelligent system that detects early signs of disease outbreaks in low-resource regions.

**UN SDG Alignment:**
- **SDG 3**: Good Health & Well-being
- **SDG 9**: Industry, Innovation & Infrastructure
- **SDG 10**: Reduced Inequalities

## Notebook Goals
1. Train 5 models (1 custom + 4 pre-trained)
2. Compare performance metrics
3. Select best model for mobile app
4. Generate outputs for app integration

In [None]:
# Import required libraries
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append('../src')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Libraries imported successfully!")

## 1. Data Loading and Exploration

In [None]:
from preprocessing.text_preprocessing import DatasetBuilder, TextPreprocessor

# Initialize
builder = DatasetBuilder()
preprocessor = TextPreprocessor()

# Create or load dataset
data_path = '../data/processed/epidemic_data.csv'

if not os.path.exists(data_path):
    print("Creating sample dataset...")
    df = builder.create_sample_dataset(n_samples=2000, save_path=data_path)
else:
    print(f"Loading dataset from {data_path}...")
    df = builder.load_data(data_path)

print(f"\n‚úì Dataset loaded: {df.shape}")
df.head()

In [None]:
# Dataset statistics
print("Dataset Information:")
print("=" * 60)
print(f"Total samples: {len(df)}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nClass distribution:")
print(df['label'].value_counts())
print(f"\nRegion distribution:")
print(df['region'].value_counts())
print(f"\nDisease distribution:")
print(df['disease'].value_counts())

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Class distribution
df['label'].value_counts().plot(kind='bar', ax=axes[0], color=['#4CAF50', '#FF4444'])
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Label (0=Normal, 1=Outbreak)')
axes[0].set_ylabel('Count')

# Region distribution
df['region'].value_counts().plot(kind='bar', ax=axes[1], color='#3498db')
axes[1].set_title('Region Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Region')
axes[1].set_ylabel('Count')

# Disease distribution
df['disease'].value_counts().plot(kind='bar', ax=axes[2], color='#e74c3c')
axes[2].set_title('Disease Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Disease')
axes[2].set_ylabel('Count')

plt.tight_layout()
plt.show()

## 2. Text Preprocessing

In [None]:
# Example text preprocessing
example_text = df['text'].iloc[0]

print("Original Text:")
print("=" * 60)
print(example_text)
print()

print("Cleaned Text:")
print("=" * 60)
cleaned = preprocessor.clean_text(example_text)
print(cleaned)
print()

print("Fully Preprocessed Text:")
print("=" * 60)
processed = preprocessor.preprocess(example_text)
print(processed)
print()

print("Extracted Features:")
print("=" * 60)
features = preprocessor.extract_features(example_text)
for key, value in features.items():
    print(f"{key}: {value}")

In [None]:
# Preprocess all texts
print("Preprocessing all texts...")
df['processed_text'] = preprocessor.preprocess_dataset(df['text'].tolist(), show_progress=True)
print("\n‚úì Preprocessing complete!")

# Show comparison
comparison_df = df[['text', 'processed_text', 'label']].head(5)
comparison_df

## 3. Data Splitting

In [None]:
# Balance dataset
df_balanced = builder.balance_dataset(df, text_col='processed_text')

# Split into train, val, test
data_splits = builder.prepare_train_test_split(
    df_balanced,
    text_col='processed_text',
    test_size=0.2,
    val_size=0.1
)

print("\n‚úì Data split complete!")

## 4. Model Training

We will train 5 models:
1. **Custom Neural Network** (LSTM + Attention) - Built from scratch
2. **XLM-RoBERTa** - Cross-lingual pre-trained model
3. **mBERT** - Multilingual BERT
4. **DistilBERT** - Efficient distilled model
5. **MuRIL** - Multilingual Representations for Indian Languages

### 4.1 Custom Neural Network (From Scratch)

In [None]:
import torch
from torch.utils.data import DataLoader
from models.custom_model import CustomEpiDetector, ModelTrainer, build_vocab, EpidemicDataset

# Build vocabulary
print("Building vocabulary...")
vocab = build_vocab(data_splits['train']['texts'], min_freq=2)
print(f"‚úì Vocabulary size: {len(vocab)}")

# Create datasets
train_dataset = EpidemicDataset(data_splits['train']['texts'], data_splits['train']['labels'], vocab)
val_dataset = EpidemicDataset(data_splits['val']['texts'], data_splits['val']['labels'], vocab)
test_dataset = EpidemicDataset(data_splits['test']['texts'], data_splits['test']['labels'], vocab)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

print(f"‚úì DataLoaders created")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")
print(f"  Test batches: {len(test_loader)}")

In [None]:
# Initialize custom model
custom_model = CustomEpiDetector(
    vocab_size=len(vocab),
    embedding_dim=256,
    hidden_dim=128,
    num_layers=2,
    dropout=0.3
)

print("Custom Neural Network Architecture:")
print("=" * 60)
print(custom_model)
print(f"\nTotal parameters: {sum(p.numel() for p in custom_model.parameters()):,}")

In [None]:
# Train custom model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

trainer = ModelTrainer(custom_model, device=device)
train_losses, val_losses = trainer.train(train_loader, val_loader, epochs=5, lr=0.001)

In [None]:
# Plot training curves
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss', marker='o')
plt.plot(val_losses, label='Val Loss', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss', fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(trainer.train_accuracies, label='Train Accuracy', marker='o')
plt.plot(trainer.val_accuracies, label='Val Accuracy', marker='s')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy', fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 4.2 Pre-trained Transformer Models

**Note:** Training transformer models can be time-consuming. For this notebook, we'll demonstrate with one model. Use the `train_all.py` script to train all models.

In [None]:
from models.pretrained_models import PretrainedEpiDetector

# Train one pre-trained model as example (DistilBERT - fastest)
print("Training DistilBERT (Fastest pre-trained model)...")

distilbert_model = PretrainedEpiDetector('distilbert-base-multilingual-cased', device=device)

# Prepare dataloaders
train_loader_bert = distilbert_model.prepare_dataloader(
    data_splits['train']['texts'],
    data_splits['train']['labels'],
    batch_size=16
)

val_loader_bert = distilbert_model.prepare_dataloader(
    data_splits['val']['texts'],
    data_splits['val']['labels'],
    batch_size=16,
    shuffle=False
)

print("‚úì DataLoaders prepared")

In [None]:
# Train DistilBERT
distilbert_model.train(train_loader_bert, val_loader_bert, epochs=3, learning_rate=2e-5)

## 5. Model Evaluation and Comparison

In [None]:
from evaluation.model_evaluator import ModelEvaluator

# Initialize evaluator
evaluator = ModelEvaluator()

# Evaluate custom model
print("Evaluating Custom Neural Network...")
predictions_custom, _ = trainer.predict(test_loader)
y_pred_custom = (predictions_custom > 0.5).astype(int)
y_true = np.array(data_splits['test']['labels'])

evaluator.evaluate_model(
    "Custom Neural Network",
    y_true,
    y_pred_custom,
    y_prob=predictions_custom,
    inference_time=0.05,
    model_size=50
)

print("\n‚úì Custom model evaluated!")

In [None]:
# Evaluate DistilBERT
print("Evaluating DistilBERT...")
predictions_distil, probabilities_distil = distilbert_model.predict(
    data_splits['test']['texts'],
    batch_size=16
)

# Measure inference time
sample_text = data_splits['test']['texts'][0]
timing = distilbert_model.measure_inference_time(sample_text, num_runs=50)

evaluator.evaluate_model(
    "DistilBERT-Multilingual",
    y_true,
    predictions_distil,
    y_prob=probabilities_distil,
    inference_time=timing['mean'],
    model_size=270
)

print("\n‚úì DistilBERT evaluated!")

In [None]:
# Display comparison table
comparison_df = evaluator.get_comparison_table()
print("\nModel Comparison:")
print("=" * 100)
print(comparison_df.to_string(index=False))

In [None]:
# Generate visualizations
evaluator.plot_comparison(save_path='../outputs/visualizations/model_comparison.png')
evaluator.plot_confusion_matrices(save_path='../outputs/visualizations/confusion_matrices.png')

In [None]:
# Get recommendation
recommendation = evaluator.generate_recommendation()

print("\n" + "=" * 80)
print("üèÜ RECOMMENDED MODEL FOR EPIWATCH APPLICATION")
print("=" * 80)
print(f"\nModel: {recommendation['recommended_model']}")
print(f"Overall Score: {recommendation['total_score']:.4f}")
print(f"\nReasons:")
for reason in recommendation['reasons']:
    print(f"  ‚úì {reason}")
print("\n" + "=" * 80)

## 6. Anomaly Detection and Alert Generation

In [None]:
from evaluation.anomaly_detection import AnomalyDetector, OutbreakAlertSystem
from datetime import datetime, timedelta

# Create sample predictions with temporal and spatial data
dates = pd.date_range(start='2024-11-01', periods=100, freq='H')
regions = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai']
diseases = ['Dengue', 'COVID-19', 'Malaria', 'Influenza']

predictions_df = pd.DataFrame({
    'date': np.random.choice(dates, 500),
    'region': np.random.choice(regions, 500),
    'disease': np.random.choice(diseases, 500),
    'text': ['sample text'] * 500,
    'prediction': np.random.binomial(1, 0.3, 500),
    'probability': np.random.uniform(0.3, 0.95, 500)
})

# Add some anomalies (outbreaks)
outbreak_indices = np.random.choice(500, 50, replace=False)
predictions_df.loc[outbreak_indices, 'prediction'] = 1
predictions_df.loc[outbreak_indices, 'probability'] = np.random.uniform(0.8, 0.99, 50)

print(f"‚úì Sample predictions created: {len(predictions_df)} records")
predictions_df.head()

In [None]:
# Initialize alert system
detector = AnomalyDetector(method='zscore', threshold=2.5)
alert_system = OutbreakAlertSystem(detector)

# Process and generate alerts
alerts = alert_system.process_and_alert(predictions_df)

print(f"\nüö® Generated {len(alerts)} alerts\n")

# Display sample alerts
for i, alert in enumerate(alerts[:5]):
    print(f"Alert {i+1}:")
    print(f"  {alert['message']}")
    print()

## 7. Mobile App Integration - Output Generation

### 7.1 Alert Feed (for Recent Alerts Screen)

In [None]:
# Get alerts for mobile app
mobile_alerts = alert_system.get_alerts_for_mobile(limit=10)

print("üì± Mobile App - Recent Alerts")
print("=" * 80)

for alert in mobile_alerts[:5]:
    color_emoji = {'high': 'üî¥', 'moderate': 'üü†', 'low': 'üü¢'}
    print(f"\n{color_emoji[alert['risk_level']]} {alert['title']}")
    print(f"   Location: {alert['location']}")
    print(f"   Cases: {alert['case_count']}")
    print(f"   Risk: {alert['risk_level'].upper()}")
    print(f"   Summary: {alert['summary'][:100]}...")

# Save to JSON for mobile app
import json
with open('../outputs/alerts/mobile_alerts.json', 'w') as f:
    json.dump(mobile_alerts, f, indent=2, default=str)

print("\n‚úì Alerts saved for mobile app: outputs/alerts/mobile_alerts.json")

### 7.2 Map Data (for Global Outbreak Status Screen)

In [None]:
# Get map data
map_data = alert_system.get_map_data()

print("üì± Mobile App - Map Data")
print("=" * 80)

for region in map_data[:5]:
    print(f"\n{region['region']}:")
    print(f"  Max Risk Level: {region['max_risk']}")
    print(f"  Active Alerts: {len(region['alerts'])}")

# Save to JSON
with open('../outputs/visualizations/map_data.json', 'w') as f:
    json.dump(map_data, f, indent=2, default=str)

print("\n‚úì Map data saved: outputs/visualizations/map_data.json")

### 7.3 Trend Data (for 7-Day Disease Trends Screen)

In [None]:
# Get 7-day trend data
trend_data = alert_system.get_trend_data(days=7)

print("üì± Mobile App - 7-Day Trend Data")
print("=" * 80)

for disease_name, disease_data in list(trend_data.items())[:3]:
    print(f"\n{disease_name}:")
    for day_data in disease_data['data'][-3:]:
        print(f"  {day_data['date']}: {day_data['count']} cases")

# Save to JSON
with open('../outputs/visualizations/trend_data.json', 'w') as f:
    json.dump(trend_data, f, indent=2, default=str)

print("\n‚úì Trend data saved: outputs/visualizations/trend_data.json")

### 7.4 Visualize Trends

In [None]:
# Visualize 7-day trends
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('7-Day Disease Trends for Mobile App', fontsize=16, fontweight='bold')

axes = axes.flatten()

for i, (disease_name, disease_data) in enumerate(list(trend_data.items())[:6]):
    dates = [d['date'] for d in disease_data['data']]
    counts = [d['count'] for d in disease_data['data']]
    
    axes[i].bar(range(len(dates)), counts, color='#3498db', alpha=0.7)
    axes[i].set_title(disease_name, fontweight='bold')
    axes[i].set_xlabel('Day')
    axes[i].set_ylabel('Cases')
    axes[i].grid(axis='y', alpha=0.3)
    
    # Add total
    total = sum(counts)
    axes[i].text(0.5, 0.95, f'Total: {total}', 
                transform=axes[i].transAxes,
                ha='center', va='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('../outputs/visualizations/7day_trends.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì Trend visualization saved: outputs/visualizations/7day_trends.png")

## 8. Summary and Next Steps

In [None]:
print("\n" + "="*80)
print("üìã EPIWATCH PROJECT SUMMARY")
print("="*80)

print("\n‚úÖ COMPLETED:")
print("  1. ‚úì Data loading and preprocessing")
print("  2. ‚úì Custom neural network training (from scratch)")
print("  3. ‚úì Pre-trained model fine-tuning (DistilBERT)")
print("  4. ‚úì Model evaluation and comparison")
print("  5. ‚úì Anomaly detection and alert generation")
print("  6. ‚úì Mobile app output generation")

print("\nüì± MOBILE APP INTEGRATION:")
print("  ‚Ä¢ Alert Feed JSON: outputs/alerts/mobile_alerts.json")
print("  ‚Ä¢ Map Data JSON: outputs/visualizations/map_data.json")
print("  ‚Ä¢ Trend Data JSON: outputs/visualizations/trend_data.json")
print("  ‚Ä¢ Visualizations: outputs/visualizations/")

print("\nüöÄ NEXT STEPS:")
print("  1. Train remaining pre-trained models (XLM-RoBERTa, mBERT, MuRIL)")
print("     ‚Üí Run: python src/models/train_all.py")
print("  2. Compare all 5 models and select the best")
print("  3. Deploy the best model with FastAPI")
print("     ‚Üí Run: uvicorn src.api.main:app --reload")
print("  4. Connect mobile app to API endpoints")
print("  5. Deploy to production (AWS/Azure/GCP)")

print("\nüåç SDG IMPACT:")
print("  ‚Ä¢ SDG 3: Faster epidemic response ‚Üí Reduced mortality")
print("  ‚Ä¢ SDG 9: Innovative AI infrastructure for public health")
print("  ‚Ä¢ SDG 10: Focusing on low-resource regions")

print("\n" + "="*80)
print("‚ú® EpiWatch: Saving lives through AI-powered early detection ‚ú®")
print("="*80 + "\n")