# ML Model for Predicting Disease Outbreak Risk (Multiple Classifiers)

This notebook builds and compares several machine learning models using the refined ASHA worker dataset. The goal is to identify the best-performing classifier for this task.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import pickle

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from scipy.sparse import hstack

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

In [6]:
# Load the dataset created by the updated data_synthesizer.py
df = pd.read_csv('truly_realistic_dataset.csv')

# Drop location identifiers as they are not predictive features for a general model
df = df.drop(columns=['State', 'District'])

print("Successfully loaded the new realistic dataset with CommunityNotes.")

Successfully loaded the new realistic dataset with CommunityNotes.


### 3. Preprocess Data (Structured + NLP)

In [7]:
X = df.drop('OutbreakStatus', axis=1)
y = df['OutbreakStatus']

# Identify column types
categorical_features = X.select_dtypes(include=['object']).drop(columns=['CommunityNotes']).columns.tolist()
boolean_features = X.select_dtypes(include=['bool']).columns.tolist()
text_feature = 'CommunityNotes'

# Create the preprocessor for structured data (categorical & boolean)
structured_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features),
        ('bool', 'passthrough', boolean_features)
    ], 
    remainder='drop'
)

# Create the preprocessor for the text data
text_preprocessor = TfidfVectorizer(stop_words='english', max_features=50, ngram_range=(1,2))

# Encode the target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split data before processing
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

# Apply preprocessing to each part and combine
X_train_structured = structured_preprocessor.fit_transform(X_train)
X_test_structured = structured_preprocessor.transform(X_test)

X_train_text = text_preprocessor.fit_transform(X_train[text_feature])
X_test_text = text_preprocessor.transform(X_test[text_feature])

X_train_final = hstack([X_train_structured, X_train_text]).tocsr()
X_test_final = hstack([X_test_structured, X_test_text]).tocsr()

print("Preprocessing with NLP complete.")
print("Final training data shape:", X_train_final.shape)

Preprocessing with NLP complete.
Final training data shape: (4000, 59)


### 4. Hyperparameter Tuning with GridSearchCV

We will tune the two most promising models: RandomForest and LightGBM. GridSearchCV will exhaustively test combinations of parameters to find the best set.

In [8]:
# --- RandomForest Tuning ---
print("--- Tuning RandomForestClassifier ---")
param_grid_rf = {
    'n_estimators': [100, 150],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
grid_rf = GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1), param_grid_rf, cv=5, scoring='accuracy', verbose=1)
grid_rf.fit(X_train_final, y_train)

print("\nBest RandomForest Parameters:", grid_rf.best_params_)
print("Best RandomForest CV Score:", grid_rf.best_score_)

# --- LightGBM Tuning ---
print("\n--- Tuning LGBMClassifier ---")
param_grid_lgbm = {
    'n_estimators': [100, 150],
    'num_leaves': [20, 31, 40],
    'learning_rate': [0.1, 0.05],
    'max_depth': [-1, 10]
}
grid_lgbm = GridSearchCV(LGBMClassifier(random_state=42, verbosity=-1), param_grid_lgbm, cv=5, scoring='accuracy', verbose=1)
grid_lgbm.fit(X_train_final, y_train)

print("\nBest LightGBM Parameters:", grid_lgbm.best_params_)
print("Best LightGBM CV Score:", grid_lgbm.best_score_)

--- Tuning RandomForestClassifier ---
Fitting 5 folds for each of 24 candidates, totalling 120 fits

Best RandomForest Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Best RandomForest CV Score: 0.79325

--- Tuning LGBMClassifier ---
Fitting 5 folds for each of 24 candidates, totalling 120 fits

Best LightGBM Parameters: {'learning_rate': 0.05, 'max_depth': -1, 'n_estimators': 100, 'num_leaves': 20}
Best LightGBM CV Score: 0.7987500000000001


In [9]:
# Compare the best scores from the tuned models
if grid_rf.best_score_ > grid_lgbm.best_score_:
    best_model = grid_rf.best_estimator_
    best_model_name = 'Tuned RandomForest'
else:
    best_model = grid_lgbm.best_estimator_
    best_model_name = 'Tuned LightGBM'

print(f"--- The best overall model is: {best_model_name} ---")

# Evaluate the final model on the unseen test data
y_pred = best_model.predict(X_test_final)
final_accuracy = accuracy_score(y_test, y_pred)

print(f"\nFinal Test Set Accuracy: {final_accuracy:.4f}")
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

--- The best overall model is: Tuned LightGBM ---

Final Test Set Accuracy: 0.8310

Classification Report on Test Set:
              precision    recall  f1-score   support

   High_Risk       0.80      0.80      0.80       428
    Low_Risk       0.85      0.85      0.85       572

    accuracy                           0.83      1000
   macro avg       0.83      0.83      0.83      1000
weighted avg       0.83      0.83      0.83      1000



### 6. Save the Complete Tuned Pipeline

In [10]:
print(f"Saving the best model '{best_model_name}' and all pipeline components.")

# Save all components
with open('best_model.pkl', 'wb') as f: pickle.dump(best_model, f)
with open('structured_preprocessor.pkl', 'wb') as f: pickle.dump(structured_preprocessor, f)
with open('text_preprocessor.pkl', 'wb') as f: pickle.dump(text_preprocessor, f)
with open('label_encoder.pkl', 'wb') as f: pickle.dump(le, f)
    
print("\nAll pipeline components have been saved successfully.")

Saving the best model 'Tuned LightGBM' and all pipeline components.

All pipeline components have been saved successfully.


### 7. Example: Loading and Using the Full Pipeline

In [12]:
# Load all saved objects
loaded_model = pickle.load(open('best_model.pkl', 'rb'))
loaded_structured_preprocessor = pickle.load(open('structured_preprocessor.pkl', 'rb'))
loaded_text_preprocessor = pickle.load(open('text_preprocessor.pkl', 'rb'))
loaded_label_encoder = pickle.load(open('label_encoder.pkl', 'rb'))
print("All pipeline components loaded successfully.")

# Create a sample of new data -- YOU CAN PLAY WITH THESE FIELDS
sample_data = pd.DataFrame({
    'WaterSourceType': ['HandPump'],
    'SanitationLevels': ['Poor'],
    'Fever': [True],
    'Vomiting': [False],
    'AbdominalPain': [True],
    'RecentTravelHistory': [False],
    'CommunityNotes': ["High Fever"]
})

print("\nNew sample data:")
display(sample_data)

# Transform the new data using the correct preprocessors
transformed_structured = loaded_structured_preprocessor.transform(sample_data)
transformed_text = loaded_text_preprocessor.transform(sample_data['CommunityNotes'])

# Combine the transformed parts
transformed_sample = hstack([transformed_structured, transformed_text]).tocsr()

# Make a prediction
prediction_encoded = loaded_model.predict(transformed_sample)
prediction = loaded_label_encoder.inverse_transform(prediction_encoded)

print(f"\nModel Prediction: {prediction[0]} (Encoded: {prediction_encoded[0]})")

All pipeline components loaded successfully.

New sample data:


Unnamed: 0,WaterSourceType,SanitationLevels,Fever,Vomiting,AbdominalPain,RecentTravelHistory,CommunityNotes
0,HandPump,Poor,True,False,True,False,High Fever



Model Prediction: High_Risk (Encoded: 0)
