**Disease Outbreak Prediction Using Machine Learning for Public Health**

**Step 1: Data Collection**

This dataset will include:

*  Temperature (°C) – Higher temperatures may impact disease spread.
*  Humidity (%) – High humidity levels can affect transmission.
*   Population Density (people per sq. km) – Higher density often correlates with higher spread.
* Mobility Change (%) – Movement restrictions impact outbreak levels.  
*  Outbreak Label (0 or 1) – 1 indicates an outbreak, 0 means no outbreak


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Number of samples
num_samples = 1000

# Generate synthetic feature data
temperature = np.random.uniform(10, 40, num_samples)  # Temperature in Celsius
humidity = np.random.uniform(20, 90, num_samples)  # Humidity in %
population_density = np.random.uniform(50, 5000, num_samples)  # People per sq.km
mobility_change = np.random.uniform(-50, 50, num_samples)  # Percentage change in movement

# Generate outbreak labels based on conditions (e.g., higher risk when population is dense and mobility is low)
outbreak = (temperature > 25) & (humidity > 50) & (population_density > 1000) & (mobility_change < 0)
outbreak = outbreak.astype(int)  # Convert Boolean to Integer (0 = No Outbreak, 1 = Outbreak)

# Create a DataFrame
data = pd.DataFrame({
    'Temperature': temperature,
    'Humidity': humidity,
    'Population_Density': population_density,
    'Mobility_Change': mobility_change,
    'Outbreak': outbreak
})

# Display first few rows of the dataset
data.head()


Unnamed: 0,Temperature,Humidity,Population_Density,Mobility_Change,Outbreak
0,21.236204,32.959305,1345.443134,17.270299,0
1,38.521429,57.933066,1272.545055,29.66814,0
2,31.959818,81.106209,4535.960174,-24.95321,1
3,27.959755,71.255742,1285.253689,12.48741,0
4,14.680559,76.45928,1396.151144,7.174598,0


**Understanding the Generated Data**

Each row represents a region or time period, with:

1.  Environmental factors (temperature, humidity).
2. Social factors (population density, mobility changes).
3.  Outbreak label (1 = Outbreak, 0 = No Outbreak).

Now, let's preprocess the data.

**Step 2: Data Preprocessing**

**Why Preprocessing?**
* Machine learning models work best with normalized (scaled) data.
* Splitting the dataset into training (80%) and testing (20%) ensures fair evaluation.

**Preprocessing Steps**


* Extract Features and Target Variable (Outbreak column is the target).
*  Normalize the Data (Scale values between 0 and 1 for better model performance).
* Split into Training and Testing Sets (80% training, 20% testing).







In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Number of samples
num_samples = 1000

# Generate synthetic feature data
temperature = np.random.uniform(10, 40, num_samples)  # Temperature in Celsius
humidity = np.random.uniform(20, 90, num_samples)  # Humidity in %
population_density = np.random.uniform(50, 5000, num_samples)  # People per sq.km
mobility_change = np.random.uniform(-50, 50, num_samples)  # Percentage change in movement

# Generate outbreak labels based on conditions (e.g., higher risk when population is dense and mobility is low)
outbreak = (temperature > 25) & (humidity > 50) & (population_density > 1000) & (mobility_change < 0)
outbreak = outbreak.astype(int)  # Convert Boolean to Integer (0 = No Outbreak, 1 = Outbreak)

# Create a DataFrame
data = pd.DataFrame({
    'Temperature': temperature,
    'Humidity': humidity,
    'Population_Density': population_density,
    'Mobility_Change': mobility_change,
    'Outbreak': outbreak
})

# Display first few rows of the dataset
data.head()


Unnamed: 0,Temperature,Humidity,Population_Density,Mobility_Change,Outbreak
0,21.236204,32.959305,1345.443134,17.270299,0
1,38.521429,57.933066,1272.545055,29.66814,0
2,31.959818,81.106209,4535.960174,-24.95321,1
3,27.959755,71.255742,1285.253689,12.48741,0
4,14.680559,76.45928,1396.151144,7.174598,0


**Preprocessing Results**


* 800 samples for training the model.
* 200 samples for testing the model.
*  Features are normalized between 0 and 1.

Now, let's train the model using Random Forest Classifier.

**Step 3: Model Training**

**Why Random Forest?**


* It works well with both structured and unstructured data.
*  It is robust to overfitting.
* It can handle imbalanced datasets well.

**Training the Model:**

We'll use 100 decision trees (n_estimators=100).


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Define features (X) and target variable (y)
X = data[['Temperature', 'Humidity', 'Population_Density', 'Mobility_Change']]
y = data['Outbreak']

# Normalize the features using MinMaxScaler (scales values between 0 and 1)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Print dataset shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((800, 4), (200, 4), (800,), (200,))

The Random Forest model has been successfully trained.
Now, let's evaluate its performance on the test data.

**Step 4: Model Evaluation**

**Why Evaluate?**


*  To measure how well the model predicts new, unseen data.
*  To check for accuracy, precision, recall, and F1-score.

**Evaluation Metrics:**
*  Accuracy: How often the model is correct.
*  Precision: Out of all predicted outbreaks, how many were correct?
*  Recall: How many actual outbreaks did we correctly detect?
*  F1-score: A balance between precision and recall.






In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest Classifier with 100 trees
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the training set
rf_model.fit(X_train, y_train)


**Evaluation Results**
*   100% Accuracy
*   100% Precision, Recall, and F1-score

This means the model perfectly predicts outbreaks, likely due to the synthetic dataset being simple. In real-world cases, results would vary.


**Step 5: Making Predictions**

Let's test the model with new, unseen data to predict an outbreak.

**Example Case**

Temperature: 30°C

Humidity: 70%

Population Density: 1500 people/km²

Mobility Change: -10%


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Store results
evaluation_results = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1-Score": f1
}

evaluation_results


{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1-Score': 1.0}

In [None]:
# New data sample for prediction
new_data = np.array([[30, 70, 1500, -10]])

# Normalize the new data using the same scaler
new_data_scaled = scaler.transform(new_data)

# Predict outbreak likelihood
prediction = rf_model.predict(new_data_scaled)

# Output result
predicted_outbreak = "Yes" if prediction[0] == 1 else "No"
predicted_outbreak




'Yes'