In [None]:
# Load the dataset with the correct filename
import pandas as pd
df = pd.read_csv('symptom_disease_dataset.csv')

# Inspect the first few rows
display(df.head())

# Inspect columns and data types
display(df.info())

Unnamed: 0,Symptom1,Symptom2,Symptom3,Symptom4,Symptom5,Symptom6,Symptom7,Disease
0,Body Pain,Headache,Fatigue,Chills,Fever,Cough,Sore Throat,Flu
1,Cough,Shortness of Breath,Sweating,Chest Pain,Fatigue,Wheezing,Anxiety,Asthma
2,Fatigue,Fever,Sore Throat,Loss of Smell,Shortness of Breath,Chest Pain,Cough,COVID-19
3,Cough,Fever,Loss of Smell,Chest Pain,Shortness of Breath,Fatigue,Sore Throat,COVID-19
4,Sweating,Anxiety,Wheezing,Fatigue,Cough,Chest Pain,Shortness of Breath,Asthma


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750 entries, 0 to 749
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Symptom1  750 non-null    object
 1   Symptom2  750 non-null    object
 2   Symptom3  750 non-null    object
 3   Symptom4  750 non-null    object
 4   Symptom5  750 non-null    object
 5   Symptom6  750 non-null    object
 6   Symptom7  750 non-null    object
 7   Disease   750 non-null    object
dtypes: object(8)
memory usage: 47.0+ KB


None

**Reasoning**:
Check for missing values in the dataset and handle them. Since there are no missing values based on `df.info()`, proceed to examine the distribution of the target variable and encode it.



In [None]:
from sklearn.preprocessing import LabelEncoder

# Check for missing values
display(df.isnull().sum())

# Examine the distribution of the target variable
display(df['Disease'].value_counts())

# Encode the disease labels
le = LabelEncoder()
df['Disease_Encoded'] = le.fit_transform(df['Disease'])

# Display the first few rows with the new encoded column
display(df.head())

Unnamed: 0,0
Symptom1,0
Symptom2,0
Symptom3,0
Symptom4,0
Symptom5,0
Symptom6,0
Symptom7,0
Disease,0


Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
Allergy,104
COVID-19,102
Heart Disease,101
Asthma,96
Food Poisoning,91
Depression,89
Diabetes,89
Flu,78


Unnamed: 0,Symptom1,Symptom2,Symptom3,Symptom4,Symptom5,Symptom6,Symptom7,Disease,Disease_Encoded
0,Body Pain,Headache,Fatigue,Chills,Fever,Cough,Sore Throat,Flu,5
1,Cough,Shortness of Breath,Sweating,Chest Pain,Fatigue,Wheezing,Anxiety,Asthma,1
2,Fatigue,Fever,Sore Throat,Loss of Smell,Shortness of Breath,Chest Pain,Cough,COVID-19,2
3,Cough,Fever,Loss of Smell,Chest Pain,Shortness of Breath,Fatigue,Sore Throat,COVID-19,2
4,Sweating,Anxiety,Wheezing,Fatigue,Cough,Chest Pain,Shortness of Breath,Asthma,1


**Reasoning**:
Transform the symptom columns into a binary format by creating a unique list of all symptoms and then One-Hot Encoding the symptom columns. Then, perform basic exploratory data analysis by checking the frequency of each symptom.



In [None]:
import numpy as np

# Get all unique symptoms from the symptom columns
symptom_cols = ['Symptom1', 'Symptom2', 'Symptom3', 'Symptom4', 'Symptom5', 'Symptom6', 'Symptom7']
all_symptoms = pd.unique(df[symptom_cols].values.ravel())

# Create a binary representation of symptoms
for symptom in all_symptoms:
    df[symptom] = df[symptom_cols].apply(lambda row: 1 if symptom in row.values else 0, axis=1)

# Drop the original symptom columns
df = df.drop(columns=symptom_cols)

# Display the first few rows with the new binary symptom columns
display(df.head())

# Perform EDA on symptom frequency
symptom_frequencies = df[all_symptoms].sum().sort_values(ascending=False)
display(symptom_frequencies)

Unnamed: 0,Disease,Disease_Encoded,Body Pain,Headache,Fatigue,Chills,Fever,Cough,Sore Throat,Shortness of Breath,...,Sleep Problems,Hunger,Thirst,Frequent Urination,Blurred Vision,Infections,Vomiting,Dehydration,Abdominal Pain,Diarrhea
0,Flu,5,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Asthma,1,0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,COVID-19,2,0,0,1,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,COVID-19,2,0,0,1,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,Asthma,1,0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,0
Fatigue,555
Cough,380
Shortness of Breath,299
Chest Pain,299
Anxiety,286
Fever,271
Sweating,197
Nausea,192
Sore Throat,180
Weight Loss,178


**Reasoning**:
Load the trained model to extract the exact feature names used during training, ensuring that the `symptom_columns.pkl` file contains the correct columns for prediction. Then save the extracted feature names to `symptom_columns.pkl`.

## Encode and save artifacts

### Subtask:
Encode the target variable, save the encoder and the list of symptom columns.


**Reasoning**:
Save the LabelEncoder and the list of symptom columns to files using pickle.



In [None]:
import pickle

# Save the trained LabelEncoder object
with open('label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

# Extract the names of the binary symptom columns (excluding the target and encoded target)
symptom_columns = df.columns.tolist()
symptom_columns.remove('Disease')
symptom_columns.remove('Disease_Encoded')

# Save the list of symptom column names
with open('symptom_columns.pkl', 'wb') as f:
    pickle.dump(symptom_columns, f)

print("LabelEncoder saved to label_encoder.pkl")
print("Symptom column list saved to symptom_columns.pkl")

LabelEncoder saved to label_encoder.pkl
Symptom column list saved to symptom_columns.pkl


## Train and evaluate models

### Subtask:
Split data, train multiple classification models, evaluate their performance using cross-validation and various metrics, and select the best model.


**Reasoning**:
Import necessary libraries for model training, evaluation, and cross-validation, separate features and target, initialize StratifiedKFold, create a dictionary of models, and perform cross-validation for each model.



In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Separate features (symptom columns) and target variable (encoded disease)
X = df[symptom_columns]
y = df['Disease_Encoded']

# Initialize StratifiedKFold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create a dictionary of classification models
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Naive Bayes": GaussianNB(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(random_state=42)
}

# Dictionary to store cross-validation results
cv_results = {}

# Iterate through each model and perform cross-validation
for name, model in models.items():
    print(f"Evaluating {name}...")
    # Perform cross-validation with multiple scoring metrics
    cv_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
    cv_precision = cross_val_score(model, X, y, cv=skf, scoring='precision_weighted')
    cv_recall = cross_val_score(model, X, y, cv=skf, scoring='recall_weighted')
    cv_f1 = cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')

    # Store average scores
    cv_results[name] = {
        'Accuracy': np.mean(cv_scores),
        'Precision': np.mean(cv_precision),
        'Recall': np.mean(cv_recall),
        'F1 Score': np.mean(cv_f1)
    }

    # Print average cross-validation scores
    print(f"{name} Average Scores:")
    print(f"  Accuracy: {cv_results[name]['Accuracy']:.4f}")
    print(f"  Precision: {cv_results[name]['Precision']:.4f}")
    print(f"  Recall: {cv_results[name]['Recall']:.4f}")
    print(f"  F1 Score: {cv_results[name]['F1 Score']:.4f}")
    print("-" * 30)


Evaluating Random Forest...
Random Forest Average Scores:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1 Score: 1.0000
------------------------------
Evaluating Logistic Regression...
Logistic Regression Average Scores:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1 Score: 1.0000
------------------------------
Evaluating Naive Bayes...
Naive Bayes Average Scores:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1 Score: 1.0000
------------------------------
Evaluating K-Nearest Neighbors...
K-Nearest Neighbors Average Scores:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1 Score: 1.0000
------------------------------
Evaluating Decision Tree...
Decision Tree Average Scores:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1 Score: 1.0000
------------------------------


**Reasoning**:
Determine the best performing model based on average accuracy, train the best model on the entire dataset, and print the name of the best model.



In [None]:
# Determine the best performing model based on average accuracy
best_model_name = max(cv_results, key=lambda name: cv_results[name]['Accuracy'])
best_model = models[best_model_name]

# Train the best performing model on the entire dataset
best_model.fit(X, y)

# Print the name of the best performing model
print(f"The best performing model based on average accuracy is: {best_model_name}")

The best performing model based on average accuracy is: Random Forest


## Export model and artifacts

### Subtask:
Save the best model, the encoder, and the symptom column list to files.


**Reasoning**:
Save the best model, the label encoder, and the symptom column list to files using pickle.



In [None]:
# Save the best model
with open('best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

# Save the LabelEncoder object (already done in a previous step, but included here for completeness)
# with open('label_encoder.pkl', 'wb') as f:
#     pickle.dump(le, f)

# Save the list of symptom column names (already done in a previous step, but included here for completeness)
# with open('symptom_columns.pkl', 'wb') as f:
#     pickle.dump(symptom_columns, f)

print("Best model saved to best_model.pkl")
print("LabelEncoder saved to label_encoder.pkl")
print("Symptom column list saved to symptom_columns.pkl")

Best model saved to best_model.pkl
LabelEncoder saved to label_encoder.pkl
Symptom column list saved to symptom_columns.pkl


## Test model with dynamic input

### Subtask:
Create a cell to test the saved model by taking a list of symptoms as input, converting it to the required format, loading the saved model and encoder, and predicting the disease.


**Reasoning**:
Define a function to predict disease based on a list of symptoms, loading the saved model and artifacts within the function, and then call the function with a sample input.



In [None]:
import pickle
import pandas as pd

def predict_disease(symptoms):
    """
    Predicts disease based on a list of symptoms.

    Args:
        symptoms (list): A list of symptom strings.

    Returns:
        str: The predicted disease string.
    """
    # Load the saved symptom column list
    with open('symptom_columns.pkl', 'rb') as f:
        symptom_columns = pickle.load(f)

    # Load the saved model
    with open('best_model.pkl', 'rb') as f:
        model = pickle.load(f)

    # Load the saved label encoder
    with open('label_encoder.pkl', 'rb') as f:
        encoder = pickle.load(f)

    # Create a DataFrame with a single row and columns for all symptoms, initialized to 0
    input_data = pd.DataFrame(0, index=[0], columns=symptom_columns)

    # Set the columns corresponding to the input symptoms to 1
    for symptom in symptoms:
        if symptom in input_data.columns:
            input_data[symptom] = 1
        else:
            print(f"Warning: Symptom '{symptom}' not found in the training data.")

    # Predict the encoded disease
    predicted_encoded_disease = model.predict(input_data)

    # Convert the encoded prediction back to the original disease string
    predicted_disease = encoder.inverse_transform(predicted_encoded_disease)

    return predicted_disease[0]

# Test the function with multiple diverse symptom combinations
test_cases = [
    ["Fatigue", "Cough", "Fever"],
    ["Chest Pain", "Shortness of Breath", "Sweating"],
    ["Loss of Smell", "Sore Throat", "Fever"],
    ["Nausea", "Vomiting", "Abdominal Pain"],
    ["Blurred Vision", "Increased Thirst", "Frequent Urination"],
    ["Weight Loss", "Fatigue", "Increased Hunger"],
    ["Anxiety", "Sleep Problems", "Lack of Concentration"],
    ["Body Pain", "Chills", "Headache"],
    ["Rash", "Itching", "Swelling"],
    ["Runny Nose", "Sneezing", "Red Eyes"]
]

for symptoms in test_cases:
    predicted_disease = predict_disease(symptoms)
    print(f"Based on the symptoms {symptoms}, the predicted disease is: {predicted_disease}")

Based on the symptoms ['Fatigue', 'Cough', 'Fever'], the predicted disease is: Flu
Based on the symptoms ['Chest Pain', 'Shortness of Breath', 'Sweating'], the predicted disease is: Asthma
Based on the symptoms ['Loss of Smell', 'Sore Throat', 'Fever'], the predicted disease is: COVID-19
Based on the symptoms ['Nausea', 'Vomiting', 'Abdominal Pain'], the predicted disease is: Food Poisoning
Based on the symptoms ['Blurred Vision', 'Increased Thirst', 'Frequent Urination'], the predicted disease is: Diabetes
Based on the symptoms ['Weight Loss', 'Fatigue', 'Increased Hunger'], the predicted disease is: Diabetes
Based on the symptoms ['Anxiety', 'Sleep Problems', 'Lack of Concentration'], the predicted disease is: Depression
Based on the symptoms ['Body Pain', 'Chills', 'Headache'], the predicted disease is: Flu
Based on the symptoms ['Rash', 'Itching', 'Swelling'], the predicted disease is: Allergy
Based on the symptoms ['Runny Nose', 'Sneezing', 'Red Eyes'], the predicted disease is: A

## Test with custom input

**Reasoning**:
Use the `predict_disease` function with a custom list of symptoms to test the model's output.

In [None]:
# Test the function with a custom list of symptoms
custom_symptoms = ['Loss of Smell', 'Sore Throat', 'Fever']
predicted_disease = predict_disease(custom_symptoms)
print(f"Based on the symptoms {custom_symptoms}, the predicted disease is: {predicted_disease}")

Based on the symptoms ['Loss of Smell', 'Sore Throat', 'Fever'], the predicted disease is: COVID-19


## Provide downloadable links

### Subtask:
Generate code to create downloadable links for the saved model, encoder, and symptom column files.


**Reasoning**:
Import necessary libraries and create a function to generate downloadable links for files in a Colab environment.



In [None]:
from google.colab import files
import base64
from IPython.display import HTML

def create_download_link(filepath, filename):
    """Generates a link to download a file in Google Colab."""
    with open(filepath, 'rb') as f:
        data = f.read()
    bin_str = base64.b64encode(data).decode()
    href = f'<a href="data:application/octet-stream;base64,{bin_str}" download="{filename}">Download {filename}</a>'
    return href

# Generate and display download links for the saved files
model_link = create_download_link('best_model.pkl', 'best_model.pkl')
encoder_link = create_download_link('label_encoder.pkl', 'label_encoder.pkl')
symptom_cols_link = create_download_link('symptom_columns.pkl', 'symptom_columns.pkl')

display(HTML(model_link))
display(HTML(encoder_link))
display(HTML(symptom_cols_link))

## Summary:

### Data Analysis Key Findings

*   The dataset `symptom_disease_dataset.csv` was successfully loaded and contained no missing values.
*   The target variable, 'Disease', was successfully encoded into a numerical format, 'Disease\_Encoded', using `LabelEncoder`.
*   The original symptom columns were transformed into a binary format, creating new columns for each unique symptom (1 if present, 0 otherwise).
*   Exploratory Data Analysis on symptom frequencies showed the distribution of symptoms across the dataset.
*   Five different classification models (Random Forest, Logistic Regression, Naive Bayes, K-Nearest Neighbors, and Decision Tree) were trained and evaluated using 5-fold Stratified K-Fold cross-validation.
*   All evaluated models achieved perfect average scores (1.0000) for Accuracy, Precision, Recall, and F1 Score during cross-validation, indicating the dataset might be perfectly separable.
*   The Random Forest model was identified as the best-performing model based on average accuracy (though all models performed equally well on this dataset).
*   The best model (`best_model.pkl`), the trained `LabelEncoder` (`label_encoder.pkl`), and the list of binary symptom column names (`symptom_columns.pkl`) were successfully saved as pickle files.
*   A function was successfully created to predict a disease based on a list of symptoms using the saved artifacts. For the sample input `["Fatigue", "Cough", "Fever"]`, the predicted disease was "Flu".
*   Downloadable links for the saved model, encoder, and symptom column files were successfully generated.

### Insights or Next Steps

*   The perfect scores observed during cross-validation suggest that the dataset might be too simple or synthetic. Further analysis could involve testing the model on a more complex or real-world dataset.
*   While the current model performs perfectly on this dataset, evaluating its robustness to noise or unseen symptom combinations would be beneficial for real-world application.
