# **Disease Prediction System**

This project uses machine learning to predict diseases based on symptoms. The system analyzes patient symptoms and provides disease predictions along with relevant information like descriptions, precautions, medications, recommended workouts, and dietary advice.

# **Table of Contents**

* [Setup and Imports](#setup-and-imports)
* [Data Loading and Exploration](data-loading-and-exploration)
* [Data Preprocessing](#data-preprocessing)
* [Model Training and Evaluation](#model-training-and-evaluation)
* [Model Selection and Saving](#model-selection-and-saving)
* [Loading Additional Data](#loading-additional-data)
* [Prediction System](#prediction-system)
* [Testing the System](#testing-the-system)

# **Setup and Imports**

In [2]:
# Importing necessary libraries for data manipulation
import pandas as pd
import numpy as np

In [3]:
# Importing machine learning related libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

In [4]:
# Importing pickle for model persistence
import pickle

# **Data Loading and Exploration**

In [5]:
# Loading the training dataset
dataset = pd.read_csv('datasets/Training.csv')

In [6]:
# Displaying the first few rows to understand the data structure
dataset.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


Let's explore the dataset dimensions and the target variable

In [7]:
# Checking the dataset dimensions
print(f"Dataset shape: {dataset.shape}")

Dataset shape: (4920, 133)


In [8]:
# Number of unique diseases (classes)
num_diseases = len(dataset['prognosis'].unique())
print(f"Number of unique diseases: {num_diseases}")

Number of unique diseases: 41


In [9]:
# List of unique diseases
print("Unique diseases:")
print(dataset['prognosis'].unique())

Unique diseases:
['Fungal infection' 'Allergy' 'GERD' 'Chronic cholestasis' 'Drug Reaction'
 'Peptic ulcer diseae' 'AIDS' 'Diabetes ' 'Gastroenteritis'
 'Bronchial Asthma' 'Hypertension ' 'Migraine' 'Cervical spondylosis'
 'Paralysis (brain hemorrhage)' 'Jaundice' 'Malaria' 'Chicken pox'
 'Dengue' 'Typhoid' 'hepatitis A' 'Hepatitis B' 'Hepatitis C'
 'Hepatitis D' 'Hepatitis E' 'Alcoholic hepatitis' 'Tuberculosis'
 'Common Cold' 'Pneumonia' 'Dimorphic hemmorhoids(piles)' 'Heart attack'
 'Varicose veins' 'Hypothyroidism' 'Hyperthyroidism' 'Hypoglycemia'
 'Osteoarthristis' 'Arthritis' '(vertigo) Paroymsal  Positional Vertigo'
 'Acne' 'Urinary tract infection' 'Psoriasis' 'Impetigo']


# **Data Preprocessing**

In [10]:
# Splitting the data into features (X) and target variable (y)
X = dataset.drop("prognosis", axis=1)
y = dataset["prognosis"]

In [11]:
# Encoding the target labels
label_encoder = LabelEncoder()
label_encoder.fit(y)
y_encoded = label_encoder.transform(y)

In [12]:
# Splitting the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.3, random_state=20
)

In [13]:
# Verifying the shapes of training and testing datasets
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Testing labels shape: {y_test.shape}")

Training data shape: (3444, 132)
Testing data shape: (1476, 132)
Training labels shape: (3444,)
Testing labels shape: (1476,)


# **Model Training and Evaluation**

We'll train multiple machine learning models and compare their performance :

In [14]:
# Defining a dictionary of models to evaluate
models = {
    'SVC': SVC(kernel='linear'),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'KNeighbors': KNeighborsClassifier(n_neighbors=5),
    'NaiveBayes': MultinomialNB()
}

In [15]:
# Dictionary to store model performances
model_performances = {}

# Training and evaluating each model
for model_name, model in models.items():
    print(f"\n{'-'*20} {model_name} {'-'*20}")
    
    # Training the model
    model.fit(X_train, y_train)
    
    # Making predictions on test data
    predictions = model.predict(X_test)
    
    # Calculating accuracy
    accuracy = accuracy_score(y_test, predictions)
    model_performances[model_name] = accuracy
    print(f"Accuracy: {accuracy:.4f}")
    
    # Generating confusion matrix
    cm = confusion_matrix(y_test, predictions)
    print(f"Confusion Matrix:\n{cm}")
    
    print(f"{'-'*50}")


-------------------- SVC --------------------
Accuracy: 1.0000
Confusion Matrix:
[[40  0  0 ...  0  0  0]
 [ 0 43  0 ...  0  0  0]
 [ 0  0 28 ...  0  0  0]
 ...
 [ 0  0  0 ... 34  0  0]
 [ 0  0  0 ...  0 41  0]
 [ 0  0  0 ...  0  0 31]]
--------------------------------------------------

-------------------- RandomForest --------------------
Accuracy: 1.0000
Confusion Matrix:
[[40  0  0 ...  0  0  0]
 [ 0 43  0 ...  0  0  0]
 [ 0  0 28 ...  0  0  0]
 ...
 [ 0  0  0 ... 34  0  0]
 [ 0  0  0 ...  0 41  0]
 [ 0  0  0 ...  0  0 31]]
--------------------------------------------------

-------------------- GradientBoosting --------------------
Accuracy: 1.0000
Confusion Matrix:
[[40  0  0 ...  0  0  0]
 [ 0 43  0 ...  0  0  0]
 [ 0  0 28 ...  0  0  0]
 ...
 [ 0  0  0 ... 34  0  0]
 [ 0  0  0 ...  0 41  0]
 [ 0  0  0 ...  0  0 31]]
--------------------------------------------------

-------------------- KNeighbors --------------------
Accuracy: 1.0000
Confusion Matrix:
[[40  0  0 ...  0  0  

In [16]:
# Identifying the best performing model
best_model = max(model_performances, key=model_performances.get)
print(f"\nBest performing model: {best_model} with accuracy {model_performances[best_model]:.4f}")


Best performing model: SVC with accuracy 1.0000


# **Model Selection and Saving**

Based on the evaluation results, we select the Support Vector Machine (SVC) model :

In [17]:
# Training the final SVC model on the complete training dataset
final_model = SVC(kernel='linear')
final_model.fit(X_train, y_train)

In [18]:
# Verifying the model accuracy on the test set
y_pred = final_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print(f"Final model accuracy: {final_accuracy:.4f}")

Final model accuracy: 1.0000


In [19]:
# Saving the trained model to disk
model_filename = 'models/disease_prediction_model.pkl'
pickle.dump(final_model, open(model_filename, 'wb'))
print(f"Model saved to {model_filename}")

Model saved to models/disease_prediction_model.pkl


In [20]:
# Quick validation - reloading the model and testing it on a sample
loaded_model = pickle.load(open(model_filename, 'rb'))

In [21]:
# Testing with first sample
sample_index = 0
predicted_disease_index = loaded_model.predict(X_test.iloc[sample_index].values.reshape(1, -1))[0]
print(f"Sample {sample_index} - Predicted disease index: {predicted_disease_index}, Actual disease index: {y_test[sample_index]}")

Sample 0 - Predicted disease index: 40, Actual disease index: 40




In [22]:
# Testing with another sample
sample_index = 100
predicted_disease_index = loaded_model.predict(X_test.iloc[sample_index].values.reshape(1, -1))[0]
print(f"Sample {sample_index} - Predicted disease index: {predicted_disease_index}, Actual disease index: {y_test[sample_index]}")

Sample 100 - Predicted disease index: 39, Actual disease index: 39




# **Loading Additional Data**

We'll load supplementary datasets containg disease-related information : 

In [23]:
# Loading datasets with disease information
disease_descriptions = pd.read_csv('datasets/description.csv')
disease_precautions = pd.read_csv('datasets/precautions_df.csv')
disease_medications = pd.read_csv('datasets/medications.csv')
disease_diets = pd.read_csv('datasets/diets.csv')
disease_workouts = pd.read_csv('datasets/workout_df.csv')
symptom_descriptions = pd.read_csv('datasets/symtoms_df.csv')

In [24]:
# Displaying a sample of each dataset
print("Disease Descriptions Sample:")
disease_descriptions.head()

Disease Descriptions Sample:


Unnamed: 0,Disease,Description
0,Fungal infection,Fungal infection is a common skin condition ca...
1,Allergy,Allergy is an immune system reaction to a subs...
2,GERD,GERD (Gastroesophageal Reflux Disease) is a di...
3,Chronic cholestasis,Chronic cholestasis is a condition where bile ...
4,Drug Reaction,Drug Reaction occurs when the body reacts adve...


In [25]:
print("\nDisease Precautions Sample:")
disease_precautions.head()


Disease Precautions Sample:


Unnamed: 0.1,Unnamed: 0,Disease,Precaution_1,Precaution_2,Precaution_3,Precaution_4
0,0,Drug Reaction,stop irritation,consult nearest hospital,stop taking drug,follow up
1,1,Malaria,Consult nearest hospital,avoid oily food,avoid non veg food,keep mosquitos out
2,2,Allergy,apply calamine,cover area with bandage,,use ice to compress itching
3,3,Hypothyroidism,reduce stress,exercise,eat healthy,get proper sleep
4,4,Psoriasis,wash hands with warm soapy water,stop bleeding using pressure,consult doctor,salt baths


In [26]:
print("\nDisease Medications Sample:")
disease_medications.head()


Disease Medications Sample:


Unnamed: 0,Disease,Medication
0,Fungal infection,"['Antifungal Cream', 'Fluconazole', 'Terbinafi..."
1,Allergy,"['Antihistamines', 'Decongestants', 'Epinephri..."
2,GERD,"['Proton Pump Inhibitors (PPIs)', 'H2 Blockers..."
3,Chronic cholestasis,"['Ursodeoxycholic acid', 'Cholestyramine', 'Me..."
4,Drug Reaction,"['Antihistamines', 'Epinephrine', 'Corticoster..."


In [27]:
print("\nDisease Diets Sample:")
disease_diets.head()


Disease Diets Sample:


Unnamed: 0,Disease,Diet
0,Fungal infection,"['Antifungal Diet', 'Probiotics', 'Garlic', 'C..."
1,Allergy,"['Elimination Diet', 'Omega-3-rich foods', 'Vi..."
2,GERD,"['Low-Acid Diet', 'Fiber-rich foods', 'Ginger'..."
3,Chronic cholestasis,"['Low-Fat Diet', 'High-Fiber Diet', 'Lean prot..."
4,Drug Reaction,"['Antihistamine Diet', 'Omega-3-rich foods', '..."


In [28]:
print("\nDisease Workouts Sample:")
disease_workouts.head()


Disease Workouts Sample:


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,disease,workout
0,0,0,Fungal infection,Avoid sugary foods
1,1,1,Fungal infection,Consume probiotics
2,2,2,Fungal infection,Increase intake of garlic
3,3,3,Fungal infection,Include yogurt in diet
4,4,4,Fungal infection,Limit processed foods


In [29]:
print("\nSymptom Descriptions Sample:")
symptom_descriptions.head()


Symptom Descriptions Sample:


Unnamed: 0.1,Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4
0,0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches
1,1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,
2,2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,
3,3,Fungal infection,itching,skin_rash,dischromic _patches,
4,4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,


# **Prediction System**

Now we'll create the functions for the prediction system :

In [30]:
# Mapping of symptoms to their indices in the feature vector
symptoms_dict = {'itching': 0, 'skin_rash': 1, 'nodal_skin_eruptions': 2, 'continuous_sneezing': 3, 'shivering': 4, 'chills': 5, 'joint_pain': 6, 'stomach_pain': 7, 'acidity': 8, 'ulcers_on_tongue': 9, 'muscle_wasting': 10, 'vomiting': 11, 'burning_micturition': 12, 'spotting_ urination': 13, 'fatigue': 14, 'weight_gain': 15, 'anxiety': 16, 'cold_hands_and_feets': 17, 'mood_swings': 18, 'weight_loss': 19, 'restlessness': 20, 'lethargy': 21, 'patches_in_throat': 22, 'irregular_sugar_level': 23, 'cough': 24, 'high_fever': 25, 'sunken_eyes': 26, 'breathlessness': 27, 'sweating': 28, 'dehydration': 29, 'indigestion': 30, 'headache': 31, 'yellowish_skin': 32, 'dark_urine': 33, 'nausea': 34, 'loss_of_appetite': 35, 'pain_behind_the_eyes': 36, 'back_pain': 37, 'constipation': 38, 'abdominal_pain': 39, 'diarrhoea': 40, 'mild_fever': 41, 'yellow_urine': 42, 'yellowing_of_eyes': 43, 'acute_liver_failure': 44, 'fluid_overload': 45, 'swelling_of_stomach': 46, 'swelled_lymph_nodes': 47, 'malaise': 48, 'blurred_and_distorted_vision': 49, 'phlegm': 50, 'throat_irritation': 51, 'redness_of_eyes': 52, 'sinus_pressure': 53, 'runny_nose': 54, 'congestion': 55, 'chest_pain': 56, 'weakness_in_limbs': 57, 'fast_heart_rate': 58, 'pain_during_bowel_movements': 59, 'pain_in_anal_region': 60, 'bloody_stool': 61, 'irritation_in_anus': 62, 'neck_pain': 63, 'dizziness': 64, 'cramps': 65, 'bruising': 66, 'obesity': 67, 'swollen_legs': 68, 'swollen_blood_vessels': 69, 'puffy_face_and_eyes': 70, 'enlarged_thyroid': 71, 'brittle_nails': 72, 'swollen_extremeties': 73, 'excessive_hunger': 74, 'extra_marital_contacts': 75, 'drying_and_tingling_lips': 76, 'slurred_speech': 77, 'knee_pain': 78, 'hip_joint_pain': 79, 'muscle_weakness': 80, 'stiff_neck': 81, 'swelling_joints': 82, 'movement_stiffness': 83, 'spinning_movements': 84, 'loss_of_balance': 85, 'unsteadiness': 86, 'weakness_of_one_body_side': 87, 'loss_of_smell': 88, 'bladder_discomfort': 89, 'foul_smell_of urine': 90, 'continuous_feel_of_urine': 91, 'passage_of_gases': 92, 'internal_itching': 93, 'toxic_look_(typhos)': 94, 'depression': 95, 'irritability': 96, 'muscle_pain': 97, 'altered_sensorium': 98, 'red_spots_over_body': 99, 'belly_pain': 100, 'abnormal_menstruation': 101, 'dischromic _patches': 102, 'watering_from_eyes': 103, 'increased_appetite': 104, 'polyuria': 105, 'family_history': 106, 'mucoid_sputum': 107, 'rusty_sputum': 108, 'lack_of_concentration': 109, 'visual_disturbances': 110, 'receiving_blood_transfusion': 111, 'receiving_unsterile_injections': 112, 'coma': 113, 'stomach_bleeding': 114, 'distention_of_abdomen': 115, 'history_of_alcohol_consumption': 116, 'fluid_overload.1': 117, 'blood_in_sputum': 118, 'prominent_veins_on_calf': 119, 'palpitations': 120, 'painful_walking': 121, 'pus_filled_pimples': 122, 'blackheads': 123, 'scurring': 124, 'skin_peeling': 125, 'silver_like_dusting': 126, 'small_dents_in_nails': 127, 'inflammatory_nails': 128, 'blister': 129, 'red_sore_around_nose': 130, 'yellow_crust_ooze': 131}

In [31]:
# Mapping of disease indices to disease names
diseases_list = {15: 'Fungal infection', 4: 'Allergy', 16: 'GERD', 9: 'Chronic cholestasis', 14: 'Drug Reaction', 33: 'Peptic ulcer diseae', 1: 'AIDS', 12: 'Diabetes ', 17: 'Gastroenteritis', 6: 'Bronchial Asthma', 23: 'Hypertension ', 30: 'Migraine', 7: 'Cervical spondylosis', 32: 'Paralysis (brain hemorrhage)', 28: 'Jaundice', 29: 'Malaria', 8: 'Chicken pox', 11: 'Dengue', 37: 'Typhoid', 40: 'hepatitis A', 19: 'Hepatitis B', 20: 'Hepatitis C', 21: 'Hepatitis D', 22: 'Hepatitis E', 3: 'Alcoholic hepatitis', 36: 'Tuberculosis', 10: 'Common Cold', 34: 'Pneumonia', 13: 'Dimorphic hemmorhoids(piles)', 18: 'Heart attack', 39: 'Varicose veins', 26: 'Hypothyroidism', 24: 'Hyperthyroidism', 25: 'Hypoglycemia', 31: 'Osteoarthristis', 5: 'Arthritis', 0: '(vertigo) Paroymsal  Positional Vertigo', 2: 'Acne', 38: 'Urinary tract infection', 35: 'Psoriasis', 27: 'Impetigo'}

In [32]:
def predict_disease(patient_symptoms, model):
    """
    Predicts disease based on a list of symptoms.
    
    Args:
        patient_symptoms (list): List of symptoms as strings
        model: Trained ML model
        
    Returns:
        str: Predicted disease name
    """
    # Creating input vector (all zeros initially)
    input_vector = np.zeros(len(symptoms_dict))
    
    # Setting 1 for each symptom that is present
    for symptom in patient_symptoms:
        if symptom in symptoms_dict:
            input_vector[symptoms_dict[symptom]] = 1
        else:
            print(f"Warning: Symptom '{symptom}' not recognized")
    
    # Making prediction
    disease_index = model.predict([input_vector])[0]
    
    # Returning disease name
    return diseases_list[disease_index]

In [33]:
def get_disease_info(disease_name):
    """
    Retrieves information about a specific disease.
    
    Args:
        disease_name (str): Name of the disease
        
    Returns:
        tuple: (description, precautions, medications, diet, workout)
    """
    # Geting disease description
    description = disease_descriptions[disease_descriptions['Disease'] == disease_name]['Description']
    description = " ".join([w for w in description])
    
    # Getting precautions
    precautions = disease_precautions[disease_precautions['Disease'] == disease_name][
        ['Precaution_1', 'Precaution_2', 'Precaution_3', 'Precaution_4']
    ]
    precautions = [col for col in precautions.values]
    
    # Getting medications
    medications = disease_medications[disease_medications['Disease'] == disease_name]['Medication']
    medications = [med for med in medications.values]
    
    # Getting dietary recommendations
    diet = disease_diets[disease_diets['Disease'] == disease_name]['Diet']
    diet = [d for d in diet.values]
    
    # Getting workout recommendations
    workout = disease_workouts[disease_workouts['disease'] == disease_name]['workout']
    workout = [w for w in workout.values]
    
    return description, precautions, medications, diet, workout

In [34]:
def display_disease_info(disease_name, description, precautions, medications, diet, workout):
    """
    Displays information about a disease in a formatted manner.
    
    Args:
        disease_name (str): Name of the disease
        description (str): Description of the disease
        precautions (list): List of precautions
        medications (list): List of medications
        diet (list): List of dietary recommendations
        workout (list): List of workout recommendations
    """
    print("\n" + "="*30 + " PREDICTION RESULTS " + "="*30)
    print(f"\n📋 PREDICTED DISEASE: {disease_name}")
    
    print(f"\n📝 DESCRIPTION:")
    print(description)
    
    print(f"\n⚠️ PRECAUTIONS:")
    for i, precaution in enumerate(precautions[0], 1):
        print(f"{i}. {precaution}")
    
    print(f"\n💊 RECOMMENDED MEDICATIONS:")
    for i, medication in enumerate(medications, len(precautions[0]) + 1):
        print(f"{i}. {medication}")
    
    print(f"\n🏋️ RECOMMENDED WORKOUTS:")
    for i, exercise in enumerate(workout, len(precautions[0]) + len(medications) + 1):
        print(f"{i}. {exercise}")
    
    print(f"\n🍽️ DIETARY RECOMMENDATIONS:")
    for i, food in enumerate(diet, len(precautions[0]) + len(medications) + len(workout) + 1):
        print(f"{i}. {food}")
    
    print("\n" + "="*75)

# **Testing the System**

Let's test our disease prediction system with user inputs : 

In [35]:
# Loading tha trained model
model = pickle.load(open('models/disease_prediction_model.pkl', 'rb'))

In [36]:
def run_prediction_system():
    """
    Runs the disease prediction system with user input.
    """
    print("\n" + "="*30 + " DISEASE PREDICTION SYSTEM " + "="*30)
    print("\nPlease enter your symptoms, separated by commas.")
    print("Example: itching,skin_rash,nodal_skin_eruptions")
    
    # Getting input from user
    symptoms_input = input("\nEnter your symptoms: ")
    
    # Processing the input
    user_symptoms = [s.strip() for s in symptoms_input.split(',')]
    user_symptoms = [symptom.strip("[]' ") for symptom in user_symptoms]
    
    # Predicting disease
    predicted_disease = predict_disease(user_symptoms, model)
    
    # Getting disease information
    description, precautions, medications, diet, workout = get_disease_info(predicted_disease)
    
    # Displaying results
    display_disease_info(predicted_disease, description, precautions, medications, diet, workout)
    
    return predicted_disease

In [37]:
# Running the prediction system
test_result = run_prediction_system()
print(f"\nTest completed. Predicted disease: {test_result}")



Please enter your symptoms, separated by commas.
Example: itching,skin_rash,nodal_skin_eruptions


📋 PREDICTED DISEASE: Urinary tract infection

📝 DESCRIPTION:
Urinary tract infection is an infection in any part of the urinary system.

⚠️ PRECAUTIONS:
1. drink plenty of water
2. increase vitamin c intake
3. drink cranberry juice
4. take probiotics

💊 RECOMMENDED MEDICATIONS:
5. ['Antibiotics', 'Urinary analgesics', 'Phenazopyridine', 'Antispasmodics', 'Probiotics']

🏋️ RECOMMENDED WORKOUTS:
6. Stay hydrated
7. Consume cranberry products
8. Include vitamin C-rich foods
9. Limit caffeine and alcohol
10. Consume probiotics
11. Avoid spicy and acidic foods
12. Consult a healthcare professional
13. Follow medical recommendations
14. Maintain good hygiene
15. Limit sugary foods and beverages

🍽️ DIETARY RECOMMENDATIONS:
16. ['UTI Diet', 'Hydration', 'Cranberry juice', 'Probiotics', 'Vitamin C-rich foods']


Test completed. Predicted disease: Urinary tract infection




# **Conclusion**

This notebook demonstrates a comprehensive system for predicting diseases based on symptoms using machine learning. The system not only predicts diseases but also provides valuable information like descriptions, precautions, medications, dietary recommandations, and workout suggestions.

The **Support Vector Machine** model was selected for its high accuracy in predicting diseases based on the given symptoms.