## Project Overview
This notebook implements machine learning models for predicting diseases based on patient symptoms. We compare three gradient boosting algorithms (XGBoost, LightGBM, and CatBoost) to identify which performs best for this medical diagnostic task.

## Dataset Information
- **Source**: Preprocessed medical diagnostic data
- **Features**: 133 binary symptom indicators (e.g., fever, headache, cough)
- **Target**: 41 unique disease classifications
- **Size**: 304 patient records

## Project Goals
1. Build accurate disease prediction models using gradient boosting algorithms
2. Compare performance across multiple metrics (accuracy, overfitting, robustness)
3. Identify the most important symptoms for disease diagnosis
4. Select the best model for potential deployment in medical applications

# 1.0 Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import math

#warning filter
import warnings
warnings.filterwarnings("ignore")

## 1.2 Loading Dataset

In [None]:
data = pd.read_csv('../data/raw/Training.csv')

In [13]:
data.head()

Unnamed: 0_level_0,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
itching,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
0,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


## 1.3 Meta-data of out dataset

In [8]:
def data_overview(data):
    print("Shape of the dataset: ", data.shape)
    print("\n")
    print("Columns in the dataset: ", data.columns)
    print("\n")
    print("Info of the dataset: ")
    print(data.info())
    print("\n")
    print("Missing values in the dataset: ")
    print(data.isnull().sum())
    print("\n")
    print("Statistical summary of the dataset: ")
    print(data.describe())
    
data_overview(data)

Shape of the dataset:  (4920, 134)


Columns in the dataset:  Index(['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing',
       'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity',
       'ulcers_on_tongue',
       ...
       'scurring', 'skin_peeling', 'silver_like_dusting',
       'small_dents_in_nails', 'inflammatory_nails', 'blister',
       'red_sore_around_nose', 'yellow_crust_ooze', 'prognosis',
       'Unnamed: 133'],
      dtype='object', length=134)


Info of the dataset: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 134 entries, itching to Unnamed: 133
dtypes: float64(1), int64(132), object(1)
memory usage: 5.0+ MB
None


Missing values in the dataset: 
itching                    0
skin_rash                  0
nodal_skin_eruptions       0
continuous_sneezing        0
shivering                  0
                        ... 
blister                    0
red_sore_around_nose       0
yellow_crust_ooze          

* Found useless column called Unnamed: 133 will remove that.

In [14]:
# Unnamed: 133 null values -> Decided to delete the row from the dataset
data = data.drop(columns=['Unnamed: 133'])

### Treat Duplicates

In [15]:
data.duplicated().sum()

np.int64(4622)

In [16]:
data = data.drop_duplicates()

# 2.0 Exporting Clean_Data

* exported cleaned data for the further EDA.

In [17]:
data.to_csv('../data/processed/cleaned_data.csv', index=False)