In [317]:
import numpy as np

In [318]:
import pandas as pd

In [319]:
from sklearn.preprocessing import StandardScaler

In [320]:
from sklearn.impute import SimpleImputer

In [321]:
data = pd.read_csv("Diabetes_raw_dataset.csv", sep = ",")
data

Unnamed: 0,Patient number,Cholesterol (mg/dl),Glucose (mg/dl),HDL Chol (mg/dl),TChol/HDL ratio,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes
0,1,193.0,77.0,49.0,3.9,19,female,154.9,54.93,54,22.88,118.0,70.0,81.3,96.5,No diabetes
1,2,146.0,79.0,41.0,3.6,19,female,152.4,98.97,61,42.61,108.0,58.0,83.8,101.6,No diabetes
2,3,217.0,75.0,54.0,4.0,20,female,170.2,116.22,85,40.13,110.0,72.0,101.6,114.3,No diabetes
3,4,226.0,97.0,70.0,3.2,20,female,162.6,54.03,52,20.44,122.0,64.0,78.7,99.1,No diabetes
4,5,164.0,91.0,67.0,2.4,20,female,177.8,83.08,64,26.28,122.0,86.0,81.3,99.1,No diabetes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496,155.0,58.0,69.0,2.2,26,male,185.4,72.19,79,21.00,110.0,76.0,76.2,88.9,No diabetes
496,497,179.0,90.0,60.0,3.0,26,female,152.4,93.07,59,40.07,138.0,84.0,81.3,101.6,No diabetes
497,498,283.0,83.0,74.0,3.8,26,male,182.9,65.83,103,19.68,158.0,104.0,104.1,111.8,No diabetes
498,499,228.0,79.0,37.0,6.2,26,male,182.9,97.16,118,29.05,122.0,90.0,121.9,124.5,No diabetes


In [322]:
print("|| Data: " + str(data.shape) + "||")

|| Data: (500, 16)||


In [323]:
data.replace(0, np.nan, inplace=True)

data.Gender = data.Gender.map({'male': 0 , 'female': 1})
data.Diabetes = data.Diabetes.map({'Diabetes': 1 , 'No diabetes': 0})

The first line of code in this code replaces all instances of the number 0 in the data with NaN. The following two lines of code encode the categorical data columns of gender and diabetes into numerical values.

In [324]:
imputer = SimpleImputer(missing_values=np.NaN, strategy='median') 
imputer = imputer.fit(data)
X = imputer.transform(data)
print(X[418])

mod_data = pd.DataFrame(X, columns = data.columns)
mod_data

[419.   254.   342.    37.     6.9   75.     0.   172.7   54.93  95.
  18.41 151.    87.   111.8  114.3    1.  ]


Unnamed: 0,Patient number,Cholesterol (mg/dl),Glucose (mg/dl),HDL Chol (mg/dl),TChol/HDL ratio,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes
0,1.0,193.0,77.0,49.0,3.9,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0
1,2.0,146.0,79.0,41.0,3.6,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0
2,3.0,217.0,75.0,54.0,4.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0
3,4.0,226.0,97.0,70.0,3.2,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0
4,5.0,164.0,91.0,67.0,2.4,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,155.0,58.0,69.0,2.2,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0
496,497.0,179.0,90.0,60.0,3.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0
497,498.0,283.0,83.0,74.0,3.8,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0
498,499.0,228.0,79.0,37.0,6.2,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0


This code creates an instance of the SimpleImputer class, indicating that the missing values to be imputed are NaN and that the median imputation approach should be used. The final line of code returns the array X to a Pandas DataFrame and assigns it to the variable mod_data.

In [325]:
len(mod_data)

500

Outliers definiation - Outliers are data points that are significantly different from other observations in a dataset. 

Why removing outliers can be problem? Removing outliers can result in a significant loss of data, Outliers can have a significant impact on the results of a model, and removing them can change the overall results of the model. It's generally recommended to analyze the impact of outliers on the model before deciding to remove them or replacing them.

In [326]:
value_threshold = (mod_data['Glucose (mg/dl)'].mean()+(mod_data['Glucose (mg/dl)'].std()*3))

This line of code computes the value threshold for the 'Glucose,' The threshold is determined as the mean of the column plus three times the standard deviation of the column.

In [327]:
value_threshold 

1368.9809181154378

This line of code will output the value of value_threshold. which is calculated as the mean of Glucose.

In [328]:
#Adding new features, Based on Article 1, we're creating new column WSR
mod_data['WSR'] = mod_data['waist (cm)']/mod_data['Height /stature (cm)']
mod_data

Unnamed: 0,Patient number,Cholesterol (mg/dl),Glucose (mg/dl),HDL Chol (mg/dl),TChol/HDL ratio,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR
0,1.0,193.0,77.0,49.0,3.9,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855
1,2.0,146.0,79.0,41.0,3.6,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869
2,3.0,217.0,75.0,54.0,4.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945
3,4.0,226.0,97.0,70.0,3.2,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010
4,5.0,164.0,91.0,67.0,2.4,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,155.0,58.0,69.0,2.2,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003
496,497.0,179.0,90.0,60.0,3.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465
497,498.0,283.0,83.0,74.0,3.8,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163
498,499.0,228.0,79.0,37.0,6.2,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484


This code generates a fresh feature called WSR (waist-to-height ratio). This new feature can be useful for determining a person's waist size in relation to his height, which can be relevant information for certain sorts of analysis or predictions.

In [329]:
#Create categories for Cholestrol, HDL and Ratio
cholestrol_data = mod_data.copy()
cholestrol_data['Chol'] = cholestrol_data[' Cholesterol (mg/dl)'].apply(lambda x: '1' if x > 240 else '0')
cholestrol_data['HDL'] = cholestrol_data['HDL Chol (mg/dl)'].apply(lambda x: '1' if x < 40 else '0')
cholestrol_data['TChol_Ratio_new'] = cholestrol_data['TChol/HDL ratio'].apply(lambda x: '1' if x > 4.5 else '0')
cholestrol_data

Unnamed: 0,Patient number,Cholesterol (mg/dl),Glucose (mg/dl),HDL Chol (mg/dl),TChol/HDL ratio,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new
0,1.0,193.0,77.0,49.0,3.9,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855,0,0,0
1,2.0,146.0,79.0,41.0,3.6,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869,0,0,0
2,3.0,217.0,75.0,54.0,4.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945,0,0,0
3,4.0,226.0,97.0,70.0,3.2,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010,0,0,0
4,5.0,164.0,91.0,67.0,2.4,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,155.0,58.0,69.0,2.2,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003,0,0,0
496,497.0,179.0,90.0,60.0,3.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465,0,0,0
497,498.0,283.0,83.0,74.0,3.8,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163,1,0,0
498,499.0,228.0,79.0,37.0,6.2,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484,0,1,1


This code creates a new DataFrame, cholestrol_data, by copying the mod_data. Then it creates three new columns, 'Chol', 'HDL' and 'TChol_Ratio_new' in the cholestrol_data DataFrame.

(1) The 'Chol' column is created by applying a lambda function that checks if the value in the 'Cholesterol (mg/dl)' column is greater than 240, if yes then assigns '1' else assigns '0'

(2) The 'HDL' column is created by applying a lambda function that checks if the value in the 'HDL Chol (mg/dl)' column is less than 40, if yes then assigns '1' else assigns '0'.

(3) The 'TChol_Ratio_new' column is created by applying a lambda function that checks if the value in the 'TChol/HDL ratio' column is greater than 4.5, if yes then assigns '1' else assigns '0'.

This process is called binning, it's a technique used to group a set of continuous numerical data into discrete "bins" for analysis.

In [330]:
#Dropping all the three categories
cholestrol_data.drop([' Cholesterol (mg/dl)','HDL Chol (mg/dl)','TChol/HDL ratio'],axis=1, inplace=True)
cholestrol_data

Unnamed: 0,Patient number,Glucose (mg/dl),Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new
0,1.0,77.0,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855,0,0,0
1,2.0,79.0,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869,0,0,0
2,3.0,75.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945,0,0,0
3,4.0,97.0,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010,0,0,0
4,5.0,91.0,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,58.0,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003,0,0,0
496,497.0,90.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465,0,0,0
497,498.0,83.0,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163,1,0,0
498,499.0,79.0,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484,0,1,1


This line of code eliminates all of the cholesterol, HDL, and ratio columns. The drop() function is used to remove the given columns, with the axis=1 argument indicating that columns are being removed and the inplace=True option modifying the old DataFrame rather than returning a new one.

In [331]:
cholestrol_data['Categorized_Glucose'] = cholestrol_data['Glucose (mg/dl)'].apply(lambda x: '1' if x > 99 else '0')
cholestrol_data.drop(['Glucose (mg/dl)'],axis=1, inplace=True)
cholestrol_data

Unnamed: 0,Patient number,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose
0,1.0,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855,0,0,0,0
1,2.0,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869,0,0,0,0
2,3.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945,0,0,0,0
3,4.0,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010,0,0,0,0
4,5.0,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003,0,0,0,0
496,497.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465,0,0,0,0
497,498.0,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163,1,0,0,0
498,499.0,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484,0,1,1,0


This code adds a new column to the cholestrol data DataFrame named 'Categorized Glucose' by running a lambda function that checks if the value in the 'Glucose (mg/dl)' column is more than 99 and assigns '1' else assigns '0'. The 'Glucose (mg/dl)' column is then dropped.

In [332]:
def BloodPressure(age,sbp,dbp):
    if age < 40:
        if sbp > 110 and dbp > 70:
            return 1
    return 0
    if age >= 40:
        if sbp > 117 and dbp > 69:
            return 1
    return 0

cholestrol_data['Categorized_BP'] = cholestrol_data.apply(lambda x: BloodPressure(x['Age'], x['Systolic BP'], x['Diastolic BP']), axis=1)
cholestrol_data

Unnamed: 0,Patient number,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP
0,1.0,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855,0,0,0,0,0
1,2.0,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869,0,0,0,0,0
2,3.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945,0,0,0,0,0
3,4.0,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010,0,0,0,0,0
4,5.0,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003,0,0,0,0,0
496,497.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465,0,0,0,0,1
497,498.0,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163,1,0,0,0,1
498,499.0,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484,0,1,1,0,1


This code introduces a function BloodPressure that accepts three parameters: age, sbp, and dbp. If the patient's age is less than 40, it checks if the SBP and DBP are larger than 110 and 70, respectively, and assigns 1 else 0. If the patient's age is more than 40, it checks if the SBP and DBP are larger than 117 and 69, respectively, and assigns 1 or 0.

In [333]:
cholestrol_data['Categorized_Age'] = cholestrol_data['Age'].apply(lambda x: '1' if x > 50 else '0')
cholestrol_data.drop(['Age','Systolic BP','Diastolic BP'],axis=1, inplace=True)
cholestrol_data

Unnamed: 0,Patient number,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age
0,1.0,1.0,154.9,54.93,54.0,22.88,81.3,96.5,0.0,0.524855,0,0,0,0,0,0
1,2.0,1.0,152.4,98.97,61.0,42.61,83.8,101.6,0.0,0.549869,0,0,0,0,0,0
2,3.0,1.0,170.2,116.22,85.0,40.13,101.6,114.3,0.0,0.596945,0,0,0,0,0,0
3,4.0,1.0,162.6,54.03,52.0,20.44,78.7,99.1,0.0,0.484010,0,0,0,0,0,0
4,5.0,1.0,177.8,83.08,64.0,26.28,81.3,99.1,0.0,0.457255,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,0.0,185.4,72.19,79.0,21.00,76.2,88.9,0.0,0.411003,0,0,0,0,0,0
496,497.0,1.0,152.4,93.07,59.0,40.07,81.3,101.6,0.0,0.533465,0,0,0,0,1,0
497,498.0,0.0,182.9,65.83,103.0,19.68,104.1,111.8,0.0,0.569163,1,0,0,0,1,0
498,499.0,0.0,182.9,97.16,118.0,29.05,121.9,124.5,0.0,0.666484,0,1,1,0,1,0


This code creates a new column called 'Categorized_Age' in the cholestrol_data DataFrame by applying a lambda function that checks if the value in the 'Age' column is greater than 50, if yes then assigns '1' else assigns '0'.

In [334]:
cholestrol_data['Categorized_BMI'] = cholestrol_data['BMI (Kg/m^2)'].apply(lambda x: '1' if x > 25 else '0')
cholestrol_data.drop(['BMI (Kg/m^2)'],axis=1, inplace=True)
cholestrol_data

Unnamed: 0,Patient number,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age,Categorized_BMI
0,1.0,1.0,154.9,54.93,54.0,81.3,96.5,0.0,0.524855,0,0,0,0,0,0,0
1,2.0,1.0,152.4,98.97,61.0,83.8,101.6,0.0,0.549869,0,0,0,0,0,0,1
2,3.0,1.0,170.2,116.22,85.0,101.6,114.3,0.0,0.596945,0,0,0,0,0,0,1
3,4.0,1.0,162.6,54.03,52.0,78.7,99.1,0.0,0.484010,0,0,0,0,0,0,0
4,5.0,1.0,177.8,83.08,64.0,81.3,99.1,0.0,0.457255,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,0.0,185.4,72.19,79.0,76.2,88.9,0.0,0.411003,0,0,0,0,0,0,0
496,497.0,1.0,152.4,93.07,59.0,81.3,101.6,0.0,0.533465,0,0,0,0,1,0,1
497,498.0,0.0,182.9,65.83,103.0,104.1,111.8,0.0,0.569163,1,0,0,0,1,0,0
498,499.0,0.0,182.9,97.16,118.0,121.9,124.5,0.0,0.666484,0,1,1,0,1,0,1


In [335]:
cholestrol_data['WHR'] = cholestrol_data['waist (cm)']/cholestrol_data['hip (cm)']
cholestrol_data

Unnamed: 0,Patient number,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),waist (cm),hip (cm),Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age,Categorized_BMI,WHR
0,1.0,1.0,154.9,54.93,54.0,81.3,96.5,0.0,0.524855,0,0,0,0,0,0,0,0.842487
1,2.0,1.0,152.4,98.97,61.0,83.8,101.6,0.0,0.549869,0,0,0,0,0,0,1,0.824803
2,3.0,1.0,170.2,116.22,85.0,101.6,114.3,0.0,0.596945,0,0,0,0,0,0,1,0.888889
3,4.0,1.0,162.6,54.03,52.0,78.7,99.1,0.0,0.484010,0,0,0,0,0,0,0,0.794147
4,5.0,1.0,177.8,83.08,64.0,81.3,99.1,0.0,0.457255,0,0,0,0,1,0,1,0.820383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,0.0,185.4,72.19,79.0,76.2,88.9,0.0,0.411003,0,0,0,0,0,0,0,0.857143
496,497.0,1.0,152.4,93.07,59.0,81.3,101.6,0.0,0.533465,0,0,0,0,1,0,1,0.800197
497,498.0,0.0,182.9,65.83,103.0,104.1,111.8,0.0,0.569163,1,0,0,0,1,0,0,0.931127
498,499.0,0.0,182.9,97.16,118.0,121.9,124.5,0.0,0.666484,0,1,1,0,1,0,1,0.979116


This code creates a new feature called 'WHR' (waist-to-hip ratio) in the cholestrol_data DataFrame by dividing the values in the 'waist (cm)' column by the values in the 'hip (cm)' column.

In [336]:
cholestrol_data.drop(['Height /stature (cm)','weight1 (Kg)','weight2(Kg)','waist (cm)','hip (cm)','Patient number'],axis=1, inplace=True)
cholestrol_data

Unnamed: 0,Gender,Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age,Categorized_BMI,WHR
0,1.0,0.0,0.524855,0,0,0,0,0,0,0,0.842487
1,1.0,0.0,0.549869,0,0,0,0,0,0,1,0.824803
2,1.0,0.0,0.596945,0,0,0,0,0,0,1,0.888889
3,1.0,0.0,0.484010,0,0,0,0,0,0,0,0.794147
4,1.0,0.0,0.457255,0,0,0,0,1,0,1,0.820383
...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.0,0.411003,0,0,0,0,0,0,0,0.857143
496,1.0,0.0,0.533465,0,0,0,0,1,0,1,0.800197
497,0.0,0.0,0.569163,1,0,0,0,1,0,0,0.931127
498,0.0,0.0,0.666484,0,1,1,0,1,0,1,0.979116


In [337]:
data_m = cholestrol_data.copy()

In [338]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, recall_score
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report

X = data_m.drop('Diabetes', axis = 1)
Y = data_m['Diabetes']

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.20,random_state=42)

log = LogisticRegression()
rnd = RandomForestClassifier(n_estimators=100)
svm = SVC()

voting = VotingClassifier(estimators=[('logistics_regression', log), ('random_forest', rnd), ('support_vector_machine', svm)], voting='hard')
voting.fit(X_train, y_train)

for clf in (log, rnd, svm, voting):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
    print('Recall: ',recall_score(y_test,y_pred))
    print('Confusion Matrix: ',confusion_matrix(y_test, y_pred)) 
    print('Classification Report: ',classification_report(y_test, y_pred))

LogisticRegression 0.87
Recall:  0.3
Confusion Matrix:  [[84  6]
 [ 7  3]]
Classification Report:                precision    recall  f1-score   support

         0.0       0.92      0.93      0.93        90
         1.0       0.33      0.30      0.32        10

    accuracy                           0.87       100
   macro avg       0.63      0.62      0.62       100
weighted avg       0.86      0.87      0.87       100

RandomForestClassifier 0.93
Recall:  0.8
Confusion Matrix:  [[85  5]
 [ 2  8]]
Classification Report:                precision    recall  f1-score   support

         0.0       0.98      0.94      0.96        90
         1.0       0.62      0.80      0.70        10

    accuracy                           0.93       100
   macro avg       0.80      0.87      0.83       100
weighted avg       0.94      0.93      0.93       100

SVC 0.87
Recall:  0.2
Confusion Matrix:  [[85  5]
 [ 8  2]]
Classification Report:                precision    recall  f1-score   support

     

(1) In this code I separates the data into feature and target variables, X and Y respectively.

(2) Then It splits the data into training and testing sets.

(3) I've used three classifiers which are: logistic regression, random forest, and support vector machine classifier.

(4) All classifiers are then fitted on the training data and predictions are made on the test data.

(5) The code then calculates the accuracy, recall, confusion matrix and classification report for each classifier. These 
measures helped me in evaluating the performance of the classifiers and choosing the best one for the task.

In [339]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

X = data_m.drop('Diabetes', axis = 1)

Y = data_m['Diabetes']

X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.20,random_state=42)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
print(recall_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test, y_pred))

X.head()

RandomForestClassifier 0.93
0.8
[[85  5]
 [ 2  8]]
              precision    recall  f1-score   support

         0.0       0.98      0.94      0.96        90
         1.0       0.62      0.80      0.70        10

    accuracy                           0.93       100
   macro avg       0.80      0.87      0.83       100
weighted avg       0.94      0.93      0.93       100



Unnamed: 0,Gender,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age,Categorized_BMI,WHR
0,1.0,0.524855,0,0,0,0,0,0,0,0.842487
1,1.0,0.549869,0,0,0,0,0,0,1,0.824803
2,1.0,0.596945,0,0,0,0,0,0,1,0.888889
3,1.0,0.48401,0,0,0,0,0,0,0,0.794147
4,1.0,0.457255,0,0,0,0,1,0,1,0.820383


In [340]:
data_m

Unnamed: 0,Gender,Diabetes,WSR,Chol,HDL,TChol_Ratio_new,Categorized_Glucose,Categorized_BP,Categorized_Age,Categorized_BMI,WHR
0,1.0,0.0,0.524855,0,0,0,0,0,0,0,0.842487
1,1.0,0.0,0.549869,0,0,0,0,0,0,1,0.824803
2,1.0,0.0,0.596945,0,0,0,0,0,0,1,0.888889
3,1.0,0.0,0.484010,0,0,0,0,0,0,0,0.794147
4,1.0,0.0,0.457255,0,0,0,0,1,0,1,0.820383
...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.0,0.411003,0,0,0,0,0,0,0,0.857143
496,1.0,0.0,0.533465,0,0,0,0,1,0,1,0.800197
497,0.0,0.0,0.569163,1,0,0,0,1,0,0,0.931127
498,0.0,0.0,0.666484,0,1,1,0,1,0,1,0.979116


In [341]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X2 = mod_data.drop(['Diabetes'],axis=1)
Y2 = mod_data['Diabetes']

bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X2,Y2)

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X2.columns)

featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] 
print(featureScores.nlargest(15,'Score'))

                   Specs        Score
2        Glucose (mg/dl)  2751.035511
0         Patient number   965.735415
5                    Age   320.235926
1    Cholesterol (mg/dl)   181.848171
13            waist (cm)    48.819581
9            weight2(Kg)    43.576859
11           Systolic BP    41.343451
3       HDL Chol (mg/dl)    31.413330
4        TChol/HDL ratio    20.619178
14              hip (cm)    19.417746
8           weight1 (Kg)    17.626782
10          BMI (Kg/m^2)     6.183316
12          Diastolic BP     6.016456
15                   WSR     0.285323
6                 Gender     0.030586


In [342]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, recall_score
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report

list_features = ['Glucose (mg/dl)','Age',' Cholesterol (mg/dl)','waist (cm)','weight2(Kg)']
X3 = mod_data[list_features]

Y3 = mod_data['Diabetes']

X_train, X_test, y_train, y_test = train_test_split(X3,Y3, test_size=0.20,random_state=42)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
print(recall_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred)) 
print(classification_report(y_test, y_pred))

mod_data

RandomForestClassifier 0.95
0.9
[[86  4]
 [ 1  9]]
              precision    recall  f1-score   support

         0.0       0.99      0.96      0.97        90
         1.0       0.69      0.90      0.78        10

    accuracy                           0.95       100
   macro avg       0.84      0.93      0.88       100
weighted avg       0.96      0.95      0.95       100



Unnamed: 0,Patient number,Cholesterol (mg/dl),Glucose (mg/dl),HDL Chol (mg/dl),TChol/HDL ratio,Age,Gender,Height /stature (cm),weight1 (Kg),weight2(Kg),BMI (Kg/m^2),Systolic BP,Diastolic BP,waist (cm),hip (cm),Diabetes,WSR
0,1.0,193.0,77.0,49.0,3.9,19.0,1.0,154.9,54.93,54.0,22.88,118.0,70.0,81.3,96.5,0.0,0.524855
1,2.0,146.0,79.0,41.0,3.6,19.0,1.0,152.4,98.97,61.0,42.61,108.0,58.0,83.8,101.6,0.0,0.549869
2,3.0,217.0,75.0,54.0,4.0,20.0,1.0,170.2,116.22,85.0,40.13,110.0,72.0,101.6,114.3,0.0,0.596945
3,4.0,226.0,97.0,70.0,3.2,20.0,1.0,162.6,54.03,52.0,20.44,122.0,64.0,78.7,99.1,0.0,0.484010
4,5.0,164.0,91.0,67.0,2.4,20.0,1.0,177.8,83.08,64.0,26.28,122.0,86.0,81.3,99.1,0.0,0.457255
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496.0,155.0,58.0,69.0,2.2,26.0,0.0,185.4,72.19,79.0,21.00,110.0,76.0,76.2,88.9,0.0,0.411003
496,497.0,179.0,90.0,60.0,3.0,26.0,1.0,152.4,93.07,59.0,40.07,138.0,84.0,81.3,101.6,0.0,0.533465
497,498.0,283.0,83.0,74.0,3.8,26.0,0.0,182.9,65.83,103.0,19.68,158.0,104.0,104.1,111.8,0.0,0.569163
498,499.0,228.0,79.0,37.0,6.2,26.0,0.0,182.9,97.16,118.0,29.05,122.0,90.0,121.9,124.5,0.0,0.666484


Methodology:

(1) Data Cleaning and Preprocessing: The first step is to load the required libraries, in this instance numpy, pandas, and the scikit-learn preprocessing libraries. The data is then imported into a pandas DataFrame, with missing or null values replaced with NaN. Based on the values of certain columns such as 'Gender' and 'Diabetes,' the data is then encoded to 0's and 1's.

(2) Handling Missing Data: The NaN values are then substituted with mean values using the SimpleImputer class from scikit-learn.

(3) Handling Outliers: Outliers are not removed.

(4) Feature Engineering: New features such as 'WSR' (Waist to Stature Ratio) and 'WHR' (Waist to Height Ratio) are added 

(5) Feature Categorization: The data is then divided into distinct bins

(6) Dropping columns: The dataframe's unnecessary columns are then dropped.

(7) Modeling: Finally, the final dataframe is used to train a model. A voting classifier is used in this situation, which includes logistic regression, random forest, and support vector machine. The data is divided into training and testing sets, and the accuracy, recall, confusion matrix, and classification report for each classifier are calculated.

(8) Feature selection: Finally, feature selection is performed using the chi-squared statistical test for non-negative features.

(9) Modeling with selected features: The random forest classifier is then used to fit the classifier on the training data, and predictions are produced on the test data. The algorithm then calculates the accuracy, recall, confusion matrix, and classification report for the classifier.

To conclude - Random Forest Classifier was found to be the best classifier among the three tested. This model was then applied to the modified dataset, using the SelectKBest method to select the top 5 features (Glucose, Age, Cholesterol, waist, and weight2) for analysis. The performance of the model was evaluated and found to have an accuracy of 0.94 and a recall of 0.90, which were both higher than when using the processed dataset. It was ultimately determined that for this task, it was more effective to first select the top features and calculate accuracy, rather than performing classification and feature selection simultaneously.

Reference - 

Wang, S., Ma, W., Yuan, Z., Wang, S., Yi, X., Jia, H. and Xue, F. (2016). Association between obesity indices and type 2 diabetes mellitus among middle-aged and elderly people in Jinan, China: a cross-sectional study. BMJ Open, 6(11), p.e012742. doi:10.1136/bmjopen-2016-012742.

Hillier, T.A. and Pedula, K.L. (2001). Characteristics of an Adult Population With Newly Diagnosed Type 2 Diabetes. Diabetes Care, 24(9), pp.1522–1527. doi:10.2337/diacare.24.9.1522.

Lee, M.-K., Han, K. and Kwon, H.-S. (2019). Age-specific diabetes risk by the number of metabolic syndrome components: a Korean nationwide cohort study. Diabetology & Metabolic Syndrome, 11(1). doi:10.1186/s13098-019-0509-8.

Mallya, V. 'Everything you wanted to know about - cholesterol, lipid profile, VLDL, HDL, and triglycerides'.

Dwivedi, K. (2022). How to Calculate BMI? Just Enter Your Weight & Height.