# Predicting early stage diabetics

We need to go through the following steps for predicting early stage diabetics:
1. Data wrangling
2. Exploratory Data Analysis
3. Data Pre processing and Feature Important analysis
4. Evaluating machine learning models
5. Creating ensamble model for robustness


## Load the data set into memory

This is a very small dataset consisting only 520 instances and cosumes approximately 70 KB so we can load all of it into memory, but for huge datasets we need to process them chunk by chunk.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv("/kaggle/input/early-stage-diabetes-risk-prediction-dataset/diabetes_data_upload.csv")
df.head()

## Data Wrangling

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or munging, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling.

But the dataset is pre cleaned and there is no missing data. As a result, it is quite an easy step here and there is not much to do.

In [None]:
# Check the data types of the columns and if there is any null value.
df.info()

## Exploratory Data Analysis

In this section, we will try to infer about the trends in the dataset using data visualization and statistics. 

We can see that every columns without age consists of boolean values. So at first we need to encode them by using *LabelEncoder*. As these are boolean categorical values we cannot use [*pandas.DataFrame.describe*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) to infer about the descriptive statistics including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Rather we can analyze the frequency of attributes over different age groups and gender.

In [None]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
for column  in df.columns[1:]:
    df[column]= label_encoder.fit_transform(df[column])

In [None]:
# descriptive statistics of the Age column
df.Age.describe()

In [None]:
# Distribution of patients age. 
sns.displot(x='Age', kind ='hist', data= df, bins = 10, kde = True);

## Data Pre processing and Feature Important analysis

the dataset is already pre-processed and there is not much to do. But we can check what are the most important features of this data set by: Finding out the pearson correlation between features and Class Feature importance techniues to reduce computational cost.


### Correlation analysis

Correlation is not casuation but we can infer which features attributes most in defining the class. So that, we can also reduce relatively insignificant features for reducing load in machine learning models. we have taken pearson correlation coefficient 

In [None]:
# taking the abosulute values of the correlating features to find out top 5 features
np.abs(df.iloc[:,:-1].corrwith(df['class'])).sort_values(ascending = False)

### Feture importance

which features are important is analyzed using RandomForestClassifier from sklearn.ensemble method. This is done in the Predictive analytics section bellow. From that we can see that *Age* is one of the most important factors but it is not quite correlated with the target variable *Class*.

So, we should not rely on only one method in determing important features for predicting the target.

## Predictive Analytics

In predictive analytics, we need to come to an strategy for making a robust classifier to classify the likelyhood of a person having early stage diabetics by using new features. In this project, our goal should be minimizing false positives even if it reduces over all accuracy.

In this step we need to:
* split the dataset in training and testing dataset
* find out the classification accuracy of a naive classifier
* compare differnt classifier usign F1 score, precision, recall, accuracy and confusion matrix.
* create an ensemble classifier using the top three classifiers 

In [None]:
# importing necessary libraries
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.model_selection import KFold, ShuffleSplit
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
predictors = df.drop(['class'], axis= 1)
target = df['class']
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.10, random_state = 0, stratify=target)
print(x_train.shape, y_train.shape)
print(x_val.shape, y_val.shape)

## Creating a Baseline Regression Model

DummyRegressor allows us to create a very simple model that we can use as a baseline to compare against our actual model. This can often be useful to simulate a “naive” existing prediction process in a product or system.

score returns the coefficient of determination $(R-squared, R^2)$. The closer $R^2$ is to 1, the more of the variance in the target vector that is explained by
the features.

In [None]:
from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy='mean')
dummy.fit(x_train, y_train)
dummy.score(x_val, y_val)

**Null Accuracy:** accuracy that could be achieved by always predicting the most frequent class. From this null accuracy, we can find out if a classifier only predicts one class (here positive class) what will be the accuracy. So, we can choose this value as our base line accuracy.

In [None]:
y_train.value_counts()

In [None]:
y_train.mean()

So, we can see that, we need to develop a classifier for more than 62% accuracy and with lowest false nevgative rate.

## Developing classifiers

Now, to develop the best classifier we need to analyze several classfiers and ensamble top three ones with soft voting for getting a robust descision and lower bias.

In [None]:
classifiers_description = {"model":[],"precision":[], "recall":[],"f1-score":[], "accuracy":[], "standard_deviation" :[]}

In [None]:
def model_accuracy(classifier = None, predictors=None, target= None, n_splits = 10):
    global classifiers_description
    # helper function for Model Evaluation
    
    kf = KFold(n_splits=10, shuffle=True, random_state=1)
    
    y_pred = cross_val_predict(classifier, predictors, target, cv=kf)
    scores = cross_val_score(classifier, predictors, target, cv=kf)
    
    # plotting confusion matrix

    cf_matrix = confusion_matrix(target, y_pred)
    sns.set_style('ticks')
    fig, ax = plt.subplots()
    sns.heatmap(cf_matrix,annot=True, ax=ax, fmt='g', cmap='Blues')

    #making classifier description report
    report = classification_report(target, y_pred, output_dict=True)
    classifier_name = type(classifier).__name__
    if classifier_name not in classifiers_description["model"]:
        classifiers_description["model"].append(classifier_name)
        classifiers_description["precision"].append(report['weighted avg']["precision"])
        classifiers_description["recall"].append(report['weighted avg']["recall"])
        classifiers_description["f1-score"].append(report['weighted avg']["f1-score"])
        classifiers_description["accuracy"].append(scores.mean())
        classifiers_description["standard_deviation"].append(scores.std())   

    
    print(classification_report(target, y_pred))

    return (scores.mean(), scores.std())

In [None]:
from sklearn.naive_bayes import GaussianNB
clf_GNB = GaussianNB()
model_accuracy(classifier=clf_GNB, predictors=x_train, target=y_train)

In [None]:
from sklearn.linear_model import LogisticRegression
clf_LR= LogisticRegression(max_iter=1000, )
model_accuracy(classifier=clf_LR, predictors=x_train, target=y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf_DT = DecisionTreeClassifier()
model_accuracy(classifier=clf_DT, predictors=x_train, target=y_train)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_RF = RandomForestClassifier()
model_accuracy(classifier=clf_RF, predictors=x_train, target=y_train)

We can also anlyze important features for this classification problem by using Random Forrest classifier.

In [None]:
# plotting feature importances

from sklearn.ensemble import RandomForestClassifier

# fitting the model
model_RF = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
model_RF.fit(x_train, y_train)

features = df.drop('class', axis=1).columns
importances = model_RF.feature_importances_
indices = np.argsort(importances)

plt.figure()
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf_GBC = GradientBoostingClassifier()
model_accuracy(classifier=clf_GBC, predictors=x_train, target=y_train)

In [None]:
from sklearn.neural_network import MLPClassifier
MLP_clf = MLPClassifier(random_state=1, max_iter=10000)
model_accuracy(classifier=MLP_clf, predictors=x_train, target=y_train)

## Comparing classifiers for developing voting classifier.

In [None]:
classifier_df = pd.DataFrame.from_dict(classifiers_description)
classifier_df.sort_values(by=["f1-score","standard_deviation"], ascending=False)

## Developing a Voting Classifier by ensambling top three models

From the above dataframe and confusion matrixes we can see that RandomForest, GradientBoosting, and Decision Tree classifiers gives better F1-score and lower False Negatives. So, we can ensemble them for developing a robust classifier usign soft voting. Ensemble methods are techniques that create multiple models and then combine them to produce improved results

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV

clf1 = RandomForestClassifier(random_state=1)
clf2 = GradientBoostingClassifier(random_state=1)
clf3 =  DecisionTreeClassifier(random_state=1)


params={"rf__max_depth":[8],
        "rf__criterion":["entropy"],
        "rf__n_estimators":[1000],
        "gb__loss":["deviance"],
        "gb__n_estimators":[1000],
        "gb__criterion":["friedman_mse"],
        "gb__max_depth":[2],
        "gb__max_features":["auto"],
        "dt__max_features":["auto"],
        "dt__criterion":["gini"],
        "dt__max_depth":[16]
        }

eclf = VotingClassifier(estimators=[("rf", clf1), ("gb", clf2), ("dt", clf3)],
                       voting= 'soft', weights = [3,1,1])
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid.fit(x_train, y_train)
grid_best = grid.best_estimator_
model_accuracy(classifier=grid_best, predictors=x_train, target=y_train)

From the confusion matrix, we can see that we have reduced the type II error, by using weighted voting classifier. Now let's check how it performs on the validation dataset.**

In [None]:
y_pred = grid_best.predict(x_val)
print(classification_report(y_val, y_pred))

## Conclusion

We can see that we can acheive a significant accuracy using Random Forest Classifier but with the help of other two classifiers we are able to reduce type II error. Our main gole was to reduce Type II error and we can see that with a tradeoff of accuracy we are able to acheive it.

In [None]:
classifier_df = pd.DataFrame.from_dict(classifiers_description)
classifier_df.sort_values(by=["f1-score","standard_deviation"], ascending=False)