# Data Science Capstone Health Care  Submission by Vijaya Ahire

### *Project Task: Week 1*
##### Data Exploration:

1. Perform descriptive analysis. Understand the variables and their corresponding values. On the columns below, a value of zero does not make sense and thus indicates missing value:
- Glucose

- BloodPressure

- SkinThickness

- Insulin

- BMI

2. Visually explore these variables using histograms. Treat the missing values accordingly.

3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables. 

### Importing Python Libraries

In [None]:
!pip install pandas-profiling

In [None]:
!pip install dabl

In [None]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

import dabl
import missingno as msno
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

### Reading Dataset using Pandas

In [None]:
df = pd.read_csv("/content/diabetes.csv")

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.replace(0,np.nan,inplace=True)

In [None]:
df.isnull().sum()

In [None]:
from sklearn.impute import KNNImputer

In [None]:
imputer = KNNImputer(n_neighbors=5)

In [None]:
df = pd.DataFrame(imputer.fit_transform(df),columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age','Outcome'])

In [None]:
df

In [None]:
df.isnull().sum()

### Exploratory Data Analysis using Pandas Profiling (Generating HTML Report)

In [None]:
design_report = ProfileReport(df)

#### Saving Pandas Profiling Data Analysis Report

In [None]:
design_report.to_file(output_file='data_analysis_report11.html')

### Exploratory Data Analysis using Data Analysis Baseline Library (DABL)

In [None]:
df_clean = dabl.clean(df, verbose=1)
df_clean

In [None]:
types = dabl.detect_types(df_clean)
types

In [None]:
dabl.plot(df, target_col="Age")

### Assuming "?" values are like Missing/NULL values, Handling them to generate more value

In [None]:
df = df.replace('?', np.nan)
df.isnull().sum()

In [None]:
print("Total Missing Values Count: ",df.isnull().sum().sum())

## Total Missing Values Count:  0




# Feature Engineering

In [None]:
df.dtypes

In [None]:
numerical_df = df.select_dtypes(include=['number'])
numerical_df.head()

In [None]:
categorical_df = df.select_dtypes(include=['object'])
categorical_df.head()

### Perform Label Encoding to Categorical Features

In [None]:
categorical_df = categorical_df.apply(LabelEncoder().fit_transform)

In [None]:
df = pd.concat([categorical_df,numerical_df],axis = 1)
df.shape

In [None]:
df.head()

### Perform Feature Scaling using Standard Scalar Technique

In [None]:
df.columns

In [None]:
features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction']
autoscaler = StandardScaler()
df[features] = autoscaler.fit_transform(df[features])
df.head()

In [None]:
main_df = df.copy()

# Machine Learning Model Building Approaches
## 1. Baseline Model using DABL SimpleClassifier()

In [None]:
ec = dabl.SimpleClassifier(random_state=0).fit(df, target_col="Age") 
ec

### Feature Importance using DABL on Best Model: Logistic Regression

In [None]:
dabl.explain(ec) 

## 2. Ensemble Learning Model: Random Forest
Training Data: 80%
Test Data: 20%

In [None]:
#np.array(df.pop('Age')''''''')
df_new=df

In [None]:
df_new.columns

In [None]:
df_new.iloc[:,8:9]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
RSEED = 10

#target = np.array(df.pop(['Age']))
target = np.array(df_new.iloc[:,8:9])
# 20% examples in test data
#train, test, train_labels, test_labels = train_test_split(df_new,target,stratify = target,test_size = 0.2,random_state = RSEED)
train, test, train_labels, test_labels = train_test_split(df_new,target,test_size = 0.2,random_state = RSEED)
# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100,random_state=RSEED, n_jobs=-1, verbose = 1)
# Fit on training data
model.fit(train, train_labels)

### Statistics about the trees in Random Forest

In [None]:
n_nodes = []
max_depths = []

# Stats about the trees in random forest
for ind_tree in model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

In [None]:
# Training predictions (to demonstrate overfitting)
train_rf_predictions = model.predict(train)
train_rf_probs = model.predict_proba(train)[:, 1]

# Testing predictions (to determine performance)
rf_predictions = model.predict(test)
rf_probs = model.predict_proba(test)[:, 1]

### Feature Importances given by Random Forest

In [None]:
# Extract feature importances
features = pd.DataFrame()
features['Feature'] = train.columns
features['Importance'] = model.feature_importances_
features.sort_values(by=['Importance'], ascending=False, inplace=True)
print(features)
features.set_index('Feature', inplace=True)
features.plot(kind='bar')

## Model Evaluation using Various Metrics
### ROC Curve

In [None]:
from plot_metric.functions import BinaryClassification
# Visualisation with plot_metric
bc = BinaryClassification(test_labels, rf_probs, labels=["Class 1", "Class 2"])

# Figures
plt.figure()
bc.plot_roc_curve()
plt.show()

### Accuracy, Confusion Matrix, Precision, Recall, F1-Score

In [None]:
print("Accuracy {:.2%}".format(accuracy_score(test_labels,rf_predictions)))
print("Classification Report")
print (classification_report(test_labels,rf_predictions))
print ("Confusion matrix")
print (confusion_matrix(test_labels,rf_predictions))

In [None]:
plt.matshow(confusion_matrix(test_labels,rf_predictions), cmap=plt.cm.copper, interpolation='nearest')
plt.title('confusion matrix')
plt.colorbar()
plt.ylabel('actual labels')
plt.xlabel('predicted labels')
plt.show()

## 3. Running Grid Search on Multiple Algorithms with Hypertuning Parameters

In [None]:
# Defining RFC(random forest classifier) model hyper-parameters
rfc_models = RandomForestClassifier()
rfc_params = {'n_estimators': [75,100,120],  
                      'max_depth': [25,30,40],
                      'min_samples_leaf': [4,6],
                      'min_samples_split': [4,6]}

# Defining LR(logistic regression) model hyper-parameters
lr_models = LogisticRegression()
lr_params = {'C': [0.1, 0.01],
                     'tol': [0.001, 0.01],
                     'max_iter': [1000, 2000]}

# Defining GBC(gradient boosting classifier) model hyper-parameters
gbc_models = GradientBoostingClassifier()
gbc_params = {'n_estimators': [25,50,100], 
              'learning_rate':[0.1,0.2,0.3],
                      'max_depth': [25,30],
                      'min_samples_leaf': [2,4],
                      'min_samples_split': [4,6,3]}


grid = zip([rfc_models,lr_models,gbc_models],[rfc_params,lr_params,gbc_params])

best_clf = None
# perform grid search and select the model with best cv set scores
for model_pipeline, param in grid:
    temp = GridSearchCV(model_pipeline, param_grid=param, cv=3, n_jobs=-1)
    temp.fit(train, train_labels)
    if best_clf is None:
        best_clf = temp
    else:
        if temp.best_score_ > best_clf.best_score_:
            best_clf = temp
print ("Best CV Score",best_clf.best_score_)
print ("Model Parameters",best_clf.best_params_)
print("Best Estimator",best_clf.best_estimator_)

## Best Model Prediction and Evaluation by Grid Search

In [None]:
predictions = best_clf.predict(test)
probs = best_clf.predict_proba(test)[:, 1]
print("Classification Report")
print (classification_report(test_labels,predictions))
print ("Confusion matrix")
print (confusion_matrix(test_labels,predictions))

In [None]:
print("Accuracy {:.2%}".format(accuracy_score(test_labels,predictions)))
plt.matshow(confusion_matrix(test_labels,predictions), cmap=plt.cm.copper, interpolation='nearest')
plt.title('confusion matrix')
plt.colorbar()
plt.ylabel('actual labels')
plt.xlabel('predicted labels')
plt.show()

In [None]:
from plot_metric.functions import BinaryClassification
# Visualisation with plot_metric
bc = BinaryClassification(test_labels,probs, labels=["Class 1", "Class 2"])

# Figures
plt.figure()
bc.plot_roc_curve()
plt.show()

In [None]:
pd.DataFrame({'Actual': test_labels, 'Predicted': predictions}).head()

# Model Interpretation using LIME

In [None]:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(train.values,
                                                   training_labels=train_labels,
                                                   feature_names=train.columns.tolist(),
                                                   feature_selection="lasso_path",
                                                   class_names=['0','1'],
                                                   discretize_continuous=True,
                                                  discretizer="entropy",
    categorical_names = ['C1','C5','C6','C7','C9','C10','C12','C13'],
                                                   mode ='classification')

## Model Explaination for 100th Candidate

In [None]:
# Pick the observation in the validation set for which explanation is required
print ('predicted:', best_clf.predict(test)[100])
print ('expected:', test_labels[100])

In [None]:
exp = explainer.explain_instance(test.iloc[100], best_clf.predict_proba, num_features=5)
 
exp.show_in_notebook(show_table=True)


## Model Explaination for 12th Candidate

In [None]:
# Pick the observation in the validation set for which explanation is required
print ('predicted:', best_clf.predict(test)[12])
print ('expected:', test_labels[12])

In [None]:
exp = explainer.explain_instance(test.iloc[12], best_clf.predict_proba, num_features=5)
 
exp.show_in_notebook(show_table=True)


## Model Explaination for 19th Candidate

In [None]:
# Pick the observation in the validation set for which explanation is required
print ('predicted:', best_clf.predict(test)[19])
print ('expected:', test_labels[19])

In [None]:
exp = explainer.explain_instance(test.iloc[19], best_clf.predict_proba, num_features=5)
 
exp.show_in_notebook(show_table=True)


The probability values for each class is different for each algorithm as the feature weights computed by each algorithm are different. Depending on the actual value of the features for a particular record and the weights assigned to those features, the algorithm computes the class probability and then predicts the class having the highest probability. These results can be interpreted by a subject matter expert to see which algorithm is picking up the right signals / features to make the prediction. Essentially, the black box algorithms have become white box in the sense that now we know what drives the algorithms to make its predictions.

## 4. Machine Learning Model Comparison using PyCaret for further improvement 

In [None]:
from pycaret import classification

In [None]:
classification_setup = classification.setup(data= main_df, target='Hired')

In [None]:
classification.compare_models()

### Best Model: CatBoost Classifier

In [None]:
classification_cat = classification.create_model('catboost')

### Model Interpretation using PyCaret Best Model: CatBoost Classifier

In [None]:
classification.interpret_model(classification_cat)

In [None]:
# Interactive Interpretation
classification.interpret_model(classification_cat,plot='reason')

#### Saving the Best Model

In [None]:
classification.save_model(classification_cat, 'catboost_model')