---

## [Index:](#index)


* [Problem Description](#problem)
* [Data Cleaning](#dc)
* [EDA](#eda)
* [Model Building](#mb)
     1. [Common Terminalogy](#common_terms)
     2. [Data Scaling & Splitting](#pre_process)
     3. [Logistic Regression](#logreg)
     4. [Naive Bayes Classification](#nb)
     5. [Random Forest](#rf)
     6. [K Nearest Neighbours](#knn)
     7. [Putting it all together](#summary)
       * [ROC AUC Curves](#roc_auc)
       * [Model Comparison](#compare)
* [Model Comparison](#mc)

<a id=problem></a>


_Notebook Overview:_ This notebook creates 4 different basic models with basic feature tuning for PIMA diabetes dataset. We get logistic regression at 93% Recall/Sensitivity. We also get KNN at 83% accuracy. 

## DESCRIPTION

### Problem Statement
- NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases.
- The dataset used in this project is originally from NIDDK. The objective is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
- Build a model to accurately predict whether the patients in the dataset have diabetes or not.

### Dataset Description
The datasets consists of several medical predictor variables and one target variable (Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and more.

|Variables | Description|
|----------|-------------|
|Pregnancies|Number of times pregnant|
|Glucose|	Plasma glucose concentration in an oral glucose tolerance test|
|BloodPressure |	Diastolic blood pressure (mm Hg)|
|SkinThickness |	Triceps skinfold thickness (mm)|
|Insulin |	Two hour serum insulin|
|BMI |	Body Mass Index|
|DiabetesPedigreeFunction|	Diabetes pedigree function|
|Age|	Age in years|
|Outcome|	Class variable (either 0 or 1). 268 of 768 values are 1, and the others are 0|

---

<a id=dc></a>

## Data Cleaning:

1. Perform descriptive analysis. Understand the variables and their corresponding values. On the columns below, a value of zero does not make sense and thus indicates missing value:

* Glucose
* BloodPressure
* SkinThickness
* Insulin
* BMI

2. Visually explore these variables using histograms. Treat the missing values accordingly.

3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables. 

---

### Observations
**Exploratory Data Analysis:**

1. 
 * Age, Insulin, DiabetesPedigreeFunction and Pregnancies are right skewed.
 * Zero values in blood pressure, BMI, Insulin and Glocuse clearly stands out in the plot
 * After removing zeros for non-zero expected columns, we see that except Insulin which 
   is highly right skewed, all other are near to gaussian distribution. 
2. 
 * Except for Insulin, for rest of other non-zero columns, we can take mean value. 
 * For Insulin, we took median value to fill
3. 
 * In Data type count plot, we can see that there are 2 int type columns and 7 float types

[Go to Index](#index)    
[EDA](#eda)

---

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import copy

In [None]:
data_raw = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
data_raw.dtypes

### Performing Exploratory Data Analysis

In [None]:
data_raw.shape

In [None]:
data_raw.sample(5)

In [None]:
data_raw.info()

In [None]:
data_raw.describe()

In [None]:
data_raw.boxplot(figsize=(10,10), rot=90)

In [None]:
data_raw.hist(figsize=(15,20), )

### Treating Zero valued columns 

In [None]:
not_allowed_zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
data = copy.deepcopy(data_raw)

In [None]:
data[not_allowed_zero_cols] = data[not_allowed_zero_cols].replace(0, np.NaN)

In [None]:
data.isnull().sum()

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,20))
sns.distplot(data.Glucose, ax=ax[0][0])
sns.distplot(data.BloodPressure, ax=ax[0][1])
sns.distplot(data.Insulin, ax=ax[1][0])
sns.distplot(data.SkinThickness, ax=ax[1][1])
sns.distplot(data.BMI, ax=ax[2][0])

In [None]:
data['Glucose'].fillna(data.Glucose.mean(), inplace=True)
data['BloodPressure'].fillna(data.BloodPressure.mean(), inplace=True)
data['BMI'].fillna(data.BMI.mean(), inplace=True)
data['SkinThickness'].fillna(data.SkinThickness.mean(), inplace=True)

data['Insulin'].fillna(data.Insulin.median(), inplace=True)

#### Plots after filling the NaN values. 

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,20))
sns.distplot(data.Glucose, ax=ax[0][0])
sns.distplot(data.BloodPressure, ax=ax[0][1])
sns.distplot(data.Insulin, ax=ax[1][0])
sns.distplot(data.SkinThickness, ax=ax[1][1])
sns.distplot(data.BMI, ax=ax[2][0])

### Plot data types 

In [None]:
data.dtypes.value_counts().plot(kind='bar')

---

<a id=eda></a>

## Exploratory Data Analysis:

1. Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.

2. Create scatter charts between the pair of variables to understand the relationships. Describe your findings.

3. Perform correlation analysis. Visually explore it using a heat map.

---

### Observations

 It is an imbalanced dataset where positive outcomes are almost half of the negative outcomes. While creating model, 
 we need to balance the outcomes either by oversampling the minority class or undersampling of majority class. Other 
 workaround could be to do a weighted computation while training the model.

**Pair plot analysis**

* BMI and Skinthickness have a positive correlation 
* Insulin and Glucose have a positive correlation.
* Rest other fields are uncorrelated or very weakly correlated.
    
**Correlation Analysis**

* There is no strong correlation between any two fields
* The BMI-Skinthickness and Insulin-Glucose are the highest correlated in the set but they are moderately correlated
* Outcome is moderately correlated to Glucose

[Go to Index](#index)    
[Model Building](#mb)

---

### Checking Data balance

In [None]:
sns.countplot(data.Outcome, ).set(title="Data Imbalance Check")

### Pair plot  analysis

In [None]:
sns.pairplot(data, hue='Outcome')

### Correlation analysis


In [None]:
cor = data.corr()
mask = np.triu(np.ones_like(cor, dtype=np.bool))

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(cor, mask=mask, center=0,
            square=True, linewidths=.5, annot=True)

<a id=mb></a>

### Creating, Tuning and Comparing Models

1. Devise strategies for model building. It is important to decide the right validation framework. Express your thought process.
2. Apply an appropriate classification algorithm to build a model. Compare various models with the results from KNN algorithm.

---

## Observations  
 
   This is a binary data classification problem where depending on the all the features, the model has to predict
whether a person have diabetes or not. We have several ways to build the model for binary/multi-class classification. 
Few of them are listed below:

 1. Logistics Regression 
 2. Naive Bayes classification 
 3. Stochastic Gradient Descent
 4. K-Nearest Neighbours
 5. Decision Tree
 6. Random Forest
 7. Support Vector Machine

    We are going to build four models and compare their performance on test and train dataset. We will tune the models if there is need to tune. 
    We will use K-Fold Cross Validation to validate the models. We will plot all the models stats together and compare their performance.
    Of all the tuned models, we will pick up the best model. The step by step procedure can be followed below: 

     1. [Common Terminalogy](#common_terms)
     2. [Data Scaling & Splitting](#pre_process)
     3. [Logistic Regression](#logreg)
     4. [Naive Bayes Classification](#nb)
     5. [Random Forest](#rf)
     6. [K Nearest Neighbours](#knn)
     7. [Putting it all together](#summary)
       * [ROC AUC Curves](#roc_auc)
       * [Model Comparison](#compare)

   From Model Comparison, we find that KNN is the most stable classifier. All the parameters are quite good. 
   It has best accuracy, auc, precision and f1_score of all the models. 

   If we are looking for a highly sensitive model, we can take logistic regression model, which has the highest recall. 


[Go to Index](#index)    
[Go to Week 4](#week_4)

---

<a id=common_terms></a>

## Classificaiton Terminalogy 

**Precision**:  What proportion of positive identifications were actually positive?
Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Precision is referred to as the positive predictive value.
    
Precision = 
$$
\frac{True Positives} {True Positives + False Positives} \\
$$


What proportion of actual positives were identified correctly is called **Recall/Sensitivity/True Positive Rate(TPR)**

In medical terms, **Sensitivity** measures how often a test correctly generates a positive result for people who have the condition that’s being tested for. A test that’s highly sensitive will flag almost everyone who has the disease and not generate many false-negative results. (Example: a test with 90% sensitivity will correctly return a positive result for 90% of people who have the disease, but will return a negative result — a false-negative — for 10% of the people who have the disease and should have tested positive.)

**Recall/Sensitivity/True Positive Rate(TPR)** = 
$$
\frac{True Positives} {True Positives + False Negatives} \\
$$


**Specificity/True Negative Rate** measures a test’s ability to correctly generate a negative result for people who don’t have the condition that’s being tested for. A high-specificity test will correctly rule out almost everyone who doesn’t have the disease and won’t generate many false-positive results. (Example: a test with 90% specificity will correctly return a negative result for 90% of people who don’t have the disease, but will return a positive result — a false-positive — for 10% of the people who don’t have the disease and should have tested negative.)

**Specificity/True Negative Rate =**

$$
\frac{True Negatives} {True Negatives + False Positives} \\
$$

$$
False Positive Rate = {1 - Specificity} \\
$$

**Inverted specificity = false alarm rate = False Positive Rate =**

$$
\frac{False Positives} {False Positives + True Negatives} \\
$$


For any test, there is usually a trade-off between TPR and FPR.

### ROC-AUC Curve
It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.

### Precision - Recall Curve
Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives.

Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1.

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve.
In terms of model selection, F-Measure summarizes model skill for a specific probability threshold (e.g. 0.5), whereas the area under curve summarize the skill of a model across thresholds, like ROC AUC.

**F1 Score=**

$$
\frac{2 * precision * recall} {precision + recall}
$$


---

<a id=pre_process></a>

## Scaling and splitting the data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, accuracy_score, mean_squared_error, roc_auc_score, confusion_matrix, roc_curve, recall_score, precision_score, f1_score 
from sklearn.preprocessing import StandardScaler

In [None]:
X_scaled = StandardScaler().fit_transform(data.drop(['Outcome'], axis='columns'))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, data.Outcome, random_state=123, test_size=.2)

<a id=logreg></a>

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr_clf= LogisticRegression(class_weight='balanced', random_state=123, max_iter=500)

In [None]:
lr_clf.fit(X_train, y_train)

In [None]:
lr_pred = lr_clf.predict(X_test)

In [None]:
lr_model_vals = dict(accuracy=accuracy_score(y_test, lr_pred),
                    auc=roc_auc_score(y_test, lr_pred),
                    recall=recall_score(y_test, lr_pred),
                    precision=precision_score(y_test, lr_pred),
                    f1_score = f1_score(y_test, lr_pred),
                    )

#### AUC ROC Curve parmas computation for Logistic Regression

In [None]:
y_pred_prob_lr = lr_clf.predict_proba(X_test)[:, 1]
fpr_lr, tpr_lr , th_lr = roc_curve(y_test, y_pred_prob_lr)
gmean_lr = np.sqrt(tpr_lr * (1-fpr_lr))
ix_lr = np.argmax(gmean_lr)

<a id=tune_lr></a>

### Tunuing by AUC_ROC Threshold

The geometric mean between TPR and FPR is an optimum value which is max for any give tpr, fpr. If our focus is to create a model that predicts both sides, then this threshold value could be choosen to be optimum threshold. 
The optimam threshold to classify True or False, we get at .364 

In [None]:
th_lr[np.argmax(gmean_lr)]

In [None]:
y_roc_pred_lr = [0 if pred < th_lr[ix_lr] else 1 for pred in y_pred_prob_lr ]

In [None]:
print("Test classification Report With  tuned threshold")
print(classification_report(y_test, y_roc_pred_lr)  )

print("Test classification Report Without  tuned threshold")
print(classification_report(y_test, lr_pred) )

In [None]:
fpr_tlr, tpr_tlr , th_tlr = roc_curve(y_test, y_roc_pred_lr)
gmean_tlr = np.sqrt(tpr_tlr * (1-fpr_tlr))
ix_tlr = np.argmax(gmean_tlr)

In [None]:
tlr_model_vals = dict(accuracy=accuracy_score(y_test, y_roc_pred_lr),
                    auc=roc_auc_score(y_test, y_roc_pred_lr),
                    recall=recall_score(y_test, y_roc_pred_lr),
                    precision=precision_score(y_test, y_roc_pred_lr),
                    f1_score = f1_score(y_test, y_roc_pred_lr),
                    )

<a id=nb></a>

## Naive Bayes Classificaiton 

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gnb_clf = GaussianNB()

In [None]:
gnb_clf.fit(X_train, y_train)

In [None]:
gnb_pred = gnb_clf.predict(X_test)

In [None]:
gnb_model_vals = dict(accuracy=accuracy_score(y_test, gnb_pred),
                    auc=roc_auc_score(y_test, gnb_pred),
                    recall=recall_score(y_test, gnb_pred),
                    precision=precision_score(y_test, gnb_pred),
                    f1_score = f1_score(y_test, gnb_pred),
                    )

### AUC ROC for Naive Bayes classifier 

In [None]:
y_pred_prob_gnb = gnb_clf.predict_proba(X_test)[:, 1]
fpr_nb, tpr_nb , th_nb = roc_curve(y_test, y_pred_prob_gnb)
gmean_nb = np.sqrt(tpr_nb * (1-fpr_nb))
ix_nb = np.argmax(gmean_nb)

In [None]:
print("Train Classification Report")
print(classification_report(y_train, gnb_clf.predict(X_train))  )

In [None]:
print("Test Classification Report")
print(classification_report(y_test, gnb_clf.predict(X_test))  )

<a id=rf></a>

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_clf = RandomForestClassifier()

In [None]:
rf_clf.fit(X_train, y_train)

In [None]:
rf_pred = rf_clf.predict(X_test)

In [None]:
rf_model_vals = dict(accuracy=accuracy_score(y_test, rf_pred),
                    auc=roc_auc_score(y_test, rf_pred),
                    recall=recall_score(y_test, rf_pred),
                    precision=precision_score(y_test, rf_pred),
                    f1_score = f1_score(y_test, rf_pred),
                    )

In [None]:
y_pred_prob_rf = rf_clf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf , th_rf = roc_curve(y_test, y_pred_prob_rf)
gmean_rf = np.sqrt(tpr_rf * (1-fpr_rf))
ix_rf = np.argmax(gmean_rf)

In [None]:
print("\t\tTrain Classification Report\n")
print(classification_report(y_train, rf_clf.predict(X_train))  )

In [None]:
print("\t\tTest Classification Report\n")
print(classification_report(y_test, rf_clf.predict(X_test))  )

## Observation

From the looks of training data, we can say that Random forest has overfitted. Due to overfitting, it may show very good responses but ultimately it is not a good model. We will tune the parmas for this. We will tune on Cost parameter and see what cost function makes the training and test data accuracy comparable.

In below code, we see only one iteration, but before coming to below values, I have done several iterations and compared trin and test errors to arrive at optimum cost value. The below iteration is to arrive at more precise cost value. 

In [None]:
alphas=[]
test=[]
train=[]
for alpha in np.linspace(.03, .05, 10):
    rf = RandomForestClassifier(ccp_alpha=alpha, random_state=123)
    rf.fit(X_train, y_train)
    y_train_predicted = rf.predict(X_train)
    y_test_predicted = rf.predict(X_test)
    mse_train = mean_squared_error(y_train, y_train_predicted)
    mse_test = mean_squared_error(y_test, y_test_predicted)
    alphas.append(alpha)
    test.append(mse_test)
    train.append(mse_train)
    print("Alpha: {} Train mse: {} Test mse: {}".format(alpha, mse_train, mse_test))
    
score=pd.DataFrame({'alpha': alphas, 'test':test, 'train': train})

In [None]:
plt.plot(score.alpha, score.test)
plt.plot(score.alpha, score.train)
plt.legend(['Test Error', 'Train Error'])
plt.xlabel('Alpha')
plt.ylabel('Error')

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Create the random grid
random_grid = { 'ccp_alpha': np.linspace(.03, .05, 10),
                'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [int(x) for x in np.linspace(5, 55, num = 10)], 
               'min_samples_split': [5, 10, 12], 
               'min_samples_leaf': [3,5,7,10],
               }

In [None]:
rf_clf_cv = RandomForestClassifier(class_weight="balanced", random_state=123)
rscv = RandomizedSearchCV(estimator=rf_clf_cv, param_distributions=random_grid, cv=3, scoring='f1_weighted')

In [None]:
rscv.fit(X_train, y_train)

In [None]:
rscv.best_estimator_

In [None]:
print("\t\tTest Classification Report\n")
print(classification_report(y_test, rscv.predict(X_test))  )

**Observation:** ccp_alpha is 0.037 

In [None]:
print("\t\tTrain Classification Report\n")
print(classification_report(y_train, rscv.predict(X_train))  )

In [None]:
tuned_rf_model_vals = dict(accuracy=accuracy_score(y_test, rscv.predict(X_test)),
                    auc=roc_auc_score(y_test, rscv.predict(X_test)),
                    recall=recall_score(y_test, rscv.predict(X_test)),
                    precision=precision_score(y_test, rscv.predict(X_test)),
                    f1_score = f1_score(y_test, rscv.predict(X_test)),
                    )

In [None]:
y_pred_prob_trf = rscv.predict_proba(X_test)[:, 1]
fpr_trf, tpr_trf , th_trf = roc_curve(y_test, y_pred_prob_trf)
gmean_trf = np.sqrt(tpr_trf * (1-fpr_trf))
ix_trf = np.argmax(gmean_trf)

In [None]:
import sklearn.metrics
sorted(sklearn.metrics.SCORERS.keys())

<a id=knn></a>

## K-Nearest Neighbour 

We are capturing rmse, error_rate and accuracy for a range of nearest neighbours. We will plot all of them to observer nearest 
neighbours. We observe that error_rate and rmse give same plot while accuracy gives a mirror image of other two.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
nbr = []
error_rmse = []
error_rate = []
accuracy = []
for n in range(2, 50):
    knn_clf = KNeighborsClassifier(n_neighbors=n, weights='distance')
    knn_clf.fit(X_train, y_train)
    pred = knn_clf.predict(X_test)
    
    nbr.append(n)
    error_rmse.append(mean_squared_error(y_test, pred, squared=False))
    error_rate.append(np.mean(y_test != pred))
    accuracy.append(accuracy_score(y_test, pred))
    
knn_stats = pd.DataFrame({'neighbour': nbr, 'rmse': error_rmse, 'error_rate': error_rate, 'accuracy': accuracy})      

In [None]:
sns.lineplot(x='neighbour', y='error_rate', data=knn_stats)

In [None]:
sns.lineplot(x='neighbour', y='rmse', data=knn_stats)

In [None]:
sns.lineplot(x='neighbour', y='accuracy', data=knn_stats)

In [None]:
knn_stats.neighbour[knn_stats.rmse.argmin()]

In [None]:
knn_stats.neighbour[[5,19,9,10]]

#### With weights of distance we have  7 as the nearest neighbour counnt. We will create our model with 7 NN

In [None]:
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)

In [None]:
print(classification_report(y_test, pred))

In [None]:
knn_model_vals = dict(accuracy=accuracy_score(y_test, knn_pred),
                    auc=roc_auc_score(y_test, knn_pred),
                    recall=recall_score(y_test, knn_pred),
                    precision=precision_score(y_test, knn_pred),
                    f1_score = f1_score(y_test, knn_pred),
                    )

In [None]:
y_pred_prob_knn = knn_clf.predict_proba(X_test)[:, 1]
fpr_knn, tpr_knn , th_knn = roc_curve(y_test, y_pred_prob_knn)
gmean_knn = np.sqrt(tpr_knn * (1-fpr_knn))
ix_knn = np.argmax(gmean_knn)

### Tuning KNN

We have seen that 4 values of Nearest Neighbours yeilded the same error in our earlier plot. We will tune the model with all those given
values and pickup the best.

In [None]:
knn_param_grid = {'n_neighbors' : [7, 11, 12, 21],
                      'weights': ['distance', 'uniform'],
                      'algorithm' : ['ball_tree', 'kd_tree'],
                     'leaf_size' :[30,40,50],                 
                 }

In [None]:
knn_rscv = RandomizedSearchCV(estimator=KNeighborsClassifier(), param_distributions=knn_param_grid, cv=3, scoring='f1_weighted')

In [None]:
knn_rscv.fit(X_train, y_train)

In [None]:
knn_rscv.best_params_

In [None]:
tknn_pred = knn_rscv.predict(X_test)

In [None]:
y_pred_prob_knn_cv = knn_rscv.predict_proba(X_test)[:, 1]
fpr_tknn, tpr_tknn , th_tknn = roc_curve(y_test, y_pred_prob_knn_cv)
gmean_tknn = np.sqrt(tpr_tknn * (1-fpr_tknn))
ix_tknn = np.argmax(gmean_tknn)

In [None]:
accuracy_score(y_test, tknn_pred)

In [None]:
tknn_model_vals = dict(accuracy=accuracy_score(y_test, tknn_pred),
                    auc=roc_auc_score(y_test, tknn_pred),
                    recall=recall_score(y_test, tknn_pred),
                    precision=precision_score(y_test, tknn_pred),
                    f1_score = f1_score(y_test, tknn_pred),
                    )

<a id=mc></a>

## Model Comparison

**Data Modeling:**

Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

### Observations

I have plotted auc_roc curve of all the models together for comparison purpose. 
Also, I have plotted all the model stats together, tuned and non-tuned version. 
We can compare the performance by looking at the plots. Of that, I have selected 
3 best tuned models and plotted again to show what model is best of the lot.

<a id=roc_auc></a>

#### AUC ROC Curve of all the models put together

In [None]:
plt.subplots(figsize=(12,9))
plt.plot(fpr_knn, tpr_knn, marker='o', markevery=[ix_knn])
plt.plot(fpr_tknn, tpr_tknn, marker='o', markevery=[ix_tknn])
plt.plot(fpr_rf, tpr_rf, marker='o', markevery=[ix_rf])
plt.plot(fpr_trf, tpr_trf, marker='o', markevery=[ix_trf])
plt.plot(fpr_nb, tpr_nb, marker='o', markevery=[ix_nb])
plt.plot(fpr_lr, tpr_lr, marker='o', markevery=[ix_lr])
plt.plot(fpr_tlr, tpr_tlr, marker='o', markevery=[ix_tlr])
plt.plot([0,1], [0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('AUC ROC Curve for all models')
plt.legend(['KNN', 'TunedKNN', 'Random Forest', 'Tuned Random Forest', 'Naive Bayes', 'Logistic Regression',  'Tuned Logistic Regression','Dumb Classifier'])
plt.show()

<a id=compare></a>

### Comparing model parameters

In [None]:
model_stats = pd.DataFrame(data=[lr_model_vals, tlr_model_vals, gnb_model_vals, rf_model_vals, 
                                 tuned_rf_model_vals, knn_model_vals, tknn_model_vals ], 
                           index=['LogReg', 'Tuned LogReg auc_roc', 'naive_bayes', 'random_forest', 
                                  'tuned_random_forest ', 'knn', 'tuned knn'])

In [None]:
model_stats.T.plot(kind='line', figsize=(12,9))

### Observations

I have plotted different parameters of all the models above. By a simple look, we know that recall and precision are in opposite direction for all the models. The overfitted Random Forest had similar charaterstics as KNN but once that was tuned, its recall has gone up and precision came down.

Considering overall parameter values, KNN is best predictor of all the models. I have tuned KNN as well. With tuning the model performance has increased on all the parameters.

I have tuned Logistic regression as well. After tuning, TPR has gone up while precision has gone down.

If we are looking for a model with high Sensitivity, we can pick up Logistic Regression model. For over-all better performance, we can choose KNN. 

<a id=final_model></a>

### Final Model: 

Of all the models tuned and plotted above, I am picking up best 3 models and we will compare their values. From observing the plot, we can say that 
knn model is best in 4 out of 5 parameters hence that can be termed as best models. KNN has best accuracy, auc, precision and f1_score. The logistic Regression has best Recall/Sensitivity at 93%


In [None]:
final_models = pd.DataFrame(data=[ tlr_model_vals, tuned_rf_model_vals,  tknn_model_vals ], 
                           index=['Logistic Regression',  'Random Forest', 'KNN'])

In [None]:
final_models.T.plot(kind='line', figsize=(12,9), table=True)

## Please comment and upvote if you liked it :) 

[Tableau Public Link](https://public.tableau.com/profile/awadhesh2246#!/vizhome/PGP-DSFinalProject/scatter_bubble?publish=yes)