## M6W1 Assignment

*Q0: Have a quick overview of the features and implement a “cleaning process”. Make sure this part of the code is well organised, if possible make this an object-oriented exercise.*

In [None]:
#Import necessary dependencies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns 
import re
import numpy as np
from sklearn.metrics import roc_curve

import warnings
warnings.filterwarnings("ignore")
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

In [None]:
df1 = df.copy()
df.info()

In [None]:
# check for null values in the dataset
df.isna().sum().sum()

In [None]:
df.nunique().sort_values()

In [None]:
# Based on the above, all columns are categorical except for Tenure, Monthly Charges, Total Charges
cols = ['Churn', 'gender', 'SeniorCitizen','Partner', 'Dependents', 'PaperlessBilling', 'PhoneService','Contract','StreamingMovies','StreamingTV','TechSupport','OnlineBackup','OnlineSecurity','InternetService', 'MultipleLines', 'DeviceProtection', 'PaymentMethod']
for col in cols:
    print (col,':', df[col].unique())

The classification variables can divided into 3 groups:

1) Yes/No classification variables (Partner, dependent, ...etc) 

2) Yes/No/Other classification variables (StreamingMovies, TechSupport, ..etc)

3) Classification variables with other values (gender, contract, etc.) 


In [None]:
# we convert SeniorCitizen to Yes/No in order to plot it with the other Yes/No columns 
df['SeniorCitizen'] = df['SeniorCitizen'].map({0:'No',1:'Yes'}) 

# For these columns we expect yes/no values only 
cols1 = ['Churn', 'SeniorCitizen', 'Partner', 'Dependents', 'PaperlessBilling', 'PhoneService']  

# For these columns we expect yes/no or a different value such as 'no internet service', or 'special package'
cols2 = ['StreamingMovies','StreamingTV','TechSupport','OnlineBackup','OnlineSecurity', 
         'MultipleLines', 'DeviceProtection']

# For these columns, we expect values other than yes/no
cols3 = ['gender','Contract','InternetService','PaymentMethod']

In [None]:
def plot_chart(cols,fz=(12,10), rot=0):
    fig, axes = plt.subplots(nrows=2, ncols=(len(cols)+1)//2, figsize=fz)
    for i, col in enumerate(cols):
        sns.countplot(x=col, data=df, ax=axes[i%2,i//2], order=df[col].value_counts().index)
        axes[i%2,i//2].set_title(col)
        axes[i%2,i//2].set_xlabel(None)
        axes[i%2,i//2].set_ylabel(None)
        xlabels = axes[i%2,i//2].get_xticklabels()
        axes[i%2,i//2].set_xticklabels(xlabels, rotation=rot)
 
    for i in range(len(cols), len(axes.flatten()) ):
        fig.delaxes(axes.flatten()[i])

In [None]:
# Plot the first type - accept only Yes/No
plot_chart(cols1)

In [None]:
# plot bar charts 
plot_chart(cols2,(18,10))

In [None]:
# plot bar charts 
plot_chart(cols3,(10,10),45)

In [None]:
# TotalCharges is showing blank when the balance is Null so will replace it to '0' then convert the column to float 
df[df['TotalCharges'] == ' '].shape

In [None]:
# convert empty values to 0 and convert TotalCharges to Numeric
df['TotalCharges'] = df['TotalCharges'].str.replace (' ','0')
df['TotalCharges'] = df['TotalCharges'].astype(float)

In [None]:
df.describe()

In [None]:
# now we look at the numeric values 
ncols = ['tenure', 'MonthlyCharges', 'TotalCharges']
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,4))
for i, col in enumerate(ncols): 
    sns.histplot(df[col], kde=True, ax=axes[i])

*Q1: Explain the process that needs to happen for each feature before you train your model. Also, think about how future observations might be different from the ones you have! Be creative.*

We need to convert all the classification variables to numbers to be used in the prediction models. This can be done by assigning numerical values for each of the categorical feature we have in the dataset. 

1) Yes/No classification variables - cannot accept values other than Yes/No.

2) Yes/No/'Others' classification variables - can accept values other than Yes/No

3) Classification variables with other values (gender, contract, etc.) - can accept any value


In [None]:
#df= df1.copy()

def mapping_dict(col, yes_no=True):
    md = {}
    if (yes_no == True):
        md = {'No': 0, 'Yes': 1}
    val = col.unique()
    if len(md)==0:
        cnt=0
    else:
        cnt=max(md.values())+1
    for i in val:
        if not(i in md.keys()):
            md[i] = cnt
            cnt+=1
    return md

In [None]:
#For first group : convert category values to numbers 
for col in cols1:
    md = mapping_dict(df[col])
    df[col] = df[col].map(md)

    # for any value other than Yes/No, identify these values, flag and remove them 
    lst = [x for x in md.keys() if x not in ['Yes','No']]
    if (len(lst)>0):
        print ('The following values in',col,'cannot be accepted and needs to be revised : ', lst)
        print (df[df[col]>1].shape[0], 'rows removed from the dataset' )
        df = df[df[col]<=1]


In [None]:
# for second group : convert category values to numbers and accept any value for these columns
for col in cols2:
    md = mapping_dict(df[col])
    df[col] = df[col].map(md)

In [None]:
# We use patterns to identify males & females based on first letter - covert it to Male / Female  
print (df['gender'].value_counts())
df.loc[df['gender'].str.contains(r'^[Mm]'),'gender']='Male'
df.loc[df['gender'].str.contains(r'^[Ff]'),'gender']='Female'

# for any value other than Male/Female, itdentify these values, flag it and remove them 
lst = [x for x in list(df['gender'].unique()) if x not in ['Male','Female']]
if (len(lst)>0):
    print ('The following values in gender to be revised : ', lst)
    print (df[~df['gender'].isin(['Male','Female'])].shape[0], 'rows removed from the dataset' )
    df = df[df['gender'].isin(['Male','Female'])]

In [None]:
# for third group : convert category values to numbers and accept any value for these columns
for col in cols3:
    md = mapping_dict(df[col], yes_no=False)
    df[col] = df[col].map(md)

df['gender'].value_counts()    

In [None]:
for col in cols:
    print (col,': before -', list(df1[col].value_counts()), '&  after -', list(df[col].value_counts()))

In [None]:
# Normalize Numeric Value
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,4))
for i, col in enumerate(ncols): 
    df[col] = (df[col] - df[col].mean()) / (df[col].std())
    sns.histplot(df[col], kde=True, ax=axes[i])

In [None]:
# customerID is not required for the prediction model
df = df.drop(['customerID'],axis = 1)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(df.corr())

From the heatmap above, judging from the features with significant correlation coefficients of higher than 0.8, customer with internet access will normally have other online services as well such as online security, backup, tech support, streaming movies, etc.

*Q2: Choose one metric to evaluate the different models you will train and explain why you are choosing that instead of other metrics. You can try a few base models but model performance is not of prime importance yet.*

Precision measures how precise/accurate the model is out of those predicted positive, how many of them are actual positive. Precision is a good measure when the costs of False Positive is high.

While recall calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.


In our model, the cost of not identifying the clients who will churn is the lost revenue after losing them. While the cost of flagging customers who are not planning to leave incorrectly could be calling them to check on their satisfaction level and perphaps offering them some incetives to keep their services

Assuming that the cost of losing clients is higher, we will use recall to evaluate our model. 

In [None]:
# Def X and Y
y = df['Churn']
X = df.drop('Churn', axis=1)

In [None]:
# split the dataset to train and test the model 
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print (X_train.shape)
print (X_test.shape)

#### 2. Using K Neighbors Classifier

In [None]:
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
         ('k_neighbors', KNeighborsClassifier())]

# Create the pipeline: pipeline 
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'k_neighbors__n_neighbors':np.arange(5,50)}

# Create the GridSearchCV object: knn
knn_cv = GridSearchCV(pipeline,param_grid=parameters,cv=5)

# Fit to the training set
knn_cv.fit(X_train,y_train)

# Compute and print the metrics
print('Best Score: %s' % knn_cv.best_score_)
print('Best Hyperparameters: %s' % knn_cv.best_params_)

In [None]:
# Plot the mean_test_score values (after 5-fold CV) versus k_neighbors from 5 to 50 curve
x1 = np.arange(5,50)
y1 = knn_cv.cv_results_['mean_test_score']
plt.plot(x1,y1)
plt.xlabel('K_neighbors_value')
plt.ylabel('Mean Test Score')
plt.title('Mean Test Score vs. K_Neighbors value')
plt.show()

In [None]:
# Run the model using the best paramter
#knn = KNeighborsClassifier(n_neighbors=knn_cv.best_params_['k_neighbors__n_neighbors'])
knn = knn_cv.best_estimator_

#Make the prediction:
y_pred1 = knn.predict (X_test)

#Classification report
print (classification_report(y_test, y_pred1))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob1 = knn.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr1, tpr1, thresholds1 = roc_curve(y_test, y_pred_prob1)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr1, tpr1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of KNN Model')
plt.show()

In [None]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred1)
print (cm)
ax = plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
# since our objective is to imporve the recall ratio - we will change the threshold to see the impact 
y.value_counts()/len(y)

In [None]:
# assuming the threshold of 0.26
y_pred_new1 = np.where(y_pred_prob1 >=0.26, 1, 0)

cm1 = confusion_matrix(y_test, y_pred_new1)
ax = plt.subplot()
sns.heatmap(cm1, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
#We improved the recall ratio but the percision is lower 
print(classification_report(y_test, y_pred_new1))

#### 2. Using Logistic Regression 

In [None]:
# Setup the pipeline steps: steps
steps2 = [('scaler', StandardScaler()),
         ('logreg', LogisticRegression())]

# Create the pipeline: pipeline 
pipeline2 = Pipeline(steps2)

# Specify the hyperparameter space
c_space = np.logspace(-5, 8, 15)
param_grid = {'logreg__C': c_space}

# Create the GridSearchCV object: knn2
logreg_cv = GridSearchCV(pipeline2,param_grid,cv=5)

# Fit to the training set
logreg_cv.fit(X_train,y_train)

# Compute and print the metrics
print('Best Score: %s' % logreg_cv.best_score_)
print('Best Hyperparameters: %s' % logreg_cv.best_params_)

In [None]:
#logreg = LogisticRegression(C=logreg_cv.best_params_['logreg__C'])
logreg = logreg_cv.best_estimator_

#Make the prediction:
y_pred2 = logreg.predict (X_test)

#Classification report
print (classification_report(y_test, y_pred2))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob2 = logreg_cv.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr2, tpr2, thresholds2 = roc_curve(y_test, y_pred_prob2)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr2, tpr2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of Logistics Regression')
plt.show()

In [None]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred2)

ax = plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
y_pred_new2 = np.where(y_pred_prob2 >=0.26, 1, 0)

cm1 = confusion_matrix(y_test, y_pred_new2)
ax = plt.subplot()
sns.heatmap(cm1, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
#We improved the recall ratio but the percision is lower 
print(classification_report(y_test, y_pred_new2))

## M6W2 Assignment

*Q0: Select three different models that you would like to test your dataset with. Make sure that at least two of them are tree-based models.*

*Q1: Explain why you selected these three models. You might want to discuss their peformance, explainability, complexity, etc.*

We will start with a simple model which is the Decision Tree Classifier then we will use Random Forest Classifer and Gradient Boosting Classifier.

Decision Tree Classifier simple to understand, easy to explain and provides a clear visual to guide the decision making process. It has some disadvantages including overfitting, error due to bias and error due to variance. 

Random forest is an ensemble model uses a collection of decision trees with a single, aggregated result. Random forests are considered one of the most accurate learning algorithm. It reduces the variance seen in decision trees by using different samples for training, specifying random feature subsets and building & combining small trees.

Gradient boosting is another ensemble model that uses a set of decision trees. The two main differences between Random Forest and Gradient Boosting are:

- How trees are built: random forests builds each tree independently while gradient boosting builds one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introducing a weak learner to improve the shortcomings of existing weak learners. 
- Combining results: random forests combine results at the end of the process (by averaging or "majority rules") while gradient boosting combines results along the way.



### 1. First Model : Decision Tree Classifier

Decision trees are a series of sequential steps designed to answer a question and provide probabilities, costs, or other consequence of making a particular decision. 

Decision tree is derived from the independent variables, with each node having a condition over a feature. The nodes decides which node to navigate next based on the condition. Once the leaf node is reached, an output is predicted. The right sequence of conditions makes the tree efficient. Information gain are used as the criteria to select the conditions in nodes. 

They are simple to understand, providing a clear visual to guide the decision making progress. However, this simplicity comes with a few serious disadvantages, including overfitting, error due to bias and error due to variance.  

- Overfitting happens for many reasons, including presence of noise and lack of representative instances. It's possible for overfitting with one large (deep) tree. 
- Bias error happens when you place too many restrictions on target functions. For example, restricting your result with a restricting function (e.g. a linear equation) or by a simple binary algorithm (like the true/false choices in the above tree) will often result in bias.
- Variance error refers to how much a result will change based on changes to the training set. Decision trees have high variance, which means that tiny changes in the training data have the potential to cause large changes in the final result.

**Advantages :**
- No preprocessing needed on data.
- No assumptions on distribution of data.
- Handles colinearity efficiently.
- Decision trees can provide understandable explanation over the prediction.

**Disadvantages :**
- Chances for overfitting the model if we keep on building the tree to achieve high purity. decision tree pruning can be used to solve this issue.
- Prone to outliers.
- Tree may grow to be very complex while training complicated datasets.
- Looses valuable information while handling continuous variables.

**Decision tree vs KNN :**
- Both are non-parametric methods.
- Decision tree supports automatic feature interaction, whereas KNN cant.
- Decision tree is faster due to KNN’s expensive real time execution.

**Decision Tree vs Logistic Regression :**
- Decision tree handles colinearity better than LR.
- Decision trees cannot derive the significance of features, but LR can.
- Decision trees are better for categorical values than LR.




In [None]:
# Specify the hyperparameters
param_grid = {'max_depth': range(1,10)}#, 'min_samples_leaf':[1,2,3]}

# Create the DecisionTreeClassifier : dt
dt = DecisionTreeClassifier()

# Create the GridSearchCV object: dt_cv
dt_cv = GridSearchCV(dt,param_grid,cv=5,return_train_score = True)

# Fit to the training set
dt_cv.fit(X_train,y_train)

# Compute and print the metrics
print('Best Score: %s' % dt_cv.best_score_)
print('Best Hyperparameters: %s' % dt_cv.best_params_)

In [None]:
dt = dt_cv.best_estimator_

#Make the prediction:
y_pred3 = dt.predict (X_test)

#Classification report
print (classification_report(y_test, y_pred3))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob3 = dt_cv.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr3, tpr3, thresholds3 = roc_curve(y_test, y_pred_prob3)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr3, tpr3)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of Decision Tree')
plt.show()

In [None]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred3)

ax = plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
y_pred_new3 = np.where(y_pred_prob3 >=0.26, 1, 0)

cm1 = confusion_matrix(y_test, y_pred_new3)
ax = plt.subplot()
sns.heatmap(cm1, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
#We improved the recall ratio but the percision is lower 
print(classification_report(y_test, y_pred_new3))

In [None]:
data = pd.Series(dt.feature_importances_, index=X.columns)
print (data)

In [None]:
 data.sort_values(ascending=True).plot.barh(figsize=(8,6))

In [None]:
plot_tree

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(dt , filled=True, max_depth=3)

### 2. Second Model : Random Forest Classifier

Random Forest is a collection of decision trees with a single, aggregated result. It is an ensemble model that uses a set of decision trees ensembled with “bagging method” to obtain classification and regression outputs. In classification, it calculates the output using majority voting , whereas in regression, mean is calculated. The derived model will be more robust, accurate and handles overfitting better than constituent models

Random forests reduce the variance seen in decision trees by:

- Using different samples for training
- Specifying random feature subsets
- Building and combining small (shallow) trees

**Advantages :**
- Accurate and powerful model.
- handles overfitting efficiently.
- Supports implicit feature selection and derives feature importance.

**Disadvantages :**
- computationally complex and slower when forest becomes large.
- Not a well descriptive model over the prediction.

**Decision tree vs Random Forest :**
- Random Forest is a collection of decision trees and average/majority vote of the forest is selected as the predicted output.
- Random Forest model will be less prone to overfitting than Decision tree, and gives a more generalized solution.
- Random Forest is more robust and accurate than decision trees.

In [None]:
# Specify the hyperparameters
param_grid = {'max_depth': range(1,10)}#, 'min_samples_leaf':[1,2,3]}

# Create the RandomForestClassifier : rf
rf = RandomForestClassifier()

# Create the GridSearchCV object: rf_cv
rf_cv = GridSearchCV(rf,param_grid,cv=5,return_train_score = True)

# Fit to the training set
rf_cv.fit(X_train,y_train)

# Compute and print the metrics
print('Best Score: %s' % rf_cv.best_score_)
print('Best Hyperparameters: %s' % rf_cv.best_params_)

In [None]:
rf = rf_cv.best_estimator_

#Make the prediction:
y_pred4 = rf.predict (X_test)

#Classification report
print (classification_report(y_test, y_pred4))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob4 = rf_cv.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr4, tpr4, thresholds4 = roc_curve(y_test, y_pred_prob4)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr4, tpr4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of Random Forest ')
plt.show()

In [None]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred4)

ax = plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
y_pred_new4 = np.where(y_pred_prob4 >=0.26, 1, 0)

cm1 = confusion_matrix(y_test, y_pred_new4)
ax = plt.subplot()
sns.heatmap(cm1, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
data = pd.Series(rf.feature_importances_, index=X.columns)
print (data)

In [None]:
data.sort_values(ascending=True).plot.barh(figsize=(8,6))

### 3. Third Model : Gradient Boosting Classifier

Gradient boosting is a also a set of decision trees. The two main differences are:

- How trees are built: random forests builds each tree independently while gradient boosting builds one tree at a time. This additive model (ensemble) works in a forward stage-wise manner, introducing a weak learner to improve the shortcomings of existing weak learners. 
- Combining results: random forests combine results at the end of the process (by averaging or "majority rules") while gradient boosting combines results along the way.

By carefully tune parameters, gradient boosting can result in better performance than random forests. However, gradient boosting may not be a good choice if we have a lot of noise, as it can result in overfitting. They also tend to be harder to tune than random forests.

Random forests and gradient boosting each excel in different areas. Random forests perform well for multi-class object detection and bioinformatics, which tends to have a lot of statistical noise. Gradient Boosting performs well when you have unbalanced data such as in real time risk assessment.


**Advantages :**
- Since boosted trees are derived by optimizing an objective function, basically GBM can be used to solve almost all objective function that we can write gradient out. This including things like ranking and poission regression, which RF is harder to achieve.

**Disadvatages :**
- GBMs are more sensitive to overfitting if the data is noisy.
- Training generally takes longer because of the fact that trees are built sequentially.
- GBMs are harder to tune than RF. There are typically three parameters: number of trees, depth of trees and learning rate, and each tree built is generally shallow

In [None]:
# Specify the hyperparameters
param_grid = {'max_depth': range(1,10)}#, 'min_samples_leaf':range(1,10)}

# Create the GradientBoostingClassifier : gb
gb =  GradientBoostingClassifier()

# Create the GridSearchCV object: gb_cv
gb_cv = GridSearchCV(gb,param_grid,cv=5,return_train_score = True)

# Fit to the training set
gb_cv.fit(X_train,y_train)

# Compute and print the metrics
print('Best Score: %s' % gb_cv.best_score_)
print('Best Hyperparameters: %s' % gb_cv.best_params_)

In [None]:
gb = gb_cv.best_estimator_

#Make the prediction:
y_pred5 = gb.predict (X_test)

#Classification report
print (classification_report(y_test, y_pred5))

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob5 = gb_cv.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr5, tpr5, thresholds5 = roc_curve(y_test, y_pred_prob5)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr5, tpr5)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of Gradient Boosting')
plt.show()

In [None]:
# plot the confusion matrix
cm = confusion_matrix(y_test, y_pred5)

ax = plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
y_pred_new5 = np.where(y_pred_prob5 >=0.26, 1, 0)

cm1 = confusion_matrix(y_test, y_pred_new5)
ax = plt.subplot()
sns.heatmap(cm1, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')

In [None]:
data = pd.Series(gb.feature_importances_, index=X.columns)
print (data)

In [None]:
data.sort_values(ascending=True).plot.barh(figsize=(8,6))

In [None]:
# Plot ROC curve to compare all models
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr1, tpr1, label='KNN')
plt.plot(fpr2, tpr2, label='Logistics')
plt.plot(fpr3, tpr3, label='Decision Tree')
plt.plot(fpr4, tpr4, label='Random Forest')
plt.plot(fpr5, tpr5, label='Gradient Boosting')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve of All Models')
plt.legend()
plt.show()

All models seem to perform very close to each other in terms of accuracy of the predictions when viewing from the ROC curve.

*Q3: Do you believe you have overfitting? Why?*

In [None]:
# overfitting
train_score = dt_cv.cv_results_["mean_train_score"]
test_score = dt_cv.cv_results_["mean_test_score"]

x = range (1,10)
plt.plot(x,train_score)
plt.plot(x,test_score)
plt.axvline(dt_cv.best_params_['max_depth'], color='gray',linestyle="--")

In [None]:
# overfitting
train_score = rf_cv.cv_results_["mean_train_score"]
test_score = rf_cv.cv_results_["mean_test_score"]

x = range (1,10)
plt.plot(x,train_score)
plt.plot(x,test_score)
plt.axvline(rf_cv.best_params_['max_depth'], color='gray',linestyle="--")

In [None]:
# overfitting
train_score = gb_cv.cv_results_["mean_train_score"]
test_score = gb_cv.cv_results_["mean_test_score"]

x = range (1,10)
plt.plot(x,train_score)
plt.plot(x,test_score)
plt.axvline(gb_cv.best_params_['max_depth'], color='gray',linestyle="--")

From the chart above, in our opinion, once the parameters used is higher than the best parameters set concluded from the GridSearchCV, each model is going to be overfitted to the training dataset - where the training accuracy increases while the accuracy of the predictions of the test dataset decreases or does not improve (for the two latter models).

We also can observe that out of the three models: The Decision Tree is the model which suffers the most from overfitting since the accuracy on the test dataset deteriorates significantly while the accuracy remains constant in the Random Forest and drops slighlty in the Gradient Boosting, when the hyperparameter is tuning higher to increase the accuracy on train dataset.

This tendancy towards overfitting on Decision Tree model is driven by the different issues compared to Random Forest and Gradient Boosting due to the fact that the two latter models are categorized as ensembled models so they can better generalize the prediction logic on unseen data.

Now, we would like to investigate on the overview of each model performance regarding the training accuracy and testset accuracy when used the best tuned parameters by plotting the accuracy scores.

In [None]:
training_accuracy = []
testing_accuracy = []

models = {knn, logreg, dt, rf, gb}

#Compute the train and test accuracy for each model
for model in models:
    y_pred_train = model.predict(X_train)
    training_accuracy.append(accuracy_score(y_train, y_pred_train))
    Y_pred_test = model.predict(X_test)
    acc_score = accuracy_score(y_test,Y_pred_test)
    testing_accuracy.append(acc_score)

In [None]:
print(training_accuracy)
print(testing_accuracy)

In [None]:
df = pd.DataFrame({'models':["KNN", "Logistics Regression", "Decision Tree", "Random Forest", "Gradient Boosting"], 'TrainAccuracy':training_accuracy, 'TestAccuracy':testing_accuracy})

df.head()

In [None]:
df_melted = df.melt(id_vars='models')
df_melted

In [None]:
sns.lineplot(x='variable', y='value', hue='models', data=df_melted)

From the chart above, we can conclude that three of the selected models so far in this assignment have the overfitting issues which are the KNN (the worst), Decision Tree and the Gradient Boosting where the test accuracy deteriorates compared to the training accuracy. The Random Forest Regressor model is quite effective in real world data where its accuracy slightly increases and by far the Logistics Regression performs the best where the accuracy increases steeply on the test dataset.