# Exploratory Data Analysis and Machine Learning Classification on Customer Churn

In this notebook, I performed EDA on the 'Customer Churn Dataset'. I visualized the data using the Seaborn library. I created pipelines with Machine Learning algorithms. I applied k-Fold Cross Validation each of them and evaluated their results. Lastly, I determined best features for some algorithms. I hope this notebook will be useful to you.

### If you have questions please ask them on the comment section.

### I will be glad if you can give feedback.

## Content:

1. [Importing the Necessary Libraries](#1)
1. [Read Datas & Explanation of Features & Information About Datasets](#2)
   1. [Variable Descriptions](#3)
   1. [Univariate Variable Analysis](#4)
      1. [Categorical Variables](#5)
      1. [Numerical Variables](#6)
1. [Basic Data Analysis](#7)
   1. [Distribution of Each Feature](#8)
   1. [Distributions of Each Feature According to 'Churn'](#9)
1. [Data Visualization](#10)   
1. [Pandas Profiling](#11)
1. [Correlation](#12)
1. [Skewness](#13)
1. [Encoding](#14)
   1. [Uniqueness of each Feature](#22)
   1. [Label Encoding](#15)
   1. [One-Hot Encoding](#16)
1. [Train-Test Split](#17)
1. [Pipelines](#18)
   1. [k-Fold Cross Validation](#19)
   1. [Best Features Selection](#20)
1. [Gradient Boosting Classifier](#23)
   1. [Confusion Matrix](#24)
   1. [Classification Report](#25)
   1. [ROC Curve](#26)
   1. [Visualization](#27)
1. [Conclusion](#21)      

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a>
# Importing the Necessary Libraries

In [None]:
import numpy as np 
import pandas as pd
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
import seaborn as sns; sns.set()

from sklearn import tree
import graphviz 
import os
import preprocessing 

import numpy as np 
import pandas as pd 
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from pandas_profiling import ProfileReport

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
import lightgbm

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from xgboost import plot_tree, plot_importance

from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


import warnings
warnings.filterwarnings("ignore")

<a id="2"></a>

# Read Datas & Explanation of Features & Information About Datasets

In [None]:
dataset = pandas.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
dataset.head(490)

In [None]:
dataset.drop("customerID", axis=1, inplace=True)

In [None]:
dataset['TotalCharges'] = dataset['TotalCharges'].apply(lambda x: 0 if x == ' ' else x)

In [None]:
dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"])

<a id="3"></a>

## Variable Descriptions

* gender    -->    Whether the customer is a male or a female     
* SeniorCitizen   -->    Whether the customer is a senior citizen or not (1, 0)
* Partner       -->       Whether the customer has a partner or not (Yes, No)
* Dependents       -->   Whether the customer has dependents or not (Yes, No)
* tenure            -->  Number of months the customer has stayed with the company
* PhoneService      -->  Whether the customer has a phone service or not (Yes, No)
* MultipleLines     -->  Whether the customer has multiple lines or not (Yes, No, No phone service)
* InternetService   -->  Customer’s internet service provider (DSL, Fiber optic, No)
* OnlineSecurity    -->  Whether the customer has online security or not (Yes, No, No internet service)
* OnlineBackup      -->  Whether the customer has online backup or not (Yes, No, No internet service)
* DeviceProtection  -->  Whether the customer has device protection or not (Yes, No, No internet service)
* TechSupport       -->  Whether the customer has tech support or not (Yes, No, No internet service)
* StreamingTV       -->  Whether the customer has streaming TV or not (Yes, No, No internet service)
* StreamingMovies   -->  Whether the customer has streaming movies or not (Yes, No, No internet service)
* Contract           -->  The contract term of the customer (Month-to-month, One year, Two year)
* PaperlessBilling  -->  Whether the customer has paperless billing or not (Yes, No)
* PaymentMethod     -->  The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
* MonthlyCharges    -->  The amount charged to the customer monthly
* TotalCharges      -->  The total amount charged to the customer
* Churn              -->  Whether the customer churned or not (Yes or No)

In [None]:
dataset.info()

In [None]:
dataset.describe().T

In [None]:
dataset.isnull().sum().sum()

<a id="4"></a>

## Univariate Variable Analysis

* Categorical Variables: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

* Numerical Variables: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

<a id="5"></a>

### Categorical Variables

In [None]:
def bar_plot(variable):
    # get feature
    var = dataset[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}:\n{}".format(variable,varValue))

In [None]:
categorical = (dataset.dtypes == "object")
categorical_list = list(categorical[categorical].index)

print("Categorical variables:")
print(categorical_list)

In [None]:
for c in categorical_list:
    bar_plot(c)

<a id="6"></a>

### Numerical Variables

In [None]:
numerical_int64 = (dataset.dtypes == "int64")
numerical_int64_list = list(numerical_int64[numerical_int64].index)

print("Categorical variables:")
print(numerical_int64_list)

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(dataset[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
for n in numerical_int64_list:
    plot_hist(n)

In [None]:
numerical_float64 = (dataset.dtypes == "float64")
numerical_float64_list = list(numerical_float64[numerical_float64].index)

print("Numerical variables:")
print(numerical_float64_list)

In [None]:
for n in numerical_float64_list:
    plot_hist(n)

<a id="7"></a>
# Basic Data Analysis

<a id="8"></a>

## Distribution of Each Feature

These graphs show the distribution of each feature within itself.

In [None]:
plt.figure(figsize=(50,50))
j = 0

for i in categorical_list:
    colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#fbdf70','#ac9fd0','#8b7470']
    
    labels = dataset[i].value_counts().index
    sizes = dataset[i].value_counts().values
    
    unique = len(dataset[i].unique())
    if(unique == 2):
        myexplode = [0.1, 0]
    if(unique == 3):
        myexplode = [0.1, 0,0]
    if(unique == 4):
        myexplode = [0.1,0,0,0]
    
    plt.subplot(5,4,j+1)
    plt.pie(sizes, labels=labels, explode = myexplode, shadow = True, startangle=90, colors=colors, autopct='%1.1f%%',textprops={'fontsize': 25})
    plt.title(f'Distribution of {i}',color = 'black',fontsize = 30)
    j += 1

<a id="9"></a>

## Distributions of Each Feature According to 'Churn'

These graphs show the distribution of the variable in each feature according to 'Churn'.

### Gender

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('gender = Female')
dataset.groupby('gender').Churn.value_counts().loc['Female'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('gender = Male')
dataset.groupby('gender').Churn.value_counts().loc['Male'].plot(kind='bar')

### Partner 

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('Partner = Yes')
dataset.groupby('Partner').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('Partner = No')
dataset.groupby('Partner').Churn.value_counts().loc['No'].plot(kind='bar')

### Dependents

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('Dependents = Yes')
dataset.groupby('Dependents').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('Dependents = No')
dataset.groupby('Dependents').Churn.value_counts().loc['No'].plot(kind='bar')

### PhoneService

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('PhoneService = Yes')
dataset.groupby('PhoneService').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('PhoneService = No')
dataset.groupby('PhoneService').Churn.value_counts().loc['No'].plot(kind='bar')

### MultipleLines

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('MultipleLines = Yes')
dataset.groupby('MultipleLines').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('MultipleLines = No')
dataset.groupby('MultipleLines').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('MultipleLines = No phone service')
dataset.groupby('MultipleLines').Churn.value_counts().loc['No phone service'].plot(kind='bar')

### InternetService

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('InternetService = DSL')
dataset.groupby('InternetService').Churn.value_counts().loc['DSL'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('InternetService = No')
dataset.groupby('InternetService').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('InternetService = Fiber optic')
dataset.groupby('InternetService').Churn.value_counts().loc['Fiber optic'].plot(kind='bar')

### OnlineSecurity

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('OnlineSecurity = Yes')
dataset.groupby('OnlineSecurity').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('OnlineSecurity = No')
dataset.groupby('OnlineSecurity').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('OnlineSecurity = No internet service')
dataset.groupby('OnlineSecurity').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### OnlineBackup

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('OnlineBackup = Yes')
dataset.groupby('OnlineBackup').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('OnlineBackup = No')
dataset.groupby('OnlineBackup').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('OnlineBackup = No internet service')
dataset.groupby('OnlineBackup').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### DeviceProtection

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('DeviceProtection = Yes')
dataset.groupby('DeviceProtection').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('DeviceProtection = No')
dataset.groupby('DeviceProtection').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('DeviceProtection = No internet service')
dataset.groupby('DeviceProtection').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### TechSupport

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('TechSupport = Yes')
dataset.groupby('TechSupport').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('TechSupport = No')
dataset.groupby('TechSupport').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('TechSupport = No internet service')
dataset.groupby('TechSupport').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### StreamingTV

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('StreamingTV = Yes')
dataset.groupby('StreamingTV').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('StreamingTV = No')
dataset.groupby('StreamingTV').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('StreamingTV = No internet service')
dataset.groupby('StreamingTV').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### StreamingMovies

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('StreamingMovies = Yes')
dataset.groupby('StreamingMovies').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('StreamingMovies = No')
dataset.groupby('StreamingMovies').Churn.value_counts().loc['No'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('StreamingMovies = No internet service')
dataset.groupby('StreamingMovies').Churn.value_counts().loc['No internet service'].plot(kind='bar')

### Contract

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('Contract = Month-to-month')
dataset.groupby('Contract').Churn.value_counts().loc['Month-to-month'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('Contract = One year')
dataset.groupby('Contract').Churn.value_counts().loc['One year'].plot(kind='bar')

plt.subplot(2,3,3)
plt.title('Contract = Two year')
dataset.groupby('Contract').Churn.value_counts().loc['Two year'].plot(kind='bar')

### PaperlessBilling

In [None]:
plt.figure(figsize=(25,15))
plt.subplot(2,3,1)
plt.title('PaperlessBilling = Yes')
dataset.groupby('PaperlessBilling').Churn.value_counts().loc['Yes'].plot(kind='bar')

plt.subplot(2,3,2)
plt.title('PaperlessBilling = No')
dataset.groupby('PaperlessBilling').Churn.value_counts().loc['No'].plot(kind='bar')

### PaymentMethod

In [None]:
plt.figure(figsize=(25,15))
sns.set_theme(style="darkgrid")

plt.subplot(2,2,1)
plt.title('PaymentMethod = Electronic check')
dataset.groupby('PaymentMethod').Churn.value_counts().loc['Electronic check'].plot(kind='bar')

plt.subplot(2,2,2)
plt.title('PaymentMethod = Mailed check')
dataset.groupby('PaymentMethod').Churn.value_counts().loc['Mailed check'].plot(kind='bar')

plt.subplot(2,2,3)
plt.title('PaymentMethod = Bank transfer (automatic)')
dataset.groupby('PaymentMethod').Churn.value_counts().loc['Bank transfer (automatic)'].plot(kind='bar')

plt.subplot(2,2,4)
plt.title('PaymentMethod = Credit card (automatic)')
dataset.groupby('PaymentMethod').Churn.value_counts().loc['Credit card (automatic)'].plot(kind='bar')

<a id="10"></a>
# Data Visualization

### Numerical values, value ranges and distributions.

In [None]:
plt.figure(figsize=(25,15))

plt.subplot(2,3,1)
sns.histplot(dataset['MonthlyCharges'], color = 'red', kde = True).set_title('MonthlyCharges Interval and Counts')

plt.subplot(2,3,2)
sns.histplot(dataset['TotalCharges'], color = 'green', kde = True).set_title('TotalCharges Interval and Counts')

plt.subplot(2,3,3)
sns.histplot(dataset['tenure'], color = 'blue', kde = True).set_title('tenure Interval and Counts')

### The relationship between 'MonthlyCharges' and 'TotalCharges' and the correlation with 'gender' and 'tenure'.

In [None]:
sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(20,10))
sns.despine(f, left=True, bottom=True)
sns.set_theme(style="darkgrid")
sns.scatterplot(x=dataset['MonthlyCharges'], y=dataset['TotalCharges'],
                hue=dataset['gender'], 
                size="tenure",
                palette='tab20',
                hue_order=dataset['gender'],
                sizes=(20, 50), 
                linewidth=0,
                data=dataset)

### Distribution of Samples According to 'Churn' = Yes or 'Churn' = No with Histograms

In [None]:
plt.figure(figsize=(50,50))
j = 0
sns.set_theme(style="whitegrid")
for i in categorical_list:
    
    plt.subplot(5,4,j+1)
    sns.histplot(dataset, x="Churn",  hue=dataset[i], multiple="stack", palette="light:m_r", edgecolor=".3", linewidth=.5)
    plt.title(f'Distribution of {i}',color = 'black',fontsize = 25)
    j += 1

<a id="11"></a>

# Pandas Profiling

Pandas profiling is a useful library that generates interactive reports about the data. With using this library, we can see types of data, distribution of data and various statistical information. This tool has many features for data preparing. Pandas Profiling includes graphics about specific feature and correlation maps too. You can see more details about this tool in the following url: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/

In [None]:
import pandas_profiling as pp
pp.ProfileReport(dataset)

<a id="12"></a>

# Correlation

In [None]:
plt.figure(figsize=(12,8)) 
sns.heatmap(dataset.corr(), annot=True, cmap='Dark2_r', linewidths = 2)
plt.show()

#### Implications:

* As seen from Heat Map, there is a high correlation between 'tenure' and 'TotalCharges'.

* Another notable correlation is between 'MonthlyCharges' and 'TotalCharges'.

In [None]:
sns.pairplot(dataset, hue = 'Churn')

<a id="13"></a>

# Skewness

In [None]:
dataset.agg(['skew'])

In [None]:
skews = ['MonthlyCharges']
from scipy.stats import norm, skew, boxcox
for i in skews:
    sns.set_style('darkgrid')
    sns.distplot(dataset[i], fit = norm)
    plt.title('Skeweed')
    plt.show()
    (mu, sigma) = norm.fit(dataset[i])
    print("mu {} : {}, sigma {} : {}".format(i, mu, i, sigma))
    print()
    
    dataset[i], lam = boxcox(dataset[i])

    sns.set_style('darkgrid')
    sns.distplot(dataset[i], fit = norm)
    plt.title('Transformed')
    plt.show()
    (mu, sigma) = norm.fit(dataset[i])
    print("mu {} : {}, sigma {} : {}".format(i, mu, i, sigma))
    print()

<a id="14"></a>

# Encoding

<a id="22"></a>
## Uniqueness of each Feature

In [None]:
label_encoding = []
one_hot = []

for x in categorical_list:
    a = dataset[x].unique()
    print(f'Unique Values for {x}: ', dataset[x].unique())
    if(len(a) == 2):
        label_encoding.append(x)
    else:
        one_hot.append(x)

<a id="15"></a>

## Label Encoding

Label Encoding is an encoding technique for handling categorical variables. In this technique, each data is assigned a unique integer.

In [None]:
for y in label_encoding:
    var = dataset[y].unique()
    y_mapping = {var[0]: 0, var[1]: 1}
    dataset[y] = dataset[y].map(y_mapping)

<a id="16"></a>

## One-Hot Encoding

One Hot Encoding is the binary representation of categorical variables. This process requires categorical values to be mapped to integer values first. Next, each integer value is represented as a binary vector with all values zero except the integer index marked with 1.

One Hot Encoding makes the representation of categorical data more expressive and easy. Many machine learning algorithms cannot work directly with categorical data, so categories must be converted to numbers. This operation is required for input and output variables that are categorical.

In this part, I converted categorical datas to the binary values. This operation increases the accuracy.


In [None]:
for i in range(0, len(one_hot)):
    dataset[f'{one_hot[i]}'] = pd.Categorical(dataset[f'{one_hot[i]}'])
    dummies = pd.get_dummies(dataset[f'{one_hot[i]}'], prefix = f'{one_hot[i]}_encoded')
    dataset.drop([f'{one_hot[i]}'], axis=1, inplace=True)
    dataset = pd.concat([dataset, dummies], axis=1)

In [None]:
dataset

<a id="17"></a>
# Train - Test Split

In [None]:
columns = dataset.columns.drop('Churn')

In [None]:
features = columns
label = ['Churn']

X = dataset[features]
y = dataset[label]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=0)

print(f'Total # of sample in whole dataset: {len(X)}')
print(f'Total # of sample in train dataset: {len(X_train)}')
print(f'Total # of sample in validation dataset: {len(X_valid)}')
print(f'Total # of sample in test dataset: {len(X_test)}')

<a id="18"></a>
# Pipelines

In [None]:
pipeline_GaussianNB = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_GaussianNB",GaussianNB())])

pipeline_BernoulliNB = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_BernoulliNB",BernoulliNB())])

pipeline_LogisticRegression = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_LogisticRegression",LogisticRegression())])

pipeline_RandomForest = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_RandomForest",RandomForestClassifier())])

pipeline_SVM = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_SVM",SVC())])

pipeline_DecisionTree = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_DecisionTree",DecisionTreeClassifier())])

pipeline_KNN = Pipeline([("scaler",StandardScaler()),
                     ("pipeline_KNN",KNeighborsClassifier())])

pipeline_GBC = Pipeline([("scaler",StandardScaler()), (
                        "pipeline_GBC",GradientBoostingClassifier())])

pipeline_SGD = Pipeline([("scaler",StandardScaler()), 
                        ("pipeline_SGD",SGDClassifier(max_iter=5000, random_state=0))])

pipeline_LGBM = Pipeline([("scaler",StandardScaler()), 
                        ("pipeline_NN",lightgbm.LGBMClassifier())])

pipelines = [pipeline_GaussianNB, pipeline_BernoulliNB, pipeline_LogisticRegression, pipeline_RandomForest, pipeline_SVM, pipeline_DecisionTree, pipeline_KNN, pipeline_GBC, pipeline_SGD, pipeline_LGBM]

pipe_dict = {0: "GaussianNB", 1: "BernoulliNB", 2: "LogisticRegression",3: "RandomForestClassifier", 4: "SupportVectorMachine", 5: "DecisionTreeClassifier",
            6: "KNeighborsClassifier", 7: "GradientBoostingClassifier", 8:"Stochastic Gradient Descent", 9: "LGBM"}

modelNames = ["GaussianNB", 'BernoulliNB','LogisticRegression','RandomForestClassifier','SupportVectorMachine',
             'DecisionTreeClassifier', 'KNeighborsClassifier','GradientBoostingClassifier',
             'Stochastic Gradient Descent', 'LGBM']

i= 0
trainScores = []
validationScores = []
testScores = []

for pipe in pipelines:
    pipe.fit(X_train, y_train)
    print(f'{pipe_dict[i]}')
    print("Train Score of %s: %f     " % (pipe_dict[i], pipe.score(X_train, y_train)*100))
    trainScores.append(pipe.score(X_train, y_train)*100)
    
    print("Validation Score of %s: %f" % (pipe_dict[i], pipe.score(X_valid, y_valid)*100))
    validationScores.append(pipe.score(X_valid, y_valid)*100)
    
    print("Test Score of %s: %f      " % (pipe_dict[i], pipe.score(X_test, y_test)*100))
    testScores.append(pipe.score(X_test, y_test)*100)
    print(" ")
    
    y_predictions = pipe.predict(X_test)
    conf_matrix = confusion_matrix(y_predictions, y_test)
    print(f'Confussion Matrix: \n{conf_matrix}\n')
    
    tn = conf_matrix[0,0]
    fp = conf_matrix[0,1]
    tp = conf_matrix[1,1]
    fn = conf_matrix[1,0]

    total = tn + fp + tp + fn
    real_positive = tp + fn
    real_negative = tn + fp

    accuracy  = (tp + tn) / total # Accuracy Rate
    precision = tp / (tp + fp) # Positive Predictive Value
    recall    = tp / (tp + fn) # True Positive Rate
    f1score  = 2 * precision * recall / (precision + recall)
    specificity = tn / (tn + fp) # True Negative Rate
    error_rate = (fp + fn) / total # Missclassification Rate
    prevalence = real_positive / total
    miss_rate = fn / real_positive # False Negative Rate
    fall_out = fp / real_negative # False Positive Rate
    
    print('Evaluation Metrics:')
    print(f'Accuracy    : {accuracy}')
    print(f'Precision   : {precision}')
    print(f'Recall      : {recall}')
    print(f'F1 score    : {f1score}')
    print(f'Specificity : {specificity}')
    print(f'Error Rate  : {error_rate}')
    print(f'Prevalence  : {prevalence}')
    print(f'Miss Rate   : {miss_rate}')
    print(f'Fall Out    : {fall_out}')

    print("") 
    print(f'Classification Report: \n{classification_report(y_predictions, y_test)}\n')
    print("")

    print("*****"*20)
    i +=1

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('darkgrid')
plt.title('Train - Validation - Test Scores of Models', fontweight='bold', size = 24)

barWidth = 0.25
 
bars1 = trainScores
bars2 = validationScores
bars3 = testScores
 
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
 
plt.bar(r1, bars1, color='blue', width=barWidth, edgecolor='white', label='train', yerr=0.5,ecolor="black",capsize=10)
plt.bar(r2, bars2, color='#557f2d', width=barWidth, edgecolor='white', label='validation', yerr=0.5,ecolor="black",capsize=10, alpha = .50)
plt.bar(r3, bars3, color='red', width=barWidth, edgecolor='white', label='test', yerr=0.5,ecolor="black",capsize=10, hatch = '-')
 
modelNames = ["GaussianNB", 'BernoulliNB','LogisticRegression','RandomForestClassifier','SupportVectorMachine',
             'DecisionTreeClassifier', 'KNeighborsClassifier','GradientBoostingClassifier',
             'Stochastic Gradient Descent', 'LGBM']
    
plt.xlabel('Algorithms', fontweight='bold', size = 24)
plt.ylabel('Scores', fontweight='bold', size = 24)
plt.xticks([r + barWidth for r in range(len(bars1))], modelNames, rotation = 75)
 
plt.legend()
plt.show()

In [None]:
table = pd.DataFrame({'Model': modelNames, 'Train': trainScores, 'Validation': validationScores, 'Test': testScores})
table


<a id="19"></a>
## Cross Validation

In [None]:
cv_results_acc = []

for i, model in enumerate(pipelines):
    cv_score = cross_val_score(model, X_train, y_train, scoring = "accuracy", cv = 10)
    cv_results_acc.append(cv_score.mean()*100)
    print("%s: %f" % (pipe_dict[i], cv_score.mean()*100))

In [None]:
table_cv = pd.DataFrame({'Model': modelNames, 'CV Score': cv_results_acc})
table_cv

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('darkgrid')
plt.title('CV Scores Means', fontweight='bold', size = 24)

barWidth = 0.5
 
bars2 = cv_results_acc
 
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
 
plt.bar(r2, bars2, color='#557f2d', width=barWidth, edgecolor='black',  yerr=0.5,ecolor="black",capsize=10)


modelNames = ["GaussianNB", 'BernoulliNB','LogisticRegression','RandomForestClassifier','SupportVectorMachine',
             'DecisionTreeClassifier', 'KNeighborsClassifier','GradientBoostingClassifier',
             'Stochastic Gradient Descent', 'Light GBM']
    
plt.xlabel('Algorithms', fontweight='bold', size = 24)
plt.ylabel('Scores', fontweight='bold', size = 24)
plt.xticks([r + barWidth for r in range(len(bars1))], modelNames, rotation = 75)
 
plt.legend()
plt.show()

<a id="20"></a>
## Best Features Selection

In [None]:
sc=StandardScaler()

X_train = sc.fit_transform(X_train)
X_valid = sc.fit_transform(X_valid)
X_test = sc.transform(X_test)

In [None]:
models = {
    'RandomForestClassifier': RandomForestClassifier(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'Light GBM': lightgbm.LGBMClassifier(),
}

for m in models:
  model = models[m]
  model.fit(X_train, y_train)
  
  print(f'{m}') 
  best_features = SelectFromModel(model)
  best_features.fit(X, y)

  transformedX = best_features.transform(X)
  print(f"Old Shape: {X.shape} New shape: {transformedX.shape}")
  print("\n")

  imp_feature = pd.DataFrame({'Feature': features, 'Importance': model.feature_importances_})
  plt.figure(figsize=(15,10))
  plt.title("Feature Importance Graphic")
  plt.xlabel("importance ")
  plt.ylabel("features")
  plt.barh(imp_feature['Feature'],imp_feature['Importance'])
  plt.show()

In [None]:
models = {
    'BernoulliNB': BernoulliNB(),
    'LogisticRegression': LogisticRegression(),
    'Stochastic Gradient Descent':  SGDClassifier(max_iter=5000, random_state=0),
}

for m in models:
  model = models[m]
  model.fit(X_train, y_train)
  
  print(f'{m}') 
  best_features = SelectFromModel(model)
  best_features.fit(X, y)

  transformedX = best_features.transform(X)
  print(f"Old Shape: {X.shape} New shape: {transformedX.shape}")
  print("\n")

<a id="23"></a>
# Gradient Boosting Classifier

In [None]:
gbc_model = GradientBoostingClassifier()
gbc_model.fit(X_train, y_train)

train_score = gbc_model.score(X_train, y_train)
print(f'Train score of trained model: {train_score*100}')

validation_score = gbc_model.score(X_valid, y_valid)
print(f'Validation score of trained model: {validation_score*100}')

test_score = gbc_model.score(X_test, y_test)
print(f'Test score of trained model: {test_score*100}')

<a id="24"></a>
## Confusion Matrix

In [None]:
y_predictions = gbc_model.predict(X_test)

conf_matrix = confusion_matrix(y_predictions, y_test)

print(f'Accuracy: {accuracy_score(y_predictions, y_test)*100}')
print()
print(f'Confussion matrix: \n{conf_matrix}\n')

sns.heatmap(conf_matrix, annot=True)

In [None]:
tn = conf_matrix[0,0]
fp = conf_matrix[0,1]
tp = conf_matrix[1,1]
fn = conf_matrix[1,0]

total = tn + fp + tp + fn
real_positive = tp + fn
real_negative = tn + fp

In [None]:
accuracy  = (tp + tn) / total # Accuracy Rate
precision = tp / (tp + fp) # Positive Predictive Value
recall    = tp / (tp + fn) # True Positive Rate
f1score  = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp) # True Negative Rate
error_rate = (fp + fn) / total # Missclassification Rate
prevalence = real_positive / total
miss_rate = fn / real_positive # False Negative Rate
fall_out = fp / real_negative # False Positive Rate

print(f'Accuracy    : {accuracy}')
print(f'Precision   : {precision}')
print(f'Recall      : {recall}')
print(f'F1 score    : {f1score}')
print(f'Specificity : {specificity}')
print(f'Error Rate  : {error_rate}')
print(f'Prevalence  : {prevalence}')
print(f'Miss Rate   : {miss_rate}')
print(f'Fall Out    : {fall_out}')

<a id="25"></a>
## Classification Report

In [None]:
predictions = gbc_model.predict(X_test)

print(classification_report(predictions, y_test))

<a id="26"></a>
## ROC Curve

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC' )
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

In [None]:
probs = gbc_model.predict_proba(X_test)
probs = probs[:, 1]

In [None]:
auc = roc_auc_score(y_test, probs)
print('AUC: ', auc*100)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.legend(loc = 'lower right')
plot_roc_curve(fpr, tpr)

<a id="27"></a>
## Visualization

In [None]:
!pip install pydotplus

In [None]:
#####
# This code snippet was taken from this url: https://stackoverflow.com/questions/44974360/how-to-visualize-an-sklearn-gradientboostingclassifier
#####

import pydotplus
from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data
from IPython.display import Image

sub_tree = gbc_model.estimators_[10, 0]
dot_data = export_graphviz(sub_tree, out_file=None, filled=True, 
                           rounded=True, special_characters=True, proportion=True, impurity=True)

graph = graph_from_dot_data(dot_data)
Image(graph.create_png())

<a id="21"></a>
# Conclusion

I made Visualization and Machine Learning on this notebook. If you like my visualization and you want to know how I made them, you can check my other notebooks which are about Seaborn and Plotly libraries. You can see them via this links:

**EDA: Visualization with Plotly for Beginners**

* https://www.kaggle.com/barisscal/eda-visualization-with-plotly-for-beginners


**EDA: Visualization with Seaborn**

* https://www.kaggle.com/barisscal/eda-visualization-with-seaborn


* If you have questions, please comment them. I will try to explain if you don't understand.
* If you liked this notebook, please let me know :)

Thank you for your time.