<span style="font-family: Arial; font-weight:bold;font-size:2.5em;color:#00b3e5;">Ensemble Technique Project

**DOMAIN: Telecom**

• **CONTEXT:** A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can 
analyse all relevant customer data and develop focused customer retention programs.

• **DATA DESCRIPTION:** Each row represents a customer, each column contains customer’s attributes described on the column 
Metadata. The data set includes information about:

• Customers who left within the last month – the column is called Churn.

• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device 
protection, tech support, and streaming TV and movies

• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly 
charges, and total charges

• Demographic info about customers – gender, age range, and if they have partners and dependents.

• **PROJECT OBJECTIVE:** Build a model that will help to identify the potential customers who have a higher probability to churn. 
This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

In [None]:
# Importing Libreries
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from scipy import stats
%matplotlib inline
sns.set_style('darkgrid')
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> 1.Importing datasets
    
There are two datasets given
    
   1) TelcomCustomer-Churn_1.csv
    
   2) TelcomCustomer-Churn_2.csv

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head(3)

**Shape and size of final dataset**

In [None]:
print(f"Shape of final Dataset : {df.shape}")
print(f"Size of final Dataset : {df.size}")

**Check any duplicate in data**

In [None]:
df[df.duplicated(keep = 'first')] #No Duplicates in the data

**Variable Descriptions:**

gender --> Whether the customer is a male or a female

SeniorCitizen --> Whether the customer is a senior citizen or not (1, 0)

Partner --> Whether the customer has a partner or not (Yes, No)

Dependents --> Whether the customer has dependents or not (Yes, No)

tenure --> Number of months the customer has stayed with the company

PhoneService --> Whether the customer has a phone service or not (Yes, No)

MultipleLines --> Whether the customer has multiple lines or not (Yes, No, No phone service)

InternetService --> Customer’s internet service provider (DSL, Fiber optic, No)

OnlineSecurity --> Whether the customer has online security or not (Yes, No, No internet service)

OnlineBackup --> Whether the customer has online backup or not (Yes, No, No internet service)

DeviceProtection --> Whether the customer has device protection or not (Yes, No, No internet service)

TechSupport --> Whether the customer has tech support or not (Yes, No, No internet service)

StreamingTV --> Whether the customer has streaming TV or not (Yes, No, No internet service)

StreamingMovies --> Whether the customer has streaming movies or not (Yes, No, No internet service)

Contract --> The contract term of the customer (Month-to-month, One year, Two year)

PaperlessBilling --> Whether the customer has paperless billing or not (Yes, No)

PaymentMethod --> The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

MonthlyCharges --> The amount charged to the customer monthly

TotalCharges --> The total amount charged to the customer

Churn --> Whether the customer churned or not (Yes or No)

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> 2.Data Cleansing

In [None]:
df.info()

 <span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Missing value treatment

In [None]:
empty_cols=['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection','TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
for i in empty_cols:
    df[i]=df[i].replace(" ",np.nan)

In [None]:
df.isnull().sum()

**Observed 11 missing values in TotalCharges.**

**Impute missing values with Mean.**

**Need to convert TotalObjects into float, because the values are continuous.**

In [None]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"])

In [None]:
df.mean()

In [None]:
df.fillna(df.mean(),inplace = True)

**Filled NaN values with mean of particular attribute, here the case is TotalCharges.**

In [None]:
df.isnull().sum()

**Imputed missing values with Mean for TotalCharges.**

In [None]:
df_graph = df.copy() # This dataframe is used for preparing univariate and bivariate and multi variate.

In [None]:
df_graph.columns

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Convert Categorical attributes to continuous.

In [None]:
col_obj = [c for c in df.columns if df[c].dtype == 'object'] # SeniorCitizen which is actually Object,but already the datatype is Int.
for col in col_obj:
    uniques = sorted(df[col].unique())
    print('{0:20s} {1:5d} \t'.format(col,len(uniques)),uniques[:10])

**We are converting attributes having Yes=1, No=0 as an item into continuous.**

In [None]:
df.gender = [1 if each == "Male" else 0 for each in df.gender]

columns_to_convert = ['Partner',
                      'Dependents',
                      'PhoneService',
                      'PaperlessBilling',
                      'Churn']

for item in columns_to_convert:
    df[item] = [1 if each == "Yes" else 0 for each in df[item]]
    
df.head()

**Converting categorical attributes, using get dummies from pandas, to continuous.**

(**Pandas.get_dummies:** This method converts string columns into one-hot representation unless particular columns are specified.)

(**OneHotEncoder:** It cannot process string values directly. If your input features are strings, then you should first map them into integers.)

So better is Get dummies, where we specify columns and convert dtype.

In [None]:
category_cols=['InternetService','Contract', 'PaymentMethod', 'OnlineSecurity','MultipleLines',
                      'OnlineBackup',
                      'DeviceProtection',
                      'TechSupport',
                      'StreamingTV',
                      'StreamingMovies',]

for cc in category_cols:
    dummies = pd.get_dummies(df[cc], drop_first=False)
    dummies = dummies.add_prefix("{}#".format(cc))
    df.drop(cc, axis=1, inplace=True)
    df = df.join(dummies)
df.head()

**Drop customer ID, because it doesn't influence on target variable.**

In [None]:
df.drop('customerID',axis=1,inplace=True)

In [None]:
df.tail(1)

In [None]:
df.columns

**Examining correlation of "Churn" with other features**

In [None]:
plt.figure(figsize=(15,8))
df.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')

**Observations**

Month to month contracts, absence of online security and tech support seem to be positively correlated with churn. While, tenure, two year contracts seem to be negatively correlated with churn.

Interestingly, services such as Online security, streaming TV, online backup, tech support, etc. without internet connection seem to be negatively related to churn.

PhoneService, Gender and MultipleLines#No phone service doesn't influence churn much.

In [None]:
df.info()

In [None]:
# Checking Correlation Heatmap
plt.figure(dpi = 540,figsize= (30,25))
mask = np.triu(np.ones_like(df.corr()))
sns.heatmap(df.corr(),mask = mask, fmt = ".2f",annot=True,lw=1,cmap = 'plasma')
plt.yticks(rotation = 0)
plt.xticks(rotation = 90)
plt.title('Correlation Heatmap')
plt.show()

**We can observe that "No Internet service" in OnlineSecurity, OnlineBackup,DeviceProtection,    TechSupport,StreamingTV,StreamingMovies, highly correlated with other and all these are highly correlated with Internetservice#No.**

**MultipleLines#No phone service and Phone service 100% negatively correlated**

**Lets drop MultipleLines#No phone service and "No Internet service" in OnlineSecurity, OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies keeping Internetservice#No**

In [None]:
df.drop(['OnlineSecurity#No internet service',
         'OnlineBackup#No internet service',
        'DeviceProtection#No internet service',
        'StreamingTV#No internet service',
        'TechSupport#No internet service',
         'MultipleLines#No phone service',
        'StreamingMovies#No internet service'], axis=1,inplace=True)

In [None]:
df.head()

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> 3.Data Analysis and Visualization

In [None]:
df.describe().T

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> Univariate and Bivariate Analysis

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(1, 3, 1)
plt.title('tenure')
sns.histplot(df['tenure'],color='green',kde = True)

# subplot 2
plt.subplot(1, 3, 2)
plt.title('MonthlyCharges')
sns.histplot(df['MonthlyCharges'],color='blue',kde = True)

# subplot 3
plt.subplot(1, 3, 3)
plt.title('TotalCharges')
sns.histplot(df['TotalCharges'],color='red', kde = True)


**Tenure distribution looks better and density is more at lower and higher side.**

**Customers whose monthly charges are more when considered less than 30. but most number of customer lies between 70-100.**

**Customers who pay total charges more than 2000 are few.**

In [None]:
fig, ax = plt.subplots(1, 3)
fig.set_figheight(5)
fig.set_figwidth(18)
sns.boxplot(x='Churn', y ='tenure', data=df, ax=ax[0])
sns.boxplot(x='Churn', y ='MonthlyCharges', data= df, ax=ax[1])
sns.boxplot(x='Churn', y='TotalCharges',data=df, ax=ax[2])
ax[0].set_title("Customer churn out based on tenure",fontsize=15)
ax[1].set_title('Customer churn out based on MonthlyCharges',fontsize=15)
ax[2].set_title('Customer churn out based on TotalCharges',fontsize=15)
plt.show()

**Observations :**

**Customers opting for less tenure are more propable to churn.**

**Customers whose monthly charges are more propable to churn.**

**Customers who paying total charges less than 2000 are more probable to churn.**

In [None]:
df.columns

In [None]:
columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod',]

In [None]:
import copy

a = 10  # number of rows
b = 3  # number of columns
c = 1  # initialize plot counter

fig = plt.figure(figsize=(20,80))

for i in range(len(columns)):
    xx = copy.deepcopy(columns)
    plt.subplot(a, b, c)
    plt.title('{}'.format(i))
    plt.xlabel(xx[i])
    sns.countplot(x=xx[i], hue="Churn", data=df_graph)
    c = c + 1

plt.show()

Observation :

We can observe gender churn out are same in male and female.

Customers churn out is more who dont have partner and dependents, having phone service, internet service with fiber optics, no online security, no online backup, no device protection, no tech support, streaming tv, streaming movies, having month-to-month contract, paperlessbilling, having electronic check in payment method.

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> Multivariate Analysis

In [None]:
fig, axes = plt.subplots(5, 2, sharex=True, figsize=(20, 10))
fig.suptitle('Summary')
sns.barplot(ax=axes[0, 0], x="tenure", y="Contract", hue="gender", data=df_graph,orient="h")
sns.barplot(ax=axes[0, 1], x="tenure", y="Contract", hue="PaymentMethod", data=df_graph,orient="h")
sns.barplot(ax=axes[1, 0], x="tenure", y="StreamingMovies", hue="gender", data=df_graph,orient="h")
sns.barplot(ax=axes[1, 1], x="tenure", y="StreamingMovies", hue="Partner", data=df_graph,orient="h")
sns.barplot(ax=axes[2, 0], x="MonthlyCharges", y="InternetService", hue="StreamingTV", data=df_graph,orient="h")
sns.barplot(ax=axes[2, 1], x="tenure", y="OnlineSecurity", hue="DeviceProtection", data=df_graph,orient="h")
sns.barplot(ax=axes[3, 0], x="tenure", y="OnlineSecurity", hue="InternetService", data=df_graph,orient="h")
sns.barplot(ax=axes[3, 1], x="tenure", y="Contract", hue="PaperlessBilling", data=df_graph,orient="h")
sns.barplot(ax=axes[4, 0], x="tenure", y="Contract", hue="SeniorCitizen", data=df_graph,orient="h")
sns.barplot(ax=axes[4, 1], x="tenure", y="InternetService", hue="SeniorCitizen", data=df_graph,orient="h")

From the left to the right:

No significant info can be recorded with Contract , Gender and Tenure features, same behaviour between males and females.

Payment methods : the favorite means of payments are Electronic Check, Bank transfer and credit card, Mailed check is the less used in all contracts types.

No significant info can be recorded with Internet Service , Gender and Tenure features, same behaviour between males and females.

Streaming Movies : the most custmers that consume this service are partners

Optic fiber is expensive. (I guess this is why customers are leaving out this product)

Some people have device protection without online protection (weird , the company should tell them that it not necessery and they can be rewarded with a usefull service instead.. in order to gain customers trust :))

Internet Service custmers with large tenure tend to make online Security.

Large tenure is significant whith paperless billing ( The company should prioritizee this mean of payment).

**Correlation of churn with respect to other features.**

In [None]:
plt.figure(figsize=(15,8))
df.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')

**Observation :**
    
There are few variables showing positive impact on churn out. 

Gender is not influencing the churn out, also checked in bivariate and multivariate, behaviour is same in male and female. 

There are other variables showing negative impct on churn out.

<span style="font-family: Arial; font-weight:bold;font-size:2.0em;color:#00b3e5;"> Hypothesis Testing

An assumption of few variables showing postive impact are true or not.

**Does these variables have significant impact on churn.**

**Chi square Test** to solve this assumption

The Chi-square test of independence determines whether there is a statistically significant relationship between categorical variables. It is a hypothesis test that answers the question—do the values of one categorical variable depend on the value of other categorical variables?

The Chi-square test of association evaluates relationships between categorical variables. Like any statistical hypothesis test, the Chi-square test has both a null hypothesis and an alternative hypothesis.

**Null hypothesis:** There are no relationships between the categorical variables. If you know the value of one variable, it does not help you predict the value of another variable.

**Alternative hypothesis:** There are relationships between the categorical variables. Knowing the value of one variable does help you predict the value of another variable.

In [None]:
['Contract#Month-to-month','OnlineSecurity#No','TechSupport#No','InternetService#Fiber optic',
 'PaymentMethod#Electronic check','DeviceProtection#No','OnlineBackup#No', 'PaperlessBilling','SeniorCitizen',
 'StreamingTV#No', 'StreamingTV#Yes','StreamingMovies#No', 'StreamingMovies#Yes']

In [None]:
var = ['Contract#Month-to-month','OnlineSecurity#No','TechSupport#No','InternetService#Fiber optic',
 'PaymentMethod#Electronic check','DeviceProtection#No','OnlineBackup#No', 'PaperlessBilling','SeniorCitizen',
 'StreamingTV#No', 'StreamingTV#Yes','StreamingMovies#No', 'StreamingMovies#Yes']
# does these variables have positive impact on churn
for i in var:
    df_var = pd.pivot_table(data=df,index='Churn',columns= i,aggfunc='size')
    chi_sq_Stat, p_value, deg_freedom, exp_freq = stats.chi2_contingency(df_var)
    print("{}Chi statistics of {}".format('\033[92m',i))
    print('{} chi_sq_Stat: {}'.format('\033[92m',chi_sq_Stat))
    print('{} p_value: {}'.format('\033[92m',p_value))
    print('{} deg_freedom: {}'.format('\033[92m',deg_freedom))
    if p_value < 0.05:  # Setting our significance level at 5%
        print('{} Rejecting Null Hypothesis.Means {} has significant impact on churn'.format('\033[92m',i))
    else:
        print('{} Fail to Reject Null Hypothesis. Means {} has no significant impact on churn'.format('\033[92m',i))
    print('\n')

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;"> An aussumption of positive impact on churn is proved true, means they have a significant impact on churn.

“Fiber_Optic” is on top position in terms of a positive impact on churn. While we would expect that this makes a customer stay, as it provides him with fast internet, our model says different. May be It because its expensive. 

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">4.Data pre-processing

**Distribution of Target Variable.**

In [None]:
count_no_churn = (df['Churn'] == 0).sum()
print("Number of customers who didn't churn:",count_no_churn)
count_yes_churn = (df['Churn']==1).sum()
print("Number of customers who churnes:",count_yes_churn)

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
width = len(df['Churn'].unique())+6
fig.set_size_inches(width , 8)
ax=sns.countplot(data = df, x= 'Churn') 



for p in ax.patches: 
    ax.annotate(str((np.round(p.get_height()/len(df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')


**Imbalance in dataset:**

As we could see, our Target variable is not equally distributed, only 26.54% of customers have Churned. So, if our model is going to learn from this dataset and do the prediction chances are there that it might be biased towards the Majority class (In this case , customers who are not churned out) and ignore the minority class. Hence , we should try to balance our dataset to make our model learn and predict with being biased and treat both classes equally for better result.

**Balancing the Target Variable**

So I am going to balance the target variable with SMOTE (Synthetic Minority Oversampling Technique). With our training data created, I’ll up-sample minority sample( in our case the 'yes_churn' (customers who churn) sample using the SMOTE algorithm. At a high level, SMOTE:

1.Works by creating synthetic samples from the minor class ( yes-churn) instead of creating copies.

2.Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

 **Segregate predictors vs target attributes.**

In [None]:
from sklearn.model_selection import train_test_split
X = df.loc[:, df.columns != 'Churn']
y = df.loc[:, df.columns == 'Churn']
print('Shape of X: {}'.format(X.shape))
print('Shape of y: {}'.format(y.shape))

**Standardization (Scaling) for numerical variables**

In [None]:
from sklearn.preprocessing import StandardScaler
cols_to_scale = ["MonthlyCharges","TotalCharges","tenure"]
scaler=StandardScaler()
X[cols_to_scale]=scaler.fit_transform(X[cols_to_scale])
X.sample(5)

**Training Data=70%, Test Data=30%**

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:

#!pip install -U imbalanced-learn

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=0)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {}'.format(y_train_res.shape))



Now we have a perfect balanced data!I only **over-sampled on the training data**, because by oversampling only on the training data, **none of the information in the test data is being used to create synthetic observations**, therefore, no information will bleed from test data into the model training.

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Check if the train and test data have similar statistical characteristics when compared with original data

**To check similar charecteristics, we will consider one sample from train data and another similar sample from test and compare them separetly with population of Original data.** 

To do this we will do hypopthesis testing using one sample Z-Test.

**z  tests** are a statistical way of testing a hypothesis when either:

We know the population variance, or
We do not know the population variance but our sample size is large n ≥ 30

We perform the **One-Sample Z test** when we want to compare a sample mean with the population mean.

SE = Sd/np.sqrt(N)
z_stat = (x_bar - mu)/SE

where,
X¯: mean of the sample.

mu: mean of the population.

Sd: Standard deviation of the population.

n: sample size.

**Lets consider MonthlyCharges attribute as a sample (its having positive impact and its numeric) to check similar charecteristics**

**Population from Original data**

In [None]:
Original = X['MonthlyCharges']
mu = Original.mean()
sigma = Original.std(ddof=0)
print("mu: ", mu, ", sigma:", sigma)

**Sample from Train data**

In [None]:
train = X_train['MonthlyCharges']
X_bar = train.mean()
n= X_train['MonthlyCharges'].size
print("X_Bar: ", X_bar, ", n:", n)

Train and Test data is having similar charecteristics with Original data

* H<sub>0</sub>: The sample from train or test data comes from the original  population, x_bar = &mu;.
* H<sub>A</sub>: The sample from train or test data not comes from the original population, x_bar != (not equal) &mu;.

In [None]:
import numpy as np

z_critical = 1.96 # alpha level of 0.05 and two-tailed test
SE = sigma/np.sqrt(n)
z_stat = (X_bar - mu)/SE
print(z_stat)

**Sample from Test Data**

In [None]:
test = X_test['MonthlyCharges']
X_bar_Test = test.mean()
n2= X_test['MonthlyCharges'].size
print("X_Bar: ", X_bar_Test, ", n:", n2)

In [None]:
import numpy as np

z_critical = 1.96 # alpha level of 0.05 and two-tailed test
SE = sigma/np.sqrt(n2)
z_stat = (X_bar_Test - mu)/SE
print(z_stat)

Since z_stat is less than z_critical we accept the null hypothesis and reject the althernative. Statistically, we say the train and test  sample mean is no different than the population mean and thus the train and test sample is drawn from the population.

We can conclude that test and train have similar characteristics when compared with original data.

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">5.Model training, testing and tuning

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
#!pip install catboost
from catboost import CatBoostClassifier
#!pip install xgboost
from xgboost import XGBClassifier
#!pip install lightgbm
from lightgbm import LGBMClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV,StratifiedKFold
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score,ConfusionMatrixDisplay,precision_score,recall_score,f1_score,classification_report,roc_curve,plot_roc_curve,auc,precision_recall_curve,plot_precision_recall_curve,average_precision_score
from sklearn.ensemble import VotingClassifier

In [None]:
#Ensemble Algorithms
models = []
models.append(['XGBClassifier',XGBClassifier(learning_rate=0.1,objective='binary:logistic',random_state=0,eval_metric='mlogloss')])
models.append(['RandomForest',RandomForestClassifier(random_state=0)])
models.append(['AdaBoostClassifier',AdaBoostClassifier()])
models.append(['GBClassifier',GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=0)])
models.append(['LGBMClassifier',LGBMClassifier(random_state=0)])
models.append(['CatBoostClassifier',CatBoostClassifier(learning_rate=0.1,loss_function= 'Logloss', eval_metric='AUC',random_state=0)])
models.append(['BaggingClassifier', BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)])

# For hybrid model preparation
Hybrid = []
Hybrid.append(['RidgeClassifier',RidgeClassifier()])
Hybrid.append(['Logistic Regression',LogisticRegression(random_state=0)])
Hybrid.append(['SVM',SVC(random_state=0)])
Hybrid.append(['KNeigbors',KNeighborsClassifier()])
Hybrid.append(['GaussianNB',GaussianNB()])
Hybrid.append(['BernoulliNB',BernoulliNB()])
Hybrid.append(['DecisionTree',DecisionTreeClassifier(random_state=0)])

**Each model outcome is stored in the "lst_2" to prepare the table.** 

In [None]:
lst_1 = []
for m in range(len(models)):
    lst_2 = []
    model = models[m][1]
    model.fit(X_train_res,y_train_res)
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train_res)
    cm = confusion_matrix(y_test,y_pred)
    accuracies = cross_val_score(estimator= model, X = X_train_res,y = y_train_res, cv=10)

# k-fOLD Validation
    roc = roc_auc_score(y_test,y_pred)
    precision = precision_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred)
    f1 = f1_score(y_test,y_pred)
    print(models[m][0],':')
    print(cm)
    print('')
    print('Train Accuracy Score: ',accuracy_score(y_train_res,y_train_pred))
    print('')
    print('Test Accuracy Score: ',accuracy_score(y_test,y_pred))
    print('')
    print('K-Fold Validation Mean Accuracy: {:.2f} %'.format(accuracies.mean()*100))
    print('')
    print('Standard Deviation: {:.2f} %'.format(accuracies.std()*100))
    print('')
    print('ROC AUC Score: {:.2f} %'.format(roc))
    print('')
    print('Precision: {:.2f} %'.format(precision))
    print('')
    print('Recall: {:.2f} %'.format(recall))
    print('')
    print('F1 Score: {:.2f} %'.format(f1))
    print('')
    print(classification_report(y_test, y_pred)) 
    print('-'*40)
    print('')
    lst_2.append(models[m][0])
    lst_2.append(accuracy_score(y_train_res,y_train_pred)*100)
    lst_2.append(accuracy_score(y_test,y_pred)*100)
    lst_2.append(accuracies.mean()*100)
    lst_2.append(accuracies.std()*100)
    lst_2.append(roc)
    lst_2.append(precision)
    lst_2.append(recall)
    lst_2.append(f1)
    lst_1.append(lst_2)

**Creating Hybrid model using other algorithms**

In [None]:
#Hybrid 
lst_3 = []
Hybrid_ensemble = VotingClassifier(Hybrid)
Hybrid_ensemble.fit(X_train_res,y_train_res)
y_pred_Hyb = Hybrid_ensemble.predict(X_test)
y_train_pred_Hyb = Hybrid_ensemble.predict(X_train_res)
cm = confusion_matrix(y_test,y_pred_Hyb)
accuracies = cross_val_score(estimator= Hybrid_ensemble, X = X_train_res,y = y_train_res, cv=10)

# k-fOLD Validation
roc = roc_auc_score(y_test,y_pred_Hyb)
precision = precision_score(y_test,y_pred_Hyb)
recall = recall_score(y_test,y_pred_Hyb)
f1 = f1_score(y_test,y_pred_Hyb)
print('Hybrid_Model')
print(cm)
print('')
print('Train Accuracy Score: ',accuracy_score(y_train_res,y_train_pred_Hyb))
print('')
print('Test Accuracy Score: ',accuracy_score(y_test,y_pred_Hyb))
print('')
print('K-Fold Validation Mean Accuracy: {:.2f} %'.format(accuracies.mean()*100))
print('')
print('Standard Deviation: {:.2f} %'.format(accuracies.std()*100))
print('')
print('ROC AUC Score: {:.2f} %'.format(roc))
print('')
print('Precision: {:.2f} %'.format(precision))
print('')
print('Recall: {:.2f} %'.format(recall))
print('')
print('F1 Score: {:.2f} %'.format(f1))
print('-'*40)
print('')
lst_3.append('Hybrid_model')
lst_3.append(accuracy_score(y_train_res,y_train_pred_Hyb)*100)
lst_3.append(accuracy_score(y_test,y_pred_Hyb)*100)
lst_3.append(accuracies.mean()*100)
lst_3.append(accuracies.std()*100)
lst_3.append(roc)
lst_3.append(precision)
lst_3.append(recall)
lst_3.append(f1)
lst_1.append(lst_3)
#final =pd.concat([lst_1,lst_3], axis=1)

**All model results**

In [None]:
df2 = pd.DataFrame(lst_1,columns=['Model','Train_Accuracy','Test_Accuracy','K-Fold Mean Accuracy','Std.Deviation','ROC_AUC','Precision','Recall','F1 Score'])

df2.sort_values(by=['Recall','F1 Score'],inplace=True,ascending=False)
df2

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;"> Best Model
    
**GBClassifier** is considerd as best model.Because,

1) Recall which tells how many customer churn are predicted correctly with our model.So prediction of customer churn (Recall)  is most important parameter to decide the best model for this problem.so this model is having highest Recall. Ofcourse Adaboost having same value. 
    
2)But GBClassifier having  more precision (how many predicted customer churn actually turned out to be positive), AUC percentage and F1_Score in comparision with Adaboost.



    

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">Tuning
    
**Hyperparameter tuning in GradientBoostingClassifier using Gridsearch.**

In [None]:
gb_clf = GradientBoostingClassifier(random_state=42)
skfold = StratifiedKFold(n_splits=5)
param_grid = {
              'n_estimators' : [25, 50 ,75, 100, 200],
              'learning_rate': [0.005 ,0.05, 0.5, 1.5],
              'max_depth': [2, 4, 6, 8],
              'max_features': [10, 12, 17] 
              }
grid_gb_clf = GridSearchCV(gb_clf, param_grid, cv=skfold, scoring="accuracy", n_jobs= -1, verbose = 1)
grid_gb_clf.fit(X_train_res,y_train_res)


In [None]:
grid_gb_clf.best_params_

**Fitting GradientBoostingClassifier with best parameters.**

In [None]:
GBC_best=GradientBoostingClassifier(random_state=42,learning_rate = 0.05,
 max_depth = 8,max_features =12,n_estimators = 200)

GBC_best.fit(X_train_res, y_train_res)

In [None]:
y_pred_GBC=GBC_best.predict(X_test)

**Evaluating GradientBoostingClassifier**


In [None]:
from sklearn.metrics import confusion_matrix


confusion_matrix_forest = confusion_matrix(y_test, y_pred_GBC)
print(confusion_matrix_forest)

In [None]:
import seaborn as sns

#plotting a confusion matrix
labels = ['Not Churned', 'Churned']
plt.figure(figsize=(7,5))
ax= plt.subplot()
sns.heatmap(confusion_matrix_forest,cmap="Blues",annot=True,fmt='.1f', ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix Random Forests'); 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_GBC)) 

**Fine tuning using Random search.**

In [None]:
gb_clf = GradientBoostingClassifier(random_state=42)
skfold = StratifiedKFold(n_splits=5)
param_grid = {
              'n_estimators' : [25, 50 ,75, 100, 200],
              'learning_rate': [0.005 ,0.05, 0.5, 1.5],
              'max_depth': [2, 4, 6, 8],
              'max_features': [10, 12, 17] 
              }
random_gb_clf = RandomizedSearchCV(gb_clf, param_grid, cv=skfold, scoring="accuracy", n_jobs= -1, verbose = 1)
random_gb_clf.fit(X_train_res,y_train_res)


In [None]:
random_gb_clf.best_params_

**Fitting Gradiant Booster classifier with new parameeters.**

In [None]:
GBC_best_Rand=GradientBoostingClassifier(random_state=42,learning_rate = 0.05,
 max_depth = 8,max_features =10,n_estimators = 100)

GBC_best_Rand.fit(X_train_res, y_train_res)

In [None]:
y_pred_GBCR=GBC_best_Rand.predict(X_test)

**Evaluating Gradiant booster classifier.**

In [None]:
from sklearn.metrics import confusion_matrix


confusion_matrix_Rand = confusion_matrix(y_test, y_pred_GBCR)
print(confusion_matrix_forest)

In [None]:
import seaborn as sns

#plotting a confusion matrix
labels = ['Not Churned', 'Churned']
plt.figure(figsize=(7,5))
ax= plt.subplot()
sns.heatmap(confusion_matrix_Rand,cmap="Blues",annot=True,fmt='.1f', ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix Random Forests'); 

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_GBCR)) 

**After fine tuning using Gridsearch and Random search we can conclude that Grid search is showing better result in comparison with improved accuracy.**

**But recall in base model is more with 73.96%, whereas using grid search it reduced to 64%.**

**Lets try to improve with detail fine tuning using grid search**

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Tuning n_estimators and Learning rate

In [None]:
p_test1 = {'learning_rate':[0.15,0.1,0.05,0.01,0.005,0.001], 'n_estimators':[100,250,500,750,1000,1250,1500,1750]}

tuning = GridSearchCV(estimator =GradientBoostingClassifier(max_depth=4, min_samples_split=2, min_samples_leaf=1, subsample=1,max_features='sqrt', random_state=10), 
            param_grid = p_test1, scoring='accuracy',n_jobs=4,cv=5)
tuning.fit(X_train_res,y_train_res)
tuning.best_params_, tuning.best_score_

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Tuning max_depth

In [None]:
p_test2 = {'max_depth':[2,3,4,5,6,7] }
tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.15,n_estimators=500, min_samples_split=2, min_samples_leaf=1, subsample=1,max_features='sqrt', random_state=10), 
            param_grid = p_test2, scoring='accuracy',n_jobs=4,cv=5)
tuning.fit(X_train_res,y_train_res)
tuning.best_params_, tuning.best_score_

**First Evaluation of model with latest tuning parameters.**

In [None]:
List_1 = []
List_final = []
model1 = GradientBoostingClassifier(learning_rate=0.15, n_estimators=500,max_depth=7, min_samples_split=2, min_samples_leaf=1, subsample=1,max_features='sqrt', random_state=10)
model1.fit(X_train_res,y_train_res)
pred=model1.predict(X_test)
pred_train=model1.predict(X_train_res)
print(classification_report(y_test, pred))
roc = roc_auc_score(y_test,pred)
precision = precision_score(y_test,pred)
recall = recall_score(y_test,pred)
f1 = f1_score(y_test,pred)
List_1.append('First Evaluation')
List_1.append(accuracy_score(y_train_res,pred_train)*100)
List_1.append(accuracy_score(y_test,pred)*100)
List_1.append(roc)
List_1.append(precision)
List_1.append(recall)
List_1.append(f1)
List_final.append(List_1)

**No improvement in the model.**

**Lets try to fine tune model with more parameters.**

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Tuning Min sample split and min samples leaf

In [None]:
p_test3 = {'min_samples_split':[2,4,6,8,10,20,40,60,100], 'min_samples_leaf':[1,3,5,7,9]}

tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.15, n_estimators=500,max_depth=7, subsample=1,max_features='sqrt', random_state=10), 
            param_grid = p_test3, scoring='accuracy',n_jobs=4,cv=5)
tuning.fit(X_train_res,y_train_res)
tuning.best_params_, tuning.best_score_

**Observed no improvment as min_samples_split=2, min_samples_leaf=1 are already in use**

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Tuning Max features

In [None]:
p_test4 ={'max_features':[2,3,4,5,6,7]}

tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.15, n_estimators=500,max_depth=7,min_samples_split=2, min_samples_leaf=1, subsample=1, random_state=10), 
            param_grid = p_test4,scoring='accuracy',n_jobs=4,cv=5)
tuning.fit(X_train_res,y_train_res)
tuning.best_params_, tuning.best_score_

**Second Evaluation of model with latest Max Features.**

In [None]:
List_2 = []
model1 = GradientBoostingClassifier(learning_rate=0.15, n_estimators=500,max_depth=7, min_samples_split=2, min_samples_leaf=1, subsample=1,max_features=4, random_state=10)
model1.fit(X_train_res,y_train_res)
pred=model1.predict(X_test)
pred_train=model1.predict(X_train_res)
print(classification_report(y_test, pred))
roc = roc_auc_score(y_test,pred)
precision = precision_score(y_test,pred)
recall = recall_score(y_test,pred)
f1 = f1_score(y_test,pred)
List_2.append('Second Evalution')
List_2.append(accuracy_score(y_train_res,pred_train)*100)
List_2.append(accuracy_score(y_test,pred)*100)
List_2.append(roc)
List_2.append(precision)
List_2.append(recall)
List_2.append(f1)
List_final.append(List_2)

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">Tuning Subsamples

In [None]:
p_test5 ={'subsample':[0.7,0.75,0.8,0.85,0.9,0.95,1]}

tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.15, n_estimators=500,max_depth=7,min_samples_split=2, min_samples_leaf=1, max_features=4, random_state=10), 
            param_grid = p_test5,scoring='accuracy',n_jobs=4,cv=5)
tuning.fit(X_train_res,y_train_res)
tuning.best_params_, tuning.best_score_

**Subsample = 1 is already in use, so improvemnt.**

**List the Result of model evalution with respect to base model**

In [None]:
# Add Base model data to final list to prepare the table
List_3 = []
List_3.append('Base')
List_3.append(82.069729)
List_3.append(76.053005)
List_3.append(0.753775)
List_3.append(0.530480)
List_3.append(0.739602)
List_3.append(0.617825)
List_final.append(List_3)

**Display and compare all the models**

In [None]:
df_final = pd.DataFrame(List_final,columns=['Model','Train_Accuracy','Test_Accuracy','ROC_AUC','Precision','Recall','F1 Score'])

df_final.sort_values(by=['Recall','F1 Score'],inplace=True,ascending=False)
df_final

**Observation :**

After fine tuning different hyperparameters,

1) Able to improve the train accuracy but not test accuracy.

2) Also there is no improvement in recall, precision, ROC and F1 score

<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">So Base model is our final model for future prediction, which able to predict customer churn with 76% accuracy and recall with 73.96%

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">Pickle the Base model for future prediction

What is pickle: 

    Pickling: It is a process where a Python object hierarchy is converted into a byte stream and dumps it into a file by using dump function.This character stream contains all the information necessary to reconstruct the object in another python script.
    
    pickle has two main methods. The first one is dump, which dumps an object to a file object and the second one is load, which loads an object from a file object.

In [None]:
# Final model (which is Base model)

GBClassifier = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=0)
GBClassifier.fit(X_train_res,y_train_res)
pred=GBClassifier.predict(X_test)
pred_train=GBClassifier.predict(X_train_res)
print(classification_report(y_test, pred))

In [None]:
# Import pickle Package

import pickle

In [None]:
# Save the Modle to file in the current working directory

Pkl_Filename = "Pickle_GBC_Model.pkl"  

with open(Pkl_Filename, 'wb') as file:  
    pickle.dump(GBClassifier, file)

In [None]:
# Load the Model back from file
with open(Pkl_Filename, 'rb') as file:  
    Pickle_GBC_Model = pickle.load(file)
    
Pickle_GBC_Model

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">6.Conclusion

**1.GradientBoostingClassifier model performs the best , evidence from above results.**

**2.GradientBoostingClassifier  model able to predict 74% customer churn.**

**3.Observed no improvement with tuning hyperparameters.**

**4.We may improve model performance by using other classification algorithms.**

**5.Using hypothesis Testing,we can conclude that there are few attributes showing positive impact on customer churn.**

**6.We have dropped customer ID(as it will not influence ) and few other attributes to avoid Multicollinearity problem, as they are higly correlated.**

Company must focus more on below points for customer retention

1) Why month to month contract customers churn out is more.

2) Why attributes with no internet service has negative impact on churn.

3) Why customers with optic fiber internet showing positive impact on churn.


<span style="font-family: Arial; font-weight:bold;font-size:1.0em;color:#00b3e5;">suggestions or improvements

In [None]:
fig, ax = plt.subplots(figsize=(20,8))
width = len(df['SeniorCitizen'].unique())+6
fig.set_size_inches(width , 8)
ax=sns.countplot(data = df, x= 'SeniorCitizen') 



for p in ax.patches: 
    ax.annotate(str((np.round(p.get_height()/len(df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')


**We can see 84% of customers are not senior citizens and we also seen before, churn out is more from these customers. 
There is no information of age or age group (teen, young, middle age).**

**Information on age help us to perform better analysis and will increase the focus on particular group.** 
