## 1. *Bachground*
The situation of the customers churn in a given organization is defined as the loss of the customers or customers stop using the company's services. Companies like banks, telecommunication, insurance and many more need to use customers churn analysis as it has proved that, it is less expensive to retain existing customers than to acquire the new ones. The existing customers purchase more than the new ones too. Introducing the new products or services to existing customers is easier than to the new customers. So, the companies need to use the data scientist to analyze the historical company's data to support the decision-makers by giving them an indication of the customers who are about to churn and the cause of the customer's decline.
         
   ###### The objectives of this project are:
* To investigate and visualize the predictors which contribute the most in customers leaving,
* To build the machine leaning algorithms which classify the customers churn based on the available historical data,
* To select the algorithm which has the best performance(high Recall,Accuracy) compare to the others, and that algorithm can be used to preduct the new data.

*Let's get started!*

<u>Explanaition of the dataset</u> <br />The data has 14 features(columns or attributes), there are categorical columns, continuous columns and target column.


In [None]:
#Here, I load the necessary libraries that I will need to visualize the distribution 
# of the features to the target attribute 
# and to build my classification models.
#Important mathematical and dataframe libraries and html univariate report
import numpy as np 
import pandas as pd
import pandas_profiling as pf
# For visualization and ploting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.options.display.max_rows = None
pd.options.display.max_columns = None

# Support functions
from sklearn.preprocessing import PolynomialFeatures
from scipy.stats import uniform
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from catboost import CatBoostClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn import svm
from imblearn.over_sampling import ADASYN, SMOTE

# Scoring functions
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score



In [None]:
#Here, I read the data and lookhttps://www.geeksforgeeks.org/python-pandas-df-size-df-shape-and-df-ndim/ at their first 10 raws.
churnData = pd.read_csv('../input/predicting-churn-for-bank-customers/Churn_Modelling.csv')
churnData.head(10)

In [None]:
# checking the dimension of my dataframe
churnData.info()

### Univariate analysis

In [None]:
# Generated a profile report about the data set using pandas profiling
# report to be added in html format
churn_data_report = pf.ProfileReport(churnData)
churn_data_report
# churn_data_report.to_file('churn_data_report.html')

## 2. **Data Manipulation**

The data features names are mixed in small and capital letters, I will make them all small letters.<br />
Checking if there is a missing data for data cleaning<br />
Dropping three first Irrelevant attributes for features selection.

In [None]:
#Let me change the columns names of my dataset to lower case.
churnData.columns = map(str.lower, churnData.columns)

In [None]:

# Here, I am goint to check if there is a missing data.
churnData.isnull().sum()

Lucky me, there is zero missing values in my data,so, *data cleaning* becomes much easy to me.<br />
Now, the *features selection* is my next step, according to my target, I am going to drop the first three columns.

In [None]:
#Here, I am going to remove the first three columns.
churnData =  churnData.iloc[:, 3:14]
# churnData.drop(columns = [list of columns], axis =1,inplace = True)

## 3. Descriptive Data Analysis

In this section, I will do the distribution of the features according to my target column.<br />
* Distribution of the target column(exited) itself.
* Distribution of the continuous attributes concerning the target column.
* Distribution of the categorical columns according to the target column.
<br />I will add the comments to show clearly how the features are contributing to the customer's churn.

      Proportion of the churned and stayed customers.

In [None]:
#The proportion pie chart of the churned and retained customers.
labels = 'churned', 'stayed'
sizes = [churnData.exited[churnData['exited']==1].count(), churnData.exited[churnData['exited']==0].count()]
fig1, ax1 = plt.subplots(figsize=(10, 8))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%')
ax1.axis('equal')
plt.title("Percentage of the customers churn", size = 15)
plt.legend()
plt.show()


*In this above pie chart, it shows that in our dataset the churned customers occupy 20% of the entire population, the churned customers have the small percentage then, here I will have to find the prediction model which has the high accuracy in order to be able to track those churned customers.*
    

### Bivariate Analysis    
     Now, I going to evaluate the contribution of the categorical attributes to the target (exited) column
   ##### Visualizing of how customers churn change based on their gender group

In [None]:
#Visualizing how customers churn based on their gender group
print(pd.crosstab(churnData['gender'],churnData['exited']))
gender = pd.crosstab(churnData['gender'],churnData['exited'])
gender.div(gender.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False, figsize=(9,5))
plt.xlabel('gender')
p = plt.ylabel('Percentage')


From the above histogram chart, it shows that the female churned more than the male gender.


 ##### Now, I am going to check the variation of the customers churn based on their geographical location


In [None]:
#Visualizing how customers churn based on their Geo-Location
print(pd.crosstab(churnData['geography'],churnData['exited']))
Location = pd.crosstab(churnData['geography'],churnData['exited'])
Location.div(Location.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False, figsize=(9,5))
plt.xlabel('geography')
p = plt.ylabel('Percentage')


From the above visualization, we can see that Customers located in Germany churn more than those located from France and Spain.

#### Here, I need to check how the attribute called <u>hascrcard</u> ('which means if the customer has a credit card or not') is affecting the customers churn. 

In [None]:
#Visualizing how customers churn based on having credit card information
print(pd.crosstab(churnData['hascrcard'],churnData['exited']))
creditCard = pd.crosstab(churnData['hascrcard'],churnData['exited'])
creditCard.div(creditCard.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False, figsize=(9,5))
plt.xlabel('credit card holding')
p = plt.ylabel('Percentage')


From the above visualization, we can see that Customers with the credit cards (1 in legeng) churn more than those who do not have the credit cards.

#### Let me check also how the situation of the customer being active or not is affecting the customers churn too.



In [None]:
#Visualizing how customers churn based on being active information
print(pd.crosstab(churnData['isactivemember'],churnData['exited']))
actives = pd.crosstab(churnData['isactivemember'],churnData['exited'])
actives.div(actives.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False, figsize=(9,5))
plt.xlabel('active members')
p = plt.ylabel('Percentage')


Not surprisingly, in the above histogram, the zero in legend above shows the inactive this means that the inactive customes churn more than the active ones.

In this descriptive analysis, we note the following:

 * Majority of the data is from persons from France. However, the proportion of churned customers is with inversely related to the population of customers, now the bank needs to put more effort to the more churning location.
 * The proportion of female customers churning is also greater than that of male customers and the majority of the female is less then the male majority. Here the bank needs to investigate why the female are churning more.
 
 * Interestingly, majority of the customers that churned are those with credit cards. Given that majority of the customers have credit cards could prove this to be just a coincidence.
 
  * Unsurprisingly the inactive members have a greater churn. Worryingly is that the overall proportion of inactive members is quite higher than the active ones, like a data scientist, this is a good indication to suggest the bank turn those inactive customers into an active group as they will increase the bank's revenue.



        The analysis of the contribution of the continuous attributes to customers churn situation.

In [None]:
# The contribution of the continuous attributes to the customers churn (exited) column.
# Relations based on the continuous data attributes
continuous = ['age', 'tenure', 'creditscore', 'estimatedsalary', 'balance','numofproducts']
fig = plt.subplots(figsize = (15,15))
for i,j in enumerate(continuous):
  plt.subplot(3,2,i+1)
  sns.boxplot(x='exited', y = j , data=churnData)
  plt.title("Boxplot of exited vrs {}".format(j))
plt.show()
 

From the above subplot where 0 means stayed and 1  means the churned customers. We note the following:



   *  Worryingly, the bank is losing customers with significant bank balances which is likely to hit their available capital for lending.
   
   *  The product and the salary have an insignificant effect on the likelihood to churn.
   
   * There is insignificant difference in the credit score distribution between retained(0) and churned(1) customers.
   * The older customers are churning at more than the younger ones. The bank may need to review their target market or review the strategy for retention between the different age groups. An idea, can the bank approaches the retired people to increase the older customers' group? 
   
   * In the tenure, the clients who spent long time with the bank are more likely to leave compared to those that are of average time.
   
##### Now, we are going to visualize the correlation between the continuous features and here we can see that our features are nor strongly correlated, so, they won't cause any redundancy effect in our model building.
  
  


In [None]:
#Selecting the continuous columns and view the correlation between them.
churnData_cont = churnData[["age","tenure","creditscore","estimatedsalary","balance","numofproducts"]]
plt.figure(figsize=(13,8), dpi=100)
sns.heatmap(churnData_cont.corr(), xticklabels=churnData_cont.corr().columns, yticklabels=churnData_cont.corr().columns, cmap="viridis", annot=True)
plt.title("Correlation of the continuous features")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

### Features engineering
In the step, I performed the features engineering process by transforming the categorical columns into encoding, later on, I will do features standardization to improve the model performance.

In [None]:
# Encoding the categorical features into numeric
#Copy the churnData dataframe to data_copy to keep the orginal data before features engineering process
data_copy = churnData.copy()
catfeat = ['geography', 'gender']
encode_dict = {}
for i in catfeat:
  t = data_copy.groupby([i])['balance'].mean().sort_values(ascending =True).index
  encode_dict[i] = {k:i for i,k in enumerate(t,0)}
for i in catfeat:
  data_copy[i] = data_copy[i].map(encode_dict[i])  

In [None]:
data_copy.head()


After we encode the categorical data, we can see the geagraghy attribute becomes France=1, Spain=0, Germany=2.
The gender attribute becomes Female=0 and Male=1 

In [None]:
churnData['geography'].unique()

In [None]:
data_copy['geography'].unique()

### Data leakage 
Data leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training data-sets. Typically, when splitting a data-set into testing and training sets, the goal is to ensure that no data is shared between the two. Here I split the data before performing the standardization or normalization to avoid the data leakage issue.

In [None]:
#Here, I am going to view the proportion of our target before performing the overbalancing or underbalancing
from collections import Counter
Counter(churnData['exited'])

print(churnData['exited'].value_counts(normalize=True))
total = float(len(churnData))
plt.figure(figsize=(5,5))
ax = sns.countplot(churnData['exited'],palette='cubehelix')
for p in ax.patches:
   height = p.get_height()
   ax.text(p.get_x()+p.get_width()/2.,
           height + 3,
           '{:.2f}%'.format((height/total)*100),
           ha="center")
# plt.savefig('targ.jpg')
plt.legend()
plt.show()

In [None]:
# I split the data into predictors and target columns to overbalance the minority
features = data_copy[['creditscore', 'geography', 'gender', 'age', 'tenure', 'balance',
       'numofproducts', 'hascrcard', 'isactivemember', 'estimatedsalary']]
target = data_copy[['exited']]

In [None]:
#Resempling with SMOTE to have the same class proportion in our data
from imblearn.over_sampling import SMOTE
smote = SMOTE(
    sampling_strategy='minority',
    random_state=None,
    k_neighbors=5,
    n_jobs=None,
)
feature_smote, target_smote = smote.fit_sample(features, target)

In [None]:
#Convert the dataframe into series to plot it.
SmotingData=target_smote.iloc[0,:]
type(SmotingData)

In [None]:
#To see how the data has become balenced after performing smote operation
from collections import Counter
Counter(churnData['exited'])

print(SmotingData.value_counts(normalize=True))
print('+-+'*38)
print('The target variable is balanced with a 50% between the number of churned customers and \
those non churn customers')
print('+-+'*38)
total = float(len(SmotingData))
plt.figure(figsize=(5,5))
ax = sns.countplot(SmotingData,palette='Set1')
for p in ax.patches:
   height = p.get_height()
   ax.text(p.get_x()+p.get_width()/2.,
           height + 3,
           '{:.2f}%'.format((height/total)*100),
           ha="center")
# plt.savefig('targ.jpg')
plt.legend()
plt.show()

## Modeling

In [None]:
# Split data into train and test
# Data splitting section, here we get the training data to pass into our model and the test data to evaluate the performance of the model 

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(feature_smote, target_smote, test_size=0.2, random_state=1)
xtrain,xtest,ytrain,ytest = train_test_split(features,target,test_size = .2,random_state = 2)
print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)



In [None]:
#The data need to be standardized to decrease ambiguity and guessworkengineering
from sklearn.preprocessing import RobustScaler,MinMaxScaler,StandardScaler
scal = StandardScaler()
xtrain = scal.fit_transform(xtrain)
xtest = scal.transform(xtest)

### Logistic Regression

In [None]:
#Imoprting several machine learning classify algorithms 
logreg = LogisticRegression()
logreg = logreg.fit(xtrain,ytrain)
pred = logreg.predict(xtest)

print('Classification Report')
print('+-+'*15)
print(classification_report(ytest,pred))
print('+-+'*15)
print('Confusion Matrix')
print('+-+'*15)
cm = confusion_matrix(ytest,pred)
df_cm = pd.DataFrame(cm, index=['Not Churn','Churn'], columns=['Not Churn','Churn'])
plt.figure(figsize=(8,5))
heatmap = sns.heatmap(df_cm, annot=True, fmt="d",cmap='Set1')
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=15)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=15)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()


### Gradient Boosting Classifier

In [None]:
radreg = GradientBoostingClassifier(n_estimators=600,max_depth =5,random_state=4)
gradreg = radreg.fit(xtrain,ytrain)
pred = gradreg.predict(xtest)
print(classification_report(ytest,pred))
print('')
print(recall_score(ytest,pred))
print('')
print(roc_auc_score(ytest,pred))

### Random Forest Classifier

In [None]:
randreg = RandomForestClassifier()
randreg = randreg.fit(xtrain,ytrain)
pred = randreg.predict(xtest)
print(classification_report(ytest,pred))
print('')
print(recall_score(ytest,pred))
print('')
print(roc_auc_score(ytest,pred))

### Catboost Classifier

In [None]:
#Training the model using catBoostClassified algorithm
catb = CatBoostClassifier(iterations=1000,learning_rate=0.001,
                          depth = 5,use_best_model=True,eval_metric='AUC',
                          early_stopping_rounds=10,
                          verbose = 100)

catb = catb.fit(xtrain,ytrain,eval_set=(xtest,ytest), plot=True)
pred = catb.predict(xtest)
print(classification_report(ytest,pred))
print('')

# plotting feature importance
plt.figure(figsize=(12,8))
feat_importances = pd.Series(catb.feature_importances_, index=features.columns)
feat_importances.nlargest(10).sort_values().plot(kind='barh',color = 'crimson')
plt.title('Feature Importance from  Cat Boost Classifier')
plt.show()


## Imbalanced Dataset
One of the common issues found 
in datasets that are used for classification 
is an imbalanced class issue.The imbalanced usually reflects an unequal distribution of classes within a dataset, in bank's customers churn data we have less exited data than the stayed one. By building the classification model, the probability of the model to recognize the not churn customers will be high since the algorithm has more data in the training set and to recognize the churned customers will be difficult to the model. To overcome this issue, we use ***imbalanced-learn*** library to improve our model performance.

### Decision Tree with Balanced Algorithm

In [None]:
#Here, I laod the library which deals with the imbalanced dataset
#Create an object of the classifier using Decision Tree algorithm.
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print out the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("DecisionTree Average Recall", crossRecall.mean())
print("DecisionTree Average Accuracy", crossScore.mean())

### Random Forest with Balenced Algorithm

In [None]:

#Create an object of the classifier Rondom Forest algorithm.
bbc = BalancedBaggingClassifier(base_estimator=RandomForestClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#print the performance of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("RandomForest Average Recall", crossRecall.mean())
print("RandomForest Average Accuracy", crossScore.mean())

### Gradient Boosting with Balanced Algorithm

In [None]:
#Create an object of the classifier using Gradient Boosting algorithm.
bbc = BalancedBaggingClassifier(base_estimator=GradientBoostingClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
grb = bbc.fit(xtrain, ytrain)
preds = grb.predict(xtest)
#Print the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


### Logistic Regression with Balanced Algorithm

In [None]:
#Create an object of the classifier using Logistic Regression.
bbc = BalancedBaggingClassifier(base_estimator=LogisticRegression(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)

#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("LogisticRegression Average Recall", crossRecall.mean())
print("LogisticRegression Average Accuracy", crossScore.mean())

### Cat Boost with Balanced Algorithm

In [None]:
#Create an object of the classifier using catBoostClassifier algorithm.
bbc = BalancedBaggingClassifier(base_estimator=CatBoostClassifier(verbose= 0),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)

#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("CatBoostClassifier Average Recall", crossRecall.mean())
print("CatBoostClassifier Average Accuracy", crossScore.mean())

In [None]:
#Average of the Recall across 10-folds on the CatBoostClassifier
crossRecall.mean()

 After building the model, we can visualize which attributes are contributing the most in customers churning
.

In [None]:

# plotting feature importance, the features that are contributing more in catBoostClassifier
plt.figure(figsize=(12,8))
feat_importances = pd.Series(catb.feature_importances_, index=features.columns)
feat_importances.nlargest(10).sort_values().plot(kind='barh',color = 'crimson')
plt.title('Feature Importance from  Catboost Classifier')
plt.show()

### Hyper-Parameter search on XGBClassifier

In [None]:
# We look for the best parameters to traing XGBClassifier model
m_dep = [5,6,7,8]
gammas = [0.01,0.001,0.001]
min_c_wt = [1,5,10]
l_rate = [0.05,0.1, 0.2, 0.3]
n_est = [5,10,20,100]

param_grid = {'n_estimators': n_est, 'gamma': gammas, 'max_depth': m_dep,
              'min_child_weight': min_c_wt, 'learning_rate': l_rate}

xgb_cv = RandomizedSearchCV(estimator = XGBClassifier(), n_iter=100, param_distributions =  param_grid, random_state=51, cv=3, n_jobs=-1, refit=True)
xgb_cv.fit(xtrain,ytrain)

print("tuned hpyerparameters :(best parameters) ",xgb_cv.best_params_)
print("accuracy :",xgb_cv.best_score_)
print(xgb_cv.best_estimator_)


In [None]:
#Traning the XGBClassifier model with the best parameters and check its performance without balanced algorithm
xgb = XGBClassifier( n_estimators = 100, min_child_weight= 5, max_depth= 5, learning_rate= 0.1, gamma=0.001)
xgb = xgb.fit(xtrain,ytrain)
pred = xgb.predict(xtest)

print('Classification Report')
print('+-+'*15)
print(classification_report(ytest,pred))
print('+-+'*15)
print('Confusion Matrix')
print('+-+'*15)
cm = confusion_matrix(ytest,pred)
df_cm = pd.DataFrame(cm, index=['Not Churn','Churn'], columns=['Not Churn','Churn'])
plt.figure(figsize=(8,5))
heatmap = sns.heatmap(df_cm, annot=True, fmt="d",cmap='Set1')
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=15)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=15)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix for XGBoost')
plt.show()

In [None]:
# Plotting the confusion Matrix of the XGBoost model
m_dep = [7]
gammas = [0.01]
min_c_wt = [1]
l_rate = [0.2]
n_est = [100]

param_grid = {'n_estimators': n_est, 'gamma': gammas, 'max_depth': m_dep,
              'min_child_weight': min_c_wt, 'learning_rate': l_rate}

xgb_cv_10 = RandomizedSearchCV(estimator = XGBClassifier(), n_iter=100, param_distributions =  param_grid, random_state=51, cv=10, n_jobs=-1, refit=True)
xgb_cv_10.fit(xtrain,ytrain)

pred = xgb_cv_10.predict(xtest)
roc_auc_score(ytest, pred)
print('Classification Report')
print('+-+'*15)
print(classification_report(ytest,pred))
print('+-+'*15)
print('Confusion Matrix')
print('+-+'*15)
cm = confusion_matrix(ytest,pred)
df_cm = pd.DataFrame(cm, index=['Not Churn','Churn'], columns=['Not Churn','Churn'])
plt.figure(figsize=(8,5))
heatmap = sns.heatmap(df_cm, annot=True, fmt="d",cmap='Set1')
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=15)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=15)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix for XGBClassifier algorithm')
plt.show()

In [None]:
# Checking the area under the curve of XGBClassifier with 10 folds cross validation
y_pred_prob = xgb_cv_10.predict_proba(xtest)[:,1]
fpr, tpr, thresholds = roc_curve(ytest, y_pred_prob)
plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr, tpr, label='xgb')
plt.legend()

In [None]:
# View the area under the curve score
roc_auc_score(ytest, y_pred_prob)

### XGBoost with the balanced algorithm

In [None]:
#Here, I laod the library which deals with the imbalanced dataset
#Create an object of the classifier using XGBoost algorithm.
bbc = BalancedBaggingClassifier(base_estimator=XGBClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print out the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("XGBoost Average Recall", crossRecall.mean())
print("XGboost Average Accuracy", crossScore.mean())

### K- Nearest Neighbors Classifier

In [None]:
#Create an object of the classifier using K-Neighbors Classier.
bbc = BalancedBaggingClassifier(base_estimator=KNeighborsClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print out the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))


In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("KNeighbors Average Recall", crossRecall.mean())
print("KNeighbors Average Accuracy", crossScore.mean())

### Support Vector Machine with balanced algorithm

In [None]:
#Here, I laod the library which deals with the imbalanced dataset
#Create an object of the classifier using Support vector machine algorithm.
bbc = BalancedBaggingClassifier(base_estimator=svm.SVC(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)
#Train the classifier.
bbc.fit(xtrain, ytrain)
preds = bbc.predict(xtest)
#Print out the results of the model
print(classification_report(ytest,preds))
print('')
print(recall_score(ytest,preds))
print('')
print(roc_auc_score(ytest,preds))

In [None]:
# Average accuracy and Recall across 10-folds cross-validation
crossRecall = cross_val_score(bbc, features, target, cv=10, scoring='recall')
crossScore = cross_val_score(bbc, features, target, cv=10, scoring='accuracy')
print("SVM Average Recall", crossRecall.mean())
print("SVM Average Accuracy", crossScore.mean())

### Cross Validation on Gradient Boost model with Balanced Algorithm



In [None]:
#Perform the cross validation data splitting to see the model improvement on many testing set.
# Here, I used k-fold cross validation with 10 folds
crossRecallg = cross_val_score(grb, features, target, cv=10, scoring='recall')
crossScoreg = cross_val_score(grb, features, target, cv=10, scoring='accuracy')
# The recall on all 10-folds as an arrary
print("Recall Average:", crossRecallg.mean())
#The avarega of the recalls across all 10- folds with gradient boosting model
print("Accuracy Average", crossScoreg.mean())

### Conclusion

In this project of classification of the customers' churn situation, I have explored the univariate, bivariate and modeling analysis. the univariate analysis is available in the provided Html file. After using the several classifier machine learning algorithms, balancing the data using a machine learning algorithm and using 10-folds cross-validation, I have come up with the model which performs better than others on our classification problem.
Our interested matrices are <u> Recall</u> and Accuracy, Recall because I need the model with less False Negative *(FN)* meaning that the model which will less predict that 
the customers will not churn but 
truly the customers will churn.

By comparing those several algorithm the Gradient Boosting predict better than others with *74.76%* of Recall metric, this means that it has the probability of *74.76%* to predict the customers who are going to churn. The accuracy of the Gradient Boosting is *80%*.