# Business Problem: 
***Get insights from the dataset of No-Churn Telecom, to find out why the more Customers are leaving the company than expected and what can be done to improve the current situation***

# Objective: 
- In this notebook we explore the Processed Data collected from and Create Churn_Risk_Score,Introduce new predicting variable churn_flag


**Steps in Exploratory Data Analysis**

Step 1 : Import the libraries

Step 2 : Import the data-set

Step 3 : Creating Churn Risk Score

Step 4 :  Creating Churn Flag

Step 5 : Model for Target--->Churn


# Step 1 : Import the libraries

In [153]:
# Import the libraries
import numpy as np  #NumPy is the fundamental package for scientific computing with Python.
import pandas as pd #andas is for data manipulation and analysis. 
import matplotlib.pyplot as plt #Matplotlib is a Python 2D plotting library which produces publication quality figures.
import seaborn as sns #Seaborn is a Python data visualization library based on matplotlib
%matplotlib inline
import joblib 

# Step 2 : Import the data-set

In [43]:
#pd.set_option('display.height', 500)
#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df = pd.read_excel("No-Churn_Telecom_Europe_processed_final01.xlsx")
print(df.shape)
df.head()

(4617, 7)


Unnamed: 0,International_Plan,International_calls,International_Mins,VMail_Plan,Total_charges,CustServ_Calls,churn
0,0,3,10.0,1,75.56,1,0
1,0,3,13.7,1,59.24,1,0
2,0,5,12.2,0,62.29,0,0
3,1,7,6.6,0,66.8,2,0
4,1,3,10.1,0,52.09,3,0


# Step 3 : Creating Churn Risk Score

![alt text](Logistic-Sigmoid-function.png "Title")

In [44]:
df1 = df.copy()

In [45]:
df.columns

Index(['International_Plan', 'International_calls', 'International_Mins',
       'VMail_Plan', 'Total_charges', 'CustServ_Calls', 'churn'],
      dtype='object')

In [46]:
df1['int'] = 1
indep_var =['International_Plan', 'International_calls', 'International_Mins', 'VMail_Plan', 'Total_charges',
       'CustServ_Calls','int', 'churn']
df1 = df1[indep_var]

In [47]:
df1.columns

Index(['International_Plan', 'International_calls', 'International_Mins',
       'VMail_Plan', 'Total_charges', 'CustServ_Calls', 'int', 'churn'],
      dtype='object')

In [48]:
# Create train and test splits
target_name = 'churn'
X = df1.drop('churn', axis=1)
y = df1[target_name]

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=123, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3693, 7)
(924, 7)
(3693,)
(924,)


### Finding Logistic Regression Coefficients:
- It will be later used to find probability of customer to churn(churn_risk_score)

In [50]:
import statsmodels.api as sm
eight = ['International_Plan', 'International_calls', 'International_Mins', 'VMail_Plan', 'Total_charges',
       'CustServ_Calls', 'int']
logReg = sm.Logit(y_train, X_train[eight])
answer = logReg.fit()

answer.summary
answer.params

Optimization terminated successfully.
         Current function value: 0.321870
         Iterations 7


International_Plan     2.081353
International_calls   -0.057075
International_Mins     0.066189
VMail_Plan            -1.064314
Total_charges          0.078957
CustServ_Calls         0.517306
int                   -8.168612
dtype: float64

- The values above are the coefficient assigned to each independent variable. The constant(int) -8.223920 represents the effect of all uncontrollable variables.

#### Create function to compute coefficients

In [51]:
coef = answer.params
def y (coef, International_Plan, International_calls, International_Mins,  
       VMail_Plan,      
       Total_charges,          
       CustServ_Calls) : 
    return coef[6] + coef[0]*International_Plan +coef[1]*International_calls +coef[2]*International_Mins +coef[3]*VMail_Plan +coef[4]*Total_charges +coef[5]*CustServ_Calls 

import numpy as np


y1 = y(coef, 0,3,10,1,75.56,7)
p = np.exp(y1) / (1+np.exp(y1))
print(f'A Customer with given parameters has a {p*100}% chance of churn.')

A Customer with given parameters has a 69.94860838234864% chance of churn.


In [52]:
df1.head(3)

Unnamed: 0,International_Plan,International_calls,International_Mins,VMail_Plan,Total_charges,CustServ_Calls,int,churn
0,0,3,10.0,1,75.56,1,1,0
1,0,3,13.7,1,59.24,1,1,0
2,0,5,12.2,0,62.29,0,1,0


In [53]:
df1['churn_score'] = df1.apply(lambda row : y(coef,row['International_Plan'],row['International_calls'] ,
                                              row['International_Mins'] ,
                                              row['VMail_Plan'] ,
                                             row['Total_charges'],row['CustServ_Calls']),axis=1)  

#### Defining Function for finding 'churn_Risk_score' of all customers

In [54]:

def final_churn_score(y1):
    return np.exp(y1) / (1+np.exp(y1))

In [55]:
# Using lambda to map to all customers
df1['churn_Risk_score'] = df1.apply(lambda row : final_churn_score(row['churn_score']) ,axis=1)

In [56]:
df1 = df1.drop('churn_score',axis=1)
df1 = df1.drop('int',axis=1)

In [57]:
pd.set_option('display.max_rows', 500)
df1.head()

Unnamed: 0,International_Plan,International_calls,International_Mins,VMail_Plan,Total_charges,CustServ_Calls,churn,churn_Risk_score
0,0,3,10.0,1,75.56,1,0,0.094578
1,0,3,13.7,1,59.24,1,0,0.03548
2,0,5,12.2,0,62.29,0,0,0.061326
3,1,7,6.6,0,66.8,2,0,0.564387
4,1,3,10.1,0,52.09,3,0,0.518692


In [58]:
Churn_Risk_Score= df1['churn_Risk_score']
Churn_Risk_Score

0       0.094578
1       0.035480
2       0.061326
3       0.564387
4       0.518692
          ...   
4612    0.028569
4613    0.229311
4614    0.034345
4615    0.022343
4616    0.120946
Name: churn_Risk_score, Length: 4617, dtype: float64

- churn_Risk_score is successfully created

In [59]:
df_churnRS = df1.copy()

In [60]:
df_churnRS 

Unnamed: 0,International_Plan,International_calls,International_Mins,VMail_Plan,Total_charges,CustServ_Calls,churn,churn_Risk_score
0,0,3,10.0,1,75.56,1,0,0.094578
1,0,3,13.7,1,59.24,1,0,0.035480
2,0,5,12.2,0,62.29,0,0,0.061326
3,1,7,6.6,0,66.80,2,0,0.564387
4,1,3,10.1,0,52.09,3,0,0.518692
...,...,...,...,...,...,...,...,...
4612,0,6,8.5,1,49.83,3,0,0.028569
4613,0,1,15.7,1,69.49,3,0,0.229311
4614,0,3,13.0,1,59.40,1,0,0.034345
4615,0,3,14.3,1,59.26,0,0,0.022343


In [61]:
#pd.set_option('display.height', 500)
#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df_raw = pd.read_excel('No-Churn_Telecom_Europe_renamed.xlsx')
print(df_raw.shape)
df_raw.head(1)

(4617, 21)


Unnamed: 0,State,Account_Length,Area_Code,Phone,International_Plan,VMail_Plan,VMail_Message,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,International_Mins,International_calls,International_Charge,CustServ_Calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.


In [62]:
Customer_Phone =df_raw.Phone

In [63]:
Churn_Risk_Score_Customers =  pd.concat([Customer_Phone, Churn_Risk_Score], axis=1)
Churn_Risk_Score_Customers

Unnamed: 0,Phone,churn_Risk_score
0,382-4657,0.094578
1,371-7191,0.035480
2,358-1921,0.061326
3,375-9999,0.564387
4,330-6626,0.518692
...,...,...
4612,345-7512,0.028569
4613,343-6820,0.229311
4614,338-4794,0.034345
4615,355-8388,0.022343


## Churn risk scores: To drive retention campaigns.

With the logistic regression model, we can now use our Churn risk scores to decide in which zone customer belongs to and take required action accordingly.Each zone is explain here:

Churn risk score(0-0.3)
- Safe Zone **<font color='green'>(Green)</font>** –  Customers within this zone are considered safe.

Churn risk score(0.3-0.5)
- Low Risk Zone **<font color='yellow'>(Yellow)</font>** –  Customers within this zone are too be taken into consideration of potential churn. This is more of a long-term track.

Churn risk score(0.5-0.8)
- Medium Risk Zone **<font color='orange'>(Orange)</font>** – Customers within this zone are at risk of churn. Action should be taken and monitored accordingly.

Churn risk score(>0.8)
- High Risk Zone **<font color='red'>(Red)</font>** – Customers within this zone are considered to have the highest chance of turnover. Action should be taken immediately.

#### Using Pandas Library save this DataFrame having churn_Risk_score, to excel for further easy retrieval

In [64]:
Churn_Risk_Score_Customers.to_excel("No-Churn_Telecom_Europe_Phone&churnRiskScore_only_final00.xlsx",index=False)

# Step 4 :  Creating Churn Flag

In [65]:
def churn_flag(df) :
    
    if df["churn_Risk_score"] < 0.49 :
        return 0
    else:
        return 1
df1["churn_flag"] = df1.apply(lambda df1:churn_flag(df1),
                                      axis = 1)

In [66]:
pd.set_option('display.max_rows', 500)
df1.head()

Unnamed: 0,International_Plan,International_calls,International_Mins,VMail_Plan,Total_charges,CustServ_Calls,churn,churn_Risk_score,churn_flag
0,0,3,10.0,1,75.56,1,0,0.094578,0
1,0,3,13.7,1,59.24,1,0,0.03548,0
2,0,5,12.2,0,62.29,0,0,0.061326,0
3,1,7,6.6,0,66.8,2,0,0.564387,1
4,1,3,10.1,0,52.09,3,0,0.518692,1


In [67]:
df1.columns

Index(['International_Plan', 'International_calls', 'International_Mins',
       'VMail_Plan', 'Total_charges', 'CustServ_Calls', 'churn',
       'churn_Risk_score', 'churn_flag'],
      dtype='object')

In [68]:
df_churn_flag = df1[['International_Plan', 'International_calls', 'International_Mins',
       'VMail_Plan', 'Total_charges', 'CustServ_Calls','churn_flag']]

In [69]:
#Using Pandas Library save this renamed data to excel for further easy retrieval
df_churn_flag.to_excel("No-Churn_Telecom_Europe_churnFlag_only_final01.xlsx",index=False)

In [70]:
df_churn_flag_1 = df1[df1['churn_flag'] ==1]
print(df_churn_flag_1.shape)

(258, 9)


In [71]:
df_churn_flag_0= df1[df1['churn_flag'] ==0]
print(df_churn_flag_0.shape)

(4359, 9)


# Step 5 : Model for Target--->Churn

In [73]:
df2 = df1.copy()

In [74]:
df1 = df1.drop('churn_Risk_score',axis=1)
df1 = df1.drop('churn_flag',axis=1)

In [75]:
# Create train and test splits
target_name = 'churn'
X = df1.drop('churn', axis=1)
y = df1[target_name]

In [76]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=123, stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3693, 6)
(924, 6)
(3693,)
(924,)


In [77]:
from sklearn.metrics import roc_auc_score

from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier

# Import Different Models 
from sklearn.linear_model import LogisticRegression
from sklearn import svm, tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import joblib 
# Python script for confusion matrix creation. 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

In [78]:
# Create train and test splits
target_name = 'churn'
y = df1[target_name]
y

0       0
1       0
2       0
3       0
4       0
       ..
4612    0
4613    0
4614    0
4615    0
4616    0
Name: churn, Length: 4617, dtype: int64

In [79]:
from sklearn.ensemble import RandomForestClassifier
import xgboost
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import roc_auc_score

In [80]:
# Using 10 fold Cross-Validation to train our RandomForestClassifier
from sklearn.model_selection import cross_val_score

model4 = RandomForestClassifier()
#The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its 
#best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
scores = cross_val_score(model4 ,X, y, cv=10,scoring='f1_micro')
print(scores)
#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.97835498 0.98051948 0.96103896 0.98484848 0.98484848 0.97619048
 0.98268398 0.98047722 0.97180043 0.98698482]
Accuracy: 0.98 (+/- 0.01)


In [235]:
# xgboost Forest Model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print('Confusion Matrix :')
print(confusion_matrix(y_test, rf.predict(X_test)))
print( 'Accuracy Score :',accuracy_score(y_test, rf.predict(X_test)) )
print( '---classification_report---')
print(classification_report(y_test, rf.predict(X_test)))

Confusion Matrix :
[[871   1]
 [ 17 127]]
Accuracy Score : 0.9822834645669292
---classification_report---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       872
           1       0.99      0.88      0.93       144

    accuracy                           0.98      1016
   macro avg       0.99      0.94      0.96      1016
weighted avg       0.98      0.98      0.98      1016



In [82]:
# Using 10 fold Cross-Validation to train our  XGBClassifier
from sklearn.model_selection import cross_val_score

model1 = xgboost.XGBClassifier()
#scoring = 'roc_auc'
#The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its 
#best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. 
scores = cross_val_score(model1 ,X, y, cv=10,scoring='f1_micro')
print(scores)
#The mean score and the 95% confidence interval of the score estimate are hence given by:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[0.98051948 0.98484848 0.96320346 0.98484848 0.98484848 0.97619048
 0.98268398 0.98047722 0.97180043 0.98698482]
Accuracy: 0.98 (+/- 0.01)


In [214]:
# xgboost Forest Model
import xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(X_train, y_train)
print('---Confusion Matrix---')
print(confusion_matrix(y_test, xgb.predict(X_test)))
print('\n')
print( 'Accuracy Score :-->',accuracy_score(y_test, xgb.predict(X_test)) )
print('\n')
print( '---classification_report---')
print(classification_report(y_test, xgb.predict(X_test)))

---Confusion Matrix---
[[872   0]
 [ 16 128]]


Accuracy Score :--> 0.984251968503937


---classification_report---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       872
           1       1.00      0.89      0.94       144

    accuracy                           0.98      1016
   macro avg       0.99      0.94      0.97      1016
weighted avg       0.98      0.98      0.98      1016



In [236]:
# Save the model as a pickle in a file 
joblib.dump(xgb, 'Xbgboost_Classifier_No-Churn Telecom_predict_Churn.pkl')       
#joblib.dump to serialize an object hierarchy

['Xbgboost_Classifier_No-Churn Telecom_predict_Churn.pkl']