<a href="https://colab.research.google.com/github/saif-byte/DataSciencePortfolio/blob/main/Churn%20Prediction/Churn_or_not_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Churn or not Churn

In this project , we will train a model to predict whether the customer will stop using our services(churn) or will continue using our services(not churn). 
This model will be trained by using the dataset that have attributes of the customers that have previously churned. The dataset for this project is from kaggle named as [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn). 
If our model predict that the customer is going to churn then we can send them promotional discoounts and email to sustain them.

In [79]:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from IPython.display import display 

%matplotlib inline 

##Data reading and Pre-processing

In [80]:
#reading the dataset
df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [81]:
#Checking the df
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [82]:
#checking the lenght of df
len(df)

7043

In [83]:
#making the name of columns consitent
df.columns = df.columns.str.lower().str.replace(" ","_")
string_columns = list(df.dtypes[df.dtypes == 'object'].index)
for col in string_columns:
  df[col]  =df[col].str.lower().str.replace(" ","_")

In [84]:
#Now check the df
df.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,no
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,...,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,yes
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.3,1840.75,no
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.7,151.65,yes


In [85]:
#We can tranpose the df to see more data, as it will be clear to see
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [86]:
#Now we can see there are multiple columns in the df but we are most interested in churn


In [87]:
#Also let us look at data types of columns
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [88]:
#As we can see that seniorcitizen is int, it must be boolean but the column has 0 and 1 values so it will not be a problem ,
# also totalcharges are object , it must be numeric 
total_charges = pd.to_numeric(df["totalcharges"] , errors  ='coerce')


In [89]:
#to check for which customers the total charges values are missing
df[total_charges.isnull()][["customerid" , "totalcharges"]]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


In [90]:
#now we set these missing values to 0 
df["totalcharges"] = pd.to_numeric(df["totalcharges"] , errors = "coerce")
df["totalcharges"] = df["totalcharges"].fillna(0)

In [91]:
#let's convert the yes/no values in churn column to binary numbers
df["churn"] = (df["churn"]=='yes').astype(int)

In [92]:
#Now let us split our data into train and test sets
df_train_full , df_test = train_test_split(df , test_size = 0.2 ,random_state=1) 

In [93]:
#Now we also need a validation set, which can be obtained by again splitting the df_train_full
df_train,df_val =  train_test_split(df_train_full , test_size = 0.33 ,random_state=11)

In [94]:
#Now we can save the y values for splitted sets
y_train = df_train.churn.values
y_val = df_val.churn.values

#Also delete the y values from sets so we cant use them accidentally
del(df_train['churn'])
del(df_val['churn'])


##Exploratory Data Analysis

In [95]:
#now we need to look at any missing values in data
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [96]:
#Also we need to look at values count of target variable
df_train_full['churn'].value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [97]:
#As we can see that majority of the customers did'nt churn
#Now let us also look at the propotion of churned users
global_mean = df_train_full.churn.mean()
round(global_mean, 3)

0.27

In [98]:
#This means that 27% of the user stopped using the services in the given dataset
#Also our dataset is imbalanced as the composition of positive and negative instances are not equal

In [99]:
#Our data is divided into two type of features , that is categorical and numerical
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
 'phoneservice', 'multiplelines', 'internetservice',
 'onlinesecurity', 'onlinebackup', 'deviceprotection',
 'techsupport', 'streamingtv', 'streamingmovies',
 'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges'] 

In [100]:
#Now let us look how many unique values does our categorical attributes have
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

###Understanding Features

In [101]:
#we know that every feature has different importance, we need to look at categorical variable with their chrun rate 
#It will help us answer important question such as "What makes people churn?" , "Which gender is churning more?"

In [102]:
#let us look into gender variable first
female_mean = df_train_full[df_train_full["gender"]=="female"].churn.mean()
male_mean = df_train_full[df_train_full["gender"]=="male"].churn.mean()
print(f'female_mean: {female_mean}\nmale_mean: {male_mean}')

female_mean: 0.27682403433476394
male_mean: 0.2632135306553911


In [103]:
#We can see that female_mean is 27% while male_mean is 26.3% as both of them has little difference we can say that knowing the gender
#will not help is in predicting whether the customer will churn or not

In [104]:
#Now let us look into another variable that is partner
partner_yes = df_train_full[df_train_full["partner"]=="yes"].churn.mean()
partner_no = df_train_full[df_train_full["partner"]=="no"].churn.mean()
print(f'partner_yes: {partner_yes}\npartner_no: {partner_no}')

partner_yes: 0.20503330866025166
partner_no: 0.3298090040927694


In [105]:
#We can see that people with no partners have significantally higher rate of churning, This means partner is important variable in 
#predicting churn

####Risk Ratio

In [106]:
#We can calculate the risk ratio to find the risk of a certain value in a categoraical variable to churn
#first group by gender and calculate the mean 
df_group = df_train_full.groupby("gender").churn.agg(['mean'])
#Now calculating the difference between global_churn and group_churn
df_group['diff']  =df_group['mean']-global_mean
#Now calculate risk of churning
df_group['risk'] = df_group['mean']/global_mean
df_group

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


In [107]:
#Now let us calculate risk ratio for every categorical variable

for col in categorical:
  df_group = df_train_full.groupby(col).churn.agg(['mean'])
  df_group['diff']  =df_group['mean']-global_mean
  df_group['risk'] = df_group['mean']/global_mean
  display(df_group)


Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


From the results , we can learn that male and female there is not much difference and risk is close to 1

Senior citizen are more likely to churn that non-senior citizen as risk ratio of senior citizen is high.

People with no dependants are more likely to churn than the people with dependant as the risks are 0.61 and 1.12 respectively.

People with no online security and no online backup are also more likely to churn.

Similarly people with electronic check payement method are also likely to churn as their risk is 1.62

####Mutual Information

In [108]:
#Now we need to know the degree of depedency between categorical variables and target variables.
def calculate_mi(series):
  return mutual_info_score(series , df_train_full.churn)
df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')
df_mi

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923
deviceprotection,0.043453
paymentmethod,0.04321
streamingtv,0.031853
streamingmovies,0.031581
paperlessbilling,0.017589


In [109]:
#We can see that contract	0.098320 onlinesecurity,techsupport,internetservice,onlinebackup are most important feature
#while partner, seniorcitizen,multiplelines,phoneservice,gender are least significant

####Correlation coefficient

In [110]:
#We have seen the variable depedency for categorical variables but numerical variables are still remaining
df_train_full[numerical].corrwith(df_train_full.churn)

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64

Tenure has negative corr with churn so the longer the customer will remain , less likely that they churn

monthly charges has positive corr with churn so the higher the monthly charges , more likely that they churn

total charges has negative corr with churn so the customer who have higher total charges means they are customer for longer time and are less likely to churn

##Feature Engineering

In [111]:
#We will use one-hot encoding to vectorize our categorical variables
#First we need to convert our DataFrame into a dictionary
train_dict  = df_train[categorical+numerical].to_dict(orient='rows')
#now if we look into the first element of this list
train_dict[0]

  This is separate from the ipykernel package so we can avoid doing imports until


{'contract': 'two_year',
 'dependents': 'no',
 'deviceprotection': 'yes',
 'gender': 'male',
 'internetservice': 'dsl',
 'monthlycharges': 86.1,
 'multiplelines': 'no',
 'onlinebackup': 'yes',
 'onlinesecurity': 'yes',
 'paperlessbilling': 'yes',
 'partner': 'yes',
 'paymentmethod': 'bank_transfer_(automatic)',
 'phoneservice': 'yes',
 'seniorcitizen': 0,
 'streamingmovies': 'yes',
 'streamingtv': 'yes',
 'techsupport': 'yes',
 'tenure': 71,
 'totalcharges': 6045.9}

In [112]:
#Now we will use dictvectorizer to convert our df into vector
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

DictVectorizer(sparse=False)

In [113]:
#Now we will convert this dictionary to matrix by transform method
X_train = dv.transform(train_dict)
X_train[0]

array([0.0000e+00, 0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       1.0000e+00, 0.0000e+00, 0.0000e+00, 8.6100e+01, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 1.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 7.1000e+01, 6.0459e+03])

In [114]:
#all the values are 0's or 1's except three they are our numerical variables
dv.get_feature_names()



['contract=month-to-month',
 'contract=one_year',
 'contract=two_year',
 'dependents=no',
 'dependents=yes',
 'deviceprotection=no',
 'deviceprotection=no_internet_service',
 'deviceprotection=yes',
 'gender=female',
 'gender=male',
 'internetservice=dsl',
 'internetservice=fiber_optic',
 'internetservice=no',
 'monthlycharges',
 'multiplelines=no',
 'multiplelines=no_phone_service',
 'multiplelines=yes',
 'onlinebackup=no',
 'onlinebackup=no_internet_service',
 'onlinebackup=yes',
 'onlinesecurity=no',
 'onlinesecurity=no_internet_service',
 'onlinesecurity=yes',
 'paperlessbilling=no',
 'paperlessbilling=yes',
 'partner=no',
 'partner=yes',
 'paymentmethod=bank_transfer_(automatic)',
 'paymentmethod=credit_card_(automatic)',
 'paymentmethod=electronic_check',
 'paymentmethod=mailed_check',
 'phoneservice=no',
 'phoneservice=yes',
 'seniorcitizen',
 'streamingmovies=no',
 'streamingmovies=no_internet_service',
 'streamingmovies=yes',
 'streamingtv=no',
 'streamingtv=no_internet_servic

##Logistic Regression

In [115]:
#we will first train our data using logistic regression
model = LogisticRegression(solver="liblinear" , random_state=1)
model.fit(X_train , y_train)

LogisticRegression(random_state=1, solver='liblinear')

In [116]:
#now let us predict our target variable using this model
dict_val = df_val[categorical+numerical].to_dict(orient='rows')
X_val = dv.transform(dict_val)

  


In [117]:
y_pred = model.predict_proba(X_val)
y_pred

array([[0.76508784, 0.23491216],
       [0.73113015, 0.26886985],
       [0.68054704, 0.31945296],
       ...,
       [0.94274614, 0.05725386],
       [0.38476895, 0.61523105],
       [0.93872763, 0.06127237]])

In [118]:
#the first column predict the probablity that customer will not churn while the second gives probablity that cutomer will churn
#We only need when the customer will churn so
y_pred = y_pred[: , 1] 
y_pred

array([0.23491216, 0.26886985, 0.31945296, ..., 0.05725386, 0.61523105,
       0.06127237])

In [120]:
#now we don't need probablities but we need a hard prediction that customer will churn or not
#so we set a threshold on probanlity and convert it into hard predictions
churn = y_pred>=0.5

##Evaluating the model

In [121]:
#we will calculate the accuracy by first comparing values iny_val and churn after taking the mean will
#tell us probablity of values matched
(y_val==churn).mean()

0.8016129032258065

In [122]:
#We can see that our model done pretty well by predicting 80% values correct

In [124]:
#We can also see weights attached to each variable
dict(zip(dv.get_feature_names_out() , model.coef_[0].round(3)))

{'contract=month-to-month': 0.563,
 'contract=one_year': -0.086,
 'contract=two_year': -0.599,
 'dependents=no': -0.03,
 'dependents=yes': -0.092,
 'deviceprotection=no': 0.1,
 'deviceprotection=no_internet_service': -0.116,
 'deviceprotection=yes': -0.106,
 'gender=female': -0.027,
 'gender=male': -0.095,
 'internetservice=dsl': -0.323,
 'internetservice=fiber_optic': 0.317,
 'internetservice=no': -0.116,
 'monthlycharges': 0.001,
 'multiplelines=no': -0.168,
 'multiplelines=no_phone_service': 0.127,
 'multiplelines=yes': -0.081,
 'onlinebackup=no': 0.136,
 'onlinebackup=no_internet_service': -0.116,
 'onlinebackup=yes': -0.142,
 'onlinesecurity=no': 0.258,
 'onlinesecurity=no_internet_service': -0.116,
 'onlinesecurity=yes': -0.264,
 'paperlessbilling=no': -0.213,
 'paperlessbilling=yes': 0.091,
 'partner=no': -0.048,
 'partner=yes': -0.074,
 'paymentmethod=bank_transfer_(automatic)': -0.027,
 'paymentmethod=credit_card_(automatic)': -0.136,
 'paymentmethod=electronic_check': 0.175,


##Real World Application

In [125]:
#Now let us assume following are the attributes of our customer, we like to know if the customer 
#will leave us soon or not
customer = {
 'customerid': '8879-zkjof',
 'gender': 'female',
 'seniorcitizen': 0,
 'partner': 'no',
 'dependents': 'no',
 'tenure': 41,
 'phoneservice': 'yes',
 'multiplelines': 'no',
 'internetservice': 'dsl',
 'onlinesecurity': 'yes',
 'onlinebackup': 'no',
 'deviceprotection': 'yes',
 'techsupport': 'yes',
 'streamingtv': 'yes',
 'streamingmovies': 'yes', 
  'contract': 'one_year',
 'paperlessbilling': 'yes',
 'paymentmethod': 'bank_transfer_(automatic)',
 'monthlycharges': 79.85,
 'totalcharges': 3320.75,
}

X_test = dv.transform([customer])
X_test

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 0.00000e+00, 7.98500e+01, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 1.00000e+00,
        1.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 4.10000e+01, 3.32075e+03]])

In [129]:
#Now we will predict if this customer will churn or not
y_pred = model.predict_proba(X_test)

In [130]:
y_pred

array([[0.92667889, 0.07332111]])

In [131]:
#We only need the probablity of churn so
y_pred = y_pred[0,1]
#also we need a answer in yes or no , so
print(y_pred>=0.5)

False


In [132]:
#Our customer will not churn so we will not send promotional discounts, emails etc to this customer

In [133]:
#Let us look into another customer
customer = {
 'gender': 'female',
 'seniorcitizen': 1,
 'partner': 'no',
 'dependents': 'no',
 'phoneservice': 'yes',
 'multiplelines': 'yes',
 'internetservice': 'fiber_optic',
 'onlinesecurity': 'no',
 'onlinebackup': 'no',
 'deviceprotection': 'no',
 'techsupport': 'no',
 'streamingtv': 'yes',
 'streamingmovies': 'no',
 'contract': 'month-to-month',
 'paperlessbilling': 'yes',
 'paymentmethod': 'electronic_check',
 'tenure': 1,
 'monthlycharges': 85.7,
 'totalcharges': 85.7
} 
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]

0.8321656556545182

In [134]:
#The probablity of churning of this customer is 83% and it is likely that this customer
#will stop using our services soon so we will send emails, discount offers etc to this customer

We have trained our model using logistic regression, for this purpose we first prepare our features and convert them to vectors using DictVectorizer method from sklearn. We also used risk ratios to find the risk of  value in a categorical variable to churn. We also used mutual info scores and coefficient to check the dependacy of attributes on target variable.
This model is great for business to know beforehand that which customers are going to leave soon and how businesses can sustain these customers.