# Lab: Logistic Regression Analysis

For this lab, we will use the CustomerChurn.csv data set. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to customer characteristics, as well as a variable indicating whether or not they churned. As discussed in class the goal of this exercise is to predict whether or not a customer will churn. 

In [1]:
# Import any libraries you may need & the data
import pandas as pd
import numpy as np
import statsmodels.api as sm

churn = pd.read_csv('Customer-Churn.csv')
churn.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [2]:
churn.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In order to use 'Churn' as a target variable, we need to encode it to 0 - 1 (or True - False) instead of yes / no. Use np.where to create a variable called y, which has the value 1 or True whenever 'Churn' is yes, and 0 or False otherwise.

In [3]:
# Your code here
y = np.where(churn['Churn']=='Yes', 1, 0)

First, we would like use 'tenure' as an explanatory variable. Declare this as your variable X, add a constant term and run a logistic regression of 'Churn' on 'tenure'. Interpret the values of the model.

In [4]:
X = churn[['tenure']]
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.510569
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7041.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1176
Time:,18:14:13,Log-Likelihood:,-3595.9
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,2.106e-210

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.0273,0.042,0.647,0.518,-0.055,0.110
tenure,-0.0388,0.001,-27.586,0.000,-0.042,-0.036


Next, we would like to add the variable 'Senior Citizen' to the model. Run a logistic regression of 'Churn' on 'tenure' and 'SeniorCitizen'. Interpret the values of the model.

In [5]:
X = churn[['tenure', 'SeniorCitizen']]
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.496871
         Iterations 6


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7040.0
Method:,MLE,Df Model:,2.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1413
Time:,18:14:13,Log-Likelihood:,-3499.5
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,1.038e-250

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1232,0.044,-2.801,0.005,-0.209,-0.037
tenure,-0.0405,0.001,-27.981,0.000,-0.043,-0.038
SeniorCitizen,1.0465,0.075,13.964,0.000,0.900,1.193


Finally, we would like to add the variable 'Contract' to the model. Please inspect the possible values for 'Contract'. What type of variable is it?

In [6]:
churn['Contract'].value_counts()

Month-to-month    3875
Two year          1695
One year          1473
Name: Contract, dtype: int64

Please convert Contract to dummy variables, and add it to the matrix of explanatory variables. Then run a logistic regression of 'Churn' on 'tenure', 'SeniorCitizen' and 'Contract'. Interpret the values of the model.

In [7]:
# your code here

X = pd.concat([X, pd.get_dummies(churn['Contract'], drop_first=True)], axis=1) 
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.466523
         Iterations 8


0,1,2,3
Dep. Variable:,y,No. Observations:,7043.0
Model:,Logit,Df Residuals:,7038.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 12 Nov 2020",Pseudo R-squ.:,0.1937
Time:,18:14:13,Log-Likelihood:,-3285.7
converged:,True,LL-Null:,-4075.1
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.1061,0.045,-2.344,0.019,-0.195,-0.017
tenure,-0.0197,0.002,-11.157,0.000,-0.023,-0.016
SeniorCitizen,0.7413,0.077,9.682,0.000,0.591,0.891
One year,-1.2895,0.097,-13.328,0.000,-1.479,-1.100
Two year,-2.4539,0.162,-15.130,0.000,-2.772,-2.136


## Bonus Challenge: Feature Selection

Use the above data set on customer churn, and try including and excluding different variables to build the best model. Which criteria can you use for deciding whether a variable is helpful for predicting whether a customer will churn or not?

In [8]:
# Your code here. 

# https://www.displayr.com/how-to-interpret-logistic-regression-coefficients/