## Problem - Customer Churn

We work for a fictitious telecommunications company 

One of the most important goals of the company is to increase customer loyalty. 

One strategy to achieve this goal is to identify customers who are likely to churn and approach them before they leave.

To do this, we'll look at historical customer data and see if we can develop a model that predicts churn based on various customer factors such as contract length, monthly payments, demographic information, etc.

This model would then help us to identify customers with a high risk of churn early enough so marketing could still address them e.g. with promotional packages.

## Import Packages

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

## Get Data

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv")

In [3]:
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


Filter only monthly contracts

In [4]:
df = df.query("Contract == 'Month-to-month'")

## Basic EDA


Skipped 

## Modeling

### Create feature matrix X

Select features

In [5]:
X = df.loc[:,['MonthlyCharges', 'tenure', 'Partner', 'SeniorCitizen']]

Add constant

In [6]:
X = sm.add_constant(X); # Add constant to estimte intercept parameter

In [7]:
X

Unnamed: 0,const,MonthlyCharges,tenure,Partner,SeniorCitizen
0,1.0,29.85,1,Yes,0
2,1.0,53.85,2,No,0
4,1.0,70.70,2,No,0
5,1.0,99.65,8,No,0
6,1.0,89.10,22,No,0
...,...,...,...,...,...
7033,1.0,69.50,38,No,0
7034,1.0,102.95,67,No,0
7035,1.0,78.70,19,No,0
7040,1.0,29.60,11,Yes,0


Convert Yes to 1 and No to 0

In [8]:
X['Partner'] = X['Partner'].map(dict(Yes=1, No=0))
X

Unnamed: 0,const,MonthlyCharges,tenure,Partner,SeniorCitizen
0,1.0,29.85,1,1,0
2,1.0,53.85,2,0,0
4,1.0,70.70,2,0,0
5,1.0,99.65,8,0,0
6,1.0,89.10,22,0,0
...,...,...,...,...,...
7033,1.0,69.50,38,0,0
7034,1.0,102.95,67,0,0
7035,1.0,78.70,19,0,0
7040,1.0,29.60,11,1,0


### Create target vector y

In [9]:
y = df['Churn']

In [10]:
y.value_counts()

Churn
No     2220
Yes    1655
Name: count, dtype: int64

In [11]:
y = y.map(dict(Yes=1, No=0))

In [12]:
y.value_counts()

Churn
0    2220
1    1655
Name: count, dtype: int64

### Fit Model

In [13]:
model = sm.Logit(y, X)

In [14]:
res = model.fit()

Optimization terminated successfully.
         Current function value: 0.606712
         Iterations 5


In [15]:
print(res.summary())

                           Logit Regression Results                           
Dep. Variable:                  Churn   No. Observations:                 3875
Model:                          Logit   Df Residuals:                     3870
Method:                           MLE   Df Model:                            4
Date:                Wed, 27 Mar 2024   Pseudo R-squ.:                  0.1110
Time:                        20:15:41   Log-Likelihood:                -2351.0
converged:                       True   LL-Null:                       -2644.6
Covariance Type:            nonrobust   LLR p-value:                9.152e-126
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -1.4356      0.100    -14.359      0.000      -1.632      -1.240
MonthlyCharges     0.0265      0.002     17.281      0.000       0.023       0.029
tenure            -0.0431      0.002

## Use Model

Predict probabilities for class `churn`:



In [16]:
res.predict(X)

0       0.325235
2       0.475677
4       0.586221
5       0.701700
6       0.493134
          ...   
7033    0.225133
7034    0.167721
7035    0.456809
7040    0.237268
7041    0.706326
Length: 3875, dtype: float64

Get class based on decision threshold `0.5`:

In [17]:
y_hat = (res.predict(X) >= 0.5).astype(int)

In [18]:
y_hat

0       0
2       0
4       1
5       1
6       0
       ..
7033    0
7034    0
7035    0
7040    0
7041    1
Length: 3875, dtype: int32