# Telco Customer Churn Prediction

## Initial data preparation

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns


from matplotlib import pyplot as plt
%matplotlib inline

Importing the dataset.

In [2]:
df = pd.read_csv('Telco-Customer-Churn.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Telco-Customer-Churn.csv'

In [None]:
df.head().T

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='O').T

In [None]:
df.TotalCharges.describe()

In [None]:
df[df['TotalCharges']==' ']

2 problems:
 - The `TotalCharges` Column is set as an object column
 - The column has 11 values that are spaces.

1 solution :
 - Turn the column to numeric and turn the spaces (non numeric values) as null values.

In [None]:
# Let's transform the TotalCharge column
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')

## Data cleaning
### The missing values

In [None]:
df.isnull().sum()


In [None]:
df['TotalCharges'].describe()

The mean is much higher that the median. The TotalCharges column should be skewed.
Let's visualize the distribution.

In [None]:
df['TotalCharges'].hist()

In [None]:
# filling the missing values with the median
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())
df.isnull().sum()

In [None]:
df['TotalCharges'].describe()

SeniorCitizen is detected as int64.
We need to change it to object.

In [None]:
#df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')
df['SeniorCitizen'] = df['SeniorCitizen'].astype(str)

The column names don’t follow the same naming convention. Let’s make it uniform by lowercasing everything.

In [None]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [None]:
string_columns = list(df.dtypes[df.dtypes == 'object'].index)

In [None]:
print(df[string_columns].dtypes)


Correcting the values in the columns (turning into lower case and replacing spaces with "_").

In [None]:
for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

Our target variable `churn` is categorical, with 2 values `yes` and `no`. For binary classification all models typically expect `0` (no) and `1` (yes). 
Let's turn `churn` into a numerical column.

In [None]:
df['churn'].head()

In [None]:
# we perform casting by using the astype(int) function
(df.churn == 'yes').astype(int).head()

In [None]:
# Let's apply the conversion to numbers
df.churn = (df.churn == 'yes').astype(int)

In [None]:
#let's count the values in the churn column:
df.churn.value_counts()

We can clearly see that there is an imbalance between the number of customers who left the company and the number of customers that didn't leave. We can even provide the proportions of both customer categories using normalize.

In [None]:
# the churn rate
df.churn.value_counts(normalize='True')

## Data splitting

In [None]:
from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train_full, test_size=0.33,random_state=11)

#We take the column with the target variable,churn, and save it outside the data-frame
y_train = df_train.churn.values
y_val = df_val.churn.values 

# we delete the churn columns from both data-frames to make sure we don’t accidentally use the churn variable as a feature during training
del df_train['churn']
del df_val['churn']

## Exploratory Data Analysis (EDA)

In [None]:
# Let us see our categorical variables
df_train_full.select_dtypes(include=['object']).columns

In [None]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
'phoneservice', 'multiplelines', 'internetservice',
'onlinesecurity', 'onlinebackup', 'deviceprotection',
'techsupport', 'streamingtv', 'streamingmovies',
'contract', 'paperlessbilling', 'paymentmethod']

In [None]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [None]:
# Let's check the number of categories in each categorical feature
df_train_full[categorical].nunique()

## Feature Importance
To verify which of our feature have a higher impact on the churn variable. In other words what customer feature are more likely to explain their churn behavior.

In [None]:
# We need our global mean (the mean of our target variable)
global_mean = df_train_full.churn.mean()
round(global_mean,3)

For a certain categorical feature we check if the global mean by category changes from one category to another.

In [None]:
# Let's check for the gender variable
churn_gender = df_train_full.groupby('gender').churn.mean()
round(churn_gender,3)

We can see that the gender (male or female) doesn't affect the churn behavior.

In [None]:
# Let's check for the churn partner variable
churn_partner = df_train_full.groupby('partner').churn.mean()
round(churn_partner,3)

There is some variation in the churn behavior depending on whether the customer lives with a partner or not.

For a more accurate conclusions we will check the risk ratios for both gender and partner variables.

#### Risk Ratio

risk = group rate / global rate

- Risk close to 1 = the category has no impact on the churn behavior
- Risk lower that 1 = the costumers in this category are less likely to churn
- Risk higher than 1 = the costumers in this category have a very high risk of churning

In [None]:
# For gender
gender_risk = churn_gender/global_mean
round(gender_risk,3)

Since the ratios are close to 1 , we can say that gender is not significantly impacting the churn behavior.

In [None]:
# For partner
partner_risk = churn_partner/global_mean
round(partner_risk,3)

Not having a partner makes the customer at higher risk of leaving the company.

Now let's get the risk ratios for all our categorical variables.

In [None]:
from IPython.display import DisplayObject
for feature in categorical:
    df_group = df_train_full.groupby(by=feature).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean']/global_mean
    display(df_group)

Let's visualize these tables

In [None]:
for feature in categorical['contract']:                                           
    _=sns.countplot(x= feature, hue = 'churn', data=df)
    plt.show()

In [None]:
for feature in categorical:
    df_group = df.groupby(by=feature).churn.agg(['mean']).reset_index()
    df_group['hue'] = df_group[feature]  # Assign the feature values to hue
    graph = sns.barplot(x=feature, y='mean', hue='hue', data=df_group, palette='Blues', dodge=False)
    graph.axhline(global_mean, linewidth=3, color='b')
    plt.text(0, global_mean - 0.03, "global_mean", color='black', weight='semibold')
    plt.legend([],[], frameon=False)  # Removes unnecessary legend
    plt.show()


## Mutual information 
The metrics of importance can help us determine what are the most important features. We can measure the degree of dependency between a categorical variable and the target variable. The higher the degree of dependency, the more useful a feature is.

For categorical variables, the `mutual information metric` tells us how much information we learn about one variable if we learn the value of the other variable.

Mutual information is already implemented in `Scikit-learn` in the `mutual_info_score` function from the metrics package, so we can just use it:

In [None]:
from sklearn.metrics import mutual_info_score

def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')
df_mi

According to the mutual information score the most useful features are: [contract, onlinesecurity, techsupport, internetservice, onlinebackup]


The least useful features are: [gender, seniorcitizen,multiplelines, phoneservice,gender ]

## Correlation coefficient

We can measure the dependency between our binary target variable and our numerical variables using the correlation coefficient.

- Positive correlation : if the features values are high , we get more ones than zero. But if the features values are low , zeros become more frequent.
- Zero correlation : no relationship between the target variable and the feature.
- Negative correlation : if the values are high, we see more zeros than ones in the target variable. When the values are low, we see more ones.

In [None]:
corr_table = df_train_full[numerical].corrwith(df_train_full.churn)
print(corr_table)

`Tenure` and `totalcharges` are negatively correlated to the churn variable, this means that the higher these variables are the more churn takes zero as a value.

##### Interpretation : 
- The longer a customer stays with the company the less risk there is of them churning.
- the totalcharges partially indicate how long the costumer stayed with the company , since the longer people stay with the company, the more they have paid in total, so it’s less likely that they will leave. 

--

`monthlycharges` is positively correlated to `churn`, this means that as the monthlychargers variable gets higher the more the churn variable takes one as a value.
##### Interpretation:
- Customers who pay more on a monthly basis tend to leave more often. 

Let's visualize  this:

In [None]:
t1 =df_train_full[df_train_full['tenure'] <= 2].churn.mean()
t1

In [None]:
t2 = df_train_full[(df_train_full.tenure >= 3) & (df_train_full.tenure <= 12)].churn.mean()
t2 

In [None]:
t3 = df_train_full[df_train_full['tenure'] >= 12].churn.mean()
t3

In [None]:
# Calculating churn means for the specified tenure groups
data = pd.DataFrame({
    'Tenure Group': ['1-2', '3-12', '+12'],
    'Churn Rate': [t1, t2, t3]
})

# Plotting the barplot with hue
sns.barplot(data=data, x='Tenure Group', y='Churn Rate', hue='Tenure Group', palette='Blues', dodge=False)
plt.xlabel('Tenure Groups')
plt.ylabel('Churn Rate')
plt.title('Tenure vs. churn (correlation –0.35)')
plt.legend([],[], frameon=False)  # Removes legend if not needed
plt.show()

In [None]:
mc1 =df_train_full[df_train_full['monthlycharges'] <= 20].churn.mean()
mc1

In [None]:
mc2 = df_train_full[(df_train_full.monthlycharges >= 21) & (df_train_full.monthlycharges <= 50)].churn.mean()
mc2

In [None]:
mc3 = df_train_full[df_train_full['monthlycharges'] > 50].churn.mean()
mc3

In [None]:
sns.barplot(x =['0-20', '21-50', '+50'], y =[mc1,mc2,mc3]);
plt.title('Monthly charges vs. churn (correlation 0.19)');
plt.xlabel('Monthly Charges');
plt.ylabel('Churn Rate');

## Feature engineering
Transforming all categorical variables to numeric features.

#### One-hot encoding for categorical variables
For the `contract` variable that takes : monthly, yearly, two-year; IF a customer has a yearly contract he will be represented by (0,1,0)
Here the yearly value is active (hot)=> 1
The remaining values are not active (cold) => 0

DictVectorizer takes in a dictionary and vectorizes it. To be able encode our categorical variables using DictVectorizer, we need to first turn our data-frame into a list of dictionaries.

In [None]:
train_dict = df_train[categorical + numerical].to_dict(orient='records')
train_dict[0]

Now we create our matrix.

In [None]:
# Adjusting NumPy print options to disable scientific notation
np.set_printoptions(suppress=True, precision=4)

In [None]:
#we first fit our vectorizer 
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False) #sparse=false means that the matrix will not be sparse and will create a simple NumPy array
dv.fit(train_dict)

#we apply it to our training set
X_train = dv.transform(train_dict)
X_train[0]

In [None]:
dv.get_feature_names_out()

#  Classification

Using logistic regression, we want to predict the probability that a customer i will churn (yi=1).

### MODEL 1 : 

#### Training logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', random_state=1)
model.fit(X_train, y_train)

Let's see how the model performs on the validation set. We need to first apply the encoding to the validation set.

In [None]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [None]:
X_val[0]

In [None]:
# We run the model on the validation set

y_pred = model.predict_proba(X_val)[:, 1]

# we take only the second column of the 2D array because it contains the probability that the target is positive (churn)

In [None]:
# We get a one-dimensional NumPy array 
y_pred[0:5]

We need binary values of True (churn, so send promotional message) or False (not churn, so don’t send the message). To do this we fix a probability threshold, and assign True to the values above the threshold and False to the values below it.

In [None]:
# we automatically get the binary values using :
churn = y_pred >= 0.5

In [None]:
churn[10:14],y_pred[10:14]

We need to evaluate the quality of our predictions. One very simple way to do this is to compare the actual values to the predicted ones.

In [None]:
# Accuracy
(y_val == churn).mean()

This means that the model predictions matched the actual value 80% of the time, or the model makes correct predictions in 80% of cases.

In [None]:
#the bias term
model.intercept_[0]

In [None]:
#the coefficients
model.coef_[0]

In [None]:
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

#### Testing set

In [None]:
test_dict = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.transform(test_dict)

In [None]:
y_test_pred = model.predict_proba(X_test)[:, 1]

In [None]:
y_test_pred[0:5]

In [None]:
 churn = (y_test_pred >= 0.5)

In [None]:
churn,y_test_pred

In [None]:
y_test = df_test.churn.values

In [None]:
(y_test == churn).mean()

In [None]:
print('LogisticRegression Training Accuracy: ', round(model.score(X_train, y_train), 3))
print('LogisticRegression Validation Accuracy: ', round(model.score(X_val, y_val), 3))
print('LogisticRegression Testing Accuracy: ', round(model.score(X_test, y_test), 3))

## MODEL 2 : Logistic regression using the most important features only

In [None]:
important_fea = df_mi.head().index.to_list()
important_fea

In [None]:
train_dict_impo = df_train[important_fea].to_dict(orient='records')
dv_important = DictVectorizer(sparse=False)
dv_important.fit(train_dict_impo)

X_impo_train = dv_important.transform(train_dict_impo)

In [None]:
dv_important.get_feature_names_out()

In [None]:
model_impo = LogisticRegression(solver='liblinear', random_state=1)
model_impo.fit(X_impo_train, y_train)

In [None]:
model_impo.intercept_[0]

In [None]:
dict(zip(dv_important.get_feature_names_out(), model_impo.coef_[0].round(3)))

In [None]:
val_dict_impo = df_val[important_fea].to_dict(orient='records')
X_val_impo = dv_important.transform(val_dict_impo)

In [None]:
y_val_pred_impo =model_impo.predict_proba(X_val_impo)[:, 1]

-------------------------------------------------------------------------------------

The two year contracts clients are more likely to stay with the company than the one year contract clients. The clients with month-to-month contracts are very prone to churning (a positive weight of 0.846).
The coefficients confirm the feature importance analysis we did above.

The our numerical features have low coefficients. Tenures weight (-0.094) is negative. This means that the longer the client stays with the company the less likely that he will churn. This confirms the results of the correlation : -0,35 between `tenure` and `churn`.
    Total changes are insignificant with a weight of 0.



-----------------------------------------------------------------------------------

## Using the model
Now we can apply the model to customers for scoring them.

In [None]:
# First, we take a customer we want to score and put all the variable values in a dictionary:
customer = {
'customerid': '8879-zkjof',
'gender': 'female',
'seniorcitizen': 0,
'partner': 'no',
'dependents': 'no',
'tenure': 41,
'phoneservice': 'yes',
'multiplelines': 'no',
'internetservice': 'dsl',
'onlinesecurity': 'yes',
'onlinebackup': 'no',
'deviceprotection': 'yes',
'techsupport': 'yes',
'streamingtv': 'yes',
'streamingmovies': 'yes',
'contract': 'one_year',
'paperlessbilling': 'yes',
'paymentmethod': 'bank_transfer_(automatic)',
'monthlycharges': 79.85,
'totalcharges': 3320.75,
}

Now we can use our model to see whether this customer is going to churn.

In [None]:
# we convert this dictionary to a matrix by using the DictVectorizer:
X_test = dv.transform([customer])

In [None]:
X_test

In [None]:
#Now we take this matrix and put it into the trained model:
model.predict_proba(X_test)

The probability of churning for this customer is at the 1st row and second column. 

In [None]:
model.predict_proba(X_test)[0, 1]

The probability that this customer will churn is only 19% (less than 50%). So there is no need to send promotional emails to this customer.

In [None]:
# We can try to score another client:
customer = {
'gender': 'female',
'seniorcitizen': 1,
'partner': 'no',
'dependents': 'no',
'phoneservice': 'yes',
'multiplelines': 'yes',
'internetservice': 'fiber_optic',
'onlinesecurity': 'no',
'onlinebackup': 'no',
'deviceprotection': 'no',
'techsupport': 'no',
'streamingtv': 'yes',
'streamingmovies': 'no',
'contract': 'month-to-month',
'paperlessbilling': 'yes',
'paymentmethod': 'electronic_check',
'tenure': 1,
'monthlycharges': 85.7,
'totalcharges': 85.7
}

In [None]:
#Let’s make a prediction:
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]