# ***Telco customer churn predictions*** 
recommended music for exploring this notebook: 
https://www.youtube.com/watch?v=t3217H8JppI&ab_channel=AnAmericanComposer

Was used while creating.

****

# Import libraries, for starters

In [None]:

import numpy as np
import pandas as pd
from scipy import stats
import math

import seaborn as sns
import matplotlib.pyplot as plt


# Import data and explore basic properties

In [None]:
#import data from kaggle store
data=pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
#data.head()

In [None]:
# nice resume table to describe the data
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    summary['Fourth Value'] = df.loc[3].values
    summary['Fifth Value'] = df.loc[4].values

    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = round(stats.entropy(df[name].value_counts(normalize=True), base=10),4) 

    return summary

In [None]:
# resumetable(data)

#also handy built/in function for data description
# data.describe()

# and one more
# data.info()

* No missing data
* One line ... one customer
* Objects should be retyped, TotalCharges checked


# **2. Lets look at histogram of numerical variables: tenure, MonthlyCharge and TotalCharge**

KDE was omitted intentionally for the first few plots, in order not to pollute visually the histograms

In [None]:
# retyping TotalCharges to numeric
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# for the sake of ease while plotting
df_y = data[data["Churn"]=="Yes"] 
df_n = data[data["Churn"]=="No"]

# 3. TotalCharge vs Tenure x MonthlyCharges - Discounts effect
* TotalCharge should equal MonthlyCharges x Tenure. If not, it is a sign of a given discount or price inrease, that the customer got. 
* That might be a big factor for churning, lets see further

In [None]:
# Calculate differene between Totalcharge and Tenure*MonthlyCharges
data['TotalCharge_diff'] = (data['tenure'] * data['MonthlyCharges']) - data['TotalCharges']
data['TotalCharge_diff_abs'] = data['TotalCharge_diff'].abs()
# leaving both as a possible good features, from logic of the thing, I suppose only TotalCharges_diff will be of any use

In [None]:
# plot
plt.figure(figsize=(14, 4))
plt.title("KDE for {}".format('TotalCharge_diff'))
ax0 = sns.histplot(data[data['Churn'] == 'No']['TotalCharge_diff'].dropna(), color = "#22ff57", label= 'Churn: No')
ax1 = sns.histplot(data[data['Churn'] == 'Yes']['TotalCharge_diff'].dropna(), color= "#FF5722", label= 'Churn: Yes')
plt.legend(prop={'size': 12})


In [None]:
#kde_plot('TotalCharge_diff')
#kde_plot('TotalCharge_diff_abs')

* Interestingly, not churning customers seems to be following "fatter" distribution ends while looking at difference of payment. 
* It seems that customers that were not exposed to change of payment are more prone to churn, while customers that were exposed to change in payment are less prone to churn. 
* **Interestingly the distribution is kind of symetrical on both sides, but, as expected, the fatter end is on the left side, i.e. customers who got a discount compred to their previous pay were less likely to churn**
* **BUT** we might be exposed to Law of Small Numbers, as the sample nor the effect are as big.

#  **4. Now lets explore the categorical variables**

In [None]:
# create copy for thingama-jigging with categorical vars
df=data

In [None]:
# borrowed fcn for plotting nice barplots
def barplot_percentages(feature, orient='v', axis_name="percentage of customers"):
    ratios = pd.DataFrame()
    g = df.groupby(feature)["Churn"].value_counts().to_frame()
    g = g.rename({"Churn": axis_name}, axis=1).reset_index()
    g[axis_name] = g[axis_name]/len(df)
    if orient == 'v':
        ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)
        ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()])
    else:
        ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)
        ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()])
    ax.plot()

In [None]:
# borrowed fcn for plotting pie plots with percentages of each category based on rule
def plot_var_percentages (df, var_list):

    n_rows = math.ceil(len(var_list)/3)
    mapper = []
    count_c = 0
    count_r = 0
    for n in range(len(var_list)):
        if count_c <= 2:
            mapper.append((count_r,count_c))
            count_c += 1
        else:
            count_r += 1
            count_c = 0
            
    #fig, axes = plt.subplots(nrows = n_rows,ncols = 3,figsize = (15,12))
    for i,var in enumerate(var_list):
        
        labels = list(df[var].value_counts().index)
        counts = list(df[var].value_counts())
        
        plt.figure(i)
        plt.pie(counts, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
        plt.title(var)
    plt.show 

In [None]:
# Brief look at data distribution across categories
var_list = data.columns[1:-5].drop(['tenure'])

#plot_var_percentages(data, var_list)

* Seems we do not have many senior citizen customers (even though we clearly see how are they behaing differently concernign churn)
* PhoneService - also not many user not using PhoneService. Lets see, if this inbalance will cause trouble further. 

In [None]:
#print(var_list)
#print(data.columns)

# Lets start with our seniors
#barplot_percentages("SeniorCitizen")

* It seems senior citizens like to change operators more often. Might be bored at home and be the only one to answer the cold calls. Or they might just have more time to calculate what services do they need for what price. Or stg else :-)
* One way or the other, it seems that being a senior citizen goes with significantly higher probability of churn.

In [None]:
df['churn_rate'] = df['Churn'].replace("No", 0).replace("Yes", 1)
#g = sns.FacetGrid(df, col="SeniorCitizen", height=4, aspect=.9)
#ax = g.map(sns.barplot, "gender", "churn_rate", palette = "Blues_d", order= ['Female', 'Male'])

* Churn rate of women/men is similar accross ages (but is highr for senior citizens)

In [None]:
#churn rates across customers w/ partners and dependents

# fig, axis = plt.subplots(1, 2, figsize=(12,4))
# axis[0].set_title("Has partner")
# axis[1].set_title("Has dependents")
# axis_y = "percentage of customers"
# # Plot Partner column
# gp_partner = df.groupby('Partner')["Churn"].value_counts()/len(df)
# gp_partner = gp_partner.to_frame().rename({"Churn": axis_y}, axis=1).reset_index()
# ax = sns.barplot(x='Partner', y= axis_y, hue='Churn', data=gp_partner, ax=axis[0])
# # Plot Dependents column
# gp_dep = df.groupby('Dependents')["Churn"].value_counts()/len(df)
# gp_dep = gp_dep.to_frame().rename({"Churn": axis_y}, axis=1).reset_index()
# ax = sns.barplot(x='Dependents', y= axis_y, hue='Churn', data=gp_dep, ax=axis[1])

Customers w/o partners and dependents are more likely tu churn. Feeling free. Interesting.

In [None]:
# Categorical vars - MultipleLines
#plt.figure(figsize=(9, 4.5))
#barplot_percentages("MultipleLines", orient='h')

* MultipleLines var seems to be not of much use

In [None]:
# What internet service the customer has? 
#plt.figure(figsize=(9, 4.5))
#barplot_percentages("InternetService", orient="h")

* The customers are probarly more satisfied with fiber optic connection than with DSL.
* This mighe be a good feature!

In [None]:
plt.figure(figsize=(15, 15))
some_vars = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','PaperlessBilling']
i=1
for var in some_vars:
    plt.subplot(3, 2, i)
    sns.countplot(x=var,data=data, hue='Churn')
    i+=1

# 5. Preprocessing: Data preps, feature adding based on EDA

In [None]:
# Show what we get here. Again.
resumetable(df)

In [None]:
non_dummy_cols = ['customerID','tenure','MonthlyCharges','TotalCharges','Churn','churn_rate','TotalCharge_diff','TotalCharge_diff_abs'] 
dummy_cols = list(set(df.columns) - set(non_dummy_cols))
df_test = pd.get_dummies(df, columns=dummy_cols)

# non_dummy_cols = ['A','B','C'] 
# Takes all other columns
# dummy_cols = list(set(df.columns) - set(non_dummy_cols))
# df = pd.get_dummies(df, columns=dummy_cols)


In [None]:
# check dummies
# resumetable(df_test)

In [None]:
# tenure - create two more categories, as the tenure feature does not have linear behaviour
df_test['tenure_short'] = np.where(df_test['tenure']<18, 1, 0)
df_test['tenure_long'] = np.where(df_test['tenure']>54, 1, 0)
#df.head()

In [None]:
# create cat var for high monthly charges
#df.loc[:,'high_payer'] = np.where(df['MonthlyCharges'] > 60, 1,0)

In [None]:
# frames = [df, df_dummies]

# df_ready = pd.concat(frames,axis=1)
# print(df_ready.head())

In [None]:
#resumetable(df_ready)

# CustomerID dropping!

In [None]:
# drop NaNs in TotalCharges
df_test = df_test.dropna()

# drop customerID, as would not be of any help
df_test.drop(['customerID'],axis=1,inplace=True)

resumetable(df_test)

In [None]:
# Further usage of just "df"
df = df_test

# Retype to ints and bools

In [None]:
# retype to boolean
non_int_cols = ['tenure','MonthlyCharges','TotalCharges','Churn','churn_rate','TotalCharge_diff','TotalCharge_diff_abs'] 
int_cols = list(set(df.columns) - set(non_int_cols))
df[int_cols] = df[int_cols].astype(bool)

# retype floats
float_cols = ['MonthlyCharges','TotalCharges','TotalCharge_diff','TotalCharge_diff_abs']
df[float_cols] = df[float_cols].astype(np.int64)


Looking good now. 

# Features correlation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import plot_confusion_matrix

from sklearn.linear_model import LogisticRegression

In [None]:
corrMatrix = df.drop(['Churn','churn_rate'], axis=1).corr()
fig, ax = plt.subplots(figsize=(30,25))
sns.heatmap(corrMatrix,annot=True, annot_kws={'size':12},cmap="GnBu")
plt.show();

* The correlation matrix is heavy a lot, but nevertheless we see what features we can drop atm.

In [None]:
# drop "No internet service" items and others with high correlation. It was nto clear to me what is the meaning, from the correlation it is clear there is no information added by multiple columns
# this was actually added after looking at Correrlation matrix, but I left it here for the sake of simplicity
df_test.drop(['OnlineBackup_No internet service','TechSupport_No internet service','StreamingTV_No internet service','DeviceProtection_No internet service','OnlineBackup_No internet service',
              'OnlineSecurity_No internet service', 'StreamingMovies_No internet service', 'MultipleLines_No phone service','PhoneService_No'],axis=1,inplace=True)
              # ,'MultipleLines_No',
              # 'OnlineSecurity_No','OnlineBackup_No','DeviceProtection_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],
              # axis=1,inplace=True)
        
# leaving out all the rest for now

In [None]:
#Correlation of "Churn" with other variables in 1D:

plt.figure(figsize=(15,8))
df.drop(['Churn'], axis=1).corr()['churn_rate'].sort_values(ascending = False).plot(kind='bar')

* so TotalCharge_diff is not correlated with churn, which makes sense amd is great 
* Gender, as expected, is also not. 

* Seems we can leave all features for now, none are so heavily correlated that they should be omitted atm. 

# Apply feature scaling and split dataset

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

In [None]:
target0 = df['churn_rate'] # for y
features0 = df.drop(['Churn','churn_rate'], axis=1) # for X

In [None]:
# To preserve the shape of the dataset (no distortion), data will be min max scaled to values between (0, 1) 
# instead of standard scaled. I tried also StandardScaler, but results were worse since the distribution of data is not gaussian. 
# RobustScaler was similar in performance to MinMaxScaker
scaler0=MinMaxScaler()

f_scale0 = scaler0.fit_transform(features0)

In [None]:
# create train and test split on scaled data
X_train0, X_test0, y_train0, y_test0 = train_test_split (f_scale0,target0,test_size=0.2, random_state=123)

# Grid search and model evaluation

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

In [None]:
# grid search to find optimal value of C (for regularization) and solver

model_gs = LogisticRegression(max_iter=500)
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model_gs, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train0, y_train0)

In [None]:
# summarize results of grid search
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# ROC-AUC score

In [None]:
from sklearn.metrics import roc_curve, auc

# Zero gen model with tuned C and solver

# increased the number of iterations here, as the lbfgs solver was not converging in 100 steps
logreg0 = LogisticRegression(max_iter=500,C=10, penalty='l2', solver='lbfgs')

#Probability scores for test set
y_score0 = logreg0.fit(X_train0, y_train0).decision_function(X_test0)
#False positive Rate and true positive rate
fpr0, tpr0, thresholds0 = roc_curve(y_test0, y_score0)

#Visualization for ROC curve
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

print('AUC: {}'.format(auc(fpr0, tpr0)))
plt.figure(figsize=(10,8))
lw = 2
plt.plot(fpr0, tpr0, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
#print(y_score0)

In [None]:
from sklearn.metrics import confusion_matrix
y_hat_test0 = logreg0.predict(X_test0)
cm0=confusion_matrix(y_test0,y_hat_test0)
conf_matrix0=pd.DataFrame(data=cm0,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sns.heatmap(conf_matrix0, annot=True,fmt='d',cmap="YlGnBu")
#print(y_hat_test0)

In [None]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report
#print(precision_recall_fscore_support(y_hat_test0, y_test0))
print(classification_report(y_hat_test0, y_test0))
print("Accuracy: ")
print (accuracy_score(y_test0, y_hat_test0))

In [None]:
# checking what is the accuracy on Train set. lbfgs solver contains L2 regularization by default
print("Train Accuracy:",logreg0.score(X_train0, y_train0))

In [None]:
# check features0
# features0.columns.values

# Weights of model illustrated

In [None]:
# To get the weights of all the variables
weights = pd.Series(logreg0.coef_[0],
                 index=features0.columns.values)

plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
# show features negatively affecting churn - based on LR coefficients
weights.sort_values(ascending = False)[:10].plot(kind='bar')
plt.subplot(1, 2, 2)
# show features positively affecting churn - based on LR coefficients
weights.sort_values(ascending = False)[-10:].plot(kind='bar')


# Train/test accuracy check
* From the results above, it is clear we have different results on train and test set. 
* It will be interesting to look at model performance on the 2 data sets with different C values

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
### 1. Use of validation curves for both datasets.
C_param_range = [0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50,100]

plt.figure(figsize=(15, 10))

# Apply logistic regression model to training data
lr = LogisticRegression(penalty='l2',C = i,random_state = 42,max_iter=500)

# Plot validation curve
train_scores, test_scores = validation_curve(estimator=lr
                                                            ,X=X_train0
                                                            ,y=y_train0
                                                            ,param_name='C'
                                                            ,param_range=C_param_range
                                                            )

train_mean = np.mean(train_scores,axis=1)
train_std = np.std(train_scores,axis=1)
test_mean = np.mean(test_scores,axis=1)
test_std = np.std(test_scores,axis=1)

plt.plot(C_param_range
            ,train_mean
            ,color='blue'
            ,marker='o'
            ,markersize=5
            ,label='training accuracy')

    
plt.plot(C_param_range
            ,test_mean
            ,color='green'
            ,marker='x'
            ,markersize=5
            ,label='test accuracy') 
    
plt.xscale('log')
plt.xlabel('C_parameter')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.725,0.825])


* Now, depending on our needs, we might in the end go with different C value, e.g. 0.1, as this might generalize best, as we have a 
* *     *small difference* of accuracies on train and test set and 
* *     *high both train a test set performance*
* The lower test set accuracy around C=0 might be because of this specific data split/e.g. inconsistency in data

# How to apply findings?
* Accuracy is certainly not high, although giving some hint and maybe better than nothing (in 75/25 data split having accuracy over 82% is just 7% better.. which is not much). But still increase of roughly 30% compared to not using any model.
* Since it is easy to get to probabilities outputted by Logistic Regression, we might consider using these probabilities of churn, and maybe combine information about probability of churning with MonthlyCharges and try not to loose most valuable customers


In [None]:
churn_prob = logreg0.predict_proba(X_test0[:10])
print(churn_prob)
print(y_test0[:10])

* This might lead us to giving some vouchers, packages to customers with churn probability e.g >40% or stg like that

# Credits
* https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/ 
* https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40 
* https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/ 
* https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
* https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
* https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression
* ... and various Kaggle kernels


# Open questions

* Does a circa 75/25% churn split match the real life situation of a telco cpy? 
* What metrics would fit best the needs of service provider - is recall more important? Or false negatives? Probably FN connected to revenue generated by customer?

# Next possible steps

* prettify the syntax, add more comments and use more looping for the sake of simplicity. This solutions is just a get-dirty-hands try, and I know it doesnt read as easy as I would liek to... 