# Why do telco customers leave?

**Customer chrun** is when an existing customer, user, subsctiber or any kind of client stops doing business or ends the relationship with a company

**The goal** of this kernel is to explore the different variables and thier effect on churn. 

In [None]:
# make sure to have the latest sns version
!pip install seaborn --upgrade

In [None]:
import seaborn as sns
sns.__version__

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
df

In [None]:
df.info()

In [None]:
for feat in df.columns:
    print(feat)
    print(df[feat].dtype)
    print(df[feat].unique())
    print('#'*30)

### Transformation
- `TotalCharges` shouldn't be an object, simple transformation isn't possible, beacuse there are various emoty values which are formatted as a string

In [None]:
# change dtype
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# fill NaNs with mean
df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)

## Look at the churn across the dataset

In [None]:
sns.countplot(data=df, x=df['Churn']);

As expected Churn is imbalanced. We need to keep that in mind when spliting the data in train and test. Stratify is the key word.

### Crosstab
- a crosstab is used to compare two (or more) factors by creating a frequancy table, unless an aggregation function is passed
- this only works with categorical data

#### Steps
1. remove non-categorical data
2. create an empty list to store every crosstab for each feature
3. create the inidvidual crosstabs for each feature with a for loop
4. save them in the list with `.append`
5. concat each individual crosstab and pass the coloumns as keys to create the outer index

In [None]:
# Step 1: remove features
df_cross = df.drop(columns=['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges'])

# Step 2: create empty list
crosstabs = []
# Step 3: For loop to create the crosstabs
for feat in df_cross.columns:
    crosstab = pd.crosstab(df_cross[feat], df_cross['Churn'])
    # Step 4: append them to the list
    crosstabs.append(crosstab)

# Step 5: concate each of them with the column names as index
crosstab_count = pd.concat(crosstabs, keys=df_cross.columns[:-1])
crosstab_count


# this can be done in one line of code:
# pd.concat([pd.crosstab(df_cross[x], df_cross['Churn']) for x in df_cross.columns[:-1]], keys=df_cross.columns[:-1])
# code from udemy course: Data Science & Deep Learning for Business

After we created the crosstab we can calculate the percentage for each factor to get a quick overview in which instances churn is especially prevalent

In [None]:
crosstab_count['percentage'] = (crosstab_count['Yes'] / (crosstab_count['Yes'] + crosstab_count['No'])* 100)
crosstab_count.sort_values('percentage', ascending=False)

### Conclusion
- the higest chrun rates or for: `PaymentMethod`-electronic check, `Contract`-month to month, and `InternetService`-fiber optic
    * this makes sense since one can quicly change the servie when they have a month to month contract and a digital payment method 
- the lowest chrun rate are for customers how have a tow year `Contract` and no internet service 
    * this also is as expected since cutomers who aren't online can't easily compare or change contracts, especially when they comitted for two years

Now that we looked at the categorical and binary data lets us now compare the ratio scaled data 

In [None]:
df_ratio = df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']]

fig, axes = plt.subplots(1, 3, figsize=[12,6])
# space between the plots 
fig.subplots_adjust(wspace=0.5)
for feat, ax in zip(df_ratio.columns[:-1], axes.flatten()):
    sns.violinplot(data=df_ratio, y=feat, x=df_ratio['Churn'], ax=ax)

### Conclusion
- `tenure`: the number of month a customer has styed in the company is a good indicator if a person is leaving nor not. 
- `MonthlyCharges`: high monthly charges lead to more churn
- `TotalCharges`: there seems to be no difference, but you can see that the body for chrun-yes is bigger than for no, which makes sense, because the most people leave in the first months which explains the low total charges

## Preprocessing
- drop `customerID`
- map binary colummns to 0 and 1
- creat dummy variabales
- combine Data Frames
> - split into x and y for train and test
- scale features

In [None]:
df.drop(columns='customerID', inplace=True)

In [None]:
df_bin = df.select_dtypes('object')

# select only features with two unique values
list_bin = [df_bin[x].name for x in df_bin.columns if df_bin[x].nunique() == 2]
# create dict with values
dict_bin = {'Yes': 1, 'No': 0, 'Female': 0, 'Male': 1}

# for loop to go through each binary feature and map the values
for feat in list_bin:
    df_bin[feat] = df_bin[feat].map(dict_bin)

Now we just need to addres the categorical vairables and transform them into dummy varibales

In [None]:
# dtype object only includes categorical variables now so i can just filter them
df_cat = df_bin.select_dtypes('object')

df_dummy = pd.get_dummies(df_cat, drop_first=True)

In [None]:
# cobine Data Frames
df_final = pd.concat([df.select_dtypes(exclude='object'), df_bin.select_dtypes(exclude='object'), df_dummy], axis=1)
df_final

In [None]:
X = df_final.drop(columns='Churn')
y = df_final['Churn']

# we will use the stratify argument to account for the imbalance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# use Column Transformer to apply scaling only to the specified columns
ct = ColumnTransformer([('scaler', StandardScaler(), ['tenure', 'MonthlyCharges', 'TotalCharges'])], remainder='passthrough')
# fit and transform X_train
X_train_sc = ct.fit_transform(X_train)
# transform X_test 
X_test_sc = ct.transform(X_test)

## Make predictions

In [None]:
# initiate models 
logreg = LogisticRegression()
rf = RandomForestClassifier()
adac = AdaBoostClassifier()
gbc = GradientBoostingClassifier()

model_list = [logreg, rf, adac, gbc]

for model in model_list:
    model.fit(X_train_sc, y_train)
    preds = model.predict(X_test_sc)
    print('accuracy: ', accuracy_score(y_test, preds))
    print(confusion_matrix(y_test, preds))
    print(classification_report(y_test, preds))
    print('#'*60)

Logistic Regression performed the best out of evey model tested

## Feature importance

In [None]:
col_list = list(df_final.drop(columns = 'Churn').columns)
importance = logreg.coef_.flatten()

feat_importance = pd.DataFrame({'feature': col_list, 'importance': importance})
feat_importance.sort_values('importance')

### Conlusion
- As seen in the EDA contract type (`Contract_Two year`, `Contract_One yar`) are influential variables predicting that a customer doesn't churn
- `Fiber optic` and digital billing (`MonthlyCharges`, `PaperlessBilling`) are strong incentives for customers to churn 

## Resources
- https://stackoverflow.com/questions/38420847/apply-standardscaler-to-parts-of-a-data-set