Dataset contains target variable `Churn` (whether the customer churned or not (Yes or No)) and some customer's attributes described below:

`customerID` – Customer ID<br>
`gender` – Whether the customer is a male or a female<br>
`SeniorCitizen` – Whether the customer is a senior citizen or not (1, 0)<br>
`Partner` – Whether the customer has a partner or not (Yes, No)<br>
`Dependents` – Whether the customer has dependents or not (Yes, No)<br>
`tenure` – Number of months the customer has stayed with the company<br>
`PhoneService` – Whether the customer has a phone service or not (Yes, No)<br>
`MultipleLines` – Whether the customer has multiple lines or not (Yes, No, No phone service)<br>
`InternetService` – Customer’s internet service provider (DSL, Fiber optic, No)<br>
`OnlineSecurity` – Whether the customer has online security or not (Yes, No, No internet service)<br>
`OnlineBackup` – Whether the customer has online backup or not (Yes, No, No internet service)<br>
`DeviceProtection` – Whether the customer has device protection or not (Yes, No, No internet service)<br>
`TechSupport` – Whether the customer has tech support or not (Yes, No, No internet service)<br>
`StreamingTV` – Whether the customer has streaming TV or not (Yes, No, No internet service)<br>
`StreamingMovies` – Whether the customer has streaming movies or not (Yes, No, No internet service)<br>
`Contract` – The contract term of the customer (Month-to-month, One year, Two year)<br>
`PaperlessBilling` – Whether the customer has paperless billing or not (Yes, No)<br>
`PaymentMethod` – The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))<br>
`MonthlyCharges` – The amount charged to the customer monthly<br>
`TotalCharges` – The total amount charged to the customer<br>


# 1. Data description and preprocessing

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
data.head()

In [None]:
# We can delete 'customerID' because it doesn't affect the churn.

data.drop(columns=['customerID'], axis=1, inplace=True)

In [None]:
# We can see there is no missing values. But 'TotalCharges' is object type although it looks like float.

data.info()

In [None]:
# We can see the categorical features have 2-4 values

data.describe(include=np.object)

In [None]:
data.columns.tolist()

In [None]:
# Let's check why 'TotalCharges' has object type - what else does it have except numbers

data[~data['TotalCharges'].str.match('^\d*\.?\d*$')]

In [None]:
data.loc[488, 'TotalCharges']

In [None]:
# 'TotalCharges' values belong to accounts which is less than a month. They give us no information so we can get rid of them.

data.drop(data[data['TotalCharges'] == ' '].index, inplace=True)

In [None]:
# Convert 'TotalChurges' to float

data['TotalCharges'] = data['TotalCharges'].astype(float)

In [None]:
# Convert target value 'Churn' to 1 and 0

data['Churn'] = data['Churn'].map({'Yes': 1, 'No': 0})

In [None]:
# We'll work with a copy of the dataset

df = data.copy()

# 2. Correlation searching

In [None]:
# Distinguish features groups: social, subscriptions to services and account features.

social_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']
service_features = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
                     'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
account_features = ['Contract', 'PaperlessBilling', 'PaymentMethod']

In [None]:
def countplot_stat(data, features, hue=None, n_cols=5):
    """
    Plot countplots for given columns in dataset by given number of columns
    """
    n_cols = min(n_cols, len(features))
    n_rows = int(np.ceil(len(features) / n_cols))
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*5, n_rows*5))
    for ax, feat in zip(axes.flatten(), features):
        sns.countplot(x=feat, hue=hue, data=data, ax=ax)
        ax.set_xticklabels(ax.get_xticklabels(),rotation=15)
        plt.tight_layout()

    plt.show()

In [None]:
# Bar graphs for 'Churn' in social features.

countplot_stat(df, social_features, hue='Churn')

We can see that 'Gender' is not very useful for predicting 'Churn'.<br>
Although total amonut of churned within Senior Citizens is lower churned ratio in this category is higher.<br>
More churned accounts among those who have NO 'Partner' and 'Dependents'.

In [None]:
# Convert binary features to 1 and 0

df[['Partner', 'Dependents']] = df[['Partner', 'Dependents']]\
    .stack().map({'Yes': 1, 'No': 0}).unstack()

In [None]:
# Bar graphs for 'Churn' in services features

countplot_stat(df, service_features, hue='Churn')

The higher rate of churned among acounts which have Fiber optic connection and doesn't have OnlineSecurity, OnlineBackup, DeviceProtection and TechSupport options.

In [None]:
# Binary feature PhoneService convert to 1 and 0

df[['PhoneService']] = df[['PhoneService']]\
    .stack().map({'Yes': 1, 'No': 0}).unstack()

# Create some features based on connected services

df['is_fiber_optic'] = df['InternetService'].apply(lambda x: 1 if x == 'Fiber optic' else 0)
df['no_internet_service'] = df['InternetService'].apply(lambda x: 1 if x == 'No' else 0)
df['no_online_security'] = df['OnlineSecurity'].apply(lambda x: 1 if x == 'No' else 0)
df['no_online_backup'] = df['OnlineBackup'].apply(lambda x: 1 if x == 'No' else 0)
df['no_device_protection'] = df['DeviceProtection'].apply(lambda x: 1 if x == 'No' else 0)
df['no_tech_support'] = df['TechSupport'].apply(lambda x: 1 if x == 'No' else 0)

In [None]:
# Bar graphs for 'Churn' in account features

countplot_stat(df, account_features, hue='Churn', n_cols=3)

Those who have monthly subscription are more likely to churn.<br>
Higher churn rate within those who use PaperlessBilling and Electronic check.

In [None]:
# Create features on account properties

df['monthly_payments'] = df['Contract'].apply(lambda x: 1 if x == 'Month-to-month' else 0)
df['electronic_check'] = df['PaymentMethod'].apply(lambda x: 1 if x == 'Electronic check' else 0)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
for ax, feat in zip(axes.flatten(), ['tenure', 'MonthlyCharges']):
    sns.distplot(df[df['Churn'] == 0][feat], hist=False, bins=10, color='b', ax=ax, label='No Churn')
    sns.distplot(df[df['Churn'] == 1][feat], hist=False, bins=10, color='r', ax=ax, label='Churn')
#     plt.legend()
plt.show()

The distributions of tenure and MonthlyCharges shows that accounts less than 20 months and charges between 70 and 110 per month have higher churn risk.

In [None]:
# Create binary features based on critical values 

df['short_tenure'] = df['tenure'].apply(lambda x: 1 if x < 20  else 0)
df['high_charges'] = df['MonthlyCharges'].apply(lambda x: 1 if x > 70 and x < 110 else 0)

In [None]:
# Choose important features and create dataset with them

features = ['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'is_fiber_optic', 'no_internet_service', 'no_online_security', 'no_online_backup', 
            'no_device_protection', 'no_tech_support', 'monthly_payments', 'electronic_check', 'short_tenure', 'high_charges']
X_selected = df[features + ['Churn']]

In [None]:
# Create the second dataset using one-hot label encoding.

X_dummies = pd.get_dummies(data)
            
X_dummies.head()

In [None]:
# Create the third dataset whith categorical feature replaced by their frequencies.

X_freq = data.copy()

categ_features = X_freq.select_dtypes(include=np.object).columns
for column in categ_features:
    encoding = X_freq.groupby(column).size()
    encoding /= len(X_freq)
    X_freq[column] = X_freq[column].map(encoding)
X_freq.head()

In [None]:
# Plot correlations between features and target

fig, ax = plt.subplots(1, 3, figsize=(25,15), )
fig.subplots_adjust(left=0.4)

sns.heatmap(X_dummies.corr()[['Churn']].sort_values(by='Churn', ascending=False), vmin=-1, vmax=1, cmap='YlGnBu', annot=True, ax=ax[0])
ax[0].set_title("Dummies data")

sns.heatmap(X_freq.corr()[['Churn']].sort_values(by='Churn', ascending=False), vmin=-1, vmax=1, cmap='YlGnBu', annot=True, ax=ax[1])
ax[1].set_title("Frequencies data")

sns.heatmap(X_selected.corr()[['Churn']].sort_values(by='Churn', ascending=False), vmin=-1, vmax=1, cmap='YlGnBu', annot=True, ax=ax[2])
ax[2].set_title("Selected features")

# ax[0].tick_params(labelsize=16)
plt.tight_layout()
plt.rc('font', size='14')
plt.show()

**We can see that features selection is right but some of features have lost part of information while constructing and have slightly smaller correlation with target**

In [None]:
# Exclude target from datasets

X_dummies = X_dummies.drop(columns=['Churn'])
X_freq = X_freq.drop(columns=['Churn'])
X_selected = X_selected.drop(columns=['Churn'])

**So we have three datasets for model fitting:**
1. Using one-hot label encoding with numpy.get_dummies function
2. With categorical features replaced with their frequencies
3. With features selected after dataset analysis.

In [None]:
variants = [X_dummies, X_freq, X_selected]

In [None]:
# Target dataseet

target = data['Churn']

# 3. Models construction

In [None]:
from sklearn.model_selection import (GridSearchCV,
                                     train_test_split,
                                     StratifiedKFold)

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from xgboost import XGBClassifier

from sklearn.preprocessing import Normalizer

rnd_state = 17
test_size_ = 0.2

In [None]:
# Create dictionaries with models

models = {'gbc': GradientBoostingClassifier(), 
          'rfc': RandomForestClassifier(), 
          'svc': SVC(), 
          'lr': LogisticRegression(),
          'xgb': XGBClassifier()
         }

In [None]:
# Determine models parameters for using with GridSearchCV

gbc_params = {'learning_rate': np.arange(0.1, 0.6, 0.1), 
              'random_state': [rnd_state]} # GradientBoostingClassifier

rfc_params = {'n_estimators': range(10, 100, 10), # RandomForestClassifier
              'min_samples_leaf': range(1, 5), 
              'random_state': [rnd_state]}

svc_params = {'kernel': ['rbf', 'sigmoid'], # SVC
              'C' : [0.1, 1, 5, 10], 
              'gamma' : [0.01, 0.1, 0.9, 1], 
              'random_state': [rnd_state]}

lr_params = {'C': np.arange(0.5, 1, 0.1), # LogisticRegression
             'max_iter': [1000],
             'random_state': [rnd_state]}

xgb_params = {'learning_rate' : [0.01, 0.03, 0.05], # GradientBoostingClassifier
              'max_depth' : [1, 4, 6], 
              'n_estimators' : [100, 300, 400, 600, 1000], 
              'random_state': [rnd_state]}

params = [gbc_params, rfc_params, svc_params, 
          lr_params, xgb_params]

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=rnd_state)

In [None]:
def grid_search_selector(features_df, target_df, models, params, test_size=0.2, random_state=None):
    """
    Function takes features and target datasets, algorithms and parameters.
    Devides data into train and validation parts and searches best parameters with GridSearchCV.
    Prints best scores and returns best parameters and best scores.
    """
    best_params = {}
    best_scores = {}

    X_train, X_test, y_train, y_test = train_test_split(features_df, target_df, test_size=test_size, random_state=random_state)
    
    for n, (name, model) in enumerate(models.items()):
        clf = GridSearchCV(estimator=model, param_grid=params[n], cv=skf).fit(X_train, y_train)
        best_params[name] = clf.best_params_
        best_scores[name] = clf.score(X_test, y_test)
        print(f"{str(name)} -- {best_scores[name]}")
    
    return best_params, best_scores

In [None]:
# Find best parameters for models for three datasets

best_dummies, dummies_scores = grid_search_selector(X_dummies, target, models, params, test_size=test_size_, random_state=rnd_state)

In [None]:
best_dummies

In [None]:
best_freq, freq_scores = grid_search_selector(X_freq, target, models, params, test_size=test_size_, random_state=rnd_state)

In [None]:
best_freq

In [None]:
best_selected, selected_scores = grid_search_selector(X_selected, target, models, params, test_size=test_size_, random_state=rnd_state)

In [None]:
best_selected

In [None]:
# Combine scores to dataframe to choose the best

all_scores = pd.DataFrame([dummies_scores.values(), freq_scores.values(), selected_scores.values()], columns=dummies_scores.keys(), index=['dummies', 'frequencies', 'selected'])

In [None]:
plt.figure(figsize=(7, 5))
sns.heatmap(all_scores, annot=True, fmt='.3g')
plt.yticks(rotation=0)
plt.show()

# 4. Models evaluation

In [None]:
def model_evaluation(model, params: 'dict', features_df, target_df, test_size_=None, random_state=None):
    """
    Fits model with given parameters and print classification report for train and test data
    and also plots ROC-curve and confusion matrix
    """
    X_train, X_test, y_train, y_test = train_test_split(features_df, target_df, test_size=test_size_, random_state=random_state)
    
    model = model.set_params(**params)
    model.fit(X_train, y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    acc_score_train = metrics.accuracy_score(y_train, y_pred_train)
    acc_score = metrics.accuracy_score(y_test, y_pred)
    
    print(metrics.classification_report(y_test, y_pred, target_names=['Non-churned', 'Churned']))

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    metrics.plot_roc_curve(model, X_test, y_test, ax=axes[0])
    sns.lineplot([0,1], [0,1], ax=axes[0])
    metrics.plot_confusion_matrix(model, X_test, y_test, display_labels= ['Non-churned', 'Churned'], cmap='GnBu', ax=axes[1])
    plt.tight_layout()

In [None]:
# XGBClassifier on frequencies data

model_evaluation(models['xgb'], best_freq['xgb'], X_freq, target, test_size_, rnd_state)

In [None]:
# SVC on selected features

model_evaluation(models['svc'], best_selected['svc'], X_selected, target, test_size_, rnd_state)

In [None]:
# RandomForestClassifier on frequencies data

model_evaluation(models['rfc'], best_freq['rfc'], X_freq, target, test_size_, rnd_state)

In [None]:
# LogisticRegression on one-hot encoded data

model_evaluation(models['lr'], best_dummies['lr'], X_dummies, target, test_size_, rnd_state)

In [None]:
# LogisticRegression on frequencies data

model_evaluation(models['lr'], best_freq['lr'], X_freq, target, test_size_, rnd_state)

**As we can see all models have almost the same quality.<br><br>
SVC model on selected features has the highest precision, but for prediction accounts that more likely to churn LogisticRegression model is slightly better as it has higher recall.**