### What is Customer Churn?
#### Customer churn means a customer leaving a product subscription or a service. All Companies want to grow their Customer base but while doing this they would want the existing customers to keep using their service. Hence companies build churn models to detect potentially churning out customers and trying to retain them by talking to them, giving offers/rewards, etc. In this notebook we try to predict churn on a customer data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Load the dataset

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv', sep=',')
df.head()

#### Get to know the size of data

In [None]:
rows = df.shape[0]
cols = df.shape[1]
print("Rows: {}, cols:{} ".format(rows, cols))

#### Check for null values (Although the below statements doesn't catch empty strings. Null and "" are different)

In [None]:
df.isnull().sum().values.sum()

#### Finding unique values in each column

In [None]:
df.nunique()

#### Look at values in columns like MultipleLines, OnlineSecurity, etc. Values like No Phone Service/No Internet Service can be replaced by "No" in their respective columns. Try to look your data in excel and you will get to know why we are doing this.

In [None]:
df['MultipleLines'] = df['MultipleLines'].replace({'No phone service': 'No'})
df['MultipleLines'].unique()

In [None]:
replace_cols = ['OnlineSecurity', 'OnlineBackup', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
for col in replace_cols:
    print("Col:{}, unique: {} ".format(col, df[col].unique()))
    df[col].replace({'No internet service': 'No'}, inplace=True)

#### check column types and change type if required

In [None]:
print(df.dtypes)

In [None]:
df["TotalCharges"] = df["TotalCharges"].astype(float)

#### Type conversion to float filled since TotalCharges column has few empty strings in it. Remove them and then try

In [None]:
df['TotalCharges'] = df["TotalCharges"].replace(" ",np.nan)
df = df.reset_index()[df.columns]
print("Number of null values in Totalcharges: {}".format(len(df) - df['TotalCharges'].count()))

df = df[df['TotalCharges'].notnull()]
df["TotalCharges"] = df["TotalCharges"].astype(float)

In [None]:
bin_labels_5 = ['Tenure1', 'Tenure2', 'Tenure3', 'Tenure4', 'Tenure5']
df['TenureBin'] = pd.qcut(df['tenure'],
                              q=[0, .2, .4, .6, .8, 1],
                              labels=bin_labels_5)

df = df.drop('tenure', axis=1)

In [None]:
df.head()

## EDA

In [None]:
churn = df[df['Churn']=='Yes']
non_churn = df[df['Churn']=='No']

In [None]:
churn_values = df['Churn'].value_counts().values.tolist()
churn_keys = df['Churn'].value_counts().keys().tolist()

print("labels are ", churn_values)
print("values are ", churn_keys)

fig1, ax1 = plt.subplots()
ax1.pie(churn_values, explode=(0, 0.1), labels=churn_keys, autopct='%1.1f%%',
        shadow=True)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.legend(['Non Churn', 'Churn'])
plt.show()

In [None]:
def plot_pie(col):
    labels = churn[col].value_counts().keys().tolist()
    churn_val = churn[col].value_counts().values.tolist()
    nonchurn_val = non_churn[col].value_counts().values.tolist()

    f, (ax1, ax2) = plt.subplots(1, 2, figsize = (5,5))
    ax1.pie(churn_val, explode=(0, 0.1), labels=labels, autopct='%1.1f%%',
            shadow=True)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    ax1.set_title('Churn')

    ax2.pie(nonchurn_val, explode=(0, 0.1), labels=labels, autopct='%1.1f%%',
            shadow=True)
    ax2.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    ax2.set_title('Non churn')

    f.suptitle(col)
    plt.legend(['Non Churn', 'Churn'], loc= 'best')
    plt.show()

In [None]:
plot_pie('SeniorCitizen')

#### Senior Citizens have a higher tendency to churn

In [None]:
plot_pie('gender')

#### Gender as a univariate feature is neutral towards churning

In [None]:
plot_pie('Partner')

#### Those who have partner tend to churn out less. Companies should help their customers get a partner right? XD

In [None]:
plot_pie('Dependents')

#### Customers with dependents churn out less
#### Similarly we can analyze other binary features 

In [None]:
df.head()

In [None]:

sns.scatterplot(df['MonthlyCharges'], df['TotalCharges'], hue=df['Churn'])
plt.show()

#### we see that customers whose monthly charges are high tend to churn the most

In [None]:
fig = plt.figure(figsize=(7,7))
sns.scatterplot(df['MonthlyCharges'], df['TotalCharges'], hue=df['TenureBin'])
plt.show()

#### Customers tend to churn more in the beginning than in the later part. This pattern is used by companies where they try to retain new customers and make them stick to it.

### Data Processing

In [None]:
df.head()

In [None]:
df = df.drop('customerID', 1)

In [None]:
binary_cols = [col for col in df.columns.tolist() if df[col].nunique()==2]

categorical_cols = [col for col in df.columns.tolist() if df[col].nunique() < 6]
categorical_cols = [col for col in categorical_cols if col not in binary_cols]

target_col = ['Churn']

numerical_cols = [col for col in df.columns.tolist() if col not in binary_cols+categorical_cols+target_col]

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

le = LabelEncoder()
for col in binary_cols :
    df[col] = le.fit_transform(df[col])
    
df = pd.get_dummies(data=df, columns=categorical_cols)

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [None]:
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=90)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

### Model building

In [None]:
import plotly.offline as py#visualization
py.init_notebook_mode(connected=True)#visualization
import plotly.graph_objs as go#visualization
import plotly.tools as tls#visualization
import plotly.figure_factory as ff#visualization

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.metrics import roc_auc_score,roc_curve,scorer
from sklearn.metrics import f1_score
import statsmodels.api as sm
from sklearn.metrics import precision_score,recall_score
from sklearn.model_selection import GridSearchCV

In [None]:
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier

In [None]:
target = df['Churn']
train = df.drop('Churn', axis=1)

In [None]:
# target = target.to_numpy()

#### Split dataset into train and test

In [None]:
X_train,X_test,y_train,y_test = train_test_split(train, target, test_size=0.15, shuffle = True, stratify=target )

In [None]:
def grid_search(params, model):
    grid_search = GridSearchCV(model, params, scoring='f1')
    model = grid_search.fit(X_train, y_train)
    print ('Best score: %0.3f' % grid_search.best_score_)

    best_parameters = model.best_estimator_
    print ('Best parameters set:', best_parameters)
    return model

def print_classification_report(model):
    predictions = model.predict(X_test)
    # conf_matrix = confusion_matrix(y_test,predictions)
    target_names = ['Not churn', 'Churn']
    print("\n")
    print(classification_report(y_test, predictions, target_names = target_names))

### Use Logistic Regression

In [None]:
parameters = {'C': (0.1, 0.5,1)}
logit = grid_search(parameters, LogisticRegression())
print_classification_report(logit)

### Using weighted XGboost(Since data is imbalanced i.e number of data points in one class is much greater than the other)

In [None]:
from xgboost import XGBClassifier

#### calculate weight of positive class (we will give more weight to the churn class since it has less number of samples. Hence loss would be high for the classifier if it predicts incorrectly for churn class samples)

In [None]:
churn_values = df['Churn'].value_counts().values.tolist()
pos_weight = churn_values[0]/churn_values[1]
print("pos_weight is ", pos_weight)

In [None]:
parameters = {'max_depth': (5, 8, 10), 'n_estimators': (70, 100, 150)}
xgb_model = grid_search(parameters, XGBClassifier(scale_pos_weight=pos_weight))
print_classification_report(xgb_model)

### Weighted Random forests

In [None]:
parameters = {'n_estimators': (15, 20, 50), 'max_depth': (5,10,12)}
rf_clf = grid_search(parameters, RandomForestClassifier(class_weight='balanced_subsample'))
print_classification_report(rf_clf)


#### Similarly, you can try a lot of other models. You can also try out different techniques like PCA (actually helpful when there are a lot of dimensions but can be used here to vusualize), SMOTE (instead of using weighted models, we can oversample the churn class by using this technique), plot AUC metric, etc. I have tried to make this notebook not too complex so that you can start easily.

#### Please comment below if there are any suggestions. I would be happy to consider it. 
#### Like the notebook if you found it helpful. Thanks!!