# What is Customer Churn?

> Churn rate, in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period. It is one of two primary factors that determine the steady-state level of customers a business will support. source - wiki

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.shape

# **Analysis of Target Variable**

A **Bar Chart** is a good choice when we want to show how a quantity varies among some discrete set of items. In our case, our target variable 'Churn' is discrete.

In [None]:
import seaborn as sns
sns.set(style='darkgrid')
sns.catplot(x="Churn", kind="count",edgecolor='.3',data=data)

In [None]:
print("Customer Churn rate is \n{}".format(data['Churn'].value_counts('Yes')))

### **Now to quantify our target variable, we can say that customer churn rate is 73%!**! 
### That is really low.  

> A typical “good” churn rate for SaaS companies that target small businesses is 3-5% monthly. The larger the businesses you target, the lower your churn rate has to be as the market is smaller. For an enterprise-level product (talking $X,000-$XX,000 per month), churn should be < 1% monthly. [source](https://www.cobloom.com/blog/churn-rate-how-high-is-too-high)

**So, Let's dig into the data..and get started..**

## Data Cleaning:

To check for errors, deal with special values, convert data into diﬀerent formats, and perform calculations. These operations are called data cleaning.

Let's check for spaces and blanks in our data.

In [None]:
columns = data.columns

In [None]:
data[data[columns] == " "].count()

We can see there are 11 blanks in 'TotalCharges' column. Let's check 'Churn' status for these values.

In [None]:
data[data['TotalCharges'] == " "].Churn

Then replace these missing values with 0 as 'TotalCharges' is a continous variable.

In [None]:
data['TotalCharges'] = data["TotalCharges"].replace(" ", 0).astype('float32')

## Handling Categorical data

In [None]:
cat_data = data.select_dtypes(include = 'object').copy()
cat_data = cat_data.drop(columns='customerID')
cat_data.head(2)

In [None]:
sns.countplot(data = data, x = 'gender',edgecolor='.3',alpha=0.8)

From above data we can say that data is balanced w.r.t gender. Let's see w.r.t target variable 'Churn'.

In [None]:
sns.countplot(data = data, x = 'gender',hue='Churn',alpha=0.8)

In [None]:
sns.violinplot(data=cat_data, x='gender', y=data['MonthlyCharges'],palette='pastel')

From the above violin plot, we can say that mean and distribution of monthly charges w.r.t gender is more are less the same. So, based on count plot and violin plot,'gender' alone may not be a good predictive feature of 'Churn'. 

Let's see on an average **how long our customers are with us** !!

In [None]:
sns.boxenplot(data=cat_data, x='Churn', y=data['tenure'])

Around 10 months, which is not pretty bad ..

# **Plotting univariate distributions**

> A histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin. [Source - Seaborn Tutorial](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions)

In [None]:
num_attr = ['tenure','MonthlyCharges','TotalCharges']
num_data = data[num_attr]

In [None]:
num_data.hist(bins=15, color='steelblue', edgecolor='black', linewidth=1.0,
           xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(3, 3, 4.2, 4.2))   

> The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).
 [Source - Seaborn Tutorial](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions)

In [None]:
sns.distplot(data['TotalCharges'])

In [None]:
resp = data['TotalCharges']
resp.skew()

## But wait, what is Skewed Distribution?

> If one tail is longer than another, the distribution is skewed. These distributions are sometimes called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry. Symmetry means that one half of the distribution is a mirror image of the other half. For example, the normal distribution is a symmetric distribution with no skew. The tails are exactly the same.

## Handle Skewed data

It is quite evident from the above plot that there is a definite right skew in the distribution. 

> If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance. We can address skewed variables by transforming them (i.e. applying the same function to each value). Common transformations include square root (sqrt(x)), logarithmic (log(x)), and reciprocal (1/x). [source](https://medium.com/@ODSC/transforming-skewed-data-for-machine-learning-90e6cc364b0)

Let's apply sqrt function to 'TotalCharges' to understand further.

In [None]:
sns.distplot(np.sqrt(data['TotalCharges']))

Well, it’s not normally distributed for sure, but is a lot better than what we had before!

In [None]:
sns.distplot(data['MonthlyCharges'])

# Identifying relation between Continous Variables: 

In [None]:
num_attributes = ['tenure', 'MonthlyCharges','TotalCharges']
data_num = data[num_attributes]

In [None]:
sns.scatterplot(x="TotalCharges", y="MonthlyCharges", hue="tenure", data=data)

In [None]:
attributes = ['MonthlyCharges', 'TotalCharges']
for x in attributes:
    sns.relplot(x='tenure',y=x, kind='line', data=data_num)

In [None]:
cat_data = data.drop(columns=num_attributes)

# Identifying relation b/n Continous and Categorical

## Churn vs Gender based on TotalCharges spent


In [None]:
sns.catplot(x='Churn', y='TotalCharges',hue='gender',kind='bar',edgecolor='.3', data=data)

## Churn vs Contract based on TotalCharges spent

In [None]:
g = sns.catplot(x="TotalCharges", y="Churn", row="Contract",
                kind="bar", orient="h", height=1.5, aspect=4,
                data=data.query("TotalCharges < 3000"))


## Churn vs SeniorCitizen based on MonthlyCharges spent

In [None]:
g = sns.catplot(x="MonthlyCharges", y="Churn", row="SeniorCitizen",
                kind="bar", orient="h", height=1.5, aspect=4,
                data=data.query("TotalCharges < 3000"))

## Analysis of remaining discrete variables vs continous variable : 'MonthlyCharges'

In [None]:
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(2,3,1)
sns.countplot(data = data, x = 'Partner', ax=ax1)

ax2 = fig.add_subplot(2,3,2)
sns.countplot(data = data, x = 'Dependents', ax=ax2)

ax3 = fig.add_subplot(2,3,3)
sns.countplot(data = data, x = 'PaperlessBilling', ax=ax3)

ax4 = fig.add_subplot(2,3,4)
#sns.boxplot(data = data, x = 'Partner', y = data['MonthlyCharges'] , ax=ax4)
sns.violinplot(data = data, x = 'Partner', y = data['MonthlyCharges'] , ax=ax4, palette='pastel')

ax5 = fig.add_subplot(2,3,5)
sns.violinplot(data = data, x = 'Dependents', y = data['MonthlyCharges'], ax=ax5, palette='pastel')

ax6 = fig.add_subplot(2,3,6)
sns.violinplot(data = data, x = 'PaperlessBilling', y = data['MonthlyCharges'], ax=ax6, palette='pastel')


In [None]:
data = data.drop('customerID', axis=1)

In [None]:
from sklearn.preprocessing import LabelEncoder

def encoder(df):
    cat_df = LabelEncoder().fit_transform(df)
    return cat_df

It is not recommended to use the same label encoder for all the features in the data set. It is safe to create a label encoder for each column because each feature varies in terms of the values. That's what we are doing below.

In [None]:
data = data.apply(lambda x: encoder(x))
data.head()

# Tree-Based Models and Voting Classifier

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Split the data into train, validation and test sets

In [None]:
n = len(data)

n_val = int(0.2 * n)
n_test = int(0.2 * n)
n_train = n - (n_test + n_val)

np.random.seed(2)
idx = np.arange(n)
np.random.shuffle(idx)

df_shuffled = data.iloc[idx]

df_train = df_shuffled.iloc[:n_train].copy()
df_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
df_test = df_shuffled.iloc[n_train+n_val:].copy()

In [None]:
df_train['TotalCharges'] = np.sqrt(data['TotalCharges'])
df_val['TotalCharges'] = np.sqrt(data['TotalCharges'])
df_test['TotalCharges'] = np.sqrt(data['TotalCharges'])

In [None]:
y_train = df_train['Churn']
y_val = df_val['Churn']
y_test = df_test['Churn']

In [None]:
df_train = df_train.drop('Churn', 1)
df_val = df_val.drop('Churn', 1)
df_test = df_test.drop('Churn', 1)

In [None]:
classifiers = [['RandomForest :', RandomForestClassifier()],
              ['XGB :', XGBClassifier()]]

predictions_df = pd.DataFrame()
predictions_df['actual_labels'] = y_val

In [None]:
for name,classifier in classifiers:
    classifier = classifier
    classifier.fit(df_train, y_train)
    predictions = classifier.predict(df_val)
    predictions_df[name.strip(" :")] = predictions
    print(name, accuracy_score(y_val, predictions).round(2))
    test_predictions = classifier.predict(df_test)
    print("Test accuracy:", accuracy_score(y_test, test_predictions).round(2))

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

clf1 = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
                                  random_state =50, max_features = "auto",
                                  max_leaf_nodes = 30)
clf2 = CatBoostClassifier(logging_level='Silent')
clf3 = XGBClassifier()
clf4 = AdaBoostClassifier()
vc = VotingClassifier(estimators=[('rf', clf1),('Cat', clf2) ,('xgb', clf3),('Ada', clf4)],voting='soft')
vc.fit(df_train, y_train)
predictions = vc.predict(df_val)
print(accuracy_score(y_val, predictions))

In [None]:
test_predictions = vc.predict(df_test)
print(accuracy_score(y_test, test_predictions))