# Introduction
Customer churn occurs when __customers__ or __subscribers__ stop doing business with a company or service, also known as __customer attrition__. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.

In [None]:
import numpy as np 
import pandas as pd 
import eli5
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from scipy.stats import chisquare
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from eli5.sklearn import PermutationImportance

# command for work offline
plotly.offline.init_notebook_mode(connected=True)

In [None]:
# read the dataset
dataset = pd.read_csv("../input/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [None]:
# an overview of the dataset
dataset.head()

In [None]:
dataset.shape

In [None]:
# list of columns in the dataset
dataset.columns

## Dataset Description
This dataset contains `7043` observations and `21` features and 1 label (`Churn`)

| __Feature Name__ | __Description__ | __Data Type__ |
| - | - | - |
| customerID | Contains customer ID | categorical | 
| gender | whether the customer female or male | categorical |
| SeniorCitizen | Whether the customer is a senior citizen or not (1, 0) | numeric, int |
| Partner | Whether the customer has a partner or not (Yes, No) | categorical |
| Dependents | Whether the customer has dependents or not (Yes, No) | categorical | 
| tenure | Number of months the customer has stayed with the company | numeric, int |
| PhoneService | Whether the customer has a phone service or not (Yes, No) | categorical |
| MultipleLines | Whether the customer has multiple lines r not (Yes, No, No phone service) | categorical |
| InternetService | Customer’s internet service provider (DSL, Fiber optic, No) | categorical |
| OnlineSecurity | Whether the customer has online security or not (Yes, No, No internet service) | categorical | 
| OnlineBackup |  Whether the customer has online backup or not (Yes, No, No internet service) | categorical | 
| DeviceProtection | Whether the customer has device protection or not (Yes, No, No internet service) | categorical |
| TechSupport | Whether the customer has tech support or not (Yes, No, No internet service) | categorical | 
| streamingTV | Whether the customer has streaming TV or not (Yes, No, No internet service) | categorical |
| streamingMovies | Whether the customer has streaming movies or not (Yes, No, No internet service) | categorical |
| Contract | The contract term of the customer (Month-to-month, One year, Two year) | categorical |
| PaperlessBilling | Whether the customer has paperless billing or not (Yes, No) | categorical |
| PaymentMethod | The customer’s payment method (Electronic check, Mailed check, Bank transfer, Credit card) | categorical |
| MonthlyCharges | The amount charged to the customer monthly  |  numeric , int |
| TotalCharges | The total amount charged to the customer  | object |
| Churn | Whether the customer churned or not (Yes or No) | categorical |

## Statistical Summary of the Dataset
[DataFrame.describe()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding `NaN` values. This method tells us a lot of things about a dataset. One important thing is that the `describe()` method deals only with `numeric` values. It doesn't work with or show any statistics on `categorical` values.

In [None]:
# only 3 feature contain numerical values, rest are categorical feature
dataset.describe()

## Data Preprocessing

In [None]:
# customer id is unnecessary
del dataset["customerID"]

### Encoding
* Binary Encoding
* One Hot Encoding

In [None]:
gender_map = {"Female" : 0, "Male": 1}
yes_no_map = {"Yes" : 1, "No" : 0}

dataset["gender"] = dataset["gender"].map(gender_map)

def binary_encode(features):
    for feature in features:
        dataset[feature] = dataset[feature].map(yes_no_map)

### Apply binary encoding on categorical features that contains only two categories

In [None]:
binary_encode_candidate = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]
binary_encode(binary_encode_candidate)

In [None]:
# converting series object dataset into numeric
# errors = 'coerce’ means, if invalid parsing occur then set NaN
dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"], errors = 'coerce')

In [None]:
# missing values check
print(dataset.isnull().any())
print("\n# of Null values in 'TotalCharges`: ",dataset["TotalCharges"].isnull().sum())

In [None]:
# fill null values with the mean values of that feature
dataset["TotalCharges"].fillna(dataset["TotalCharges"].mean(), inplace=True)

### Apply One Hot Encoding on categorical features that containg more than two categories

In [None]:
dataset = pd.get_dummies(dataset)

In [None]:
# now take a look at our final dataset
dataset.head()

In [None]:
dataset.describe().T

## Feature Selection
__Applying $chi^2$ test and select only top 20 highest $chi^2$ weighted feature__

In [None]:
result = pd.DataFrame(columns=["Features", "Chi2Weights"])

for i in range(len(dataset.columns)):
    chi2, p = chisquare(dataset[dataset.columns[i]])
    result = result.append([pd.Series([dataset.columns[i], chi2], index = result.columns)], ignore_index=True)

In [None]:
result = result.sort_values(by="Chi2Weights", ascending=False)

In [None]:
result.head(20)

In [None]:
new_df = dataset[result["Features"].head(20)]

In [None]:
new_df.head()

### Finding Correlation

In [None]:
plt.figure(figsize = (15, 12))
sns.heatmap(new_df.corr(), cmap="RdYlBu", annot=True, fmt=".1f")
plt.show()

In [None]:
hightly_corr_feature = ["OnlineBackup_No internet service", "StreamingMovies_No internet service", "StreamingTV_No internet service", 
"TechSupport_No internet service", "DeviceProtection_No internet service", "OnlineSecurity_No internet service"]

def remove_corr_features(features):
    for feature in features:
        del new_df[feature]

In [None]:
remove_corr_features(hightly_corr_feature)

In [None]:
plt.figure(figsize = (12, 8))
sns.heatmap(new_df.corr(), cmap="RdYlBu", annot=True, fmt=".1f")
plt.show()

### Boxplot For Outlier Detection

In [None]:
trace = []

def gen_boxplot(df):
    for feature in df:
        trace.append(
            go.Box(
                name = feature,
                y = df[feature]
            )
        )
        
gen_boxplot(new_df)

In [None]:
data = trace
plotly.offline.iplot(data)

> __Note: you can interect with the boxplot. So play around with it. Double click to back on initial state. __


__This plot is generated by using `plotly`. I have one interactive tutorial on `plotly`, you may visit those kernel:__

* [Gettring started with Plotly (Part 1)](https://www.kaggle.com/nasirislamsujan/getting-started-with-plotly-part-1)

## Data Visualization

In [None]:
ax = new_df["Churn"].value_counts().plot(kind='bar', figsize=(6, 8), fontsize=13)
ax.set_ylabel("Number of Customer", fontsize=14);

totals = []
for i in ax.patches:
    totals.append(i.get_height())

total = sum(totals)

for i in ax.patches:
    ax.text(i.get_x() - .01, i.get_height() + .5, \
            str(round((i.get_height()/total)*100, 2))+'%', fontsize=15,
                color='#444444')
plt.show()

In [None]:
new_df.columns

In [None]:
new_df["tenure"].unique()

In [None]:
_, ax = plt.subplots(1, 2, figsize= (16, 6))
sns.scatterplot(x="TotalCharges", y = "tenure" , hue="Churn", data=new_df, ax=ax[0])
sns.scatterplot(x="MonthlyCharges", y = "tenure" , hue="Churn", data=new_df, ax=ax[1])

In [None]:
facet = sns.FacetGrid(new_df, hue = "Churn", aspect = 3)
facet.map(sns.kdeplot,"TotalCharges",shade= True)
facet.set(xlim=(0, new_df["TotalCharges"].max()))
facet.add_legend()

facet = sns.FacetGrid(new_df, hue = "Churn", aspect = 3)
facet.map(sns.kdeplot,"MonthlyCharges",shade= True)
facet.set(xlim=(0, new_df["MonthlyCharges"].max()))
facet.add_legend()

In [None]:
_, ax = plt.subplots(1, 2, figsize= (8, 6))
plt.subplots_adjust(wspace = 0.5)
sns.boxplot(x = 'Churn',  y = 'TotalCharges', data = new_df, ax=ax[0])
sns.boxplot(x = 'Churn',  y = 'MonthlyCharges', data = new_df, ax=ax[1])

In [None]:
_, axs = plt.subplots(1, 2, figsize=(9, 6))
plt.subplots_adjust(wspace = 0.3)
ax = sns.countplot(data = new_df, x = "SeniorCitizen", hue = "Churn", ax = axs[0])
ax1 = sns.countplot(data = new_df, x = "MultipleLines_No phone service", hue = "Churn", ax = axs[1])

for p in ax.patches:
        height = p.get_height() 
        ax.text(
                p.get_x()+p.get_width()/2,
                height + 3.4,
                "{:1.2f}%".format(height/len(new_df),0),
                ha = "center", rotation = 0
               ) 
        
for p in ax1.patches:
        height = p.get_height() 
        ax1.text(
                p.get_x()+p.get_width()/2,
                height + 3.4,
                "{:1.2f}%".format(height/len(new_df),0),
                ha = "center", rotation = 0
               ) 

> **Senior Citizens customer are trends to Churn more than other**

In [None]:
plt.figure(figsize=(8, 6))
sns.swarmplot(x = 'SeniorCitizen', y = 'MonthlyCharges', hue="Churn", data = new_df)
plt.legend(loc='upper-right')

In [None]:
fig, ax = plt.subplots(1,3, figsize=(14, 4))
plt.subplots_adjust(wspace=0.4)
sns.countplot(x = "Contract_One year", hue="Churn" , ax=ax[0], data=new_df)
sns.countplot(data = new_df, x = "PaymentMethod_Credit card (automatic)", ax=ax[1], hue="Churn")
sns.countplot(data = new_df, x ="InternetService_No", ax=ax[2], hue="Churn")
fig.show()

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10, 4))
plt.subplots_adjust(wspace=0.4)
sns.swarmplot(x = 'PaymentMethod_Bank transfer (automatic)', y = 'TotalCharges', hue="Churn", data = new_df, ax=ax[0])
sns.swarmplot(x = 'Contract_Two year', y = 'TotalCharges', hue="Churn", data = new_df, ax=ax[1])

> Customer with **less than 2 years** contract are more often churn

In [None]:
fig, ax = plt.subplots(1,2, figsize=(8, 4))
plt.subplots_adjust(wspace=0.3)
sns.swarmplot(x = 'PaymentMethod_Mailed check', y = 'TotalCharges', hue="Churn", data = new_df, ax=ax[0])
sns.swarmplot(x = 'Contract_Two year', y = 'TotalCharges', hue="Churn", data = new_df, ax=ax[1])
fig.show()

In [None]:
cols = ["TotalCharges", "MonthlyCharges", "tenure", "Churn"] 
pairplot_feature = new_df[cols]
sns.pairplot(pairplot_feature, hue = "Churn")

In [None]:
X = new_df.drop("Churn", axis=1)
y = new_df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
clf = RandomForestClassifier(random_state=0).fit(X_train, y_train)

In [None]:
perm = PermutationImportance(clf, random_state = 1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

> ** Darker green represent the `highest impact` and lesser green represent the `less impact` feature **

### That's it for this Kernel. 