# Teleco Churn Prediction

![imglink](http://www.oxper.in/wp-content/uploads/2017/06/churn-1-900x444.png)


# <font color='red'> Introduction </font>

     Welcome to basic binary classification task
     this dataset want us to create model which can predict customer that will churn in future


In this kernel,

 - Simple Exploratory Data Analysis
 - Data wrangling
 - Creating predictive models
 - Fine tuning by GridSearch

## Import package

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph.
import warnings
warnings.filterwarnings("ignore")

from pylab import rcParams


%matplotlib inline

In [None]:
data = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')

## Data Exploration

In [None]:
data.columns

In [None]:
data.shape

    7043 data objects with 21 attributes

In [None]:
data.head(5)

**<font color='forestgreen'> Note</font>**

    First of all, drop customerID because it should not effect to churn probability

In [None]:
data.drop(['customerID'], axis=1, inplace=True)

## Target Feature

In [None]:
data['Churn'].value_counts(sort = False)

In [None]:
data['Churn'].value_counts(sort = False)

In [None]:
# Data to plot
labels =data['Churn'].value_counts(sort = True).index
sizes = data['Churn'].value_counts(sort = True)


colors = ["whitesmoke","red"]
explode = (0.1,0)  # explode 1st slice
 
rcParams['figure.figsize'] = 8,8
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=270,)

plt.title('Percent of churn in customer')
plt.show()

**<font color='tomato'> Finding</font>**
    
    In training set, contain Churn customer only 26.5%

In [None]:
data['Churn'] = data['Churn'].map(lambda s :1  if s =='Yes' else 0)

## Data Wrangling

In [None]:
data.info()

In [None]:
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(6)

**<font color='tomato'> Finding</font>**

    No missing data ??

In [None]:
data.head(5)

**<font color='royalblue'> Preprocessing</font>**

#### Gender : Customer gender (female, male)

    Because we can't put string in model, then I decide to create new binary columns.

In [None]:
data['gender'].head()

In [None]:
g = sns.factorplot(y="Churn",x="gender",data=data,kind="bar" ,palette = "Pastel1")


In [None]:
data = pd.get_dummies(data=data, columns=['gender'])

**<font color='royalblue'> Preprocessing</font>**

#### SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
    this feature is ready to use

In [None]:
data['SeniorCitizen'].value_counts()

**<font color='royalblue'> Preprocessing</font>**

#### Partner : Whether the customer has a partner or not (Yes, No)
    This feature value has the same meaning with SeniorCitizen 
    but it is a "Yes/No" values then we need to change format

In [None]:
data['Partner'].value_counts()

In [None]:
data['Partner'] = data['Partner'].map(lambda s :1  if s =='Yes' else 0)
data['Partner'].value_counts()

**<font color='royalblue'> Preprocessing</font>**

#### Dependents, PhoneService, PaperlessBilling
    Do the same what we do in "Partner" column
    

In [None]:
data['Dependents'] = data['Dependents'].map(lambda s :1  if s =='Yes' else 0)
data['PhoneService'] = data['PhoneService'].map(lambda s :1  if s =='Yes' else 0)
data['PaperlessBilling'] = data['PaperlessBilling'].map(lambda s :1  if s =='Yes' else 0)


**<font color='royalblue'> Preprocessing</font>**

#### Tenure : Number of months the customer has stayed with the company
    This is numerical feature we can cut it to bin width but I think it ready to use  
    

In [None]:
data['tenure'].head()

In [None]:
# tenure distibution 
g = sns.kdeplot(data.tenure[(data["Churn"] == 0) ], color="Red", shade = True)
g = sns.kdeplot(data.tenure[(data["Churn"] == 1) ], ax =g, color="Blue", shade= True)
g.set_xlabel("tenure")
g.set_ylabel("Frequency")
plt.title('Distribution of tenure comparing with churn feature')
g = g.legend(["Not Churn","Churn"])

**<font color='tomato'> Finding</font>**

    Seem like most of churn customer stayed in company less than 20 months
    and customer with high value in tenure has low probability to churn

**<font color='royalblue'> Preprocessing</font>**

#### MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
    look like Yes No feature but it contain 3 values. 
    I should create new column that can tell model this customer has phone service or not.
    but we already have 'PhoneService' columns, 
    then I decide to assume that "No phone service" has the same meaning with "No"
    

In [None]:
data['MultipleLines'].value_counts()

In [None]:
data['MultipleLines'].replace('No phone service','No', inplace=True)
data['MultipleLines'] = data['MultipleLines'].map(lambda s :1  if s =='Yes' else 0)
data['MultipleLines'].value_counts()

**<font color='royalblue'> Preprocessing</font>**

#### InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
    First of all, I want something like "PhoneService" column, 
    then I decide to create Has_InternetService column which can tell this customer has internet service or not.
    Next, if they have internet service we need to tell model what kind of service

In [None]:
data['InternetService'].value_counts()

In [None]:
data['Has_InternetService'] = data['InternetService'].map(lambda s :0  if s =='No' else 1)
data['Fiber_optic'] = data['InternetService'].map(lambda s :1  if s =='Fiber optic' else 0)
data['DSL'] = data['InternetService'].map(lambda s :1  if s =='DSL' else 0)


In [None]:
print(data['Has_InternetService'].value_counts())
print(data['Fiber_optic'].value_counts())
print(data['DSL'].value_counts())
data.drop(['InternetService'], axis=1, inplace=True)

**<font color='royalblue'> Preprocessing</font>**

**OnlineSecurity OnlineBackup DeviceProtection <br>
 TechSupport StreamingTV StreamingMovies**

     All of these columns have the same format,I do the samething what I do in "MultipleLines" column

In [None]:
data['OnlineSecurity'] = data['OnlineSecurity'].map(lambda s :1  if s =='Yes' else 0)
data['OnlineBackup'] = data['OnlineBackup'].map(lambda s :1  if s =='Yes' else 0)
data['DeviceProtection'] = data['DeviceProtection'].map(lambda s :1  if s =='Yes' else 0)
data['TechSupport'] = data['TechSupport'].map(lambda s :1  if s =='Yes' else 0)
data['StreamingTV'] = data['StreamingTV'].map(lambda s :1  if s =='Yes' else 0)
data['StreamingMovies'] = data['StreamingMovies'].map(lambda s :1  if s =='Yes' else 0)

**<font color='royalblue'> Preprocessing</font>**

#### PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
    This is categorical feature, I will use pandas function "get_dummies" for this feature

In [None]:
data['PaymentMethod'].value_counts()

In [None]:
data = pd.get_dummies(data=data, columns=['PaymentMethod'])

**<font color='forestgreen'> Note</font>**

    What we got from get_dummies ??

In [None]:
data[['PaymentMethod_Electronic check',
      'PaymentMethod_Mailed check',
      'PaymentMethod_Bank transfer (automatic)',
      'PaymentMethod_Credit card (automatic)']].head()

**<font color='royalblue'> Preprocessing</font>**

#### Contract : The contract term of the customer (Month-to-month, One year, Two year)
    this is also categorical value, let "get_dummies" it


In [None]:
data['Contract'].value_counts()

In [None]:
data = pd.get_dummies(data=data, columns=['Contract'])

**<font color='royalblue'> Preprocessing</font>**

#### MonthlyCharges : The amount charged to the customer monthly
    numerical features and lucky!! it ready to use

In [None]:
data['MonthlyCharges'].head()

In [None]:
g = sns.factorplot(x="Churn", y = "MonthlyCharges",data = data, kind="box", palette = "Pastel1")

**<font color='tomato'> Finding</font>**

    According to above plot, High MonthlyCharges may affect to churn probability.
    And seem like customer will not churn if they have low amount in MonthlyCharges

**<font color='royalblue'> Preprocessing</font>**

#### TotalCharges : The total amount charged to the customer
    numerical feature !?, but now it still in object type.
    we need to fix it.

In [None]:
data['TotalCharges'].head()

In [None]:
## because 11 rows contain " " , it means 11 missing data in our dataset
len(data[data['TotalCharges'] == " "])

In [None]:
## Drop missing data
data = data[data['TotalCharges'] != " "]

In [None]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])
## At first time I use this command but it error because some value contain " "
## That why I know " " hide in our dataset 

In [None]:
g = sns.factorplot(y="TotalCharges",x="Churn",data=data,kind="boxen", palette = "Pastel2")

**<font color='tomato'> Finding</font>**

    From boxen plot, most of churn customer has less than 2000 in total chrage.
    In range of 2500 to 8000, population of loyalty customer is around twice as amount as churn customer.

In [None]:
data.info()

##  CreatingModel & Evaluate

      In this step, I will use model with default parameter compare with tuning parameter by using grid search
      Gridsearch is just like try to put all possible combining between parameter in ranges that we input.
      It consume a lot of time, I decide to comment some line. you can try uncomment it when run in your local PC.

In [None]:
data["Churn"] = data["Churn"].astype(int)

Y_train = data["Churn"]
X_train = data.drop(labels = ["Churn"],axis = 1)

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import  cross_val_score,GridSearchCV

Rfclf = RandomForestClassifier(random_state=15)
Rfclf.fit(X_train, Y_train)

In [None]:
# 10 Folds Cross Validation 
clf_score = cross_val_score(Rfclf, X_train, Y_train, cv=10)
print(clf_score)
clf_score.mean()

**<font color='tomato'> Finding</font>**

    My default random forest get around 78% accuracy from cross-validation

In [None]:
%%time
param_grid  = { 
                'n_estimators' : [500,1200],
               # 'min_samples_split': [2,5,10,15,100],
               # 'min_samples_leaf': [1,2,5,10],
                'max_depth': range(1,5,2),
                'max_features' : ('log2', 'sqrt'),
                'class_weight':[{1: w} for w in [1,1.5]]
              }

GridRF = GridSearchCV(RandomForestClassifier(random_state=15), param_grid)

GridRF.fit(X_train, Y_train)
#RF_preds = GridRF.predict_proba(X_test)[:, 1]
#RF_performance = roc_auc_score(Y_test, RF_preds)

print(
    #'DecisionTree: Area under the ROC curve = {}'.format(RF_performance)
     "\nBest parameters \n" + str(GridRF.best_params_))

In [None]:
rf = RandomForestClassifier(random_state=15,**GridRF.best_params_)
rf.fit(X_train, Y_train)

## K-Fold CV with accuracy metric


In [None]:
# 10 Folds Cross Validation 
clf_score = cross_val_score(rf, X_train, Y_train, cv=10)
print(clf_score)
clf_score.mean()
    

**<font color='tomato'> Finding</font>**

    My grid-search random forest get around 80% accuracy from cross-validation,
    a little bit improve from default parameter

## Feature importances
    another advantage from rule-based model

In [None]:
Rfclf_fea = pd.DataFrame(rf.feature_importances_)
Rfclf_fea["Feature"] = list(X_train) 
Rfclf_fea.sort_values(by=0, ascending=False).head()

In [None]:
g = sns.barplot(0,"Feature",data = Rfclf_fea.sort_values(by=0, ascending=False)[0:5], palette="Pastel1",orient = "h")
g.set_xlabel("Weight")
g = g.set_title("Random Forest")

## Confusion Matrix

    also known as an error matrix, it is a specific table layout that allows visualization of the performance of an algorithm

In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

y_pred = rf.predict(X_train)

print(confusion_matrix(Y_train, y_pred))

In [None]:
from sklearn.metrics import classification_report

print(classification_report( Y_train, y_pred))

**Thank you for reading until the end : )** 

    I will try to update new version
    please vote or comment If you like it ^_^
    If you have any suggestion let me know in comment.