# Sony Research - Customer Churn Prediction

## Table of Contents
* [Assignment](#Assignment)
* [Data Description](#Data-Description)
* [Question 1](#Question-1)
* [Question 2](#Question-2)
* [Question 3](#Question-3)
* [Question 4](#Question-4)

## Assignment
You are provided with a sample dataset of a telecom company’s customers and it's expected to done the following tasks:

**Question 1:**  Perform exploratory analysis and extract insights from the dataset.

**Question 2:** Split the dataset into train/test sets and explain your reasoning.

**Question 3:** Build a predictive model to predict which customers are going to churn and discuss the reason why you choose a particular algorithm.

**Question 4:** Establish metrics to evaluate model performance.

**Question 5:** Discuss the potential issues with deploying the model into production.

## Data Description

**State:** The state where a customer comes from

**Account length:**	Number of days a customer has been using services

**Area code:** The area where a customer comes from

**Phone number:** The phone number of a customer

**International:** The status of customer international plan

**Voicemail plan:** The status of customer voicemail plan

**Number vmail msgs:** Number of voicemail message sent by a customer

**Total day minutes:** Total call minutes spent by a customer during day time

**Total day calls:** Total number of calls made by a customer during day time

**Total day charge:** Total amount charged to a customer during day time

**Total eve minutes:** Total call minutes spent by a customer during evening time

**Total eve calls:** Total number of calls made by a customer during evening time

**Total eve charge:** Total amount charged to a customer during evening time

**Total night minutes:** Total call minutes spent by a customer during night time

**Total night calls:** Total number of calls made by a customer during night time

**Total night charge:** Total amount charged to a customer during night time

**Total intl minutes:** Total international call minutes spent by a customer

**Total intl calls:** Total number of international calls made by a customer

**Total intl charge:** Total international call amount charged to a customer

**Customer service calls:** Total number of customer service calls made by a customer

**Churn:** Whether a customer is churned or not

## Question 1
Perform exploratory analysis and extract insights from the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('Data_Science_Challenge.csv')
data.head(3)

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
data.info()

In [None]:
print('Number of unique phone number:' , data['phone number'].nunique())
print('Number of unique area code:' , data['area code'].nunique())
print('Number of unique state:' , data['state'].nunique())

Other features such as area code (as long as their distinct value amount does not explode the number of variables in the dataset) can be processed with one-hot encoding to create insight for machine learning models. This effort is necessary because if we would remain them as it is, it could misguide the ML model such as having an implicit ordinal relationship between categories.



We will prefer to leave state values out of the dataset in order to not have issues with high dimensionality. We can start to process other categorical features.

In [None]:
area_code_dummies = pd.get_dummies(data['area code'])

# The add_prefix() method inserts the specified value in front of the column label.
area_code_dummies = area_code_dummies.add_prefix('area_code_')

area_code_dummies

In [None]:
data['voice mail plan'].loc[data['voice mail plan'] == 'no'] = 0

data['voice mail plan'].loc[data['voice mail plan'] == 'yes'] = 1

data['voice mail plan'] = data['voice mail plan'].astype('int64')

data['voice mail plan']

In [None]:
data['international plan'].loc[data['international plan'] == 'no'] = 0

data['international plan'].loc[data['international plan'] == 'yes'] = 1

data['international plan'] = data['international plan'].astype('int64')

data['international plan']

In [None]:
data_final = data.drop(columns = ['phone number', 'state', 'area code'])

#Pandas concat() method is used to concatenate pandas objects such as DataFrames and Series
data_final = pd.concat([data_final, area_code_dummies], axis = 1)

data_final

In [None]:
data_final.hist(figsize = (15, 15), bins = 15)
plt.show()

In [None]:
data_final.groupby(['churn'])['churn'].count()

The distributions tell us:

* Most customers don't use voice mail service and international plans.

* Half of the customers live in area code 415.

* The company earns more by total day calls (check total day charge).

* We have an imbalanced dataset which could be tricky when choosing evaluation metrics.

In [None]:
f, ax = plt.subplots(figsize=(23,15))
sns.heatmap(data_final.corr(), annot = True, linewidths = 0.5)

From the correlation matrix, we observe the following things:

There is a positive correlation between:

* total day charge, total day minutes, and churn
* total eve minutes and total eve charge
* total night minutes and total night charge
* total intl minutes and total intl charge
* total customer service calls and churn
* number vmail messages and voice mail
* international plan and churn

There is a negative correlation between:

* churn and voice mail plan
* churn and number vmail messages
* churn and total intl calls


Most of the relations are as expected. Still, there could be some interesting points such as the positive correlation between churn and international plan. It could be caused by the poor quality of international plans or calls. Let's check the individual effect of features on churn rate through a random forest classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

label_encoder = preprocessing.LabelEncoder()

# Apply label encoder for churn since its values are also categories.
y = label_encoder.fit_transform(data_final['churn'])

X = data_final.drop(columns = ['churn'])

# Train-test split.
X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Selected features are selected in multicollinearity check part
features_names = [f"feature {i}" for i in range((X.shape[1]))]

random_forest = RandomForestClassifier(max_depth = 5)
random_forest.fit(X_train, y_train)

features = {}  # A dict to hold feature_name : feture_importance

# The zip() function returns a zip object, which is an iterator of tuples where the first item in each 
# passed iterator is paired together, and then the second item in each passed iterator is paired together, etc.
for feature,importance in zip(data_final.drop(columns = ['churn']).columns, random_forest.feature_importances_):
    features[feature] = importance   # Add the name/value pair
    
#Feature importance refers to techniques that assign a score to input features based on 
# how useful they are at predicting a target variable.    
    
importances = pd.DataFrame.from_dict(features, orient = 'index').rename(columns = {0 : 'Gini-importance'})
importances.sort_values(by = 'Gini-importance').plot(kind = 'bar', rot = 90, figsize = (12,8))

plt.show()

Gini-importance shows us which features would be most helpful if we build a tree-based model with given features. According to the analysis above, the most important three churn features are:
* total day charge
* total day minutes
* customer service calls.

## Question 2
Split the dataset into train/test sets and explain your reasoning.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

label_encoder = preprocessing.LabelEncoder()

# Apply label encoder for churn since its values are also categories.
y = label_encoder.fit_transform(data_final['churn'])

X = data_final.drop(columns = ['churn'])

# Train-test split.
X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)


## Question 3
Build a predictive model to predict which customers are going to churn and discuss the reason why you choose a particular algorithm.

Since this is a classification problem by definition, we will apply a bunch of classifiers and decide to pick one to use in production based on the performance. Hyperparameters of the given classifiers are chosen as trial-error without applying an advanced hyperparameter tuning mechanism.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier

In [None]:
names = [
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Gaussian Process",
    "Decision Tree",
    "Random Forest",
    "Neural Net",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
    "XGBoost",
    "LightGBM"
]

classifiers = [

    KNeighborsClassifier(3), 
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), random_state=42),
    DecisionTreeClassifier(max_depth=5, random_state=42),
    RandomForestClassifier(max_depth=5, random_state=42),
    MLPClassifier(alpha=1, max_iter=1000, random_state=42),
    AdaBoostClassifier(random_state=42),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
    XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', seed=0),
    #LGBMClassifier(random_state=42),
]

## Question 4
Establish metrics to evaluate model performance.

This is a classification task, and the most commonly used metric is accuracy. But, we have an imbalanced dataset, which means we need to be careful about our evaluations. Let's say you have a very skewed dataset with a distribution of 99% of labels 1 and 1% of them 0. Then, if your model always predicts 1, it will have 99% accuracy but still not a good model. F1 score balances the precision and recall so we can have a good metric even for imbalanced datasets. Hence, we will use accuracy and F1 scores while comparing the performance of different algorithms.

In [None]:
from sklearn.metrics import f1_score

## Classical Machine Learning Models

In [None]:
for name, classification in zip(names,classifiers):
    classification.fit(X_train,y_train)
    accurate_score = classification.score(X_test, y_test)
    y_pred = classification.predict(X_test)
    f_score = f1_score(y_test, y_pred, average = 'macro')
    print('Accuracy:', "{:.2f}".format(accurate_score), "F1 Score",
          "{:.2f}".format(f_score), 'Model:', name)

We obtained good accuracy and satisfying F1 score on tree-based methods. The best performed model is XGBoost with 0.95 accuracy and 0.88 F1-score. Let's visualize the Decision Tree and see how tree-based algorithms decide for our particular problem.

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from IPython.display import SVG,display,Image
import pydotplus

# Function attributes
# maximum_depth  - depth of tree
# criterion_type - ["gini" or "entropy"]
# split_type     - ["best" or "random"]

def plot_decision_tree(maximum_depth, criterion_type, split_type):
    clf = DecisionTreeClassifier(max_depth = 3)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print('Accuracy:', "{:.2f}".format(accurate_score), "F1 Score",
          "{:.2f}".format(f_score))
    
    #plot
    graph = tree.export_graphviz(clf, out_file = None, rounded = True,proportion = True,
                                feature_names = data_final.drop(columns = ['churn']).columns.to_list(),
                                precision  = 2, class_names=["Not churn","Churn"], filled = True,)
    
    
    pydot_graph = pydotplus.graph_from_dot_data(graph)
    pydot_graph.set_size('10', '10')
    plt = Image(pydot_graph.create_png())
    display(plt)
    
    
plot_decision_tree(3, 'gini', 'best')

This visualization shows us how the decision tree grows via splitting based on feature values. For example, left-hand side of the tree visualize customers who do not use international plan churns less than who have internal plan. Another example from level two of the tree is customers who call customers service less than 1.47 don't churn while others have higher probability to churn.

## Deep Learning Model
What would be the performance of an Artificial Neural Network (ANN) for the given problem (without spending hours on hyperparameter optimization - just experimenting)?

In [None]:
# Sequential model to initialise our ann and dense module to build the layers.
from keras.models import Sequential
from keras.layers import Dense

# To have reproducible results
import tensorflow
tensorflow.random.set_seed(42)

In [None]:
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu',
                    input_dim = X.shape[1]))

# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

In [None]:
# Comiling the ANN | means appliyng SGD on the whole ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100, verbose = 0)

loss, accuracy = classifier.evaluate(X_train, y_train, batch_size = 10)

print('Train accuracy:', accuracy)

In [None]:
# Predicting the test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

print('*'*20)
loss, accuracy = classifier.evaluate(X_test, y_test,
                            batch_size=10)

print('Test accuracy:', accuracy)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

f1_score = f1_score(y_test, y_pred)
print('Test F1-score:', f1_score)