# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Mini Project Notebook: Customer Churn Analysis

## Learning Objectives

At the end of the experiment, you will be able to :

* find users that are going to churn in future
* find what factors drive users to churn
* perform EDA on the given churn data and prepare data for prediction task.
* apply various machine learning algorithms and analyse the results


## Information

**Churn Analysis**

Customer churn analysis refers to the customer attrition rate in a company. This analysis helps identify the cause of the churn and implement effective strategies for retention.


Customer Churn is used to describe subscribers to a service who decide to discontinue their service for a certain time frame. Churn prediction consists of detecting which customers are likely to cancel a subscription to a service based on how they use the service.

Businesses often have to invest substantial amounts attracting new clients, so every time a client leaves it represents a significant investment lost. Both time and effort then need to be channelled into replacing them. Being able to predict when a client is likely to leave and offer them incentives to stay can offer huge savings to a business.

**Predicting customer churn with machine learning**

As with any machine learning task, data science specialists first need data to work with. Depending on the goal, selected data is prepared, preprocessed, and transformed in a form suitable for building machine learning models. Finding the right methods to training machines, fine-tuning the models, and selecting the best performers is another significant part of the work. Once a model that makes predictions with the highest accuracy is chosen, it can be put into production.

The overall scope of work data scientists carry out to build ML-powered systems capable to forecast customer attrition may look like the following:

* Understanding a problem and final goal
* Data collection
* Data preparation and preprocessing
* Modeling and testing
* Model deployment and monitoring

## Dataset

The dataset chosen for this task is customer churn dataset representing the trips of the users and drivers rating along with luxury cars used. Every row represents a separate customer. The data has a total of 50,000 customers.

variables	description
* **city:**	city this user signed up in
* **phone:**	primary device for this user
* **signup_date:**	date of account registration; in the form `YYYYMMDD`
* **last_trip_date:**	the last time this user completed a trip; in the form `YYYYMMDD`
* **avg_dist:**	the average distance (in miles) per trip taken in the first 30 days after signup
* **avg_rating_by_driver:**	the rider’s average rating over all of their trips
* **avg_rating_of_driver:**	the rider’s average rating of their drivers over all of their trips
* **surge_pct:**	the percent of trips taken with surge multiplier > 1
* **avg_surge:**	The average surge multiplier over all of this user’s trips
* **trips_in_first_30_days:**	the number of trips this user took in the first 30 days after signing up
* **luxury_car_user:**	TRUE if the user took a luxury car in their first 30 days; FALSE otherwise
* **weekday_pct:**	the percent of the user’s trips occurring during a weekday



## Problem Statement

Analyse and preprocess the data and build machine learning model to  predict Customer Churn.

## Grading = 10 Points

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/churn.csv
print("Dataset downloaded successfully!!")

### Import required Packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import ensemble
from xgboost import XGBClassifier
import warnings
warnings.simplefilter('ignore')

### Load the data and summarize (1 point)

In [None]:
# reading the .csv file
data = pd.read_csv("/content/churn.csv")
data

#### Summarize the data
* Explore the datatypes of the columns and correct
* Identify the numerial, categorical and date columns
* Identify the columns with missing values

In [None]:
#Available Information
data.info()

In [None]:
#Lets first convert columns to their appropriate data types
data.signup_date = data.signup_date.astype('datetime64')
data.last_trip_date = data.last_trip_date.astype("datetime64")

#### Breakdown by months

* Using the `last_trip_date` get the data for each month

In [None]:
data['last_trip_date'].max()

In [None]:
#Lets see the breakdown by months
last_trip_bd = data.groupby(data['last_trip_date'].dt.strftime('%B')).last_trip_date.count()
cats = ['January', 'February', 'March', 'April','May','June', 'July', 'August','September', 'October', 'November', 'December']
last_trip_bd.index = pd.CategoricalIndex(last_trip_bd.index, categories=cats, ordered=True)
last_trip_bd = last_trip_bd.sort_index()
last_trip_bd

Clearly, users who have used the app in July and June, are customers who are still loyal to the company. However, customers who last used the app before June (in May or before) have gone by without using the app for a considerable time. Lets mark them as inactive (or users who have churned).

### Data Preparation (Target variable - Churn) ( 2 points)

Clearly, users who have used the app in July and June, are customers who are still loyal to the company. However, customers who last used the app before June (in May or before) have gone by without using the app for a considerable time. Lets mark them as inactive (or users who have churned).

**Note:** Any user whose last trip with the company was before 1st June, 2014 is considered to be "churned".

In [None]:
#Any user whose last trip with the company was before 1st June, 2014 is considered to be "churned".
data["churned"] = 1
data["churned"][data.last_trip_date >= "2014-06-01"] = 0
data.churned = data.churned.astype("category")

In [None]:
data[data["churned"] == 0]

#### Handle the Duplicates

Although, we dont have a unique customer ID for each customer, having all values similar looks highly unlikely for 2 customers. Find such rows in the data (customer having the same city, same phone, same signup_date, same last_trip_date looks highly unlikely) and drop.

Hint: `drop_duplicates()`


In [None]:
data[data.duplicated()].shape

Although, we dont have a unique customer ID for each customer, having all values similar looks highly unlikely for 2 customers. There are 8 such rows in the data (customer having the same city, same phone, same signup_date, same last_trip_date looks highly unlikely)

In [None]:
clean_data = data.copy()
clean_data.drop_duplicates(inplace = True)

#We have a total of 49,992 customers
clean_data.shape

In [None]:
#Since, we have used last_trip_date to create our target variable, we can drop the variable from further analysis
clean_data.drop(["last_trip_date"],axis = 1,inplace = True)

#### Separate columns by data types

* Identify the columns belongs various data types and separate them

In [None]:
#Before moving further, lets separate our variables based on their types
#Separating columns by data types
def separate(df):
    separated_cols = {
        "categorical" : list(df.select_dtypes(include = ["bool","object","category"]).columns),
        "continuous" : list(df.select_dtypes(include = ["int64","float64"]).columns),
        "date" : list(df.select_dtypes(include = ["datetime"]).columns)
    }
    return separated_cols

separate(clean_data)

#### Handle the null values

* Identify and handle the null values and provide the justification

In [None]:
# Missing Values
clean_data.isnull().sum()

Replacing Missing Values

In [None]:
clean_data["phone"] = clean_data["phone"].fillna("Other")

clean_data["phone"].value_counts()

#Before replacing values for rating variables, lets create a separate variable indicating that values have been replaced here!
clean_data["rating_by_driver_replaced"] = 0
clean_data["rating_by_driver_replaced"][clean_data.avg_rating_by_driver.isnull()] = 1
clean_data["rating_of_driver_replaced"] = 0
clean_data["rating_of_driver_replaced"][clean_data.avg_rating_of_driver.isnull()] = 1

#Replacing ratings with median of the variable, since it is a highly skewed variable.
clean_data["avg_rating_by_driver"] = clean_data["avg_rating_by_driver"].fillna(clean_data.avg_rating_by_driver.median())
clean_data["avg_rating_of_driver"] = clean_data["avg_rating_of_driver"].fillna(clean_data.avg_rating_of_driver.median())

clean_data.isnull().sum()

* Why does column "phone" have missing values? The customer needs a phone to use the app - Could be an different OS
* Having missing values for ratings seem intuitive. Not all customers provide a rating to the drivers. Similar for Drivers.

#### Outliers Detection

* Investigate outliers for every variable find the neccesary variables suitable for modelling

In [None]:
clean_data[clean_data["avg_dist"] > 50].shape

In [None]:
clean_data[clean_data["avg_dist"] > 50].head(5)

Why are trips_in_first_30_days = 0, when avg_dist travelled by customer is higher than 0?

* If customer did not take any trip after signing up (in first 30 days), then dist. travelled should be 0.

In [None]:
clean_data[(clean_data["avg_dist"] > 0) & (clean_data["trips_in_first_30_days"] == 0)].shape
clean_data[(clean_data["avg_dist"] > 0) & (clean_data["trips_in_first_30_days"] == 0)].head(10)

There are 15,000 such customers (thats 30% of observations)

Looks like the variable has quality issues. We'll drop the variable from further analysis and not include it for modeling.

In [None]:
clean_data.drop(["trips_in_first_30_days"],inplace = True, axis =1)

#Lets look at remaining outliers
clean_data[clean_data["avg_dist"] == 0].shape
clean_data[clean_data["avg_dist"] == 0].head(10)

### Data Exploration & Analysis (1 point)

#### Univariate Analysis

* Analyze each variable individually with appropriate plot

In [None]:
#Churn distribution
clean_data.churned.value_counts(normalize = True).plot(kind = "bar",title = "Class distribution: Churned")
plt.xticks(np.arange(2),labels = ["churned","active"])
plt.show()

#### Categorical variables

In [None]:
cat_cols = separate(clean_data)["categorical"]
cat_cols.remove("churned")
cat_cols

In [None]:
fig, ax= plt.subplots(1,3, figsize = (15,5))
for i, col in enumerate(cat_cols):
    print(i)
    sns.countplot(x = col,data = clean_data, ax = ax[i])
    ax[i].set_title("Distribution:"+ col.upper())
clean_data[cat_cols].describe()

#### Numerical Variables

In [None]:
cont_cols = separate(clean_data)["continuous"]
clean_data[cont_cols].describe()

In [None]:

clean_data[cont_cols].mean().plot(kind = "bar")
plt.title("Mean Distribution")
plt.show()

In [None]:
clean_data[cont_cols].hist(figsize = (15,10),bins = 20)
plt.show()

**Insights**

Almost all variables are skewed. We need to transform them.
Customer/ Driver ratings can give us an insight into their behaviour and personality. We can create new features using the variable.
Avg_surge has most obs. at 1 and surge_pct at 0. There could be some correlation here. Need further analysis.
All outlier points need further investigation.

#### Bivariate Analysis

* Identify relationships between variables with appropriate plot



In [None]:
fig,ax = plt.subplots(1,3,figsize = (15,5))
for i in range(len(separate(clean_data)["categorical"])-1):
    temp = clean_data.groupby(separate(clean_data)["categorical"][i])["churned"].value_counts(normalize = True).unstack()
    temp = temp[[1,0]]
    temp.plot(kind = "bar",stacked = True,rot = 0,ax = ax[i])
    ax[i].hlines(0.63,-10,100,linestyle = "dashed") #dashed line if the average customers churn rate
plt.show()

**Insights**

* City Astapor is experiencing a higher churn rate than average. Customers are unhappy in Astapor. King's Landing is managing the operations really well. A very low churn rate.
* Android users are unhappy / churning - There can be various issues here for example - UI for the Android app is too complex/ difficult for customers or * customers experiencing other problems.
* Customers taking a luxury car in first 30 days churn less. We should promote usage of luxury cars.

### Feature Engineering (1 point)

* Create a feature indicating android users face surge pricing more number of times
* Create variable based on ratings indicating user is good/bad,  by grouping the average ratings of customers
*  Create a variable to identify 3 groups of population by grouping the `weekday_pct`
   - those  who dont ride during week
   - those who ride only during week
   - others
   
  

In [None]:
#Android users facing surge
clean_data["Android_user_facing_surge"] = 'No'
clean_data["Android_user_facing_surge"][(clean_data["phone"] == "Android") & (clean_data["surge_pct"] != 0)] = "Yes"

In [None]:
# Converting ratings into a categorical variable
clean_data["customer_behaviour"] = ''
clean_data.customer_behaviour[clean_data.avg_rating_by_driver >= 4] = "good"
clean_data.customer_behaviour[(clean_data.avg_rating_by_driver >= 3) & (clean_data.avg_rating_by_driver < 4)] = "okay"
clean_data.customer_behaviour[clean_data.avg_rating_by_driver < 3] = "bad"

In [None]:
#Weekday pct into groups
clean_data["ride_during_week"] = ''
clean_data.ride_during_week[clean_data.weekday_pct == 0] = "none"
clean_data.ride_during_week[clean_data.weekday_pct == 100] = "all"
clean_data.ride_during_week[(clean_data.weekday_pct > 0) & (clean_data.weekday_pct < 100)] = "some"

In [None]:
#sanity check
clean_data.info()

In [None]:
#changing rating_by_driver_replaced dtype to object
clean_data["rating_by_driver_replaced"] = clean_data["rating_by_driver_replaced"].astype("bool")
clean_data["rating_of_driver_replaced"] = clean_data["rating_of_driver_replaced"].astype("bool")

In [None]:
#Removing signup_date
clean_data.drop("signup_date",axis =1, inplace = True)
clean_data.head()

### Data Preprocessing

#### Plot the correlation heatmap and analyze

In [None]:
plt.figure(figsize = (15,10))
sns.heatmap(clean_data.corr(),annot = True,linewidth = 0.2)
plt.show()

In [None]:
clean_data.drop("avg_surge",axis = 1,inplace = True)

In [None]:
#Converting categorical variables into dummy variables
clean_data = pd.get_dummies(clean_data, drop_first = True,columns = ["city","phone","Android_user_facing_surge",
                                                                     "customer_behaviour","ride_during_week"])

In [None]:
#Before scaling, lets divide our data into training and testing splits
features = [i for i in clean_data.columns if i != "churned"]
target = ["churned"]
X_train, X_test, y_train, y_test = train_test_split(clean_data[features], clean_data[target], test_size=0.2,random_state = 1)

#### Scaling the features

In [None]:
scale = StandardScaler().fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

#### PCA Analysis

In [None]:
cor_mat1 = np.corrcoef(X_train.T)
eig_vals, eig_vecs = np.linalg.eig(cor_mat1)
# Looking at sorted eigenvalues
rounded_eigs = [np.around(i,5) for i in eig_vals]
sorted_eigs  = sorted(rounded_eigs, reverse = True)
print('Eigenvalues in descending order:\n',sorted_eigs)

### Train the Machine Learning models (3 points)


* Apply all the ML models on the data



#### Evaluation Metrics - Model Comparison

* Since, the target is to identify churning customers correctly, focus more on getting True Positive correct (High TPR). Let off False Positive errors (Customers we predicted will churn, but do not!) as they are not that important.

* Also, lower the False Negative error (Customers we predicted will not churn, but they did churn!). In this case, we might lose these customers due to the error.

**main target would be to - MAXIMIZE TRUE POSITIVES and MINIMIZE FALSE NEGATIVE ERRORS!**

**Metrics:** Plot the ROC-AUC curve and confusion Matrix for all the models.


In [None]:
def cross_validation(model,xtrain,ytrain, scoretype, folds):
    scores = cross_val_score(estimator = model,X= xtrain, y = ytrain,scoring = scoretype,cv = folds)
    print("%s: %0.3f (+/- %0.2f)" % ("roc-auc",scores.mean(),scores.std()))

def roc_curve(X_test,y_test,model,model_name):
    from sklearn import metrics
    roc_auc = metrics.roc_auc_score(y_test,model.predict_proba(X_test)[:,1])
    fpr,tpr,threshold = metrics.roc_curve(y_test,model.predict_proba(X_test)[:,1])
    plt.figure()
    plt.plot(fpr,tpr,label = "Model:" + model_name +(" (AUC) = %0.2f")%roc_auc)
    plt.plot([0,1],[0,1],"r--")
    plt.xlim(0,1)
    plt.ylim(0,1)
    plt.legend(loc = "lower right")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC curve")
    plt.show()

**Models**

Logistic Regression

In [None]:
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
LogisticRegression()

cross_validation(log_model,X_train,y_train,scoretype = "roc_auc",folds = 10)
roc_curve(X_test,y_test,log_model,"Logistic Regression")

Decision Tree

In [None]:
dtree = DecisionTreeClassifier(random_state = 42)
dtree.fit(X_train,y_train)
DecisionTreeClassifier(random_state=42)

cross_validation(dtree,X_train,y_train,scoretype = "roc_auc",folds = 10)
roc_curve(X_test,y_test,dtree,"Decision Tree")

Random Forest

In [None]:
rf = RandomForestClassifier(random_state = 42)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=42)

cross_validation(rf,X_train,y_train,scoretype = "roc_auc",folds = 10)
roc_curve(X_test,y_test,rf,"Random Forest")

Gradient Boosting

In [None]:
gbm = ensemble.GradientBoostingClassifier(random_state = 30)
gbm.fit(X_train,y_train)

cross_validation(gbm,X_train,y_train,scoretype = "roc_auc",folds = 10)
roc_curve(X_test,y_test,gbm,"Gradient Boosting Classifier")

AdaBoost

In [None]:
ab = ensemble.AdaBoostClassifier(random_state = 30)
ab.fit(X_train,y_train)

cross_validation(ab,X_train,y_train,scoretype = "roc_auc",folds = 10)
roc_curve(X_test,y_test,ab,"Ada Boost Classifier")

In [None]:
#combined results
plt.figure(figsize = (15,7))

roc_auc = metrics.roc_auc_score(y_test,log_model.predict_proba(X_test)[:,1])
fpr,tpr,threshold = metrics.roc_curve(y_test,log_model.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,label = "Model: Logistic Regression" + (" (AUC) = %0.2f")%roc_auc)

roc_auc = metrics.roc_auc_score(y_test,dtree.predict_proba(X_test)[:,1])
fpr,tpr,threshold = metrics.roc_curve(y_test,dtree.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,label = "Model: Decision Tree" + (" (AUC) = %0.2f")%roc_auc)

roc_auc = metrics.roc_auc_score(y_test,rf.predict_proba(X_test)[:,1])
fpr,tpr,threshold = metrics.roc_curve(y_test,rf.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,label = "Model: Random Forest" + (" (AUC) = %0.2f")%roc_auc)

roc_auc = metrics.roc_auc_score(y_test,gbm.predict_proba(X_test)[:,1])
fpr,tpr,threshold = metrics.roc_curve(y_test,gbm.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,label = "Model: Gradient Boosting" + (" (AUC) = %0.2f")%roc_auc)

roc_auc = metrics.roc_auc_score(y_test,ab.predict_proba(X_test)[:,1])
fpr,tpr,threshold = metrics.roc_curve(y_test,ab.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,label = "Model: AdaBoost" + (" (AUC) = %0.2f")%roc_auc)

plt.plot([0,1],[0,1],"r--")
plt.xlim(0,1)
plt.ylim(0,1)
plt.legend(loc = "lower right")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.show()

#### Model Optimization: Hyper-tuning parameters

Short-list all the best working models, and hyper-tune their parameters and see whether we can improve the performance even further.

There are two ways to hyper tune parameters:

1. Grid Search
2. Random Search

In [None]:
#Tuning learning-rate and number of trees - with change in learning rate, we are adjusting number of estimators as well
param_grid = { 'learning_rate' : [0.15,0.1,0.05,0.01,0.005,0.001],
              'n_estimators' :[100,250,500,750,1000,1250,1500,1750],
              }
grid = RandomizedSearchCV(estimator = ensemble.GradientBoostingClassifier(),param_distributions = param_grid,n_jobs =-1,scoring = "roc_auc")
grid.fit(X_train,y_train)

In [None]:
#Tuning max_depth
param_grid = { 'max_depth' : list(np.linspace(0,10,6)),
              }
grid = GridSearchCV(estimator = ensemble.GradientBoostingClassifier(learning_rate = 0.05, n_estimators = 750 ),param_grid = param_grid,n_jobs =-1,cv =5,scoring = "roc_auc")
grid.fit(X_train,y_train)

In [None]:
print(grid.best_params_, grid.best_score_)

In [None]:
#Tuned Classifier
gbm_tuned = ensemble.GradientBoostingClassifier(learning_rate =0.05 ,n_estimators =750 ,random_state = 42, max_depth =2)
gbm_tuned.fit(X_train,y_train)
roc_curve(X_test,y_test,gbm_tuned,"Tuned Gradient Boosting Classifier")

### Factors driving customers to churn (1 point)

* Find the factors from the data which are causing customers to churn

* Plot the features with a bar plot

Hint: `model.feature_importances_`

In [None]:
temp = pd.DataFrame({'features': features, 'importance': gbm_tuned.feature_importances_}).sort_values('importance',ascending = False)
temp

In [None]:
chart = sns.barplot(x = "features",y = "importance",data = temp)
chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
plt.show()

### Report Analysis

* Find the city which experiencing a higher churn rate than average.

* Which app users (Android/IOS) are unhappy / churning and why ?

* Derive a insight on the Luxury cars and customers churned.

* Discuss the overall factors causing the customers churn and reasons for poor ratings.