
### Mini Project Notebook: Customer Churn Analysis

## Learning Objectives

At the end of the experiment, you will be able to :

* find users that are going to churn in future
* find what factors drive users to churn
* perform EDA on the given churn data and prepare data for prediction task.
* apply various machine learning algorithms and analyse the results


## Information

**Churn Analysis**

Customer churn analysis refers to the customer attrition rate in a company. This analysis helps identify the cause of the churn and implement effective strategies for retention.


Customer Churn is used to describe subscribers to a service who decide to discontinue their service for a certain time frame. Churn prediction consists of detecting which customers are likely to cancel a subscription to a service based on how they use the service.

Businesses often have to invest substantial amounts attracting new clients, so every time a client leaves it represents a significant investment lost. Both time and effort then need to be channelled into replacing them. Being able to predict when a client is likely to leave and offer them incentives to stay can offer huge savings to a business.

## Dataset

The dataset chosen for this task is customer churn dataset representing the trips of the users and drivers rating along with luxury cars used. Every row represents a separate customer. The data has a total of 50,000 customers.

variables	description
* **city:**	city this user signed up in
* **phone:**	primary device for this user
* **signup_date:**	date of account registration; in the form `YYYYMMDD`
* **last_trip_date:**	the last time this user completed a trip; in the form `YYYYMMDD`
* **avg_dist:**	the average distance (in miles) per trip taken in the first 30 days after signup
* **avg_rating_by_driver:**	the rider’s average rating over all of their trips
* **avg_rating_of_driver:**	the rider’s average rating of their drivers over all of their trips
* **surge_pct:**	the percent of trips taken with surge multiplier > 1
* **avg_surge:**	The average surge multiplier over all of this user’s trips
* **trips_in_first_30_days:**	the number of trips this user took in the first 30 days after signing up
* **luxury_car_user:**	TRUE if the user took a luxury car in their first 30 days; FALSE otherwise
* **weekday_pct:**	the percent of the user’s trips occurring during a weekday



## Problem Statement

Analyse and preprocess the data and build a machine learning model to  predict Customer Churn.

## Grading = 10 Points

In [1]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/churn.csv
print("Dataset downloaded successfully!!")

Dataset downloaded successfully!!


### Import required Packages 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn import ensemble
from xgboost import XGBClassifier
import warnings
warnings.simplefilter('ignore')

### Load the data and summarize (1 point)

In [None]:
# reading the .csv file
path = "/content/churn.csv"
# YOUR CODE HERE

#### Summarize the data
* Explore the datatypes of each column
* Identify the numerical, categorical and date columns and convert to appropriate data types if required
* Identify the columns with missing values and handle them appropriately

In [None]:
# YOUR CODE HERE

#### Breakdown by months

* Using the `last_trip_date` get the data for each month 

In [None]:
# YOUR CODE HERE

### Data Preparation (Target variable - Churn) ( 2 points)

Clearly, users who have used the app in June and July, are customers who are still loyal to the company. However, customers who last used the app before June (in May or before) have gone by without using the app for a considerable time. Lets mark them as inactive (or users who have churned).

**Note:** Any user whose last trip with the company was before 1st June, 2014 is considered to be "churned". 

In [None]:
# YOUR CODE HERE

#### Handle the Duplicates

Although, we dont have a unique customer ID for each customer, having all values similar looks highly unlikely for 2 customers. Find such rows in the data (customer having the same city, same phone, same signup_date, same last_trip_date looks highly unlikely) and drop.

Hint: `drop_duplicates()`


In [None]:
# YOUR CODE HERE

#### Separate columns by data types

* Identify and separate the columns based on their data type 

  eg. categorical, continuous, date

In [None]:
# YOUR CODE HERE

#### Handle the null values

* Identify and handle the null values and provide the justification

In [None]:
# YOUR CODE HERE

#### Outliers Detection

* Investigate outliers for every variable
-Accordingly identify the variables suitable to use for modelling

In [None]:
# YOUR CODE HERE

### Data Exploration & Analysis (1 point)

  - Derive exploratory insights
    - trends
    - anomalies
    - interesting observations

In [None]:
# YOUR CODE HERE

#### Univariate Analysis

* Analyze each variable individually with appropriate plot

In [None]:
# YOUR CODE HERE

#### Categorical variables

In [None]:
# YOUR CODE HERE

#### Numerical Variables

In [None]:
# YOUR CODE HERE

#### Multivariate Analysis

* Identify relationships between variables with appropriate plot 



In [None]:
# YOUR CODE HERE

### Feature Engineering (1 point)

* Create a feature to count the instances of surge pricing for android vs non-android users
* Create a variable based on ratings indicating user is good/bad,  by grouping the average ratings of customers
*  Create a variable to identify 3 groups of population by grouping the `weekday_pct`
   - those  who dont ride during week
   - those who ride only during week
   - others
   
  

In [None]:
# YOUR CODE HERE

### Data Preprocessing (1 point)

#### Plot the correlation heatmap and analyze

- Identify and drop the highly correlated features

In [None]:
# YOUR CODE HERE

#### Scale the features

In [None]:
# YOUR CODE HERE

#### PCA Analysis

* Apply PCA and analyze the variables with explained variance ratio

In [None]:
# YOUR CODE HERE

### Predict churn/no churn using Machine Learning models (3 points)


* Apply suitable ML models on the data and evaluate the models
* Improve the performance of the models



#### Build, fit and evaluate ML models

- Any ML classifiers can be used. Compare with atleast 4 different models
- Since, the target is to identify churning customers correctly, focus more on getting True Positive correct (High TPR). We could ignore False Positive errors (Customers we predicted will churn, but do not!) as they are not that important.

* Also, minimize the False Negative error (Customers we predicted will not churn, but they did churn!). In this case, we might lose these customers due to the error.

**main target would be to - MAXIMIZE TRUE POSITIVES and MINIMIZE FALSE NEGATIVE ERRORS!**

**Metrics:** Plot the ROC-AUC curve and confusion Matrix for all the models.


In [None]:
# YOUR CODE HERE

#### Model Optimization: tuning hyperparameters

Short-list all the best working models, and tune their hyperparameters and see whether we can improve the performance even further.

There are two ways to select optimal hyperparameters:

1. Grid Search
2. Random Search

In [None]:
# YOUR CODE HERE

### Factors driving customers to churn (1 point)

* Find the factors from the data which are causing customers to churn

* Plot the features with a bar plot

Hint: `model.feature_importances_`

In [None]:
# YOUR CODE HERE

### Report Analysis

* Find the city which is experiencing a higher churn rate than average. 

* Which app users (Android/IOS) are unhappy / churning and why ?

* Derive an insight on the luxury cars and customers churned.

* Discuss the overall factors causing the customer churn and reasons for poor ratings.