#### Sample Problem Statement

Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

Step 1:

- load libraries
- import the json
- convert the json file into dataframe
- output the data description

### Imports
Let us first start by importing some common libraries of Python.

In [66]:
import pandas as pd
import pyodbc
import numpy as np
import matplotlib as plt
import json

%matplotlib inline

### Get the Data
After importing these libraries, let us now import our data using the pandas libraries.

In [79]:
with open("uber_data_challenge.json") as datafile:
    data = json.load(datafile)
df = pd.DataFrame(data)

Let’s look at how this data looks.

In [80]:
df.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,uber_black_user,weekday_pct
0,3.67,5.0,4.7,1.1,King's Landing,2014-06-17,iPhone,2014-01-25,15.4,4,True,46.2
1,8.26,5.0,5.0,1.0,Astapor,2014-05-05,Android,2014-01-29,0.0,0,False,50.0
2,0.77,5.0,4.3,1.0,Astapor,2014-01-07,iPhone,2014-01-06,0.0,3,False,100.0
3,2.36,4.9,4.6,1.14,King's Landing,2014-06-29,iPhone,2014-01-10,20.0,9,True,80.0
4,3.13,4.9,4.4,1.19,Winterfell,2014-03-15,Android,2014-01-27,11.8,14,False,82.4


We can see few features are not relevatnt to our problem e.g. last_trip_date, phone, signup_date,uber_black_user. Or in other words they will not impact the model so we can drop these ans select rest of the features.

### Exploratory Data Analysis

We will perform our EDA with following steps:

1. Drop or impute missing values
2. Drop irrelevant features
3. Converting Categorical Features


We begin some exploratory data analysis! We’ll start by checking out missing data!



#### 1. Drop or impute missing values

In [91]:
print(df_norm.isnull().sum())

avg_dist                     0
avg_rating_by_driver       201
avg_rating_of_driver      8122
avg_surge                    0
surge_pct                    0
weekday_pct                  0
trips_in_first_30_days       0
city_code                    0
dtype: int64


### Missing values

We get 8122 records that are missing values for avg_rating_of_driver, 396 missing values for phone and 201 values for avg_rating_by_driver.

### Data Cleaning

We want to fill in missing data instead of just dropping the missing data rows. One way to do this is by filling in the mean of all of the columns (imputation).

In [92]:

#df[['avg_rating_by_driver','avg_rating_of_driver']] = df[['avg_rating_by_driver','avg_rating_of_driver']] 
df_norm.fillna(df_norm.mean(), inplace=True)


In [98]:
df_norm['surge_pct'].fillna(df_norm['surge_pct'].mean(), inplace=True)
df_norm.head(10)

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct,trips_in_first_30_days,city_code
0,3.67,5.0,4.7,1.1,15.4,46.2,4,1
1,8.26,5.0,5.0,1.0,0.0,50.0,0,0
2,0.77,5.0,4.3,1.0,0.0,100.0,3,0
3,2.36,4.9,4.6,1.14,20.0,80.0,9,1
4,3.13,4.9,4.4,1.19,11.8,82.4,14,2
5,10.56,5.0,3.5,1.0,0.0,100.0,2,2
6,3.95,4.0,4.601559,1.0,0.0,100.0,1,0
7,2.04,5.0,5.0,1.0,0.0,100.0,2,2
8,4.36,5.0,4.5,1.0,0.0,100.0,2,2
9,2.37,5.0,4.601559,1.0,0.0,0.0,1,2


In [93]:
print(df_norm.isnull().sum())

avg_dist                  0
avg_rating_by_driver      0
avg_rating_of_driver      0
avg_surge                 0
surge_pct                 0
weekday_pct               0
trips_in_first_30_days    0
city_code                 0
dtype: int64


#### 3. Drop irrelevant features

In [82]:
#df.info()

In [83]:
df1 = df.select_dtypes(include ='float64')
df1.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct
0,3.67,5.0,4.7,1.1,15.4,46.2
1,8.26,5.0,5.0,1.0,0.0,50.0
2,0.77,5.0,4.3,1.0,0.0,100.0
3,2.36,4.9,4.6,1.14,20.0,80.0
4,3.13,4.9,4.4,1.19,11.8,82.4


In [84]:
df2 = df.select_dtypes(include ='int64') 
#df2

In [85]:
df3 = df['city']

In [88]:
df_norm = pd.concat([df1, df2, df['city_code']], axis=1, sort=False)
#df_norm

In [86]:
df = pd.concat([df1, df2, df3], axis=1, sort=False)
df.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct,trips_in_first_30_days,city
0,3.67,5.0,4.7,1.1,15.4,46.2,4,King's Landing
1,8.26,5.0,5.0,1.0,0.0,50.0,0,Astapor
2,0.77,5.0,4.3,1.0,0.0,100.0,3,Astapor
3,2.36,4.9,4.6,1.14,20.0,80.0,9,King's Landing
4,3.13,4.9,4.4,1.19,11.8,82.4,14,Winterfell


#### 3. Converting Categorical Features
We’ll need to encode categorical feature to numerical values using sklearn LabelEncoder Otherwise our machine learning algorithm won’t be able to directly take in those features as inputs.

In [87]:
from sklearn.preprocessing import LabelEncoder

lb_city = LabelEncoder()
df["city_code"] = lb_city.fit_transform(df["city"])
df[["city", "city_code"]].head(11)

Unnamed: 0,city,city_code
0,King's Landing,1
1,Astapor,0
2,Astapor,0
3,King's Landing,1
4,Winterfell,2
5,Winterfell,2
6,Astapor,0
7,Winterfell,2
8,Winterfell,2
9,Winterfell,2


But before doing all splitting, let’s first separate our features and target variables. We want to predict the ‘retained customer’, so it should be our ‘target’ rather than part of ‘data’. We will set a rule if a user took any trip in last 30 days he/she is going to be retained. Using this rule we create our target variable.


In [48]:
df_norm['retained_customer'] = [1 if x != 0 else 0 for x in df_norm['trips_in_first_30_days']]

In [51]:
df_norm.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct,trips_in_first_30_days,city_code,retained_customer
0,3.67,5.0,4.7,1.1,15.4,46.2,4,1,1
1,8.26,5.0,5.0,1.0,0.0,50.0,0,0,0
2,0.77,5.0,4.3,1.0,0.0,100.0,3,0,1
3,2.36,4.9,4.6,1.14,20.0,80.0,9,1,1
4,3.13,4.9,4.4,1.19,11.8,82.4,14,2,1


Now we have

Feauture Set: 

               [avg_dist, avg_rating_by_driver, avg_rating_of_driver, 
               avg_surge, surge_pct, weekday_pct trips_in_first_30_days,
               city_code trips_in_first_30_days, city_code]
Target Variable : 
                 
                 [retained_customer] 

In [100]:
#USE this before data scaling
print(df_norm.describe())


           avg_dist  avg_rating_by_driver  avg_rating_of_driver     avg_surge  \
count  50000.000000          50000.000000          50000.000000  50000.000000   
mean       5.796827              4.778158              4.601559      1.074764   
std        5.707357              0.445753              0.564977      0.222336   
min        0.000000              1.000000              1.000000      1.000000   
25%        2.420000              4.700000              4.500000      1.000000   
50%        3.880000              5.000000              4.700000      1.000000   
75%        6.940000              5.000000              5.000000      1.050000   
max      160.960000              5.000000              5.000000      8.000000   

          surge_pct   weekday_pct  trips_in_first_30_days     city_code  
count  50000.000000  50000.000000            50000.000000  50000.000000  
mean       8.849536     60.926084                2.278200      1.136040  
std       19.958811     37.081503               

### Build Model
Since we have multiple variables we may have correlated variables so a good choice is Logistic regression
 1. Split the data in TRAINING / TEST DATA
 2. Scale trainging/testing data( as our data summary tells a big difference between min and max so we should scale    
    the data to prevent bad effects on model)
 3. Fit the model

In [26]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split( df_norm.drop('retained_customer', axis=1), df_norm['retained_customer'], test_size=0.2, random_state=42)

In [59]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler = scaler.fit(train_x)

X_train_minmax = scaler.transform(train_x)
X_test_minmax = scaler.transform(test_x)

There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem.

### Logistic Regression

Predicting customer churn/retention is a binary classification problem. Customers either churn or retain in a given period. Along with being a robust model, Logistic Regression provides interpretable outcomes too.

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. The real life example of classification example would be, to categorize the mail as spam or not spam, to categorize the tumor as malignant or benign and to categorize the transaction as fraudulent or genuine. All these problem’s answers are in categorical form i.e. Yes or No. We have to determine whether or not a user will be retained.



In [109]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
result = model.fit(X_train_minmax, train_y)



In [111]:
from sklearn import metrics
prediction_test = model.predict(X_test_minmax)
# Print the prediction accuracy
print (metrics.accuracy_score(test_y, prediction_test))


0.8709


When we prepare a such Prediction model, we may ask two questions below:

1- Which characteristics make customers churn or retain?

2- What are the most critical ones? What should we focus on?



The second question can be answered by looking at the coef column. Exponential coef gives us the expected change in Churn/Retention Rate if we change it by one unit. If we apply the code below, we will see the transformed version of all coefficients:

In [103]:
# To get the weights of all the variables
weights = pd.Series(model.coef_[0],
 index=X.columns.values)
weights.sort_values(ascending = False)

trips_in_first_30_days    74.854577
avg_surge                  0.445019
city_code                  0.111440
surge_pct                  0.013527
avg_rating_by_driver      -0.008558
weekday_pct               -0.077533
avg_rating_of_driver      -0.218198
avg_dist                  -0.368804
dtype: float64

It can be observed that some variables have a positive relation to our predicted variable and some have a negative relation. A positive value has a positive impact on our predicted variable. A good example is “trips_in_first_30_days”: The positive relation to predicted variable  means that monitoring weekly, biwwely and monthly trips of every user will increases the probability of a customer to stay. On the other hand that “avg_rating_by_driver” is in a highly negative relation to the predicted variable, which means that customers with this type of rating are very unlikely not to stay. 

### Evaluation
We can check precision, recall, f1-score using classification report.

In [112]:
from sklearn.metrics import classification_report
print(classification_report(test_y,prediction_test))


print("Accuracy:",metrics.accuracy_score(test_y, prediction_test))


              precision    recall  f1-score   support

           0       1.00      0.58      0.73      3075
           1       0.84      1.00      0.91      6925

    accuracy                           0.87     10000
   macro avg       0.92      0.79      0.82     10000
weighted avg       0.89      0.87      0.86     10000

Accuracy: 0.8709


We got a classification rate of 87%, considered as good accuracy.

Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you can say, when a model makes a prediction, how often it is correct.