# Ultimate Challange - part 3

**The Data Science Method**  

1.   Problem Identification 

2.   Data Wrangling
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Data Wrangling

In [163]:
#load common python packages
import os
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.4f' % x) #get rid of scientific notations
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import time
import math
from sklearn import preprocessing
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.metrics import mean_squared_error, mean_absolute_error
from IPython.display import Image
%matplotlib inline

In [97]:
# switch folder
os.chdir('C:\\Users\\tc18f\\Desktop\\springboard\\ultimate_challenge')
os.getcwd()

'C:\\Users\\tc18f\\Desktop\\springboard\\ultimate_challenge'

In [98]:
# load the logins.json file
df = pd.read_json('ultimate_data_challenge.json')
df.head() # take a quick look

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,surge_pct,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
0,King's Landing,4,2014-01-25,4.7,1.1,2014-06-17,iPhone,15.4,True,46.2,3.67,5.0
1,Astapor,0,2014-01-29,5.0,1.0,2014-05-05,Android,0.0,False,50.0,8.26,5.0
2,Astapor,3,2014-01-06,4.3,1.0,2014-01-07,iPhone,0.0,False,100.0,0.77,5.0
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,20.0,True,80.0,2.36,4.9
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,11.8,False,82.4,3.13,4.9


In [99]:
# check df.info() to see each column's Dtype
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city                    50000 non-null  object 
 1   trips_in_first_30_days  50000 non-null  int64  
 2   signup_date             50000 non-null  object 
 3   avg_rating_of_driver    41878 non-null  float64
 4   avg_surge               50000 non-null  float64
 5   last_trip_date          50000 non-null  object 
 6   phone                   49604 non-null  object 
 7   surge_pct               50000 non-null  float64
 8   ultimate_black_user     50000 non-null  bool   
 9   weekday_pct             50000 non-null  float64
 10  avg_dist                50000 non-null  float64
 11  avg_rating_by_driver    49799 non-null  float64
dtypes: bool(1), float64(6), int64(1), object(4)
memory usage: 4.2+ MB


In [100]:
# makes dates in datetime for oclumns that contains 'date'
for i in df.columns:
    if 'date' in i:
        df[i] = pd.to_datetime(df[i])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   city                    50000 non-null  object        
 1   trips_in_first_30_days  50000 non-null  int64         
 2   signup_date             50000 non-null  datetime64[ns]
 3   avg_rating_of_driver    41878 non-null  float64       
 4   avg_surge               50000 non-null  float64       
 5   last_trip_date          50000 non-null  datetime64[ns]
 6   phone                   49604 non-null  object        
 7   surge_pct               50000 non-null  float64       
 8   ultimate_black_user     50000 non-null  bool          
 9   weekday_pct             50000 non-null  float64       
 10  avg_dist                50000 non-null  float64       
 11  avg_rating_by_driver    49799 non-null  float64       
dtypes: bool(1), datetime64[ns](2), float64(6), int

In [101]:
# fill in the missing values with column mean for columns with word 'avg'
for i in df.columns:
    if ('avg' in i):
        df[i].fillna(df[i].mean(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   city                    50000 non-null  object        
 1   trips_in_first_30_days  50000 non-null  int64         
 2   signup_date             50000 non-null  datetime64[ns]
 3   avg_rating_of_driver    50000 non-null  float64       
 4   avg_surge               50000 non-null  float64       
 5   last_trip_date          50000 non-null  datetime64[ns]
 6   phone                   49604 non-null  object        
 7   surge_pct               50000 non-null  float64       
 8   ultimate_black_user     50000 non-null  bool          
 9   weekday_pct             50000 non-null  float64       
 10  avg_dist                50000 non-null  float64       
 11  avg_rating_by_driver    50000 non-null  float64       
dtypes: bool(1), datetime64[ns](2), float64(6), int

In [102]:
# let's check for unique values on the phone column
df['phone'].unique()

array(['iPhone', 'Android', None], dtype=object)

In [103]:
# since there are only two variables and less than 1% of the data doesn't have it
# let's drop the rows with missing values in this column
df=df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49604 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   city                    49604 non-null  object        
 1   trips_in_first_30_days  49604 non-null  int64         
 2   signup_date             49604 non-null  datetime64[ns]
 3   avg_rating_of_driver    49604 non-null  float64       
 4   avg_surge               49604 non-null  float64       
 5   last_trip_date          49604 non-null  datetime64[ns]
 6   phone                   49604 non-null  object        
 7   surge_pct               49604 non-null  float64       
 8   ultimate_black_user     49604 non-null  bool          
 9   weekday_pct             49604 non-null  float64       
 10  avg_dist                49604 non-null  float64       
 11  avg_rating_by_driver    49604 non-null  float64       
dtypes: bool(1), datetime64[ns](2), float64(6), int

In [104]:
# check the dataframe's correlation between columns
df.corr()

Unnamed: 0,trips_in_first_30_days,avg_rating_of_driver,avg_surge,surge_pct,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
trips_in_first_30_days,1.0,-0.0114,-0.0019,0.0055,0.1122,0.0508,-0.1368,-0.0389
avg_rating_of_driver,-0.0114,1.0,-0.0216,-0.0032,-0.0025,0.012,0.0288,0.1011
avg_surge,-0.0019,-0.0216,1.0,0.7934,-0.0777,-0.1102,-0.0817,0.0107
surge_pct,0.0055,-0.0032,0.7934,1.0,-0.1056,-0.1452,-0.1045,0.0203
ultimate_black_user,0.1122,-0.0025,-0.0777,-0.1056,1.0,0.0359,0.0329,0.0093
weekday_pct,0.0508,0.012,-0.1102,-0.1452,0.0359,1.0,0.1023,0.0201
avg_dist,-0.1368,0.0288,-0.0817,-0.1045,0.0329,0.1023,1.0,0.0795
avg_rating_by_driver,-0.0389,0.1011,0.0107,0.0203,0.0093,0.0201,0.0795,1.0


avg_surge and surge_pct is close to 0.8
surge_pct: the percent of trips taken with surge multiplier > 1
avg_surge: The average surge multiplier over all of this user’s trips
let's just keep avg_surge

In [105]:
# drop the surge_pct column
df =df.drop('surge_pct', axis=1)
df.head()

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
0,King's Landing,4,2014-01-25,4.7,1.1,2014-06-17,iPhone,True,46.2,3.67,5.0
1,Astapor,0,2014-01-29,5.0,1.0,2014-05-05,Android,False,50.0,8.26,5.0
2,Astapor,3,2014-01-06,4.3,1.0,2014-01-07,iPhone,False,100.0,0.77,5.0
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,True,80.0,2.36,4.9
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,False,82.4,3.13,4.9


In [106]:
# since we want to find out those who retained in the 6th months, we need a new column to indicate that
# since if a rider is still active on 2nd month then it means the last trip date - signup date >=30 days
# we will get a column that calculate the date difference between last trip and sign up date
# we then use date_diff to get active_150 column indicating the rider is still active on their 6th month
df['date_diff'] = (df.last_trip_date - df.signup_date) # get the date difference
df['date_diff'] = df['date_diff']/np.timedelta64(1, 'D') # get rid of 'day' in datetime
df['date_diff'] = df['date_diff'].astype('int64') # make it int
df['active_150'] = df['date_diff'].apply(lambda x: True if (x >= 150) else False)
df.head()

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver,date_diff,active_150
0,King's Landing,4,2014-01-25,4.7,1.1,2014-06-17,iPhone,True,46.2,3.67,5.0,143,False
1,Astapor,0,2014-01-29,5.0,1.0,2014-05-05,Android,False,50.0,8.26,5.0,96,False
2,Astapor,3,2014-01-06,4.3,1.0,2014-01-07,iPhone,False,100.0,0.77,5.0,1,False
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,True,80.0,2.36,4.9,170,True
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,False,82.4,3.13,4.9,47,False


In [107]:
# check and see if there's negative value in date_diff
df['date_diff'].min()

0

In [108]:
# check the signup_date period
display(df['signup_date'].min())
display(df['signup_date'].max())

Timestamp('2014-01-01 00:00:00')

Timestamp('2014-01-31 00:00:00')

In [109]:
# since we have the signup up, let's add a column indicating sign up dayofweek, maybe it is a factor
df['signup_DoW'] = df.signup_date.dt.dayofweek
df['signup_DoW'].replace([i for i in range(7)], ['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], inplace=True) #change it to Mon~Sun
df.head()

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver,date_diff,active_150,signup_DoW
0,King's Landing,4,2014-01-25,4.7,1.1,2014-06-17,iPhone,True,46.2,3.67,5.0,143,False,Sat
1,Astapor,0,2014-01-29,5.0,1.0,2014-05-05,Android,False,50.0,8.26,5.0,96,False,Wed
2,Astapor,3,2014-01-06,4.3,1.0,2014-01-07,iPhone,False,100.0,0.77,5.0,1,False,Mon
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,True,80.0,2.36,4.9,170,True,Fri
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,False,82.4,3.13,4.9,47,False,Mon


In [110]:
# the dataset contains only 1 month of data (those who signed up in 2014-Jan)
# let's see how many of them retained after 150 days
retained = df[df['active_150']==True]
print('Retained on 6th month: ', len(retained))
print('Retained percentage: ', round(len(retained)/len(df)*100, 2), '%')

Retained on 6th month:  12629
Retained percentage:  25.46 %


In [111]:
# let's check the mean between those who retained and those who didn't
df.groupby('active_150').agg('mean')

Unnamed: 0_level_0,trips_in_first_30_days,avg_rating_of_driver,avg_surge,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver,date_diff
active_150,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,1.8112,4.604,1.0741,0.3287,60.7107,6.0896,4.7833,69.341
True,3.6417,4.5941,1.0775,0.5115,61.3906,4.8947,4.7618,161.9718


When comparing the means between those who ratained after 150 days vs those who didn't retain

Noticeable difference (retained vs not retained): 3.6 vs 1.8 trips in first 30 days, 51% vs 33% ultimate_black_user, 4.9 vs 6.1 avg_dist.

The rest of the columns are almost the same (about 1% difference).

In [112]:
# let's check the median between those who retained and those who didn't
df.groupby('active_150').agg('median')

Unnamed: 0_level_0,trips_in_first_30_days,avg_rating_of_driver,avg_surge,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver,date_diff
active_150,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,1,4.8,1.0,False,66.7,4.0,5.0,71
True,2,4.7,1.02,True,63.6,3.65,4.8,161


When comparing the medians between those who ratained after 150 days vs those who didn't retain

Noticeable difference (retained vs not retained): 2 vs 1 trip(s) in first 30 days, True vs False ultimate_black_user, 3.65 vs 4 avg_dist.

The rest of the columns are almost the same (less than 5% difference).

After comparing the mean/median of the two groups, we can say that the following variables are influencial variables.

trip_in_first_30_days, ultimate_black_user, avg_dist.

In [138]:
# let's compare the columns with string variables
n_retained = df[df['active_150']==False] #get the subset of df for those who didn't retrain
# create empty lists to store values
val_list=[]
retained_list=[]
n_retained_list=[]
for col in ['city','phone','signup_DoW']: # iterate thru the columns with string values
    for val in df[col].unique(): #iterate thru the strings
        val_list.append(val)
        retained_val = retained[col].value_counts()[val] # get the value from retaiend df
        retained_val_p = round(retained_val/len(retained)*100,2) # get it by percentage
        retained_list.append(retained_val_p)
        n_retained_val = n_retained[col].value_counts()[val] # get the value from retaiend df
        n_retained_val_p = round(n_retained_val/len(n_retained)*100,2) # get it by percentage
        n_retained_list.append(n_retained_val_p)
comp_df = pd.DataFrame({
    'string_value':val_list,
    'retained':retained_list,
    'n_retained':n_retained_list
})
comp_df

Unnamed: 0,string_value,retained,n_retained
0,King's Landing,33.18,15.89
1,Astapor,22.23,36.8
2,Winterfell,44.59,47.31
3,iPhone,83.97,64.85
4,Android,16.03,35.15
5,Sat,20.52,19.1
6,Wed,14.4,13.02
7,Mon,11.89,10.26
8,Fri,16.71,20.5
9,Thu,12.72,14.3


From the string value's comparison there's a lot more difference between the two. The values is the count number divided by the total count from the same column. Example, city had King's Landing, Astapor, and Winterfell, so retained's 33.18+22.23+44.59=100. In another words, we can say that out of all those who remained, 33.18% of them signed up from King's Landing, while those who didn't retain only 15.89% of them signed up from King's landing.

From this we can see that, the city where the riders signed up may have some weight on whether or not the rider will retain after 150 days (those who signed up from sing'es landing is more likely to retain). The same goes to phone type, but not so much on the day of the week when the rider signed up.

Key take away: City and Phone are definitely influencial variables.

All influential varaibles: trip_in_first_30_days, ultimate_black_user, avg_dist, city, phone

Let's now drop the unnecessary columns.

In [141]:
df = df[['city', 'trips_in_first_30_days','ultimate_black_user','avg_dist','phone','active_150']]
df.head()

Unnamed: 0,city,trips_in_first_30_days,ultimate_black_user,avg_dist,phone,active_150
0,King's Landing,4,True,3.67,iPhone,False
1,Astapor,0,False,8.26,Android,False
2,Astapor,3,False,0.77,iPhone,False
3,King's Landing,9,True,2.36,iPhone,True
4,Winterfell,14,False,3.13,Android,False


In [148]:
# since phone only have 2 different values, let's change them to 1 and 0, with 1 being iPhone and 0 being Android
df['phone'].replace(['iPhone','Android'], [1, 0], inplace=True)
df.head()

Unnamed: 0,city,trips_in_first_30_days,ultimate_black_user,avg_dist,phone,active_150
0,King's Landing,4,True,3.67,1,False
1,Astapor,0,False,8.26,0,False
2,Astapor,3,False,0.77,1,False
3,King's Landing,9,True,2.36,1,True
4,Winterfell,14,False,3.13,0,False


Create dummy features for the city features and add those to the 'df' dataframe. Make sure to also remove the original categorical columns from the dataframe.

In [153]:
dfd= pd.DataFrame(df.drop('city',axis =1)).merge(pd.get_dummies(dfo['city']),left_index=True,right_index=True)
dfd.head()

Unnamed: 0,trips_in_first_30_days,ultimate_black_user,avg_dist,phone,active_150,Astapor,King's Landing,Winterfell
0,4,True,3.67,1,False,0,1,0
1,0,False,8.26,0,False,1,0,0
2,3,False,0.77,1,False,1,0,0
3,9,True,2.36,1,True,0,1,0
4,14,False,3.13,0,False,0,0,1


In [159]:
# Create the X and y matrices from the dataframe, where y = df.active_150
X = dfd.drop('active_150', axis=1)
y = dfd.active_150

In [160]:
# apply standard scaler
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

In [161]:
#Split the X_scaled and y into 75/25 training and testing data subsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.25, random_state=17)

In [165]:
# grid search to find best learning rate
learn_rate_list = []
n_est_list=[]
max_dep_list=[]
train_score=[]
valid_score=[]
for learn_rate in [0.05, 0.1, 0.25, 0.5, 0.75, 1]:
    for estimator in [25,50,100,200]:
        for depth in [2,3,4,5]:
            gb = GradientBoostingClassifier(n_estimators=estimator, learning_rate=learn_rate, max_depth=depth, random_state = 17)
            gb.fit(X_train, y_train)
            learn_rate_list.append(learn_rate)
            n_est_list.append(estimator)
            max_dep_list.append(depth)
            train_score.append(gb.score(X_train, y_train))
            valid_score.append(gb.score(X_test, y_test))

In [170]:
gb_df = pd.DataFrame({
    'n_estimator':n_est_list,
    'max_depth': max_dep_list,
    'learning_rate': learn_rate_list,
    'train_score': train_score,
    'validation_score': valid_score,
})
gb_df.head()

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score
0,25,2,0.05,0.7468,0.7411
1,25,3,0.05,0.7561,0.7507
2,25,4,0.05,0.7578,0.7525
3,50,2,0.05,0.7591,0.7545
4,50,3,0.05,0.7635,0.7578


In [179]:
gb_df.head()

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score
0,25,2,0.05,0.7468,0.7411
1,25,3,0.05,0.7561,0.7507
2,25,4,0.05,0.7578,0.7525
3,50,2,0.05,0.7591,0.7545
4,50,3,0.05,0.7635,0.7578


In [180]:
# display the top 10 validation scores
display(gb_df.sort_values('validation_score', ascending=False).head(10))

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score
77,50,5,0.1,0.7725,0.7631
75,200,5,0.05,0.7757,0.7631
78,100,5,0.1,0.7762,0.763
10,200,3,0.05,0.7702,0.7625
63,50,2,1.0,0.77,0.7625
49,25,3,0.75,0.77,0.7624
22,200,3,0.1,0.7721,0.7623
76,25,5,0.1,0.7704,0.7623
74,100,5,0.05,0.7724,0.7623
29,50,4,0.25,0.7732,0.7623


In [183]:
# looks like 0.76 is the best, however, we want train_score and validation_score to be as close as possible
# let's add score_diff column that equals train_score - validation score then subset the score df
gb_df['score_diff'] = abs(gb_df.train_score - gb_df.validation_score)
gb_df[gb_df['validation_score'] > 0.76].sort_values('score_diff').head()

Unnamed: 0,n_estimator,max_depth,learning_rate,train_score,validation_score,score_diff
6,100,2,0.05,0.7672,0.7613,0.0059
36,25,2,0.5,0.7674,0.7615,0.006
24,25,2,0.25,0.7675,0.7612,0.0062
15,50,2,0.1,0.767,0.7607,0.0063
9,200,2,0.05,0.7676,0.7612,0.0063


We will pick gb_df's 36th row's parameter. It has 2nd lowest score_difference between training and validation score, but the validation score improved by 0.0002 (and is highest in the top 5 lowest score_diff) while the score difference only increased by 0.0001

In [184]:
# using the params with gb_df's 36th row
gb = GradientBoostingClassifier(n_estimators=25, learning_rate = 0.5, max_depth = 2, random_state = 17)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
# print the classification report
print(classification_report(y_test, y_pred))
# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print('True Positive:', tp, '\n'
     'False Positive :', fp, '\n'
     'True Negative:', tn, '\n'
     'False Negative:', fn)

              precision    recall  f1-score   support

       False       0.78      0.94      0.85      9191
        True       0.59      0.25      0.35      3210

    accuracy                           0.76     12401
   macro avg       0.69      0.59      0.60     12401
weighted avg       0.73      0.76      0.72     12401

True Positive: 795 
False Positive : 543 
True Negative: 8648 
False Negative: 2415


As expected, the f1-score for accuracy is 0.76. I'd say it's an okay model.

In [191]:
# let's find out how important each columns were
ft_df = pd.DataFrame({
    'Feature':X.columns,
    'feature_importance':gb.feature_importances_
})
ft_df.set_index('Feature', inplace=True)
ft_df.sort_values('feature_importance', ascending=False)

Unnamed: 0_level_0,feature_importance
Feature,Unnamed: 1_level_1
trips_in_first_30_days,0.3345
King's Landing,0.1806
phone,0.165
avg_dist,0.1411
ultimate_black_user,0.1272
Astapor,0.0516
Winterfell,0.0


The best predictor is actually trips_in_first_30_days, which we will never know until 30 days later. While 2nd in line is those who signed up in King's landing, they're more likely to retain after 150 days. And 3rd is type of phone, which earlier indicated that iPhone users are more likely to retain after 150 days.

Random Forest Classifier could be used here as well, but since we didn't have much missing data, I decided to use gradient booster over random forest.