# Question 3 Modeling

Uber’s Driver team is interested in predicting which driver signups are most likely to start driving.
To help explore this question, we have provided a sample dataset of a cohort of driver signups.

The data was pulled a some time after they signed up to include the result of whether they
actually completed their first trip. It also includes several pieces of background information
gathered about the driver and their car.

We would like you to use this data set to help understand what factors are best at predicting
whether a signup will start to drive within 30 days of signing up, and offer suggestions to
operationalize those insights to help Uber.

See below for a description of the dataset. Please include any code you wrote for the analysis
and delete the dataset when you have finished with the challenge. Please also call out any data
related assumptions or issues that you encounter.

# Part A

Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this
analysis (a few sentences/plots describing your approach will suffice). What fraction of the driver
signups took a first trip within 30 days of signing up?

### Data issues and exploration
- how can records have value for first_complete_trip_timestamp but NaN for signup_timestamp (remove those records)
- after removing records with Nan signup_timestamp, proportion of records that complete first ride within 30 days is 54%

### Data assumptions
- Nan values in any timestamp columns would mean event never happened, eg: a Nan first_complete_timestamp would mean user never gave a first ride
- signup_timestamp must occur before bgc, vehicle added, and first complete ride


### Data transformation
- first_complete_trip_timestamp 

        -> convert to binary (1,0) where 1 is within 30 days since sign up and all else 0
        

- bgc_date

        -> convert to binary (1,0) where 1 is driver did a bgc
        -> days it took to complete bgc since signup
           
- vehicle_date 

        -> convert to binary(1,0) where 1 is driver register vehicle information
        -> days it took to register vehicle since signup
        
- vehicle_year
        -> identify whether vehicle registered is a recent model
        
        

In [1]:
import os
import pandas as pd
import numpy as np
import math
from datetime import datetime

In [2]:
# basic import files, right now is just simply calling from my local, must change path if other wants to run it
filepath = r'C:\Users\sunny.wong2\JupyterNotebook\Uber Assignment\uber_assignment\product_ds_exercise_2018_h2_dataset.csv'
df = pd.read_csv(filepath)

In [4]:
# analysis only focus on records with non null signup timestamp
df['signup_label'] = df['signup_timestamp'].notnull()
df = df[df['signup_label'] == True]
df.shape

In [100]:
# create data_transformer class to any data transformation custom functions
class data_transformer(object):
    
    def __init__(self):
        self.sign_up_window = 30
            
    # create label of whether a first completed trip happened within 30 days
    # assumption is that a Nan timestamp means there is no first trip completed
    def create_final_label(self, days):
        if days < self.sign_up_window:
            return 1
        else:
            return 0
    
    # calculate date difference (used to find how long it took for bgc and veh regis since signup date)
    def extract_days_difference(self, input_delta):

        # Attempt to coerce into Pandas time delta
        delta = pd.Timedelta(input_delta)

        # Attempt to extract number of days
        days = np.NaN
        if pd.notnull(delta):
            days = delta.days

        # Return result
        return days
    
    # want a feature that looks at whether vehicle registered is "new" or not
    def vehicle_year_bin(self, y):
        if y > 2011:
            return "new"
        else:
            return "old"
        

In [11]:
# Create simple features based on whether the data exist
df['bgc_known'] = df['bgc_date'].notnull()
df['vehicle_inspection_known'] = df['vehicle_added_date'].notnull()
df['signup_os_known'] = df['signup_os'].notnull()
df['vehicle_make_known'] = df['vehicle_make'].notnull()
df['drove_label'] = df['first_completed_trip_timestamp'].notnull()

In [12]:
# convert date columns into dates
df['first_completed_trip_timestamp'] = pd.to_datetime(df['first_completed_trip_timestamp'], infer_datetime_format=True)
df['vehicle_added_date'] = pd.to_datetime(df['vehicle_added_date'], infer_datetime_format=True)
df['bgc_date'] = pd.to_datetime(df['bgc_date'], infer_datetime_format=True)
df['signup_timestamp'] = pd.to_datetime(df['signup_timestamp'], infer_datetime_format=True)

In [14]:
# compute days difference, maybe important features since drivers who complete these actions are more commited
dt = data_transformer()
df['signup_to_bgc'] = (df['bgc_date'] - df['signup_timestamp']).apply(dt.extract_days_difference)
df['signup_to_veh'] = (df['vehicle_added_date'] - df['signup_timestamp']).apply(dt.extract_days_difference)

In [106]:
# based on vehicle year, see if vehicle is a recent model
df['vehicle_new_indicator'] = df['vehicle_year'].apply(dt.vehicle_year_bin)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [17]:
# build label, if days took from signup to first trip is under 30days then 1 else 0
df['signup_to_first_complete'] = (df['first_completed_trip_timestamp'] - df['signup_timestamp']).apply(dt.extract_days_difference)
df['complete_trip_label'] = df['signup_to_first_complete'].apply(dt.create_final_label)

In [107]:
# keep only columns considered for modelling
columns = ['city_name',
           'signup_os',
           'signup_channel',
           'vehicle_make',
           'vehicle_model',
           'vehicle_year',
           'bgc_known',
           'vehicle_inspection_known',
           'signup_os_known',
           'vehicle_make_known',
           'signup_to_bgc',
           'signup_to_veh',
           'vehicle_new_indicator',
           'complete_trip_label'
          ]
df = df[columns]

In [108]:
# data explortory, 
# rate of drivers that have a complete trip after 30 days of signup is 54% - note this is after removing records with null signup date
r = len(df.loc[(df['complete_trip_label'] == 1)]) / len(df)
print(r)

0.546810791495444


# Part B 

Build a predictive model to help Uber determine whether or not a driver signup will start driving
within 30 days of signing up. Discuss why you chose your approach, what alternatives you
considered, and any concerns you have. How valid is your model? Include any key indicators of
model performance.

- After the data transformation step, we now have a dataframe ready to build our predictive model
- will create a simple model as baseline (decision tree)
- build more advance models (such as random forest and gradient boosting) to beat the baseline model
- use grid search and cross validation to find optimal hyperparameters
- evaluate against test set


In [78]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection  import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix, precision_score, recall_score, accuracy_score
from xgboost.sklearn import XGBClassifier

In [109]:
# X is our predictors, y is the label we want to predict
y = df['complete_trip_label']
X = df.drop(['complete_trip_label'], axis=1)

In [110]:
# sanity check to see if y and X have same number of rows
print(len(y))
print(len(X))

11194
11194


In [111]:
# use function below to prepare our X dataframe for scikit learn models
X_prepped = prep_df(X)

In [112]:
def prep_df(input_df):
    
    prepped_df = input_df.copy()

    # replace null values with not_known in these categorical variables
    columns = ['signup_os', 'signup_channel', 'vehicle_make', 'vehicle_model', 'city_name', 'vehicle_year']
    for c in columns:
        prepped_df[c] = prepped_df[c].fillna('not_known')
    
    # a simple way to deal with NaN in these days difference features is to replace with the mean
    columns = ['signup_to_bgc', 'signup_to_veh']
    for c in columns:
        prepped_df[c] = prepped_df[c].replace(np.NaN, df[c].mean())
        
    prepped_df = pd.get_dummies(prepped_df)
    return prepped_df

In [113]:
# this function first splits into train and test, then do cross validation grid search on the train set 
# to identify best hyperparamters, using the best_fit model to validate against test set
def train_model_with_cv(model, params, X, y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

    # Use Train data to parameter selection in a Grid Search
    gs_clf = GridSearchCV(model, params, n_jobs=1, cv=5)
    gs_clf = gs_clf.fit(X_train, y_train)
    model = gs_clf.best_estimator_

    # Use best model and test data for final evaluation
    y_pred = model.predict(X_test)

    _f1 = f1_score(y_test, y_pred, average='micro')
    _confusion = confusion_matrix(y_test, y_pred).ravel()
    _accuracy = accuracy_score(y_test, y_pred)
    _precision = precision_score(y_test, y_pred)
    _recall = recall_score(y_test, y_pred)
    _statistics = {'f1_score': _f1,
                   'confusion_matrix': 'tn, fp, fn, tp' + str(_confusion),
                   'accuracy': _accuracy,
                   'precision': _precision,
                   'recall': _recall
                   }

    return model, _statistics 

In [114]:
# baseline model using decision tree model
clf = DecisionTreeClassifier()
param_grid = {"max_depth": [10, 15, 20],
              "min_impurity_decrease": [0],
              "criterion": ["gini"],
              "min_samples_split": [50],
              "min_samples_leaf": [50],
              "max_features": [None]
              }

dt_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.5737483085250338, 'confusion_matrix': 'tn, fp, fn, tp[ 676 1005  570 1444]', 'accuracy': 0.5737483085250338, 'precision': 0.5896284197631686, 'recall': 0.7169811320754716}


In [115]:
# random forest model
clf = RandomForestClassifier()
param_grid = {"n_estimators": [100, 150],
                  "max_depth": [3, 8, 12],
                  "max_features": ["auto", "sqrt"],
                  "min_samples_split": [30, 75],
                  "min_samples_leaf": [30, 75],
                  "bootstrap": [True],
                  "criterion": ["gini"]
              }

rf_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.56617050067659, 'confusion_matrix': 'tn, fp, fn, tp[ 335 1346  257 1757]', 'accuracy': 0.56617050067659, 'precision': 0.5662262326780535, 'recall': 0.8723932472691162}


In [116]:
# gradient boosting model
clf = XGBClassifier()
param_grid = {"learning_rate": [0.1],
              "n_estimators": [100, 150],  # Number of estimators
              "max_depth": [3, 8, 15],  # maximum depth of decision trees
              "colsample_bytree": [0.33, 0.66],   # Criterion for splitting
              "subsample": [0.5, 0.8, 1]
             }


gb_model , stats = train_model_with_cv(clf, param_grid, X_prepped, y)
print(stats)

{'f1_score': 0.5732070365358592, 'confusion_matrix': 'tn, fp, fn, tp[ 595 1086  491 1523]', 'accuracy': 0.5732070365358592, 'precision': 0.5837485626676888, 'recall': 0.7562065541211519}


All of the models I tested performed about the same. This is due to the quality of the data. If I was able to collect more features, we’ll see the algorithms differ a little more in performance.

# Part C

Briefly discuss how Uber might leverage the insights gained from the model to generate more
first trips (again, a few ideas/sentences will suffice).

Using trained random forest model, can take advantage of its feature importances to see which features are strong predictors

In [117]:
feature_important_list = sorted(zip(map(lambda x: round(x, 4), rf_model.feature_importances_), list(X_prepped)), reverse=True)
feature_important_list_top5 = feature_important_list[:5]

In [118]:
# show top 5 features from feature important list
print(feature_important_list_top5)

[(0.1708, 'signup_channel_Referral'), (0.1017, 'signup_to_bgc'), (0.0904, 'signup_channel_R2D'), (0.0695, 'signup_to_veh'), (0.0628, 'signup_os_known')]


Insights to Leverage
The main factor that is best at predicting whether someone who signs up completes their first drive is that signups are completed thru referral. Therefore Uber should increase their incentive to users for successful referrals.

Another factor important to predicting whether someone who signs up completes their first drive is the time it takes them to submit their background check consent form. Uber may want to come up with ways to encourage their signups to complete their background check consent form as soon as possible such as offers or incentives.