# Part 3 - Ultimate Challenge - model training

We will take our dataset created in previous notebook and train our model

# Conclusion

Our final model had an accuracy score of 0.715

The following factors contributes the most in predicting whether a user will be active or not:

* trips_in_first_30_days - more rides within first 30 days the better
* avg_dist - shorter the better
* city - users from Kings Landing are more likely to stay active
* phone - iPhone users are more likely to stay active
* Ultimate Black users


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight


%matplotlib inline
sns.set(font_scale=2)

DATA_FILE = 'ultimate_data_challenge-preprocessed.csv'

In [2]:
data = pd.read_csv(DATA_FILE)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49511 entries, 0 to 49510
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   trips_in_first_30_days     49511 non-null  int64  
 1   avg_rating_of_driver       49511 non-null  float64
 2   avg_surge                  49511 non-null  float64
 3   surge_pct                  49511 non-null  float64
 4   weekday_pct                49511 non-null  float64
 5   avg_dist                   49511 non-null  float64
 6   avg_rating_by_driver       49511 non-null  float64
 7   city_astapor               49511 non-null  int64  
 8   city_kingslanding          49511 non-null  int64  
 9   city_winterfell            49511 non-null  int64  
 10  phone_android              49511 non-null  int64  
 11  phone_iphone               49511 non-null  int64  
 12  ultimate_black_user_false  49511 non-null  int64  
 13  ultimate_black_user_true   49511 non-null  int

# Extract features and label columns

In [3]:
feature_cols = [i for i in data.columns if i != "active"]
features = data[feature_cols]
features.columns

Index(['trips_in_first_30_days', 'avg_rating_of_driver', 'avg_surge',
       'surge_pct', 'weekday_pct', 'avg_dist', 'avg_rating_by_driver',
       'city_astapor', 'city_kingslanding', 'city_winterfell', 'phone_android',
       'phone_iphone', 'ultimate_black_user_false', 'ultimate_black_user_true',
       'signup_dayofweek_fri', 'signup_dayofweek_mon', 'signup_dayofweek_sat',
       'signup_dayofweek_sun', 'signup_dayofweek_thu', 'signup_dayofweek_tue',
       'signup_dayofweek_wed'],
      dtype='object')

In [4]:
labels = data["active"]

# Normalize our data

We want to normalize our data for traing so it trains faster and make our coefficients more understandable

In [5]:
s = MinMaxScaler()
s.fit(features.values)
print(s.data_max_)

[125.     5.     8.   100.   100.   160.96   5.     1.     1.     1.
   1.     1.     1.     1.     1.     1.     1.     1.     1.     1.
   1.  ]


In [6]:
features_normed = pd.DataFrame(s.transform(features.values), columns=features.columns)
features_normed.sample(5).T

Unnamed: 0,34702,22738,11355,40339,7525
trips_in_first_30_days,0.008,0.016,0.072,0.008,0.032
avg_rating_of_driver,1.0,1.0,0.85,0.9,0.875
avg_surge,0.0,0.0,0.001429,0.0,0.0
surge_pct,0.0,0.0,0.024,0.0,0.0
weekday_pct,1.0,0.0,0.595,1.0,0.611
avg_dist,0.079833,0.023111,0.028889,0.04703,0.035785
avg_rating_by_driver,0.75,1.0,0.975,0.75,0.95
city_astapor,0.0,0.0,0.0,1.0,0.0
city_kingslanding,0.0,1.0,0.0,0.0,0.0
city_winterfell,1.0,0.0,1.0,0.0,1.0


# Train our model using statsmodel

In [21]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 1)
X_train_normed, X_test_normed, y_train_normed, y_test_normed = train_test_split(features_normed, labels, random_state = 1)

# SKLearn Model - Logistic Regression

In [59]:
lr = LogisticRegression(random_state = 1, 
#                         class_weight = 'balanced',
                        max_iter = 500, 
                        verbose = 1,
                       fit_intercept=True).fit(X_train, y_train)
predict = lr.predict(X_test_normed)
print(lr.score(X_test_normed, y_test_normed))


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.5862013249313298


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.0s finished


# Variables that have positive impact on user being active

* more trips in the first 30 days
* users living in kingslanding
* iphone users
* ultimate black users
* higher surge percentage

In [60]:
coef_df = pd.DataFrame(lr_normed.coef_, columns=features.columns).T.rename({0: "coef"}, axis=1)
coef_df[coef_df.coef > 0].sort_values("coef", ascending=False)

Unnamed: 0,coef
trips_in_first_30_days,11.007165
city_kingslanding,0.936073
phone_iphone,0.537091
ultimate_black_user_true,0.441988
surge_pct,0.278817
signup_dayofweek_sun,0.037076
signup_dayofweek_mon,0.035158
signup_dayofweek_sat,0.034118
signup_dayofweek_wed,0.016163
signup_dayofweek_tue,0.001225


# Variables that have a negative impact on user being active

* Large average distances for trips
* High average surge price
* Users in Astapor
* Users with high average ratings by Drivers
* Android Users
* Non-Ultimate Black Users

In [61]:
coef_df[coef_df.coef < 0].sort_values("coef", ascending=True)

Unnamed: 0,coef
avg_dist,-5.384973
avg_surge,-0.900984
city_astapor,-0.707441
avg_rating_by_driver,-0.688605
phone_android,-0.518419
ultimate_black_user_false,-0.423316
avg_rating_of_driver,-0.308756
city_winterfell,-0.20996
weekday_pct,-0.071473
signup_dayofweek_thu,-0.061179


# Feature Selection

We will select the features that have the most impact on our model and retrained our model to see if our model performance is better

Variables that had the most impact on our model (positive or negative):

* trips_in_first_30_days
* avg_dist
* city
* avg_surge
* avg_rating_by_driver
* phone
* ultimate_black_user
* avg_rating_of_driver

In [63]:
coef_df["coef_abs"] = np.absolute(coef_df.coef)
coef_df.sort_values("coef_abs", ascending=False)

Unnamed: 0,coef,coef_abs
trips_in_first_30_days,11.007165,11.007165
avg_dist,-5.384973,5.384973
city_kingslanding,0.936073,0.936073
avg_surge,-0.900984,0.900984
city_astapor,-0.707441,0.707441
avg_rating_by_driver,-0.688605,0.688605
phone_iphone,0.537091,0.537091
phone_android,-0.518419,0.518419
ultimate_black_user_true,0.441988,0.441988
ultimate_black_user_false,-0.423316,0.423316


In [66]:
filtered_columns = ["trips_in_first_30_days", "avg_dist", 
                    "city_astapor", "city_kingslanding", 
                    "city_winterfell", "phone_iphone", 
                    "phone_android", "ultimate_black_user_false", 
                    "ultimate_black_user_true", "avg_surge", 
                    "avg_rating_by_driver", "avg_rating_of_driver"]

X_train_filtered = X_train_normed[filtered_columns]
X_test_filtered = X_test_normed[filtered_columns]


lr_filtered = LogisticRegression(random_state = 1, 
#                         class_weight = 'balanced',
                        max_iter = 500, 
                        verbose = 1,
                       fit_intercept=True).fit(X_train_filtered, y_train_normed)
predict_filtered = lr_filtered.predict(X_test_filtered)
print(lr_filtered.score(X_test_filtered, y_test_normed))


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.7174018419777024


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished


In [72]:
coef_filtered = pd.DataFrame(lr_filtered.coef_, 
                             columns=X_train_filtered.columns).T.rename({0: "coef"}, 
                                                                        axis=1)
coef_filtered

Unnamed: 0,coef
trips_in_first_30_days,10.58274
avg_dist,-5.361652
city_astapor,-0.678762
city_kingslanding,0.953123
city_winterfell,-0.181199
phone_iphone,0.574399
phone_android,-0.481237
ultimate_black_user_false,-0.386408
ultimate_black_user_true,0.479571
avg_surge,0.486637


# Let's filter the feature set one more time.

phone is either iphone or androi
ultimate black user is either true or false

Let's just keep one of each and see how the model does.

As suspected, our model is preforming roughly the same as the columns that we drop did not add any further information to our model.

In [73]:
filtered_column2 = ["trips_in_first_30_days", "avg_dist", 
                    "city_astapor", "city_kingslanding", 
                    "city_winterfell", "phone_iphone", 
                    "ultimate_black_user_true", "avg_surge", 
                    "avg_rating_by_driver", "avg_rating_of_driver"]

X_train_filtered2 = X_train_normed[filtered_column2]
X_test_filtered2 = X_test_normed[filtered_column2]


lr_filtered2 = LogisticRegression(random_state = 1, 
#                         class_weight = 'balanced',
                        max_iter = 500, 
                        verbose = 1,
                       fit_intercept=True).fit(X_train_filtered2, y_train_normed)
predict_filtered2 = lr_filtered2.predict(X_test_filtered2)
print(lr_filtered2.score(X_test_filtered2, y_test_normed))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.7175634189691388


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s finished


In [75]:
coef_filtered2 = pd.DataFrame(lr_filtered2.coef_, 
                             columns=X_train_filtered2.columns).T.rename({0: "coef"}, 
                                                                        axis=1)
coef_filtered2["coef_abs"] = np.absolute(coef_filtered2)
coef_filtered2.sort_values("coef_abs", ascending=False)

Unnamed: 0,coef,coef_abs
trips_in_first_30_days,10.587678,10.587678
avg_dist,-5.36177,5.36177
phone_iphone,1.055166,1.055166
ultimate_black_user_true,0.865738,0.865738
city_astapor,-0.828811,0.828811
city_kingslanding,0.803056,0.803056
avg_rating_by_driver,-0.658597,0.658597
avg_surge,0.492726,0.492726
city_winterfell,-0.331323,0.331323
avg_rating_of_driver,-0.284911,0.284911


# Remove Factors That Are Out Of Our Control

The following factors are generally out of our control

* ratings by and for drivers
* average surge

Let's remove these to see if this impacts our model greatly.

## Conclusion

Our model accuracy did go down by around 0.002, but this is such a small number - we can safely ignore

In [79]:
filtered_column3 = ["trips_in_first_30_days", "avg_dist", 
                    "city_astapor", "city_kingslanding", 
                    "city_winterfell", "phone_iphone", 
                    "ultimate_black_user_true"]

X_train_filtered3 = X_train_normed[filtered_column3]
X_test_filtered3 = X_test_normed[filtered_column3]


lr_filtered3 = LogisticRegression(random_state = 1, 
#                         class_weight = 'balanced',
                        max_iter = 500, 
                        verbose = 1,
                       fit_intercept=True).fit(X_train_filtered3, y_train_normed)
predict_filtered3 = lr_filtered3.predict(X_test_filtered3)
print(lr_filtered3.score(X_test_filtered3, y_test_normed))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


0.7151397640975925


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s finished


In [80]:
coef_filtered3 = pd.DataFrame(lr_filtered3.coef_, 
                             columns=X_train_filtered3.columns).T.rename({0: "coef"}, 
                                                                        axis=1)                                                                         
coef_filtered3["coef_abs"] = np.absolute(coef_filtered3)
coef_filtered3.sort_values("coef_abs", ascending=False)

Unnamed: 0,coef,coef_abs
trips_in_first_30_days,10.639454,10.639454
avg_dist,-5.568104,5.568104
phone_iphone,1.058335,1.058335
city_kingslanding,0.905685,0.905685
ultimate_black_user_true,0.861433,0.861433
city_astapor,-0.708534,0.708534
city_winterfell,-0.195566,0.195566
