## Sections
* [Retention Percentage](#Retention_Percentage)
* [Modeling](#Modeling)
* [Feature Importances](#Feature_Importances)

In [1]:
import pandas as pd
import numpy as np
from statistics import mean
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, log_loss
import chart_studio.plotly as py
import cufflinks as cf
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()

- `city`: city this user signed up in
- `phone`: primary device for this user
- `signup_date`: date of account registration; in the form ‘YYYYMMDD’
- `last_trip_date`: the last time this user completed a trip; in the form ‘YYYYMMDD’
- `avg_dist`: the average distance in miles per trip taken in the first 30 days after signup
- `avg_rating_by_driver`: the rider’s average rating over all of their trips
- `avg_rating_of_driver`: the rider’s average rating of their drivers over all of their trips
- `surge_pct`: the percent of trips taken with surge multiplier > 1
- `avg_surge`: The average surge multiplier over all of this user’s trips
- `trips_in_first_30_days`: the number of trips this user took in the first 30 days after signing up
- `ultimate_black_user`: TRUE if the user took an Ultimate Black in their first 30 days; FALSE otherwise
- `weekday_pct`: the percent of the user’s trips occurring during a weekday

In [2]:
data = pd.read_json('ultimate_data_challenge.json')
data

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,surge_pct,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
0,King's Landing,4,2014-01-25,4.7,1.10,2014-06-17,iPhone,15.4,True,46.2,3.67,5.0
1,Astapor,0,2014-01-29,5.0,1.00,2014-05-05,Android,0.0,False,50.0,8.26,5.0
2,Astapor,3,2014-01-06,4.3,1.00,2014-01-07,iPhone,0.0,False,100.0,0.77,5.0
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,20.0,True,80.0,2.36,4.9
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,11.8,False,82.4,3.13,4.9
...,...,...,...,...,...,...,...,...,...,...,...,...
49995,King's Landing,0,2014-01-25,5.0,1.00,2014-06-05,iPhone,0.0,False,100.0,5.63,4.2
49996,Astapor,1,2014-01-24,,1.00,2014-01-25,iPhone,0.0,False,0.0,0.00,4.0
49997,Winterfell,0,2014-01-31,5.0,1.00,2014-05-22,Android,0.0,True,100.0,3.86,5.0
49998,Astapor,2,2014-01-14,3.0,1.00,2014-01-15,iPhone,0.0,False,100.0,4.58,3.5


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city                    50000 non-null  object 
 1   trips_in_first_30_days  50000 non-null  int64  
 2   signup_date             50000 non-null  object 
 3   avg_rating_of_driver    41878 non-null  float64
 4   avg_surge               50000 non-null  float64
 5   last_trip_date          50000 non-null  object 
 6   phone                   49604 non-null  object 
 7   surge_pct               50000 non-null  float64
 8   ultimate_black_user     50000 non-null  bool   
 9   weekday_pct             50000 non-null  float64
 10  avg_dist                50000 non-null  float64
 11  avg_rating_by_driver    49799 non-null  float64
dtypes: bool(1), float64(6), int64(1), object(4)
memory usage: 4.2+ MB


In [4]:
data.describe()

Unnamed: 0,trips_in_first_30_days,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct,avg_dist,avg_rating_by_driver
count,50000.0,41878.0,50000.0,50000.0,50000.0,50000.0,49799.0
mean,2.2782,4.601559,1.074764,8.849536,60.926084,5.796827,4.778158
std,3.792684,0.617338,0.222336,19.958811,37.081503,5.707357,0.446652
min,0.0,1.0,1.0,0.0,0.0,0.0,1.0
25%,0.0,4.3,1.0,0.0,33.3,2.42,4.7
50%,1.0,4.9,1.0,0.0,66.7,3.88,5.0
75%,3.0,5.0,1.05,8.6,100.0,6.94,5.0
max,125.0,5.0,8.0,100.0,100.0,160.96,5.0


Based off some of the values for `trips_in_first_30_days`, `avg_surge`, `surge_pct`, and `avg_dist` it would seem that there is atleast 1 outlier. The max values for these columns are well outside 3 standard deviations of Quartile 3 let alone the mean. Should I get rid of these?

# Comments

- The prompt says there are two cities but in the set there are 3. 
- I do not believe the `avg_dist` is the average distance for the first 30 days. If that was the case then any person who didn't ride in the first 30 days would have an average distance of 0. But that is not the case as seen in the second entry. 
- I'm going to assume the `avg_rating_of_driver` is the avg rating of the specific driver the rider happened to be riding with. I'm also going to assume the `avg_rating_by_driver` is the average of all the ratings the rider gave out. 

## Retention Percentage<a id='Retention_Percentage'></a>

To get the percentage of user retention we first need to find the number of users that took rides past 30 days of their signup date. Then we compare that number to the total amount of users. 

In [5]:
# Focus on signup_date and last_trip_date
user_dates = data[['signup_date', 'last_trip_date']]
user_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   signup_date     50000 non-null  object
 1   last_trip_date  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
# Convert columns to datetime type
user_dates.signup_date = pd.to_datetime(user_dates.signup_date)
user_dates.last_trip_date = pd.to_datetime(user_dates.last_trip_date)

# Get number of days between each date and append to original data
user_days = user_dates.last_trip_date-user_dates.signup_date
num_days = [num.days for num in user_days]
data['num_days'] = num_days

# mean number of days signedup
mean_num_days = mean(num_days)

In [7]:
# Visualize distribution of signup days
fig = px.histogram(num_days)
fig.update_layout(showlegend=False); fig.update_xaxes(title = 'signup days')
fig.add_vline(x=30, line_width=3, line_dash="dash", line_color="red", annotation_text = '30 days')
fig.add_vline(x=mean_num_days, line_width=1, line_dash="dash", line_color="green", annotation_text = 'Mean Days')

In [8]:
# Mark days>=30 as 1 (retatined), else mark as 0 (not-retained)
data['retention_status'] = np.where(data['num_days']>=30, 1, 0)
retention_labels = pd.DataFrame(np.where(data['retention_status']==1, 'Retained (>=30 days)', 'Not-Retained (<30 days)'))\
                                                                                            .rename(columns={0:'Status'})

In [9]:
# Visualize Percentages
fig = px.pie(retention_labels, names='Status', title='Retention Percentages')
fig.update_layout(legend_title_text='Status', title_x=.42)

Almost 75% of users are retained

In [10]:
px.imshow(data.dropna().corr())

It would seem that the <ins>number of trips in the first 30 days</ins> and being an <ins>ultimate black user</ins> are most correlated with retention status. We disregard the number of days the user has been active because retention status is based directly from number of active days.    

## Modeling<a id='Modeling'></a>

We want to predict user retention. So we will use `retention_status` as our target variable. 

In [11]:
# fill in null values with mean
data.avg_rating_of_driver = data.avg_rating_of_driver.fillna(data.avg_rating_of_driver.mean())
data.avg_rating_by_driver = data.avg_rating_by_driver.fillna(data.avg_rating_by_driver.mean())

In [12]:
# Drop all null 
df = data.dropna().reset_index(drop=True)

# set retention_status as target variable
y = df.retention_status.astype('int64')

# OneHotEncode categorical variables
needs_encoding = df[['city', 'phone', 'ultimate_black_user']]
encode = OneHotEncoder()
array_encode = encode.fit_transform(needs_encoding).toarray()
df_encode = pd.DataFrame(array_encode, columns = encode.get_feature_names())

# Join all explanatory variable 
X = df.select_dtypes(include=['int64', 'float64']).join(df_encode)
X = X.rename(columns={'avg_rating_by_driver':'avg_rating_given','x0_Astapor':'Astapor', "x0_King's Landing":"King's Landing", 'x0_Winterfell':'Winterfell',
       'x1_Android':'Android', 'x1_iPhone':'iphone', 'x2_False':'Non-UB User', 'x2_True':'UB User'}).drop(columns='num_days')
X.head()

Unnamed: 0,trips_in_first_30_days,avg_rating_of_driver,avg_surge,surge_pct,weekday_pct,avg_dist,avg_rating_given,Astapor,King's Landing,Winterfell,Android,iphone,Non-UB User,UB User
0,4,4.7,1.1,15.4,46.2,3.67,5.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
1,0,5.0,1.0,0.0,50.0,8.26,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
2,3,4.3,1.0,0.0,100.0,0.77,5.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
3,9,4.6,1.14,20.0,80.0,2.36,4.9,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,14,4.4,1.19,11.8,82.4,3.13,4.9,0.0,0.0,1.0,1.0,0.0,1.0,0.0


In [13]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [14]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((37203, 14), (37203,), (12401, 14), (12401,))

In [15]:
# Simple pipeline
steps = [('scaler', StandardScaler()), 
         ('log', LogisticRegression(random_state = 0))]

pipe = Pipeline(steps)

params = {'log__C':np.logspace(-2, 2, 50)}

gs = GridSearchCV(pipe, params, cv=5).fit(x_train, y_train)

In [16]:
y_pred = gs.predict(x_test)

In [17]:
labels = ['Not-Retained', 'Retained']
print(classification_report(y_test, y_pred, target_names = labels, digits = 5))
pd.DataFrame(confusion_matrix(y_test, y_pred), index=labels, columns=labels)

              precision    recall  f1-score   support

Not-Retained    0.44366   0.01984   0.03797      3176
    Retained    0.74606   0.99144   0.85142      9225

    accuracy                        0.74260     12401
   macro avg    0.59486   0.50564   0.44470     12401
weighted avg    0.66862   0.74260   0.64309     12401



Unnamed: 0,Not-Retained,Retained
Not-Retained,63,3113
Retained,79,9146


In [18]:
p_ret = 2331+7956
p_ret/10362

0.9927620150550087

This is clearly a terrible model because it is predicting that over 99% of the users will retain their membership and is barely capturing any of the non-retained users.   

## Feature Importances<a id='Feature_Importances'></a>

In [19]:
coefficients = pd.DataFrame()
coefficients['features'] = X.columns
coefficients['coefficients'] = np.transpose(gs.best_estimator_[1].coef_).round(3)
fig = px.bar(coefficients, x='features', y='coefficients', text = 'coefficients')
fig.update_traces(textposition='outside')

Even though this is a poor model, based off the coefficents I would suggest that Ultimate foucs more on riders from King's Landing and somehow promote shorter rides. Also, as one would expect, we should try to increase the number of UB users. 