# ATG Data Scientist Challenge

Thank you for your interest in joining the data science team at Uber ATG. The next step is to complete Uber ATG’s take home exercise. This will allow you to get an idea of what it's like to work for us while showcasing your statistics, programming, and data analysis capabilities. 

While we invite you to fill out the remainder of this notebook for your submission, you may send your results back in any format as long as the work/code/analysis is **reproducible**. We've had candidates submit RMarkdown HTMLs or LaTeX generated PDFs as well. You may use any language (or Jupyter Kernel) you want but keep in mind that we primarily do deployment, engineering, and analysis work in Python.

There is no time limit, but please try to send back the completed assignment within 1 week of receiving it. Please delete any data you have downloaded from us after submitting the assignment.

If you have any questions about the assignment, please reach out to your recruiter.

Thanks,
ATG Data Science


# Driver Signup Analysis

You can use the csv: 
  * `ds_challenge_v2_data.csv` 

included in the zip file included with this notebook or download the data set at the following link:

* [**Dataset Download Link**](https://drive.google.com/a/uber.com/file/d/0BxkZqrCogcyWbUs2Smhlc0VSams/view?usp=drive_web)


Uber’s Driver team is interested in predicting which driver signups are most likely to start driving. To help explore this question, we have provided a sample dataset of a cohort of driver signups in January 2015. The data was pulled a few months after they signed up to include the result of whether they actually completed their first trip. It also includes several pieces of background information gathered about the driver and their car.


See below for a detailed description of the dataset:

- **id:** driver_id
- **city_id:** city_id this user signed up in
- **signup_os:** signup device of the user (“android”, “ios”, “website”, “other”)
- **signup_channel:** what channel did the driver sign up from (“offline”, “paid”, “organic”, “referral”)
- **signup_timestamp:** timestamp of account creation; local time in the form ‘YYYY-MM-DD’
- **bgc_date:** date of background check consent; in the form ‘YYYY-MM-DD’
- **vehicle_added_date:** date when driver’s vehicle information was uploaded; in the form ‘YYYY-MM-DD’
- **first_trip_date:** date of the first trip as a driver; in the form ‘YYYY-MM-DD’
- **vehicle_model:** model of vehicle uploaded (i.e. Accord, Prius, 350z)
- **vehicle_year:** year that the car was made; in the form ‘YYYY’


Our primary goal is to understand what factors are best at predicting whether a signup will start to drive, and offer suggestions to operationalize those insights to help Uber. This take home consists of answering three main tasks with some discussion questions under them to get you started.

_Ordering and presentation format is up to you, but we love analyses that are well organized and have a linear flow from data ingestion to final result(s). We especially look for well stated assumptions and an eye for business/product impact of any analysis or model:_


- **Conduct an exploratory analysis of the data to give us qualitative and quantitative insights.** 
    - Does all the data make sense? Did you have to throw anything away? Are there interesting patterns that emerge?


- **Build a statistical model to predict whether a driver that signed up will begin driving for Uber.**
    - How did the model perform? Are there any caveats? How can Uber use your model to improve our product?
    


- ** _Optional_: Build a model to forecast the number of new drivers we expect to start every week.**
    - How would you validate a model like this? What other information would you use if you had access to all of Uber's data?
    - Feel free to include model results in your presentation and analysis.
    
    
 

In [29]:
%%HTML
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

### ^Toggle Button for the Code. Turned off by default to improve readability.

In [3]:
data = pd.read_csv('ds_challenge_v2_data.csv')
#create truth table for completed first trip
data['first_trip_completed'] = data['first_completed_date'].notnull()

#### Sample Drivers come from largely 2 cities.

## Looking at Date-Time data in relation to trip-completion

In [4]:
#convert dates to Pandas DateTime
data['signup_date'] = pd.to_datetime(data['signup_date'])
data['bgc_date']=pd.to_datetime(data['bgc_date'])
data['vehicle_added_date']=pd.to_datetime(data['vehicle_added_date'])
data['first_completed_date']=pd.to_datetime(data['first_completed_date'])

In [5]:

#check date quartile distribution
data[['signup_date','bgc_date','vehicle_added_date','first_completed_date']].describe(include='all')

Unnamed: 0,signup_date,bgc_date,vehicle_added_date,first_completed_date
count,54681,32743,13134,6137
unique,30,74,78,57
top,2016-01-05 00:00:00,2016-01-29 00:00:00,2016-01-26 00:00:00,2016-01-23 00:00:00
freq,2489,1119,377,257
first,2016-01-01 00:00:00,2016-01-01 00:00:00,2016-01-01 00:00:00,2016-01-04 00:00:00
last,2016-01-30 00:00:00,2016-03-25 00:00:00,2016-03-26 00:00:00,2016-02-29 00:00:00


In [6]:
#Distribution pan of First Completed Trip in months
fig = data['first_completed_date'].dt.month.value_counts().iplot(title='Span of Dates(months) from the sample',kind='bar', asFigure= True)
fig['data'].update({'x':['January','February']})
py.iplot(fig)

All date data from the sample span over 2 months, mainly January. (Bar graph shows only Dates of First Trip which had the widest range of dates).

#### Examining Duration between date-metrics

In [7]:
#examine different Time Delta's and create duration data
data['dur_signup-bgc'] = data['bgc_date']-data['signup_date']
data['dur_bgc-vadd'] = data['vehicle_added_date']- data['bgc_date']
data['dur_signup-vadd'] = data['vehicle_added_date']- data['signup_date']
data['dur_bgc-trip'] = data['first_completed_date']- data['bgc_date']
data['dur_vadd-trip'] = data['first_completed_date']- data['vehicle_added_date']
data['dur_signup-trip'] = data['first_completed_date']- data['signup_date']
data[['dur_signup-bgc','dur_signup-vadd','dur_bgc-vadd','dur_bgc-trip','dur_vadd-trip','dur_signup-trip']].describe(include='all')


Unnamed: 0,dur_signup-bgc,dur_signup-vadd,dur_bgc-vadd,dur_bgc-trip,dur_vadd-trip,dur_signup-trip
count,32743,13134,12794,5984,5872,6137
mean,10 days 01:04:23.103564,15 days 04:52:37.606212,7 days 04:16:23.711114,8 days 20:29:40.748663,5 days 21:47:49.209809,12 days 00:45:03.079680
std,10 days 12:48:19.197391,14 days 01:33:12.396720,9 days 08:24:05.184277,6 days 19:04:19.437692,5 days 21:10:11.195601,7 days 17:30:26.639917
min,0 days 00:00:00,-5 days +00:00:00,-30 days +00:00:00,-14 days +00:00:00,-18 days +00:00:00,0 days 00:00:00
25%,2 days 00:00:00,4 days 00:00:00,1 days 00:00:00,3 days 00:00:00,1 days 00:00:00,6 days 00:00:00
50%,6 days 00:00:00,11 days 00:00:00,4 days 00:00:00,7 days 00:00:00,4 days 00:00:00,11 days 00:00:00
75%,15 days 00:00:00,24 days 00:00:00,10 days 00:00:00,13 days 00:00:00,9 days 00:00:00,17 days 00:00:00
max,69 days 00:00:00,72 days 00:00:00,55 days 00:00:00,30 days 00:00:00,30 days 00:00:00,30 days 00:00:00


This table describes informational metrics,quartile,and general distribution of our time data. 

In [8]:
pie = pd.DataFrame(data.loc[data['first_trip_completed']==True]['city_name'].value_counts())
pie

Unnamed: 0,city_name
Strark,3239
Berton,2437
Wrouver,461


### Most Signup Traffic is coming from Mobile

In [9]:
pie = pd.DataFrame(data.loc[data['first_trip_completed']==True]['signup_os'].value_counts())
pie2 = pd.DataFrame(data.loc[data['first_trip_completed']==True]['signup_channel'].value_counts())
colors = ['#FEBFB3', '#E1396C', '#96D3b8C', '#D0F9B1']
fig = {
  "data": [
    {
      "values": pie.signup_os,
      "labels": pie.index,
      "domain": {"x": [0, .48]},
      "name": "Sign-Up OS",
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie",
        'text': list(pie.index),
        'marker':{'colors':colors}
    },     
    {
      "values": pie2.signup_channel,
      "labels": pie2.index,
      "text":"CO2",
      "textposition":"inside",
      "domain": {"x": [.52, 1]},
      "name": "Acquisition Channel",
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie",
        'text': list(pie2.index),
        'marker':{'colors':colors}
    }],
  "layout": {
        "title":"Total Incoming Channels",
        "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Sign-Up <br>OS",
                "x": 0.18,
                "y": 0.5
            },
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Acquisition<br> Channel",
                "x": 0.83,
                "y": 0.5
            }
        ]
    }
}
#fig['data'].update({'text': list(pie.index)+ list(pie2.index),'textposition':'auto','marker':{'colors':colors}})
py.iplot(fig)

In [10]:
pie = pd.DataFrame(data['signup_os'].value_counts())
pie2 = pd.DataFrame(data['signup_channel'].value_counts())
colors = ['#FEBFB3', '#E1396C', '#96D3b8C', '#D0F9B1']
fig = {
  "data": [
    {
      "values": pie.signup_os,
      "labels": pie.index,
      "domain": {"x": [0, .48]},
      "name": "Sign-Up OS",
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie",
        'text': list(pie.index),
        'marker':{'colors':colors}
    },     
    {
      "values": pie2.signup_channel,
      "labels": pie2.index,
      "text":"CO2",
      "textposition":"inside",
      "domain": {"x": [.52, 1]},
      "name": "Acquisition Channel",
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie",
        'text': list(pie2.index),
        'marker':{'colors':colors}
    }],
  "layout": {
        "title":"Incoming Channels that ended up having their first trip",
        "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Sign-Up <br>OS",
                "x": 0.18,
                "y": 0.5
            },
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Acquisition<br> Channel",
                "x": 0.83,
                "y": 0.5
            }
        ]
    }
}
#fig['data'].update({'text': list(pie.index)+ list(pie2.index),'textposition':'auto','marker':{'colors':colors}})
py.iplot(fig)

### Preferred/popular makes and models among Uber Drivers

In [11]:
fig = tls.make_subplots(rows=2, cols=1,subplot_titles=('Top Makes Uber Drivers Prefer','Top Models Uber Drivers Prefer'))
fig1 = data.loc[data['first_trip_completed']==True]['vehicle_make'].value_counts()[:6].iplot(asFigure = True,title='Top Car Manufacturers Uber Drivers Add',kind='bar')
fig.append_trace(fig1['data'][0],1,1)

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]



In [12]:
fig2 = data.loc[data['first_trip_completed']==True]['vehicle_model'].value_counts()[:6].iplot(asFigure = True, title='Top Models Uber Drivers Prefer to Add',kind='bar')
fig.append_trace(fig2['data'][0],2,1)
py.iplot(fig)

Data Cleaning

In [13]:
print('Out of place Values:')
print('SignUp' ,len(data[data['dur_signup-trip'].dt.days<0]))
print('BGC',len(data[data['dur_bgc-trip'].dt.days<0]))
print('vehicle' ,len(data[data['dur_vadd-trip'].dt.days<0]))

Out of place Values:
SignUp 0
BGC 33
vehicle 54


## Cleaning Solutions:
Make sure there no trips until after background checks and vehicles have been added. 
* Otherwise, preserve trip data, and edit BGC or VehAdds to the midpoint of SignUP and FirstTrip. 

In [14]:
#edit data that claims drivers started first trip before vehicle or background checks were completed. replaced with middle of signup and first trip
#loop to fill-in missing or edit the background date column
for i, row in data.loc[(data['signup_date'].notnull()&data['first_completed_date'].notnull()&data['bgc_date'].isnull())|\
                       (data['dur_bgc-trip'].dt.days<0) | (data['dur_signup-bgc'].dt.days<0),:].iterrows():
    a = row['signup_date']
    b = row['first_completed_date']
    c = a + (b - a)/2
    data.at[i, 'bgc_date'] = pd.to_datetime(c)
#loop to fill-in missing or edit the vehicle added date column
for i, row in data.loc[(data['signup_date'].notnull()&data['first_completed_date'].notnull()&data['vehicle_added_date'].isnull())|\
                       (data['dur_vadd-trip'].dt.days<0) | (data['dur_signup-vadd'].dt.days<0),:].iterrows():
    a = row['signup_date']
    b = row['first_completed_date']
    c = a + (b - a)/2
    data.at[i, 'vehicle_added_date'] = pd.to_datetime(c)
    
#Note run duration cell again to update duration data
#Run cell directly above to make sure data is consistently in order. 

### Assumptions and Notes
#### Ordering Issues
* BGChecks before/after vehicle appears to have no topological order to the process (can occur concurently or out of order). 
* Entries with background checks **AFTER** the vehicle happened only 2.1% of this sample, so while it may not be typical, it may be okay to assume topological(ordering) exceptions.

#### Spanning Problem
* It appears the sample has been selected one month for signups and 2 months for first trip. Although data spanning across a much wider seasonality and time period would be better for model fitting, other features should help in creating our model later. 

In [15]:
bandv = data.loc[data['bgc_date'].notnull()&data['vehicle_added_date'].notnull()]
print('entries with background checks before vehicle added:',len(bandv.loc[data['bgc_date']<data['vehicle_added_date']]))
print('entries with background checks after vehicle added:', len(bandv.loc[data['bgc_date']>data['vehicle_added_date']]))

entries with background checks before vehicle added: 10337
entries with background checks after vehicle added: 280


In [16]:
# calculating conversion percentages
count= data.count()
perc= round(count/data.shape[0] *100,1)

### Only 11.2% of Signups end up completing their first trip

In [17]:
titl = f'Conversion of Signup to First-Completed-Trip<br><b>{perc[5]}%->{perc[6]}%->{perc[10]}%</b><br>(% relative to total signups)'
fig = data[['signup_date','bgc_date','vehicle_added_date','first_completed_date']].count().iplot(kind='bar',title=titl,asFigure= True) 
fig['data'].update({'text':data[['signup_date','bgc_date','vehicle_added_date','first_completed_date']].count(),'textposition':'auto', 'opacity':0.8,'marker':{'color':'rgb(158,202,225)'}})
py.iplot(fig)

### Friction between conversions. Why might that be?

The largest relative drop is in vehicle_added, possibly due to:
* Failed Inspections
* Shared Accounts/Vehicles
* Insufficient Motivation to reach a Inspection Dealer

Notes: 
* Perhaps duration between BackgroundCheck-date and VehicleAdded-date are not as important, since those two steps of the process can be done concurrently, without a topological order. 
* Time-to-completion in the 2 above stages is likely not as much of a factor as in the other stages. Obviously, getting the vehicle inspected and approved may be signficant friction for some applicants.

In [18]:
#copy of cell from before; recalculate duration values to cleaned data
# data['dur_signup-bgc'] = data['bgc_date']-data['signup_date']
# data['dur_bgc-vadd'] = data['vehicle_added_date']- data['bgc_date']
# data['dur_signup-vadd'] = data['vehicle_added_date']- data['signup_date']
# data['dur_bgc-trip'] = data['first_completed_date']- data['bgc_date']
# data['dur_vadd-trip'] = data['first_completed_date']- data['vehicle_added_date']
# data['dur_signup-trip'] = data['first_completed_date']- data['signup_date']

In [19]:
##** MAKE SURE TO RUN DURATION CELL FROM ABOVE OR THE HISTOGRAM WILL LOOK SKEWED

fig = data['dur_signup-bgc'].dt.days.iplot(asFigure = True, kind='histogram', title = 'Days to Completion: Frequency of Successful Conversion by days<br>(Hover Mouse to compare values)')
fig['layout'].update({'barmode':'stack'})
t1 = go.Histogram(x = data['dur_signup-vadd'].dt.days, name = 'dur_signup-vehAdd')
t2 = go.Histogram(x= data['dur_signup-trip'].dt.days, name = 'dur_signup-trip')
t3 =  go.Histogram(x= data['dur_vadd-trip'].dt.days ,name = 'dur_vehAdd-trip')
t4 = go.Histogram(x= data['dur_bgc-trip'].dt.days ,name = 'dur_bgc-trip')
fig['data'].append(t1)
fig['data'].append(t2)
fig['data'].append(t3)
fig['data'].append(t4)
py.iplot(fig)

All duration data appears to have a right skewed distribution. Clueing in-- the sooner the applicant is motivated to move on to the next step of the process-- the more likely they are to be successful and therefore-- have a successful first trip. 

In [20]:
#data[['dur_signup-bgc','dur_signup-vadd','dur_bgc-vadd','dur_bgc-trip','dur_vadd-trip','dur_signup-trip']].corr()

## Building a Statistical Model 
### to predict whether a new signup will begin their first drive with Uber

Prework before running a machine learning model:
* Selecting which features to include in our model (intuitively and through testing). 
* Create dummy variables so our categorical/non-numerical data can be inputed into the model-algorithm.
* Select which models and algorithms to use
* Tuning Parameters and accuracy/validation testing

In [21]:
data.columns

Index(['id', 'city_name', 'signup_os', 'signup_channel', 'signup_date',
       'bgc_date', 'vehicle_added_date', 'vehicle_make', 'vehicle_model',
       'vehicle_year', 'first_completed_date', 'first_trip_completed',
       'dur_signup-bgc', 'dur_bgc-vadd', 'dur_signup-vadd', 'dur_bgc-trip',
       'dur_vadd-trip', 'dur_signup-trip'],
      dtype='object')

In [24]:
#data['dur_signup-trip'] = data.loc[(data['dur_signup-trip'].notnull())].apply(lambda x: x.days()if(pd.notnull(x)) else x)
# data['dur_bgc-trip'] = data['dur_bgc-trip'].apply(lambda x: x.days())
# data['dur_vadd-trip']= data['dur_signup-trip'].apply(lambda x: x.days())


In [26]:
dummy = pd.get_dummies(data, columns=['city_name','signup_os','signup_channel'],drop_first=True)

In [27]:
data = pd.concat([dummy],axis = 1)

In [28]:
data

Unnamed: 0,id,signup_date,bgc_date,vehicle_added_date,vehicle_make,vehicle_model,vehicle_year,first_completed_date,first_trip_completed,dur_signup-bgc,...,dur_vadd-trip,dur_signup-trip,city_name_Strark,city_name_Wrouver,signup_os_ios web,signup_os_mac,signup_os_other,signup_os_windows,signup_channel_Paid,signup_channel_Referral
0,1,2016-01-02,NaT,NaT,,,,NaT,False,NaT,...,NaT,NaT,1,0,1,0,0,0,1,0
1,2,2016-01-21,NaT,NaT,,,,NaT,False,NaT,...,NaT,NaT,1,0,0,0,0,1,1,0
2,3,2016-01-11,2016-01-11 00:00:00,NaT,,,,NaT,False,0 days,...,NaT,NaT,0,1,0,0,0,1,0,0
3,4,2016-01-29,2016-02-03 00:00:00,2016-02-03 00:00:00,Toyota,Corolla,2016.0,2016-02-03,True,5 days,...,0 days,5 days,0,0,0,0,0,0,0,1
4,5,2016-01-10,2016-01-25 00:00:00,2016-01-26 00:00:00,Hyundai,Sonata,2016.0,NaT,False,15 days,...,NaT,NaT,1,0,0,0,0,0,0,1
5,6,2016-01-18,2016-01-18 00:00:00,2016-01-22 00:00:00,Cadillac,DTS,2006.0,NaT,False,0 days,...,NaT,NaT,1,0,0,0,0,0,0,1
6,7,2016-01-14,2016-01-16 00:00:00,2016-01-21 00:00:00,Toyota,Prius V,2014.0,2016-01-23,True,2 days,...,2 days,9 days,1,0,1,0,0,0,1,0
7,8,2016-01-26,2016-02-05 00:00:00,NaT,,,,NaT,False,10 days,...,NaT,NaT,1,0,1,0,0,0,0,1
8,9,2016-01-05,NaT,NaT,,,,NaT,False,NaT,...,NaT,NaT,1,0,0,0,0,0,0,1
9,10,2016-01-25,NaT,NaT,,,,NaT,False,NaT,...,NaT,NaT,0,0,1,0,0,0,1,0


Probably best to use logistic regression here. Using Cross-Validation (KFold Cross Validation, which is the best technique to apply to create the most optimally trained model). 

In [259]:
X = data.iloc[:,-8:]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                        test_size=0.3, 
                        random_state=1)
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=1000, 
                                criterion='mse', 
                                random_state=1, 
                                n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
forest.min_weight_fraction_leaf

In [None]:
print('MSE train set:', metrics.mean_squared_error(y_train, y_train_pred))
print('MSE test set:', metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
print('r2 score train:', metrics.r2_score(y_train, y_train_pred))
print('r2 score test:', metrics.r2_score(y_test, y_test_pred))

In [260]:
# Your code here feel free to use multiple cells and include markdown, graphs, latex or equations as you see fit.
y = data['first_trip_completed']
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)
# from sklearn.linear_model import LogisticRegression
# lm = LogisticRegression()
# lm.fit(X_train,y_train)
# # mse = np.mean((lm.predict(X_test) - y_test)**2)
# y_train_pred=lm.predict(X_train)
# y_pred = lm.predict(X_test)


from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression()
predicted = cross_val_predict(lr, X, y, cv=10)
print(metrics.accuracy_score(y, predicted))
print(metrics.classification_report(y, predicted))


0.8877672317624038
             precision    recall  f1-score   support

      False       0.89      1.00      0.94     48544
       True       0.00      0.00      0.00      6137

avg / total       0.79      0.89      0.83     54681




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



## Post-Model Notes
#### Overfit:
This model would likely perform well on brand new data with similar characteristics (mainly same time span) as the sample it was trained on. However, I imagine that for new and more recent samples of data, there will likely be a lot of variance(due to seasonality, changes outside the sample data, etc), resulting in an overfit model that is suboptimal as predicting new data. 
#### Solution:
With Uber's huge data warehouse, it would be interesting to construct a low bias training model and feed it lots of data (over 100 mil training examples on a neural network for example) and it would most likely result in a much better performing model great at forecasting future data.

In [None]:
# Further code or markdowns





In [None]:
#neural network
df['Date'] = pd.to_datetime(df['Date']) - pd.to_timedelta(7, unit='d')
df = df.groupby(['Name', pd.Grouper(key='Date', freq='W-MON')])['Quantity']
       .sum()
       .reset_index()
       .sort_values('Date')