# LYFT - Driver Lifetime Value

## Contents

* [Assignment](#Assignment)
* [Data Description](#Data-Description)
* [Read and Explore all Datasets](#Read-and-Explore-all-Datasets)
* [Data Engineering](#Data-Engineering)
* [Question 2](#Question-2)
* [Question 3](#Question-3)

## Assignment
After exploring and analyzing the data, please:

1. Recommend a Driver's Lifetime Value (i.e., the value of a driver to Lyft over the entire projected lifetime of a driver).


2. What are the main factors that affect a driver's lifetime value?


3. What is the average projected lifetime of a driver? That is, once a driver is onboarded, how long do they typically continue driving with Lyft?


4. Do all drivers act alike? Are there specific segments of drivers that generate more value for Lyft than the average driver?


5. What actionable recommendations are there for the business?


## Data Description

You'll find three CSV files attached with the following data:

### driver_ids.csv

driver_id  = Unique identifier for a driver

driver_onboard_date  = Date on which driver was on-boarded


### ride_ids.csv

driver_id  = Unique identifier for a driver

ride_id  = Unique identifier for a ride that was completed by the driver

ride_distance = Ride distance in meters

ride_duration = Ride duration in seconds

ride_prime_time  = Prime Time applied on the ride


### ride_timestamps.csv

ride_id = Unique identifier for a ride

event =  describes the type of event; this variable takes the following values:

          requested_at - passenger requested a ride
          accepted_at - driver accepted a passenger request
          arrived_at - driver arrived at pickup point
          picked_up_at - driver picked up the passenger
          dropped_off_at - driver dropped off a passenger at destination


timestamp  = Time of event


### You can assume that:

All rides in the data set occurred in San Francisco

All timestamps in the data set are in UTC

## Read and Explore all Datasets

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
driver_ids = pd.read_csv('driver_ids.csv')

ride_ids = pd.read_csv('ride_ids.csv')

ride_timestamps = pd.read_csv('ride_timestamps.csv')

In [9]:
driver_ids.head(3)

Unnamed: 0,driver_id,driver_onboard_date
0,002be0ffdc997bd5c50703158b7c2491,2016-03-29 00:00:00
1,007f0389f9c7b03ef97098422f902e62,2016-03-29 00:00:00
2,011e5c5dfc5c2c92501b8b24d47509bc,2016-04-05 00:00:00


In [10]:
ride_ids.head(3)

Unnamed: 0,driver_id,ride_id,ride_distance,ride_duration,ride_prime_time
0,002be0ffdc997bd5c50703158b7c2491,006d61cf7446e682f7bc50b0f8a5bea5,1811,327,50
1,002be0ffdc997bd5c50703158b7c2491,01b522c5c3a756fbdb12e95e87507eda,3362,809,0
2,002be0ffdc997bd5c50703158b7c2491,029227c4c2971ce69ff2274dc798ef43,3282,572,0


In [12]:
ride_timestamps.head(3)

Unnamed: 0,ride_id,event,timestamp
0,00003037a262d9ee40e61b5c0718f7f0,requested_at,2016-06-13 09:39:19
1,00003037a262d9ee40e61b5c0718f7f0,accepted_at,2016-06-13 09:39:51
2,00003037a262d9ee40e61b5c0718f7f0,arrived_at,2016-06-13 09:44:31


In [25]:
driver_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 937 entries, 0 to 936
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   driver_id            937 non-null    object
 1   driver_onboard_date  937 non-null    object
dtypes: object(2)
memory usage: 14.8+ KB
DRIVER IDS None


In [15]:
ride_ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193502 entries, 0 to 193501
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   driver_id        193502 non-null  object
 1   ride_id          193502 non-null  object
 2   ride_distance    193502 non-null  int64 
 3   ride_duration    193502 non-null  int64 
 4   ride_prime_time  193502 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 7.4+ MB


In [17]:
ride_timestamps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 970405 entries, 0 to 970404
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   ride_id    970405 non-null  object
 1   event      970405 non-null  object
 2   timestamp  970404 non-null  object
dtypes: object(3)
memory usage: 22.2+ MB


In [30]:
print('Shape of data : ', driver_ids.shape)
print('Unique of driver id : ', len(driver_ids['driver_id'].unique()))
print('Min date : ',driver_ids['driver_onboard_date'].min())
print('Max date : ',driver_ids['driver_onboard_date'].max())

Shape of data (937, 2)
Unique of driver id :  937
Min date :  2016-03-28 00:00:00
Max date :  2016-05-15 00:00:00


In [31]:
print('Shape of data : ', ride_ids.shape)
print('Unique of driver id : ', len(ride_ids['driver_id'].unique()))
print('Unique of ride id : ', len(ride_ids['ride_id'].unique()))
ride_ids.describe()

Shape of data :  (193502, 5)
Unique of driver id :  937
Unique of ride id :  193502


Unnamed: 0,ride_distance,ride_duration,ride_prime_time
count,193502.0,193502.0,193502.0
mean,6955.218266,858.966099,17.305893
std,8929.444606,571.375818,30.8258
min,-2.0,2.0,0.0
25%,2459.0,491.0,0.0
50%,4015.0,727.0,0.0
75%,7193.0,1069.0,25.0
max,724679.0,28204.0,500.0


In [34]:
print('Shape of data : ', ride_timestamps.shape)
print('Unique of ride id : ', len(ride_timestamps['ride_id'].unique()))
ride_timestamps['event'].value_counts()

Shape of data :  (970405, 3)
Unique of ride id :  194081


requested_at      194081
accepted_at       194081
arrived_at        194081
picked_up_at      194081
dropped_off_at    194081
Name: event, dtype: int64

#### A little reminder
          requested_at - passenger requested a ride
          accepted_at - driver accepted a passenger request
          arrived_at - driver arrived at pickup point
          picked_up_at - driver picked up the passenger
          dropped_off_at - driver dropped off a passenger at destination

## Data Engineering

### Calculate the Total Cost of Ride

We calculate the cost per ride using the assumptions from the Lyft rate card given:

* Base Fare $2.00

* Cost per Mile $1.15

* Cost per Minute $0.22

* Service Fee $1.75

* Minimum Fare $5.00

* Maximum Fare $400.00

We also apply assumptions regarding applying the Prime Time rate and the Service Fee in line with the actual pricing model of Lyft as described in many articles such as this one

So we calculate it by using this formula: 

(base fare
+
cost per mile
×
ride_distance
+
cost per minute
×
ride_duration
)
×
(
1
+
(ride_prime_time /100)
)
+
service fee

After that, we check if there are any costs less than Minimum Fare, or more than Maximum Fare are change them appropriately, to either Minimum or Maximum Fare respectively.


In [36]:
base_fare = 2.00
cost_per_mile = 1.15
cost_per_minute = 0.22
service_fee = 1.75
minimum_fare = 5.00
maximum_fare = 400.00

ride_ids['ride_total_cost'] = ((base_fare + (cost_per_mile * (ride_ids['ride_distance'] * 0.000621)) + 
                              (cost_per_minute * (ride_ids['ride_duration'] / 60))) *
                              (1 + ride_ids['ride_prime_time'] / 100)) + service_fee

ride_ids['ride_total_cost'] = np.where(ride_ids['ride_total_cost'] < minimum_fare, minimum_fare, ride_ids['ride_total_cost'])
ride_ids['ride_total_cost'] = np.where(ride_ids['ride_total_cost'] > maximum_fare, maximum_fare, ride_ids['ride_total_cost'])
# print first 3 rows in dataframe 
ride_ids.head(3)

Unnamed: 0,driver_id,ride_id,ride_distance,ride_duration,ride_prime_time,ride_total_cost
0,002be0ffdc997bd5c50703158b7c2491,006d61cf7446e682f7bc50b0f8a5bea5,1811,327,50,8.488488
1,002be0ffdc997bd5c50703158b7c2491,01b522c5c3a756fbdb12e95e87507eda,3362,809,0,9.117306
2,002be0ffdc997bd5c50703158b7c2491,029227c4c2971ce69ff2274dc798ef43,3282,572,0,8.191174
