In [1]:
import pandas as pd

# Part 1: Supply Data Analysis

Important part of our business is a supply/demand balance. We can’t control demand but we can shift some supply to necessary hours to cover more demand during peaks.

As part of the task you will have sample supply and demand data over a few weeks in a single city a few weeks after launch.

**We need to understand:**
- What is the supply to demand dynamic and whether they match?
- Where are the hours of oversupply? Can we shift some of them to undersupply hours?

**Needed output:**
- Show which 36 hours in a week are most undersupplied. Show/describe your decision based on sample data.
- 24-hour curve of average supply and demand (to illustrate match/mismatch).
- Visualisation of hours where we lack supply during a weekly period. This one we can send to drivers to show when to online for extra hours.
- Estimate number of hours needed to ensure we have a high Coverage Ratio during most peak hours.
- Calculate levels of guaranteed hourly earnings we can offer to drivers during 36 weekly hours with highest demand without losing money + how much extra hours we want to get to capture missed demand.
  - Assume that Finished Rides have an average value of €10 (80% goes to driver, 20% is our revenue).
  - Assume the same level of demand with increased supply, base it on RPH over 3 hour periods, but with increased supply.
  - Assume that with extra hours we will capture “missed coverage” or people attributed to “People saw 0 cars” in demand data.

# The data

**Hourly_DriverActivity_1.csv**

In [None]:
data1 = pd.read_csv('Hourly_DriverActivity_1.csv')

Details:  
  - Hourly data for 5 full weeks from 2016-11-14 until 2016-12-18  
  - Real data from a recent launch in a competitive city (2 big apps for years)  
Fields:  
- Date – date + hour for which the row of data is presented
- Active drivers – number of active drivers (any level of activity) available during time period
 - Online (h) – total supply hours that were available during time period
 - Has booking (h) – total hours during which drivers had a client booking (any state)
 - Waiting for booking (h) – total hours which drivers spent waiting for booking
 - Busy (h) – total hours which drivers were not available to take orders in
 - Hours per active driver – average number of hours each driver was online during time period
 - Rides per online hour – aka RPH – avg. finished trips per online hour during period
 - Finished Rides – number of finished trips during period
 - Note the data is sorted with more recent data first
 - Note that is time period has 0 values in all columns, it would be skipped (no row)


In [22]:
data1.head()

Unnamed: 0,Date,Active drivers,Online (h),Has booking (h),Waiting for booking (h),Busy (h),Hours per active driver,Rides per online hour,Finished Rides
0,2016-12-18 23,52,18,6,11,11,0.3,0.67,12.0
1,2016-12-18 22,59,20,11,9,12,0.3,1.4,28.0
2,2016-12-18 21,72,25,7,18,15,0.3,0.64,16.0
3,2016-12-18 20,86,29,7,23,15,0.3,0.52,15.0
4,2016-12-18 19,82,31,14,17,19,0.4,1.16,36.0


**Hourly_OverviewSearch_1.csv**

In [10]:
data2 = pd.read_csv('Hourly_OverviewSearch_1.csv')

Shows how many people saw a car in the app when setting the pickup marker on the map. If you saw a car at one point and did not see a car later, you are counted in both columns in that period.

Details:
- The data is from the same period as Supply data above.
- Fields:
  - Date – date + hour for which the row of data is presented
  - People saw 0 cars (unique) – number of users who didn’t not see a car.
  - People saw +1 cars (unique) – number of users who saw a car.
  - Coverage Ratio (unique) – % of users who saw the car.
- Note the data is sorted with more recent data first
- Note that is time period has 0 values in all columns, it would be skipped (no row)


In [21]:
data2.head()

Unnamed: 0,Date,People saw 0 cars (unique),People saw +1 cars (unique),Coverage Ratio (unique)
0,2016-12-18 23,9,32,78
1,2016-12-18 22,29,64,69
2,2016-12-18 21,5,39,89
3,2016-12-18 20,13,48,79
4,2016-12-18 19,12,77,87


**merging dataframes**  
we will use a full outer join, since we will analyse first every possible existing data

In [55]:
data = pd.merge(data1, data2, on='Date', how='outer')

# Questions

## What is the supply to demand dynamic and whether they match?

In [98]:
data

Unnamed: 0,Date,Active drivers,Online (h),Has booking (h),Waiting for booking (h),Busy (h),Hours per active driver,Rides per online hour,Finished Rides,People saw 0 cars (unique),People saw +1 cars (unique),Coverage Ratio (unique)
0,2016-12-18 23,52,18,6,11,11,0.3,0.67,12.0,9.0,32.0,78.0
1,2016-12-18 22,59,20,11,9,12,0.3,1.40,28.0,29.0,64.0,69.0
2,2016-12-18 21,72,25,7,18,15,0.3,0.64,16.0,5.0,39.0,89.0
3,2016-12-18 20,86,29,7,23,15,0.3,0.52,15.0,13.0,48.0,79.0
4,2016-12-18 19,82,31,14,17,19,0.4,1.16,36.0,12.0,77.0,87.0
...,...,...,...,...,...,...,...,...,...,...,...,...
835,2016-11-14 04,15,6,0,6,6,0.4,0.00,,4.0,4.0,50.0
836,2016-11-14 03,18,7,0,7,7,0.4,0.00,,1.0,2.0,67.0
837,2016-11-14 02,21,7,0,7,9,0.3,0.14,1.0,3.0,6.0,67.0
838,2016-11-14 01,29,9,1,8,11,0.3,0.22,2.0,8.0,8.0,50.0


## Where are the hours of oversupply? Can we shift some of them to undersupply hours?