
<font color='gray'>

## Data Science Challenge

The purpose of this challenge is to assist us in evaluating candidates for a role in our Product team. We only pass this challenge to candidates that we feel have a solid background and could be a good fit for our team. We appreciate you taking this time to help ensure we’re a good fit for each other.

### Tips
- Include code, graphics and text in a combined output. Tell a story, and let us know very clearly about your thoughts and analytical process.

### Part 1: Experiment design
#### Background
In 2011 AirBnB ran some experiments which showed that when a property featured professional photography, users were much more likely to trust the property and consequently make a booking. So, AirBnB a launched [free professional photography service](https://www.airbnb.com/professional_photography) for all hosts. From inside the listing page, hosts were able to click a link to view more about the service, request a professional photographer, and subsequently (after the photo shoot) have their property profile updated with professional photos. 

The project initially proved to be a success:
- Guests were more likely to book a property that had professional photography.
- Hosts were able to charge more for listings with professional photos.

However, over time this also became a multimillion dollar operation and a challenge to manage across over 200 countries. 

Fast forward to 2016, and some new developments have also helped with building trust:

- 2013: Launch of identity verification for hosts and guests.
- 2014: Launch of double blind reviews (neither host nor guest can see the other’s review), ensuring more honest reviews of properties and hosts. 
- 2015: Huge global PR lift for AirBnB, raising the profile of the company.

An additional interesting development has also been the proliferation of smartphones with powerful and high-quality cameras (+apps) over the last few years, which has made it more possible for hosts to take good quality photos of their property. There is also the opinion that perhaps millennials have come to expect smartphone photos as the norm and are less likely to expect professional photography. 

#### Challenge

Since the professional photography service consumes so many operational and financial resources, AirBnB management are unsure if they should continue. AirBnB management have asked the Data Science team to analyse the impact of the professional photography service in order to determine whether or not they should continue funding the service. 

- Provide full details about how you will run experiments to assess the impact of this service on both hosts and guests. How will you ensure that the experiments are valid and not biased? 

</font>



---------------------------------------------

## Response to the Part 1


I will do a A/B testing and using different metrics to assess the impact of the professional photos.

**Benchmark**

Average price per night without using professional photos

**Metrics**  
1. Does the professional photo make more click rate?
2. Does the professional photo make more conversion rate?
3. Does the professional photo lead a higher profit or not? (optional)

**Period**  

30 days (based on this [ref](https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7))

**Sample groups**  

The sample size depends on the control conversion rate/click rate. Let's set the splitting size 50% vs. 50%.   
I'd choose 5 tops cities where there are more demanding on AirBnb and randomly choose control vs. test flats by area and price. 


**Steps**  
Using Bacelona as an example.  
1. I may choose in total 50 flats in L'Eixample with similar nightly price. 
2. Split them into 25 flats for control and 25 flats for test and run the test for 30 days but excluding the holiday in order to prevent the bias. 
3. Switch these two groups and run for the other 30 days to see if there is any impact from the professional photos.
4. Other than switch the two groups, we can also test with different price of flats. (I am not sure of this part)
5. Calulating p-values of the click rate and conversion rate between groups along the 30 days.







-----------------------------

<font color='gray'>
    
### Part 2: Result analysis

#### Background
A ride hailing app currently assigns new incoming trips to the _closest_ available vehicle. To compute such distance, the app currently computes haversine distance between the pickup point and each of the available vehicles. We refer to this distance as *linear*. 

However, the expected time to reach A from B in a city is not 100% defined by Haversine distance:
cities are known to be places where huge amount of transport infrastructure (roads, highways, bridges, tunnels) is deployed to increase capacity and reduce average travel time. Interestingly, this heavy investment in infrastructure also implies that bird distance does not work so well as proxy, so the isochrones for travel time from certain location drastically differ from the perfect circle defined by bird distance, as we can see in this example from CDMX where the blue area represents that it is reachable within a 10 min drive. 

![Imgur](https://i.imgur.com/hYXhpiM.png)
 
In addition to this, travel times can be drastically affected by traffic, accidents, road work...So that even if a driver is only 300m away, he might need to drive for 10 min because of road work in a bridge.

#### Proposal
In order to optimise operations, engineering team has suggested they could query an external real time maps API that not only has roads, but also knows realtime traffic information. We refer to this distance as *road* distance.

In principle this assignment is more efficient and should outperform *linear*. However, the queries to the maps API have a certain cost (per query) and increase the complexity and reliability of a critical system within the company. So Data Science team has designed an experiment to help engineering to decide.

#### Experimental design

The designed experiment is very simple. For a period of 5 days, all trips in 3 cities (Bravos, Pentos and Volantis) have been randomly assigned using *linear* or *road* distance:

* Trips whose *trip_id* starts with digits 0-8 were assigned using *road* distance.
* Trips whose *trip_id* starts with digits 9-f were assigned using *linear* distance.

#### Data description
The collected data is available in [this link](https://www.dropbox.com/s/e3j1pybfz5o3vq9/intervals_challenge.json.gz?dl=0). Each object represent a `vehicle_interval` that contains the following attributes:

* `type`: can be `going_to_pickup`, `waiting_for_rider` or `driving_to_destination`. 
* `trip_id`: uniquely identifies the trip.
* `duration`: how long the interval last, in seconds.
* `distance`: how far the vehicle moved in this interval, in meters.
* `city_id`: either bravos, pentos and volantis.
* `started_at`: when the interval started, UTC Time.
* `vehicle_id`: uniquely identifies the vehicle.
* `rider_id`: uniquely identifies the rider.

#### Example
```
{
  "duration": 857,
  "distance": 5384,
  "started_at": 1475499600.287,
  "trip_id": "c00cee6963e0dc66e50e271239426914",
  "vehicle_id": "52d38cf1a3240d5cbdcf730f2d9a47d6",
  "city_id": "pentos",
  "type": "driving_to_destination"
}
```

#### Challenge
Try to answer the following questions:

1. Should the company move towards *road* distance? What's the max price it would make sense to pay per query? (make all the  assumptions you need, and make them explicit)
2. How would you improve the experimental design? Would you collect any additional data? 

</font>

-------------

## Response to the Part 2





In [32]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
df = pd.read_json ('data/intervals_challenge.json', lines=True)

In [3]:
df.head()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type
0,857,5384,2016-10-03 13:00:00.286999941,c00cee6963e0dc66e50e271239426914,52d38cf1a3240d5cbdcf730f2d9a47d6,pentos,driving_to_destination
1,245,1248,2016-10-03 13:00:00.852999926,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup
2,1249,5847,2016-10-03 13:00:01.670000076,757867f6d7c00ef92a65bfaa3895943f,8885c59374cc539163e83f01ed59fd16,pentos,driving_to_destination
3,471,2585,2016-10-03 13:00:01.841000080,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup
4,182,743,2016-10-03 13:00:01.970000029,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup


In [4]:
df.describe()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type
count,165170.0,165170.0,165170,165170,165170,165170,165170
unique,3344.0,13185.0,165012,58686,4746,3,3
top,4.0,0.0,2016-10-03 13:00:52.447000027,afacd04e18402f482e950ecc17c9f998,6661ee4bee90709e97c50a6bcb5ac682,pentos,going_to_pickup
freq,2972.0,16470.0,2,10,150,113684,58510
first,,,2016-10-03 13:00:00.286999941,,,,
last,,,2016-10-04 20:36:20.473999977,,,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165170 entries, 0 to 165169
Data columns (total 7 columns):
duration      165170 non-null object
distance      165170 non-null object
started_at    165170 non-null datetime64[ns]
trip_id       165170 non-null object
vehicle_id    165170 non-null object
city_id       165170 non-null object
type          165170 non-null object
dtypes: datetime64[ns](1), object(6)
memory usage: 8.8+ MB


### Identify the linear or road distance type

As we are going to compare if these two types have any different, we need to identify them first

In [8]:
# Define a function to separate the distance type by the first letter/number

def regex_filter(val):
    mo = re.search(r'^[0-8]',val)
    if mo:
        return "road"
    else:
        return "linear"

df["distance_type"] = df["trip_id"].apply(regex_filter)

In [9]:
df.head()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type,distance_type
0,857,5384,2016-10-03 13:00:00.286999941,c00cee6963e0dc66e50e271239426914,52d38cf1a3240d5cbdcf730f2d9a47d6,pentos,driving_to_destination,linear
1,245,1248,2016-10-03 13:00:00.852999926,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup,road
2,1249,5847,2016-10-03 13:00:01.670000076,757867f6d7c00ef92a65bfaa3895943f,8885c59374cc539163e83f01ed59fd16,pentos,driving_to_destination,road
3,471,2585,2016-10-03 13:00:01.841000080,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup,linear
4,182,743,2016-10-03 13:00:01.970000029,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup,road


In [10]:
# Have a look of their distribution

df["distance_type"].value_counts()

road      93652
linear    71518
Name: distance_type, dtype: int64

### Extract only the "going to pickup"
As the main question is about to study the difference of distance type between the pickup point and each of the available vehicles, I will first focus on only the type of "going to pickup"

In [11]:
# As usual, have a look of the number of each type

df["type"].value_counts()

going_to_pickup           58510
waiting_for_rider         53746
driving_to_destination    52914
Name: type, dtype: int64

In [12]:
# Choose only the going to pickup type

df_pickup = df[df["type"] == "going_to_pickup"]

In [13]:
df_pickup.describe()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type,distance_type
count,58510.0,58510.0,58510,58510,58510,58510,58510,58510
unique,1851.0,4217.0,58488,58468,4745,3,1,2
top,,0.0,2016-10-03 23:23:35.987999916,afacd04e18402f482e950ecc17c9f998,a3e0ec5c6ea97b306dbe4fcabd47f2b5,pentos,going_to_pickup,road
freq,299.0,1720.0,2,9,58,40064,58510,33171
first,,,2016-10-03 13:00:00.852999926,,,,,
last,,,2016-10-04 20:36:20.473999977,,,,,


In [33]:
df_pickup["distance"] = df_pickup["distance"].astype("int")

In [34]:
df_pickup["duration"] = np.where(df_pickup["duration"] == "NA", 0, df_pickup["duration"])

In [35]:
df_pickup["duration"] = df_pickup["duration"].astype("int")

In [36]:
df_pickup.describe()

Unnamed: 0,duration,distance
count,58510.0,58510.0
mean,298.632559,979.436
std,291.058871,10434.48
min,0.0,0.0
25%,140.0,304.0
50%,236.0,628.0
75%,370.0,1067.0
max,9441.0,1218089.0


In [38]:
df_pickup["speed"] = (df_pickup["distance"]/df_pickup["duration"]).replace(np.inf, 0)

In [39]:
df_pickup.head()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type,distance_type,speed
1,245,1248,2016-10-03 13:00:00.852999926,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup,road,5.093878
3,471,2585,2016-10-03 13:00:01.841000080,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup,linear,5.488323
4,182,743,2016-10-03 13:00:01.970000029,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup,road,4.082418
5,599,1351,2016-10-03 13:00:02.154000044,158e7bc8d42e1d8c94767b00c8f89568,126e868fb282852c2fa95d88878686bf,volantis,going_to_pickup,road,2.255426
9,1525,2674,2016-10-03 13:00:05.637000084,d3e6e8fb50c02d66feca2c60830c4fcc,b0906e917dc5cc0bcba190fd80079a74,bravos,going_to_pickup,linear,1.753443


In [43]:
df_pickup.groupby(["city_id", "distance_type"]).agg(avg_duration = ("duration", "mean"),
                              avg_distance = ("distance", "mean"),
                              avg_speed = ("speed", "mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_duration,avg_distance,avg_speed
city_id,distance_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bravos,linear,550.764774,2980.561431,379.554993
bravos,road,573.926288,3021.996152,102.190215
pentos,linear,252.109576,697.203335,4.098022
pentos,road,251.657042,724.663851,4.461381
volantis,linear,327.902783,873.317225,5.520544
volantis,road,318.142937,868.545417,5.28069
