## Part 1 - Assessing the Impact of the Free Photography Offer

In [8]:
import pandas as pd
import numpy as np
from scipy import stats
import json

**Background**

In 2011 AirBnB ran some experiments which showed that when a property featured professional photography, users were much more likely to trust the property and consequently make a booking. So, AirBnB a launched free professional photography service for all hosts. From inside the listing page, hosts were able to click a link to view more about the service, request a professional photographer, and subsequently (after the photo shoot) have their property profile updated with professional photos.

The project initially proved to be a success:

Guests were more likely to book a property that had professional photography.
Hosts were able to charge more for listings with professional photos.

However, over time this also became a multimillion dollar operation and a challenge to manage across over 200 countries.

Fast forward to 2016, and some new developments have also helped with building trust:

- 2013: Launch of identity verification for hosts and guests.
- 2014: Launch of double blind reviews (neither host nor guest can see the other’s review), ensuring more honest reviews of properties and hosts.
- 2015: Huge global PR lift for AirBnB, raising the profile of the company.

An additional interesting development has also been the proliferation of smartphones with powerful and high-quality cameras (+apps) over the last few years, which has made it more possible for hosts to take good quality photos of their property. There is also the opinion that perhaps millennials have come to expect smartphone photos as the norm and are less likely to expect professional photography.

**Challenge 1**

Since the professional photography service consumes so many operational and financial resources, AirBnB management are unsure if they should continue. AirBnB management have asked the Data Science team to analyse the impact of the professional photography service in order to determine whether or not they should continue funding the service.

Provide full details about how you will run experiments to assess the impact of this service on both hosts and guests. How will you ensure that the experiments are valid and not biased?

**-----------------------------------------------------------------------------------------**

**Step 1: Gather and prepare the data:**

In my opinion, there are two ways of going about this: 

- Chronologically differ between the bookings and prices before and after offering / making use of the professional photo service. 
- Have a control group of people who are not using the professional photo service and compare them to the hosts that did take the offer of the service. 

In the second scenario, there is a variety of biases that could occurr, for example:
- Some of the hosts not using the service could be proffessional photographers themselves
- Some of the hosts not using the service could be targetting people travelling on a budget and therefore do not see the need to take better pictures of a crammed single room
- Some of the hosts not using the service might not be active hosts, just rent out occasionally or have the listing up just for fun and by default.

These are just some examples for potential biases in the data. If the control group could be filtered, it would also be of benefit to add a control group in the first scenario, when drawing a chronical comparison, to detect and filter general trends that do not occurr do to the change in image quality. A good choice of control group would be the hosts that are on the waiting list for the photo service. 

**Step 2: Formulate Hypothesis:**

In order to stay focussed on the most valuable prospective findings I would formulate Hypothesis:

1. Does the number of Bookings on a listing change after the photography service?
2. If this information is accessable to AirBnB: Does the number of views on the booking change after the photography service?
3. Did the price of the locations with photography service rise? Were the hosts able to charge more for a stay?

The tests would have to be balanced by number of listings, location, size of property, type of property (weekend stay, flat for monthly stay, country house, ...) and quality of service measured by ratings.


**Step 3: Exploring the Dataset:**

Exploring the dataset in order to get a feel for the data and reassure myself of the Hypothsis, clean out eventual biases, testing correlations of factors.

**Step 4: Change over time:**

Test if there is a change in bookings, views and price over time.
Additionally, ensure the significance of the findings.

**Step 5: Change in comparison to the control group:**

See if there is a significant difference in change or rather parallels in change compared to the control group that could not be related to the photography and adjust accordingly

**Step 6: Draw conclusions**

Draw conclusions and find potential solutions:
- determine the value for AirBnB
- determine the optimal pricing for the professional photography service / research alternatives to optimize the outcome and value add for AirBnB and the hosts.

**Step 7: Visualize and prepare:**

Visualize and prepare a presentation including conclusions, findings and proposal solutions for management / responsibles.

**-----------------------------------------------------------------------------------------**

## Part 2 - Performance Analysis

**Background**

A ride hailing app currently assigns new incoming trips to the closest available vehicle. To compute such distance, the app currently computes haversine distance between the pickup point and each of the available vehicles. We refer to this distance as linear.

However, the expected time to reach A from B in a city is not 100% defined by Haversine distance: cities are known to be places where huge amount of transport infrastructure (roads, highways, bridges, tunnels) is deployed to increase capacity and reduce average travel time. Interestingly, this heavy investment in infrastructure also implies that bird distance does not work so well as proxy, so the isochrones for travel time from certain location drastically differ from the perfect circle defined by bird distance, as we can see in this example from CDMX where the blue area represents that it is reachable within a 10 min drive.

In addition to this, travel times can be drastically affected by traffic, accidents, road work...So that even if a driver is only 300m away, he might need to drive for 10 min because of road work in a bridge.

**Proposal**

In order to optimise operations, engineering team has suggested they could query an external real time maps API that not only has roads, but also knows realtime traffic information. We refer to this distance as road distance.

In principle this assignment is more efficient and should outperform linear. However, the queries to the maps API have a certain cost (per query) and increase the complexity and reliability of a critical system within the company. So Data Science team has designed an experiment to help engineering to decide.

**Experimental design**

The designed experiment is very simple. For a period of 5 days, all trips in 3 cities (Bravos, Pentos and Volantis) have been randomly assigned using linear or road distance:

Trips whose trip_id starts with digits 0-8 were assigned using road distance.
Trips whose trip_id starts with digits 9-f were assigned using linear distance.


**Challenge**

Try to answer the following questions:

Should the company move towards road distance? What's the max price it would make sense to pay per query? (make all the assumptions you need, and make them explicit)
How would you improve the experimental design? Would you collect any additional data?

**-----------------------------------------------------------------------------------------**

1. Examine the Data

In [15]:
intervals = pd.read_json(r'data/intervals_challenge.json', lines=True) 
intervals.head()

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type
0,857,5384,2016-10-03 13:00:00.286999941,c00cee6963e0dc66e50e271239426914,52d38cf1a3240d5cbdcf730f2d9a47d6,pentos,driving_to_destination
1,245,1248,2016-10-03 13:00:00.852999926,427425e1f4318ca2461168bdd6e4fcbd,8336b28f24c3e7a1e3d582073b164895,volantis,going_to_pickup
2,1249,5847,2016-10-03 13:00:01.670000076,757867f6d7c00ef92a65bfaa3895943f,8885c59374cc539163e83f01ed59fd16,pentos,driving_to_destination
3,471,2585,2016-10-03 13:00:01.841000080,d09d1301d361f7359d0d936557d10f89,81b63920454f70b6755a494e3b28b3a7,bravos,going_to_pickup
4,182,743,2016-10-03 13:00:01.970000029,00f20a701f0ec2519353ef3ffaf75068,b73030977cbad61c9db55418909864fa,pentos,going_to_pickup


In [17]:
intervals.dtypes

duration              object
distance              object
started_at    datetime64[ns]
trip_id               object
vehicle_id            object
city_id               object
type                  object
dtype: object

**Question 1:** Should the Company move towards road distance?

**Question 2:** What's the max price it would make sense to pay per query? (make all the assumptions you need, and make them explicit)

**Question 3:** How would you improve the experimental design?

**Question 4:** Would you collect any additional data?