# Data Science Challenge 2

## Challenge 

#### Background
A ride hailing app currently assigns new incoming trips to the _closest_ available vehicle. To compute such distance, the app currently computes haversine distance between the pickup point and each of the available vehicles. We refer to this distance as *linear*. 

However, the expected time to reach A from B in a city is not 100% defined by Haversine distance:
cities are known to be places where huge amount of transport infrastructure (roads, highways, bridges, tunnels) is deployed to increase capacity and reduce average travel time. Interestingly, this heavy investment in infrastructure also implies that bird distance does not work so well as proxy, so the isochrones for travel time from certain location drastically differ from the perfect circle defined by bird distance, as we can see in this example from CDMX where the blue area represents that it is reachable within a 10 min drive. 

#### Proposal
In order to optimise operations, engineering team has suggested they could query an external real time maps API that not only has roads, but also knows realtime traffic information. We refer to this distance as *road* distance.

In principle this assignment is more efficient and should outperform *linear*. However, the queries to the maps API have a certain cost (per query) and increase the complexity and reliability of a critical system within the company. So Data Science team has designed an experiment to help engineering to decide.

#### Experimental design

The designed experiment is very simple. For a period of 5 days, all trips in 3 cities (Bravos, Pentos and Volantis) have been randomly assigned using *linear* or *road* distance:

* Trips whose *trip_id* starts with digits 0-8 were assigned using *road* distance.
* Trips whose *trip_id* starts with digits 9-f were assigned using *linear* distance.

#### Data description
The collected data is available in [this link](https://www.dropbox.com/s/e3j1pybfz5o3vq9/intervals_challenge.json.gz?dl=0). Each object represent a `vehicle_interval` that contains the following attributes:

* `type`: can be `going_to_pickup`, `waiting_for_rider` or `driving_to_destination`. 
* `trip_id`: uniquely identifies the trip.
* `duration`: how long the interval last, in seconds.
* `distance`: how far the vehicle moved in this interval, in meters.
* `city_id`: either bravos, pentos and volantis.
* `started_at`: when the interval started, UTC Time.
* `vehicle_id`: uniquely identifies the vehicle.
* `rider_id`: uniquely identifies the rider.

#### Example
```
{
  "duration": 857,
  "distance": 5384,
  "started_at": 1475499600.287,
  "trip_id": "c00cee6963e0dc66e50e271239426914",
  "vehicle_id": "52d38cf1a3240d5cbdcf730f2d9a47d6",
  "city_id": "pentos",
  "type": "driving_to_destination"
}
```

#### Challenge
Try to answer the following questions:

1. Should the company move towards *road* distance? What's the max price it would make sense to pay per query? (make all the  assumptions you need, and make them explicit)
2. How would you improve the experimental design? Would you collect any additional data? 



## Solution

In [64]:
import pandas as pd

In [65]:
pwd

'/Users/rebeccaestiarte/Desktop/IronHack/LABS/technical-interview'

In [76]:
data = pd.read_json("data/intervals_challenge.json", lines=True)

In [77]:
# DATA EXPLORATION
data.dtypes

duration              object
distance              object
started_at    datetime64[ns]
trip_id               object
vehicle_id            object
city_id               object
type                  object
dtype: object

In [78]:
data.isnull().values.any()

False

In [79]:
data[data.duration == "NA"].head()
# We have 1157 columns with missing values on both duration and distance. Those are the only missing values I found.

Unnamed: 0,duration,distance,started_at,trip_id,vehicle_id,city_id,type
131832,,,2016-10-04 16:16:57.677000046,e1a5305515f04de1a32a883e752f5da4,9eceeaf8c8ad105212d6e8eecda02c4a,pentos,driving_to_destination
133409,,,2016-10-04 16:28:50.309999943,13f154ab0c7d17fb2ec203a3a714d6b0,fce3a43cd5f5a43e2d0b929ad604d3b6,pentos,going_to_pickup
138211,,,2016-10-04 17:04:50.207000017,99dc4314729ae959762a9bc2ba681de6,b041d487fdc4afcbdc9d3ce23bfbe59a,pentos,going_to_pickup
153973,,,2016-10-04 19:11:57.548000097,4edd2ed1f5c5401d117b87d70d694f8b,909d1f5607f5796963b0142f8536ccad,pentos,driving_to_destination
155979,,,2016-10-04 19:27:43.782999992,179ec5f2abe307b008d8f5d4b33d29b4,bef0644f66f06d5aa5f547c58845d8b7,pentos,driving_to_destination


In [80]:
data[data.duration == "NA"].shape[0]/ data.shape[0] * 100
# The rows with missing values represent only the 0.7% of the whole dataset. I can remove the data as we won't need it for the analysis.

0.7004904038263606

In [85]:
# DATA CLEANING
data = data[data.duration != "NA"] # excluding rows with missing values
data.duration = data.duration.astype(str).astype(int) # changing datatype duration from object to int
data.distance = data.distance.astype(str).astype(int) # changing datatype distance from object to int
data.trip_id = data.trip_id.astype(str) # changing datatype trip_id from object to str
data.dtypes

duration               int64
distance               int64
started_at    datetime64[ns]
trip_id               object
vehicle_id            object
city_id               object
type                  object
dtype: object

In [122]:
data.trip_id.filter(regex=("d"))

Series([], Name: trip_id, dtype: object)

0         c00cee6963e0dc66e50e271239426914
1         427425e1f4318ca2461168bdd6e4fcbd
2         757867f6d7c00ef92a65bfaa3895943f
3         d09d1301d361f7359d0d936557d10f89
4         00f20a701f0ec2519353ef3ffaf75068
                        ...               
165131    0e4ed67de5fc7e16456119bc21143310
165139    99878ca945b1b6d2feef106d0cb9527f
165148    5352a94aa14c7e66783dab604aef5313
165155    f6080061c6425a877e34ea92d536017c
165162    d2f81b419daddb90bd701ab9870f47a3
Name: trip_id, Length: 164013, dtype: object

In [124]:
data.trip_id.apply(lambda x: x if x.str.contains("\d/", regex=True) else "bella ciao")

data.trip_id.str.extract("(\d)")

AttributeError: 'str' object has no attribute 'str'

In [98]:
# DATA ANALYSIS

NameError: name 're' is not defined

In [None]:
Trips whose trip_id starts with digits 0-8 were assigned using road distance.
Trips whose trip_id starts with digits 9-f were assigned using linear distance.