# Insights from Failed Orders

This data project has been used as a take-home assignment in the recruitment process for the data science positions at Gett.

Gett, previously known as GetTaxi, is an Israeli-developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not completed successfully, i.e., the customer didn't end up getting a car.

### Assignment
Please complete the following tasks.

- Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. Analyse the resulting plot. Which category has the highest number of orders?
- Plot the distribution of failed orders by hours. Is there a trend that certain hours have an abnormally high proportion of one category or another? What hours are the biggest fails? How can this be explained?
- Plot the average time to cancellation with and without driver, by the hour. If there are any outliers in the data, it would be better to remove them. Can we draw any conclusions from this plot?
- Plot the distribution of average ETA by hours. How can this plot be explained?

*BONUS Hexagons. Using the h3 and folium packages, calculate how many sizes 8 hexes contain 80% of all orders from the original data sets and visualise the hexes, colouring them by the number of fails on the map.*

### Data Description
We have two data sets: data_orders and data_offers, both being stored in a CSV format. The data_orders data set contains the following columns:

order_datetime - time of the order
origin_longitude - longitude of the order
origin_latitude - latitude of the order
m_order_eta - time before order arrival
order_gk - order number
order_status_key - status, an enumeration consisting of the following mapping:
    4 - cancelled by client,
    9 - cancelled by system, i.e., a reject
is_driver_assigned_key - whether a driver has been assigned
cancellation_time_in_seconds - how many seconds passed before cancellation

The data_offers data set is a simple map with 2 columns:

order_gk - order number, associated with the same column from the orders data set
offer_id - ID of an offer

### Practicalities
Make sure that the solution reflects your entire thought process including the preparation of data - it is more important how the code is structured rather than just the final result or plot.

In [1]:
# import packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# read data
data_offer = pd.read_csv("./datasets/data_offers.csv")
data_order = pd.read_csv("./datasets/data_orders.csv")


### Data description

In [14]:
# data_offer
data_offer.describe()

Unnamed: 0,order_gk,offer_id
count,334363.0,334363.0
mean,3000602000000.0,300051500000.0
std,24316380.0,527682.1
min,3000551000000.0,300050600000.0
25%,3000585000000.0,300051100000.0
50%,3000596000000.0,300051600000.0
75%,3000625000000.0,300052000000.0
max,3000633000000.0,300052400000.0


In [15]:
data_offer.head()

Unnamed: 0,order_gk,offer_id
0,3000579625629,300050936206
1,3000627306450,300052064651
2,3000632920686,300052408812
3,3000632771725,300052393030
4,3000583467642,300051001196


In [19]:
data_offer.shape

(334363, 2)

In [20]:
# data_order
data_order.describe()

Unnamed: 0,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
count,10716.0,10716.0,2814.0,10716.0,10716.0,10716.0,7307.0
mean,-0.964323,51.450541,441.415423,3000598000000.0,5.590612,0.262598,157.892021
std,0.022818,0.011984,288.006379,23962610.0,2.328845,0.440066,213.366963
min,-1.066957,51.399323,60.0,3000550000000.0,4.0,0.0,3.0
25%,-0.974363,51.444643,233.0,3000583000000.0,4.0,0.0,45.0
50%,-0.966386,51.451972,368.5,3000595000000.0,4.0,0.0,98.0
75%,-0.949605,51.456725,653.0,3000623000000.0,9.0,1.0,187.5
max,-0.867088,51.496169,1559.0,3000633000000.0,9.0,1.0,4303.0


In [21]:
data_order.head()

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
0,18:08:07,-0.978916,51.456173,60.0,3000583041974,4,1,198.0
1,20:57:32,-0.950385,51.456843,,3000583116437,4,0,128.0
2,12:07:50,-0.96952,51.455544,477.0,3000582891479,4,1,46.0
3,13:50:20,-1.054671,51.460544,658.0,3000582941169,4,1,62.0
4,21:24:45,-0.967605,51.458236,,3000583140877,9,0,


In [22]:
data_order.shape

(10716, 8)

### Task-1
- Build up distribution of orders according to reasons for failure: cancellations before and after driver assignment, and reasons for order rejection. 
- Analyse the resulting plot. 
- Which category has the highest number of orders?