# Travel Audience Data Science Challenge #

## Goal ##

One of the main problems we face at travel audience is identifying users that will eventually book a trip to an advertised destination. In this challenge, you are tasked to build a classifier to predict the conversion likelihood of a user based on previous search events, with emphasis on the feature engineering and evaluation part.

## Data ##

You are provided with two sample data sets

- `events.csv.gz` - A sample of events collected from an online travel agency, containing:
  * `ts` - the timestamp of the event
  * `event_type` - either `search` for searches made on the site, or `book` for a conversion, e.g. the user books the flight
  * `user_id` - unique identifier of a user
  * `date_from` - desired start date of the journey
  * `date_to` - desired end date of the journey
  * `origin` - IATA airport code of the origin airport
  * `destination` - IATA airport code of the destination airport
  * `num_adults` - number of adults
  * `num_children` - number of children

- `iata.csv` - containing geo-coordinates of major airports
  * `iata_code` - IATA code of the airport
  * `lat` - latitude in floating point format
  * `lon` - longitude in floating point format

## Tasks ##

Your code needs to do the following:

- Data preparation:
  - Calculate the geographic distance between origins and destinations
  - Convert raw data to a format suitable for the classification task
- Feature_engineering:
  - Based on the given input data, compute and justify three features of your choice that are relevant for predicting converters
- Experimental design:
  - Split data into test and training sets in a meaningful way
- Model:
  - A classifier of your choice that predicts the conversion-likelihood of a user

Use your best judgment to define rules and logic to compute each feature. Don't forget to comment your code!

## Deliverables ##

Code & comments that satisfy the tasks and demonstrate your coding style in Python or R. In addition, instructions on how to run your code.

We'll be evaluating the quality of your code, communication, and general solution design. We won't evaluate the actual performance of your model.

<br>
<br>

### Imports ###

In [119]:
import numpy as np
import pandas as pd
from haversine import haversine

In [137]:
events = pd.read_csv('events.csv')

In [138]:
iata = pd.read_csv('iata.csv')

In [139]:
events.head()

Unnamed: 0,ts,event_type,user_id,date_from,date_to,origin,destination,num_adults,num_children
0,2017-04-27 11:06:51,search,60225f,2017-06-01,2017-06-07,PAR,NYC,6,1
1,2017-04-27 20:15:27,book,e5d69e,2017-08-12,2017-09-02,FRA,WAS,3,1
2,2017-04-27 23:03:43,book,f953f0,2017-10-08,2017-10-11,BER,CGN,2,0
3,2017-04-27 15:17:50,book,794d35,2017-04-28,2017-05-01,BER,BCN,1,0
4,2017-04-27 22:51:57,book,ca4f94,2017-05-16,2017-05-22,DEL,BKK,4,0


In [140]:
events = events.merge(iata, how='left', left_on='origin', right_on='iata_code')

In [141]:
events.rename(columns={'lat': 'origin_lat', 'lon': 'origin_lon'}, inplace=True)
del events['iata_code']

In [142]:
events = events.merge(iata, how='left', left_on='destination', right_on='iata_code')

In [143]:
events.rename(columns={'lat': 'dest_lat', 'lon': 'dest_lon'}, inplace=True)
del events['iata_code']

In [144]:
events.columns

Index(['ts', 'event_type', 'user_id', 'date_from', 'date_to', 'origin',
       'destination', 'num_adults', 'num_children', 'origin_lat', 'origin_lon',
       'dest_lat', 'dest_lon'],
      dtype='object')

In [145]:
events['event_type'].value_counts()

search    45198
book       1809
Name: event_type, dtype: int64

In [146]:
events['trip_len'] = (pd.to_datetime(events['date_to']) - pd.to_datetime(events['date_from'])).dt.days

In [147]:
events['travel_dist'] = events.apply(lambda x: haversine((x['origin_lat'], x['origin_lon']), (x['dest_lat'], x['dest_lon']), unit='km'), axis=1)

In [149]:
events.sort_values('ts',inplace=True)

In [150]:
events.reset_index(inplace=True)

In [151]:
df1 = events.set_index('event_type', append=True)['user_id'].unstack().add_prefix('num_')
df1 = pd.concat([df1.dropna(subset=[c]).groupby(c).cumcount().add(1) 
                                             for c in df1.columns], axis=1, keys=df1.columns)

In [152]:
events = events.join(df1.groupby(events['user_id']).ffill().fillna(0).astype(int))

In [153]:
events

Unnamed: 0,index,ts,event_type,user_id,date_from,date_to,origin,destination,num_adults,num_children,origin_lat,origin_lon,dest_lat,dest_lon,trip_len,travel_dist,num_book,num_search
0,3919,2017-04-18 04:41:09,search,10fc25,2017-05-05,2017-05-18,MRS,MIA,1,0,-85.077306,-103.303183,-41.495320,-80.770338,13.0,4891.279315,0,1
1,3689,2017-04-18 06:08:06,search,2c3d94,2017-05-30,2017-06-02,FRA,BCN,2,0,-88.026756,-78.978109,27.961273,166.863410,3.0,13204.814377,0,1
2,3232,2017-04-18 06:11:20,search,2c3d94,2017-05-30,2017-06-02,FRA,BCN,2,0,-88.026756,-78.978109,27.961273,166.863410,3.0,13204.814377,0,2
3,3523,2017-04-18 07:18:38,search,a8590b,2017-04-19,2017-05-24,HAM,BCN,1,0,-65.797210,29.715909,27.961273,166.863410,35.0,14886.581651,0,1
4,3092,2017-04-18 07:40:59,search,a8590b,2017-04-19,2017-04-22,HAM,BCN,1,0,-65.797210,29.715909,27.961273,166.863410,3.0,14886.581651,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47002,42340,2017-05-02 04:01:17,search,a51dd0,2017-05-03,2017-05-05,FRA,IST,1,0,-88.026756,-78.978109,80.704313,-123.580997,2.0,18815.411765,0,1
47003,41142,2017-05-02 04:01:33,search,ccdb92,2017-09-24,2017-10-16,ORY,OPO,4,0,63.232777,-98.270180,28.342015,26.149790,22.0,8725.960760,0,1
47004,45117,2017-05-02 04:02:39,search,f82db9,2017-10-18,2017-11-03,DUS,HNL,1,0,41.220876,147.690852,85.156499,107.553020,16.0,5021.266807,0,2
47005,41018,2017-05-02 04:06:18,search,51e763,2017-05-11,2017-05-12,PAR,VIE,2,0,-85.864998,-73.578576,-5.985141,-139.866217,1.0,9158.733291,0,2


As we have a series of events that can be either `search` or `book`, and for a combination of user and journey over time, it gives us a sequence of events. Taking the last instance of that sequence of events, tells us whether it resulted in a booking, or not. 

In [157]:
events_grouped = events.groupby(['user_id', 'origin', 'destination']).tail(1)