# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [55]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


Let's prepare a model for set2.

# Load train dataset

In [56]:
set_name = 'set1'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-03-01 00:00:00,0
1,2018-03-01 00:01:00,0
2,2018-03-01 00:02:00,0
3,2018-03-01 00:03:00,0
4,2018-03-01 00:04:00,1


Predict position for set2.

In [57]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [58]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [59]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [60]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [61]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43
3,2018-03-15 00:03:00,"[0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, ...",7,21,26,35,42
4,2018-03-15 00:04:00,"[1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, ...",7,20,26,35,42


# Train model

Let's generate simple features.

By time:

In [62]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)


for i in range(24):
    aa = []
    for row in df.itertuples():
        if row[1].hour == i:
              aa.append(1)
        else: 
              aa.append(0)
    df['hour{}'.format(i)] = aa

for i in range(7):
    aa = []
    for row in df.itertuples():
        if row[1].weekday == i:
              aa.append(1)
        else: 
              aa.append(0)
    df['weekday{}'.format(i)] = aa


In [63]:
df

Unnamed: 0,datetime,history,target_10,target_30,target_45,target_60,target_75,weekday,hour,minute,...,hour21,hour22,hour23,weekday0,weekday1,weekday2,weekday3,weekday4,weekday5,weekday6
0,2018-03-15 00:00:00,"[0, 0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, ...",5,18,28,32,42,3,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2018-03-15 00:01:00,"[0, 0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, ...",5,19,27,32,42,3,0,1,...,0,0,0,0,0,0,0,0,0,0
2,2018-03-15 00:02:00,"[0, 0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, ...",7,20,27,33,43,3,0,2,...,0,0,0,0,0,0,0,0,0,0
3,2018-03-15 00:03:00,"[0, 1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, ...",7,21,26,35,42,3,0,3,...,0,0,0,0,0,0,0,0,0,0
4,2018-03-15 00:04:00,"[1, 2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, ...",7,20,26,35,42,3,0,4,...,0,0,0,0,0,0,0,0,0,0
5,2018-03-15 00:05:00,"[2, 0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, 1, ...",7,21,26,34,41,3,0,5,...,0,0,0,0,0,0,0,0,0,0
6,2018-03-15 00:06:00,"[0, 1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, ...",7,20,25,34,40,3,0,6,...,0,0,0,0,0,0,0,0,0,0
7,2018-03-15 00:07:00,"[1, 1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 2, ...",6,20,24,33,40,3,0,7,...,0,0,0,0,0,0,0,0,0,0
8,2018-03-15 00:08:00,"[1, 4, 0, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 2, 1, ...",6,19,23,33,39,3,0,8,...,0,0,0,0,0,0,0,0,0,0
9,2018-03-15 00:09:00,"[4, 0, 1, 1, 1, 1, 3, 2, 3, 1, 1, 1, 2, 1, 1, ...",5,19,23,34,39,3,0,9,...,0,0,0,0,0,0,0,0,0,0


Aggregators by order history with different shift and window size:

In [64]:
SHIFTS = [
    HOUR_IN_MINUTES // 12,
    HOUR_IN_MINUTES // 6,
    HOUR_IN_MINUTES // 3,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2,
    WEEK_IN_MINUTES * 4
]
WINDOWS = [
    HOUR_IN_MINUTES // 12,
    HOUR_IN_MINUTES // 6,
    HOUR_IN_MINUTES // 3,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2,
    WEEK_IN_MINUTES * 4
]

In [65]:
for shift in SHIFTS:
    for window in WINDOWS:
        if window >= shift:
            continue
        df['num_orders_{}_{}'.format(shift, window)] = \
            df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [66]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-03-15 00:00:00'), Timestamp('2018-08-29 23:59:00'))

In [67]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]

In [68]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]

df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)

In [69]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [70]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

In [71]:
for position in target_positions:
    model = catboost.CatBoostRegressor(
        iterations=3000, learning_rate=0.3, loss_function='MAE')
    model.fit(
        X=df_train,
        y=y_train['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_10
stupid:	0.582173151841312
model:	0.32092150425224913

target_30
stupid:	0.5388110900373055
model:	0.25857678545831303

target_45
stupid:	0.5280161445107636
model:	0.23698090438379554

target_60
stupid:	0.5090823344692388
model:	0.22133911572245954

target_75
stupid:	0.5070227037322823
model:	0.20934441191283895



Our model is better than constant solution. Saving model.

In [72]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))