# IDAO: expected time of orders in airports

Airports are special points for taxi service. Every day a lot of people use a taxi to get to the city centre from the airport.

One of important task is to predict how long a driver need to wait an order. It helps to understand what to do. Maybe the driver have to wait near doors, or can drink a tea, or even should drive to city center without an order.

We request you to solve a simple version of this prediction task.

**Task:** predict time of $k$ orders in airport (time since now when you get an order if you are $k$-th in queue), $k$ is one of 5 values (different for every airports).

**Data**
- train: number of order for every minutes for 6 months
- test: every test sample has datetime info + numer of order for every minutes for last 2 weeks

**Submission:** for every airport you should prepare a model which will be evaluated in submission system (code + model files). You can make different models for different airports.

**Evaluation:** for every airport for every $k$ sMAPE will be calculated and averaged. General leaderboard will be calculated via Borda count. 

## Baseline

In [1]:
%pylab inline

import catboost
import pandas as pd
import pickle
import tqdm

Populating the interactive namespace from numpy and matplotlib


Let's prepare a model for set2.

# Load train dataset

In [2]:
set_name = 'set3'
path_train_set = '../../data/train/{}.csv'.format(set_name)

data = pd.read_csv(path_train_set)
data.datetime = data.datetime.apply(
    lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
data = data.sort_values('datetime')
data.head()

Unnamed: 0,datetime,num_orders
0,2018-02-01 00:00:00,0
1,2018-02-01 00:01:00,0
2,2018-02-01 00:02:00,0
3,2018-02-01 00:03:00,0
4,2018-02-01 00:04:00,0


Predict position for set2.

In [3]:
target_positions = {
    'set1': [10, 30, 45, 60, 75],
    'set2': [5, 10, 15, 20, 25],
    'set3': [5, 7, 9, 11, 13]
}[set_name]

Some useful constant.

In [4]:
HOUR_IN_MINUTES = 60
DAY_IN_MINUTES = 24 * HOUR_IN_MINUTES
WEEK_IN_MINUTES = 7 * DAY_IN_MINUTES

MAX_TIME = DAY_IN_MINUTES

## Generate train samples with targets

We have only history of orders (count of orders in every minutes) but we need to predict time of k orders since current minutes. So we should calculate target for train set. Also we will make a lot of samples from all set (we can only use two weeks of history while prediction so we can use only two weeks in every train sample).

In [5]:
samples = {
    'datetime': [],
    'history': []}

for position in target_positions:
    samples['target_{}'.format(position)] = []
    
num_orders = data.num_orders.values

To calculate target (minutes before k orders) we are going to use cumulative sum of orders. 

In [6]:
# start after 2 weeks because of history
# finish earlier because of target calculation
for i in range(2 * WEEK_IN_MINUTES,
               len(num_orders) - 2 * DAY_IN_MINUTES):
    
    samples['datetime'].append(data.datetime[i])
    samples['history'].append(num_orders[i-2*WEEK_IN_MINUTES:i])
    
    # cumsum not for all array because of time economy
    cumsum_num_orders = num_orders[i+1:i+1+2*DAY_IN_MINUTES].cumsum()
    for position in target_positions:
        orders_by_positions = np.where(cumsum_num_orders >= position)[0]
        if len(orders_by_positions):
            time = orders_by_positions[0] + 1
        else:
            # if no orders in last days
            time = MAX_TIME
        samples['target_{}'.format(position)].append(time)

Convert to pandas.dataframe. Now we have targets to train and predict.

In [7]:
df = pd.DataFrame.from_dict(samples)
df.head()

Unnamed: 0,datetime,history,target_5,target_7,target_9,target_11,target_13
0,2018-02-15 00:00:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",308,418,421,820,1167
1,2018-02-15 00:01:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",307,417,420,819,1166
2,2018-02-15 00:02:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",306,416,419,818,1165
3,2018-02-15 00:03:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",305,415,418,817,1164
4,2018-02-15 00:04:00,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",304,414,417,816,1163


# Train model

Let's generate simple features.

By time:

In [8]:
df['weekday'] = df.datetime.apply(lambda x: x.weekday())
df['hour'] = df.datetime.apply(lambda x: x.hour)
df['minute'] = df.datetime.apply(lambda x: x.minute)

Aggregators by order history with different shift and window size:

In [9]:
SHIFTS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]
WINDOWS = [
    HOUR_IN_MINUTES // 4,
    HOUR_IN_MINUTES // 2,
    HOUR_IN_MINUTES,
    DAY_IN_MINUTES,
    DAY_IN_MINUTES * 2,
    WEEK_IN_MINUTES,
    WEEK_IN_MINUTES * 2]

In [10]:
for shift in SHIFTS:
    for window in WINDOWS:
        if window > shift:
            continue
        df['num_orders_{}_{}'.format(shift, window)] = \
            df.history.apply(lambda x: x[-shift : -shift + window].sum())

Train/validation split for time. Let's use last 4 weeks for validation.

In [11]:
df.datetime.min(), df.datetime.max()

(Timestamp('2018-02-15 00:00:00'), Timestamp('2018-07-29 23:59:00'))

In [12]:
df_train = df.loc[df.datetime <= df.datetime.max() - datetime.timedelta(days=28)]
df_test = df.loc[df.datetime > df.datetime.max() - datetime.timedelta(days=28)]

In [13]:
target_cols = ['target_{}'.format(position) for position in target_positions]

y_train = df_train[target_cols]
y_test = df_test[target_cols]

df_train = df_train.drop(['datetime', 'history'] + target_cols, axis=1)
df_test = df_test.drop(['datetime', 'history'] + target_cols, axis=1)

In [14]:
def sMAPE(y_true, y_predict, shift=0):
    return 2 * np.mean(
        np.abs(y_true - y_predict) /
        (np.abs(y_true) + np.abs(y_predict) + shift))

Also we will save models for prediction stage.

In [15]:
model_to_save = {
    'models': {}
}

What is good or bad model? We can compare our model with constant solution. For instance median (optimal solution for MAE).

In [16]:
for position in target_positions:
    model = catboost.CatBoostRegressor(
        iterations=1500, learning_rate=0.69, loss_function='MAE') #don't remember exactly which one this or 2000 0.31
    model.fit(
        X=df_train,
        y=y_train['target_{}'.format(position)],
        use_best_model=True,
        eval_set=(df_test, y_test['target_{}'.format(position)]),
        verbose=False)
    y_predict = model.predict(df_test)
    
    print('target_{}'.format(position))
    print('stupid:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_train['target_{}'.format(position)].median())))
    print('model:\t{}'.format(sMAPE(
        y_test['target_{}'.format(position)],
        y_predict)))
    print()
    
    model_to_save['models'][position] = model

target_5
stupid:	0.6113491702802483
model:	0.4494756126421308

target_7
stupid:	0.5761401162272479
model:	0.3965217739645374

target_9
stupid:	0.5630557259108779
model:	0.37044011939266813

target_11
stupid:	0.5587023576311817
model:	0.34135429288536934

target_13
stupid:	0.5513765273934248
model:	0.3020100382154508



Our model is better than constant solution. Saving model.

In [17]:
pickle.dump(model_to_save, open('models.pkl', 'wb'))