build a model trained on a historical demand dataset, that can forecast demand on a Hold-out test dataset. The model should be able to accurately forecast ahead by T+1 to T+5 time intervals (where each interval is 15-min) given all data up to time T.

step by step documentation on how to run your code. Our evaluators will be running your data models on a test dataset.

The given dataset contains normalised historical demand of a city, aggregated spatiotemporally within geohashes and over 15 minute intervals. The dataset spans over a two month period.


- geohash6: geohash level 6. Geohash is a public domain geocoding system which encodes a geographic location into a short string of letters and digits with arbitrary precision. You are free to use any geohash library to encode/decode the geohashes into latitude and longitude or vice versa. Some examples include https://github.com/hkwi/python-geohash (for Python).
- day: day, where the value indicates the sequential order and not a particular day of the month
- timestamp: start time of 15-minute intervals, in the following format: hour:minute, where hour ranges from 0 to 23 and minute is either one of (0, 15, 30, 45)
- demand: aggregated demand normalised to be in the range [0,1]


Test dataset details:

1. Timeframe: The test dataset can start from any time period after the timeframe of the training dataset. Your model can use features of up to 14 consecutive days from the test dataset, ending at timestamp T and predict T+1 to T+5.


2. Geohash coverage: You may assume that the set of geohashes are the same in training dataset and test dataset. The original geohashes are anonymised (it may not be on an existing city), but you may assume that adjacency is maintained between the geohashes.


Submissions will be evaluated by RMSE (root mean squared error) averaged over all geohash6, 15-minute-bucket pairs.

[Data Source](https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/traffic-management.zip)

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
# configurations

output_size = 5  # timesteps

In [None]:
# walk-forward validation
steps_per_day = 96
n_min_obs = 7
window_size = 14

In [None]:
# https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
from pandas import Series
from matplotlib import pyplot
series = Series.from_csv('sunspots.csv', header=0)
X = series.values
n_train = 500
n_records = len(X)
for i in range(n_train, n_records):
    train, test = X[0:i], X[i:i+1]
    print('train=%d, test=%d' % (len(train), len(test)))
#     Within the loop is where you would train and evaluate your model.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
tscv = TimeSeriesSplit(n_splits=3, max_train_size=None)
for train_index, test_index in tscv.split(X):
     print("Train:", train_index, "Validation:", test_index)
     X_train, X_test = X[train_index], X[test_index]
     y_train, y_test = y[train_index], y[test_index]

# 1. Load Data

In [2]:
from rdforecast import datasets
data = datasets.load_training_data()

'filepath' not given, download data from: https://s3-ap-southeast-1.amazonaws.com/grab-aiforsea-dataset/traffic-management.zip
Data loaded.
N: 4206321
  geohash6  day timestamp    demand
0   qp03wc   18      20:0  0.020072
1   qp03pn   10     14:30  0.024721
2   qp09sw    9      6:15  0.102821


In [23]:
assert data.isna().sum().sum() == 0
test_days = 14
total_days = data['day'].max()
cutpoint = total_days - test_days
train = data[data['day'] <= cutpoint]
test = data[data['day'] > cutpoint]
print('Training set: {} days'.format(len(train['day'].unique())))
print('Testing set: {} days'.format(len(test['day'].unique())))

Training set: 47 days
Testing set: 14 days


# 2. Pre-process

# 3. Feature Engineering
Scaling by Max-Min: This is good and often required preprocessing for Linear models, Neural Networks

Normalization using Standard Deviation: This is good and often required preprocessing for Linear models, Neural Networks

Log-based feature/Target: Use log based features or log-based target function. If one is using a Linear model which assumes that the features are normally distributed, a log transformation could make the feature normal. It is also handy in case of skewed variables like income.



Time segmentation
Manual relate temporal features
Manually tag sth, domain specific

normalize your inputs so that the average is zero.

# Feature Selection and Sampling and Splitting
- spatial sampling?
- imbalance issue?

# 4. Model Training
- Training score
- Validation score

In [None]:
# model setup

# 5. Evaluation (Backtesting)
- RMSE
- MAPE
- visualize

In [None]:
from sklearn.metrics import mean_squared_error
assert len(y_true) == len(y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
# rmse = np.sqrt(np.average((y_true - y_pred) ** 2))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# 6. Model Selection
- ANN stacking with models?

# 7. Prediction

# 8. Output
- result log for validation
- test prediction