# Website Traffic Timeseries Forecasting

What is the prediction problem?
* 140k web pages wiht 550 days of search history
* set aside final 55 days as test

Thoughts?
* The series is really spiky; spikes are probably driven by exogenous events - can we source data somewhere to capture these events?
* Most series don't seem particularly seasonal, though some probably are. 

Initial Model Plans?
* Quick ARIMA baseline
* LSTM Multi-task Network
* All models that don't incorporate events might be quite crap

In [None]:
import numpy as np
import itertools
import matplotlib.pyplot as plt
import pandas as pd 
import warnings

from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima.model import ARIMA

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
datadir = '/kaggle/input/web-traffic-time-series-forecasting/'

df = (pd.read_csv(os.path.join(datadir, 'train_1.csv.zip'), index_col='Page')).T

df.index = pd.to_datetime(df.index)

## EDA

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
df.iloc[:, 6000:6010].rolling(7).mean().plot(ax=ax)

In [None]:
df.max().sort_values(ascending=False)[:20]

## Data Preparation e.g. Train Test Split

In [None]:
train, test = train_test_split(df, shuffle=False, test_size=.1)

## Baseline Models

### ARIMA

Try and fit an ARIMA to the series and see what we get. 

How baseline is this? It's really baseline... 

In [None]:
fit, validate = train_test_split(train.iloc[:, 5].reset_index(drop=True), test_size=.1, shuffle=False)

best_model = None
best_mse = np.inf

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    for p, d, q in itertools.permutations(np.arange(5), 3):
        arima = ARIMA(fit, order=(p, d, q)).fit()
        preds = arima.forecast(len(validate))

        mse = ((preds - validate)**2).sum()

        if mse < best_mse:
            best_mse = mse
            best_model = arima
            
print(best_mse)
best_model.summary()

#### Summary: ARIMA was awful (best RMSE of ~60)

This is partly because forecasts with ARIMA models that have not many terms and no seasonal components is quite dependent on the final few values of the fit set - which here fell down really low towards the end. 

Instead of bothering with much more ARIMA let's just skip to the good stuff.

## Simple LSTM

How will this work? 

Inputs features:
* *pageviews*: sequence; standardised independently (might need missing value identification and imputation)
* *median volume*: scalar; 
* *std*: scalar; 
* *country*: categorical;
* *agent*: categorical;
* *attention*: sequence(?); zoom back a year/quarter and add in (already-standardised) values from then; single lagged points might be good enough.

All features then standardised.

In [None]:
train, test = train_test_split(df.to_numpy(), shuffle=False, test_size=.1)

In [None]:
def 