### Simple Time Series Methods

This notebook demonstrates some simple time series methods from the statsmodels package and simple predictions generated.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import timeit
start_time = timeit.default_timer()
train = pd.read_csv('../input/train_2.csv')
elapsed = timeit.default_timer() - start_time
print("Time to load data: ", round(elapsed, 2), "s")
print("Shape of Data: ", train.shape)

In [None]:
train.head()

In order to run the tume series I need the data to be transposed to show 1 article per column:

In [None]:
train = train.transpose()
train.head(2)

Turn the first row to a header:

In [None]:
new_header = train.iloc[0]
train = train[1:]
train.columns = new_header
train.head(3)

Check the average number of missing values for each website:

In [None]:
train.isna().sum().mean()

Each page has an average of 48 missing rows. This is a lot. I'm going to do simple forwards (and backwards) fill imputation.

In [None]:
train = train.fillna(method = "ffill")
train.head(2)

In [None]:
train = train.fillna(method = "bfill")
train.head(2)

In [None]:
train.isna().sum().mean()

To decide which model is best, I'll use a small sample of 100 websites:

In [None]:
sample = train.sample(n = 100, axis = 1)
sample.head(2)

In [None]:
sample.shape

Now withold 1 month for testing:

In [None]:
sample_test = sample.tail(30)
sample = sample.head(803 - 30)
sample.tail(2)

In [None]:
sample_test.head(2)

The data sets match. Now I try 8 different time series models and make predictions with the test set, one after the other. I'm going to measure the time and compare the RMSE to the test set, before choosing one to predict on all the data.

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 1. Autoregression (AR)
from statsmodels.tsa.ar_model import AR

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = AR(sample[column], freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())


In [None]:
# 2. Moving Average (MA)
from statsmodels.tsa.arima_model import ARMA

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = ARMA(sample[column], order = (0,1), freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 3. Autoregressive Moving Average (ARMA)
from statsmodels.tsa.arima_model import ARMA

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = ARMA(sample[column], order = (1,0), freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
# 4. Autoregressive Integrated Moving Average (ARIMA)
from statsmodels.tsa.arima_model import ARIMA

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = ARIMA(sample[column], order = (1, 0, 0), freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
# 5. SARIMAX
from statsmodels.tsa.statespace.sarimax import SARIMAX

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = SARIMAX(sample[column], freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.forecast(steps = 30)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
# 6. SARIMAX parameters
from statsmodels.tsa.statespace.sarimax import SARIMAX

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = SARIMAX(sample[column], order = (1,1,0), freq = 'D')
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.forecast(steps = 30)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
# 7. Simple Exponential Smoothing (SES)
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = SimpleExpSmoothing(sample[column])
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

In [None]:
# 8. Holt Winters Exponential Smoothing (HWES)
from statsmodels.tsa.holtwinters import ExponentialSmoothing

preds_data = pd.DataFrame()
start_date = "2017-08-12"
end_date = "2017-09-10"

# Measure time
start_time = timeit.default_timer()

# Fit model
for column in sample:
    model = ExponentialSmoothing(sample[column])
    model_fit = model.fit()
# Make prediction
    yhat = model_fit.predict(start_date, end_date)
    preds_data[column] = yhat

# End time
elapsed = timeit.default_timer() - start_time

print("Time for 100 predictions: ", round(elapsed, 2), "s")
print("RMSE: ", (((preds_data - sample_test) ** 2).mean() ** 0.5).mean())

With these tested I can pick one and then generate predictions for all the data.