# <center>Time Series Data with ARIMA Model</center>

## Table of contents
> ### Overview of Data
> ### Feature Engineering
> ### Prediction


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
%matplotlib inline

## Overview of Data

First, let's take a look at the data we will use.

In [None]:
train = pd.read_csv('../input/web-traffic-time-series-forecasting/train_1.csv.zip')

In [None]:
train.head()

In [None]:
train.info()

So, our data consist of...
* Page:  The name of each wikipedia page.
* yyyy-mm-dd :  The days views were recorded. (2015-07-01 - 2016-12-31,  550 days in total) 


## Missing Data

In [None]:
# Missing data per day

days = [r for r in range(train.shape[1] - 1)]
fig, ax = plt.subplots(figsize = (10,7))
plt.xlabel('Day')
plt.ylabel('# of null data')
ax.axvline(x=123, c = 'red',  lw = 0.5)
plt.plot(days, train.iloc[:,1:].isnull().sum())

In [None]:
train.columns[123]

In [None]:
# Histogram of pages with their number of data missing. 

train.isnull().sum(axis = 1).hist()

* OK, so we have quite lots of null data in the beggining, but decreases as the time goes.
* Perhaps, some pages were newly created?
* However, many pages have no missing data as seen from the second diagram.

I will fill them with 0 as it makes it easier to continue.

In [None]:
train = train.fillna(0)

## Feature Engineering
First I will take a look at each page name. The general format is 

```
SPECIFIC NAME _ LANGUAGE.wikipedia.org _ ACESS TYPE _ ACCESS ORIGIN
``` 

So we can split them with the underscore and dots.

In [None]:
train.Page

In [None]:
import re
def split_page(page):
  w = re.split('_|\.', page)
  return ' '.join(w[:-5]), w[-5], w[-2], w[-1]

li = list(train.Page.apply(split_page))
df = pd.DataFrame(li)
df.columns = ['Title', 'Language', 'Access_type','Access_origin']
df = pd.concat([train, df], axis = 1)
del df['Page']

In [None]:
df.iloc[:, -4:]

Okay, so now the Page name is split into 
* Title (title of the page)
* Language (languaeg written with)
* Access_type (the type of access)
*Acceess_origin (the type of access) 


In [None]:
df[df.Language == 'de'].iloc[:,-4:]

In [None]:
df.Language.value_counts().plot(kind = 'bar')

* There are actually pages with unspecific languages: commons and www.
* Others are English, Japanese, German, French, Chinese, Russian, Spanish

In [None]:
df.Access_type.value_counts().plot(kind = 'bar')

* There are three types of access: all-access, mobike, and desktop

In [None]:
df.Access_origin.value_counts().plot(kind = 'bar', color = 'orange')

* The acess origins are either all-agents or spider. 

Okay, so now we have global categorical features. Let's take a look at the time series data.

In [None]:
sum_all = df.iloc[:,:-4].sum(axis = 0)

days = list(r for r in range(sum_all.shape[0]))

fig = plt.figure(figsize = (10, 7))
plt.xlabel('Days')
plt.ylabel('Views')
plt.title('Page View of All Pages')
plt.plot(days, sum_all)


In [None]:
summap = {}
lang_list = ["en", "ja", "de", "fr", "zh", "ru", "es", "commons", "www"]
for l in lang_list:
  summap[l] = df[df.Language == l].iloc[:,:-4].sum(axis = 0)/df[df.Language == l].shape[0]

fig = plt.figure(figsize = (15, 7))
plt.xlabel('Days')
plt.ylabel('Views')
plt.title('Average Page View by Language')

for key in summap:
  plt.plot(days, summap[key], label = key)
plt.legend()
plt.show()




* The overall sum is largelly affected by the English trend. 
* It is difficult to see the trends of minor languages so I will use Forier Transform.

In [None]:
from scipy.fftpack import fft

#data = df.iloc[idx,0:-4]

fig, ax = plt.subplots(figsize = (15, 7))

fftmean = {}
fftxvals = {}

for key in summap:
  fftval = fft(df[df.Language == key].iloc[:, :-6])

#calculate magnitude
  fftmag = [np.sqrt(np.real(x)*np.real(x)+
                    np.imag(x)*np.imag(x)) for x in fftval]
  arr = np.array(fftmag)
#calculate mean
  fftmean[key] = np.mean(arr,axis=0)

  fftxvals[key] = [day/fftmean[key].shape[0] for day in range(fftmean[key].shape[0])]

  npts = len(fftxvals[key])//2 + 1
  fftmean[key] = fftmean[key][:npts]/fftmean[key].shape[0]
  fftxvals[key] = fftxvals[key][:npts]
  ax.plot(fftxvals[key][1:], fftmean[key][1:], label = key)

plt.axvline(x = 1/7, color = 'black', lw = 0.5)
plt.axvline(x = 2/7, color = 'black', lw = 0.5)
plt.axvline(x = 3/7, color = 'black', lw = 0.5)

plt.xlabel('Frequency')
plt.ylabel('Views')
plt.title('Fourier Transform of Average View by Language')

plt.legend()
plt.show()

* There are clear peaks at 1/7, 2/7 and 3/7. They are likely to be the weekly trends as we have 7 days per week. 
* Trends in longer terms (smaller frequency) depend on the language. 

Okay, so now I will take a look at individual page.
Let's check the most viewed pages for each language.

In [None]:
sums = pd.concat([df.iloc[:,-4:], df.iloc[:,:-4].sum(axis = 1)], axis = 1)
sums.columns = ['Title', 'Language', 'Access_type', 'Access_origin', 'sumvalues']

In [None]:
max_list = {}
for l in lang_list:
  lang_sums = sums[sums.Language == l]
  max_list[l] = lang_sums.sumvalues.idxmax()

In [None]:
df[df.index.isin(max_list.values())].iloc[:,-4:]

- The most viewed pages are the main portal pages of Wikipedia. 


In [None]:
import matplotlib as mpl
mpl.rcParams['font.family'] = 'AppleGothic'

def plot_trend(lang, idx):
    fig = plt.figure(1,figsize=(10,5))
    plt.plot(days, df.iloc[idx,:-4])
    plt.xlabel('day')
    plt.ylabel('views')
    plt.title('Most Viewed Pages ({})'.format(lang))  
    plt.show()

In [None]:
for key in max_list:
  plot_trend(key, max_list[key])

- Each page has unique trend features. 
- There are some weird spikes as well.
- OK, so what about the second largest?

In [None]:
sums2 = sums.drop(labels = max_list.values(), axis = 0)
max_list2 = {}
for l in lang_list:
  lang_sums = sums2[sums2.Language == l]
  max_list2[l] = lang_sums.sumvalues.idxmax()
  
df[df.index.isin(max_list2.values())].iloc[:,-4:]

- The second most viewed ones are from main pages in different access types. 
- Let's compare the trend by their access type then.


- Why don't we compare the access type to see if there is any difference?

In [None]:
main_titles = dict(zip(list(df[df.index.isin(max_list.values())].Language), list(df[df.index.isin(max_list.values())].Title)))

all_access = {}
mobile_access = {}
desktop_access = {}

for l in lang_list:
  all_access[l] = df.index[(df.Language == l) & (df.Title == main_titles[l]) & (df.Access_type == 'all-access')]
  mobile_access[l] = df.index[(df.Language == l) & (df.Title == main_titles[l]) & (df.Access_type == 'mobile-web')]
  desktop_access[l] = df.index[(df.Language == l) & (df.Title == main_titles[l]) & (df.Access_type == 'desktop')]

In [None]:
def plot_trend_access_type(lang):

    plt.figure(figsize=(15,4))

    plt.subplot(1,3, 1)
    plt.plot(days, df.iloc[all_access[l][0],:-4])
    plt.title('All Access ({})'.format(lang))
    plt.subplot(1,3, 2)
    plt.plot(days, df.iloc[mobile_access[l][0],:-4])
    plt.title('Mobile-web Access ({})'.format(lang))
    plt.subplot(1,3, 3)
    plt.plot(days, df.iloc[desktop_access[l][0],:-4])
    plt.title('Desktop Access ({})'.format(lang))
    plt.show()

In [None]:
for l in lang_list:
  plot_trend_access_type(l)


- The shapes of the graphs differ a lot depending on the access type. 
- In most of the languages, there are peaks around 300th days, which could be because of the 2016 summer Olympics from Rio. 
- For the case of English, Japanese and French, there are overall decrease in Mobile Access but increase in Desktop Access for some reason... 
- Possible reasons could be... mobile users shifted to desktop? Was there new macbook released?  Or mobile wikipedia degraded? 


In [None]:
def plot_trend_access_origin(lang):

    plt.figure(figsize=(10,3))

    plt.subplot(1,2, 1)
    plt.plot(days, df.iloc[all_access[l][0],:-4])
    plt.title('{} ({})'.format(df.iloc[all_access[l][0],:].Access_origin,lang))
    plt.subplot(1,2, 2)
    plt.plot(days, df.iloc[all_access[l][1],:-4])
    plt.title('{} ({})'.format(df.iloc[all_access[l][1],:].Access_origin, lang))
    plt.show()
  
for l in lang_list:
    plot_trend_access_origin(l)

- Spider seem to be more constant than all-agent. Perhaps, they are less events driven than normal access.

# Prediction
I will use the first 500 days to predict the last 50 days.

## ARIMA
Let's try ARIMA to predict the views from the time series data.

In [None]:
# Split the data into train and test

series = df.iloc[:, 0:-4]

from sklearn.model_selection import train_test_split

X = series.iloc[:,:500]
y = series.iloc[:,500:]

X_train, X_val, y_train, y_val = train_test_split(X.values, y.values, test_size=0.1, random_state=42)


Let's use walk-forward validation technique to see how the ARIMA model learns and predicts. It will take a lot of time to predict all dataset so I will just use the main page in Japanese.

In [None]:
from statsmodels.tsa.arima_model import ARIMA

train, test = X_train[86431], y_train[86431]
record = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
	# fit model
	model = ARIMA(record, order=(4,1,0))
	model_fit = model.fit(disp=False)
	# forecast one step
	yhat = model_fit.forecast()[0]
	# store the result
	predictions.append(yhat)
	record.append(test[t])


In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)
# plot forecasts against actual outcomes
fig = plt.subplots(figsize=(10,7))
plt.plot(test)
plt.plot(predictions, color='red')
plt.legend(['test', 'prediction'])
plt.title('ARIMA with Walk-foward validation')
plt.show()

Let's see how the parameters p, d, q affect our ARIMA model.
- p: Determines the time range to obtain the auto regression (AR)
- q: Determines the time range to obtain moving avereage (MV)
- d: Determines the level of differentiation.

In [None]:

# evaluate an ARIMA model for a given order (p,d,q) with MSE
def evaluate_arima_model(train, test, arima_order):
	# prepare training dataset
	record = [x for x in train]
	# make predictions
	predictions = list()
	for t in range(len(test)):
		model = ARIMA(record, order=arima_order)
		model_fit = model.fit(disp=0)
		yhat = model_fit.forecast()[0]
		predictions.append(yhat)
		record.append(test[t])
	# calculate out of sample error
	error = mean_squared_error(test, predictions)
	aic_score= model_fit.aic
	return error, aic_score


import warnings
warnings.filterwarnings("ignore")


# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(train, test, p_values, d_values, q_values):
	train, test = train.astype('float32'), test.astype('float32')
	best_score, best_cfg = float("inf"), None
	for p in p_values:
		for d in d_values:
			for q in q_values:
				order = (p,d,q)
				try:
					mse, aic = evaluate_arima_model(train, test, order)
					if mse < best_score:
						best_score, best_cfg = mse, order
						aic_out = aic
					print('ARIMA  ', order,    'MSE=%.3f   AIC=%.3f' % ( mse, aic))
				except:
					continue
	#print('Best ARIMA:    ', best_cfg,  'MSE=%.3f  AIC=%.3f' % (best_cfg, best_score))


In [None]:

p_values = [0, 5]
d_values = range(0, 2)
q_values = range(4, 8)
warnings.filterwarnings("ignore")
evaluate_models(X_train[86431], y_train[86431], p_values, d_values, q_values)

- So in general,  MSE is smaller at p=0, d=0 and d around 5 to 7.
- In deed, the best combination was when p=0,d = 0 and q = 5.

In [None]:
p_values = [0, 1]
d_values = [0,1]
q_values = [5,7]
warnings.filterwarnings("ignore")
evaluate_models(X_train[86431], y_train[86431], p_values, d_values, q_values)

So the best combination was actually 1, 0 ,5.

Let's visualize and compare to the previous result.


In [None]:
train, test = X_train[86431], y_train[86431]
record = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
	# fit model
	model = ARIMA(record, order=(1,0,5))
	model_fit = model.fit(disp=False)
	# forecast one step
	yhat = model_fit.forecast()[0]
	# store the result
	predictions.append(yhat)
	record.append(test[t])
 
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)
# plot forecasts against actual outcomes
fig = plt.subplots(figsize=(10,7))
plt.plot(test)
plt.plot(predictions, color='red')
plt.plot(predictions, color='red')
plt.legend(['test', 'prediction'])
plt.title('ARIMA with Walk-foward validation 2')
plt.show()

- So compared to the previous model, the model is not cpaturing the large spikes, but did not cause any huge error either, which is why this model had the lowest MSE.