# Web Traffic Time Series Forecasting

**Forecast future traffic to Wikipedia pages**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Collecting DATA

In [None]:
base_url = '/kaggle/input/web-traffic-time-series-forecasting/'

key_1 = pd.read_csv(base_url+'key_1.csv')
train_1 = pd.read_csv(base_url+'train_1.csv')
sample_submission_1 = pd.read_csv(base_url+'sample_submission_1.csv')

In [None]:
print(train_1.shape, key_1.shape, sample_submission_1.shape)

## Understanding the DATA

**train_1.csv**

- Contains 145.063 rows representing different Wikipedia URL pages
- Contains 551 columns, first column is the URL page and then each column represents a value in time from 2015-07-01 to 2016-12-31 (1.5 year, total of 550 days), where the value is the number of visits to the page in that day

Jul/2015 - 31 days  
Aug/2015 - 31 days  
Sep/2015 - 30 days  
Oct/2015 - 31 days  
Nov/2015 - 30 days  
Dec/2015 - 31 days  

Total: 184 days

2016 - 366 days (leap year)

Total: 184 + 366 = 550 days

In [None]:
train_1.head()

**key_1.csv**

- Contains 8.703.780 rows, each one representing the "URL page"_"datetime", where datetime varies from 2017-01-01 to 2017-03-01 (total of 60 days), which is the result of the total number of pages multiplied by 60 days (145063 x 60 = 8.703.780)
- Contains 2 columns, first one is the "URL page"_"datetime", second one is the ID for that page

In [None]:
key_1.head()

In [None]:
print(key_1.Page[0])
print
print(key_1.Page[59])
print
print(key_1.Page[60])

**sample_submission_1.csv**

- Contains 8.703.780 rows, each one having the ID for the page and respective number of visits to the page at that datetime

In [None]:
sample_submission_1.head()

In summary:

We need to predict the number of visits for the period between 2017-01-01 to 2017-03-1 (60 days) from training data (train_1) containing the visits to the 145063 pages in previous period given between 2015-07-01 to 2016-12-31 (550 days).

## Exploratory Data Analisys (EDA)

In [None]:
train_1.info()

In [None]:
train_1.head()

In [None]:
# Creating a list of wikipedia main sites 
sites = ["wikipedia.org", "commons.wikimedia.org", "www.mediawiki.org"]

# Function to create a new column having the site part of the article page
def filter_by_site(page):
    for site in sites:
        if site in page:
            return site

# Creating a new column having the site part of the article page
train_1['Site'] = train_1.Page.apply(filter_by_site)

In [None]:
train_1['Site'].value_counts(dropna=False)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

plt.figure(figsize=(12, 6))
plt.title("Number of Wikipedia Articles by Sites", fontsize="18")
train_1['Site'].value_counts().plot.bar(rot=0);

In [None]:
# Checking which country codes exist in the article pages
train_1.Page.str.split(pat=".wikipedia.org", expand=True).iloc[:,0].str[-3:].value_counts().index.to_list()

In [None]:
# Creating a list of country codes
train_1.Page.str.split(pat=".wikipedia.org", expand=True).iloc[:,0].str[-2:].value_counts().index.to_list()[0:7]

In [None]:
# Checking which agents + access exist in the article pages and creating a list with them
train_1.Page.str.split(pat=".wikipedia.org", expand=True).iloc[:,1].str[1:].value_counts().index.to_list()

In [None]:
# Creating the list of country codes and agents
countries = train_1.Page.str.split(pat=".wikipedia.org", expand=True).iloc[:,0].str[-2:].value_counts().index.to_list()[0:7]
agents = train_1.Page.str.split(pat=".wikipedia.org", expand=True).iloc[:,1].str[1:].value_counts().index.to_list()

# Function to create a new column having the country code part of the article page
def filter_by_country(page):
    for country in countries:
        if "_"+country+"." in page:
            return country

# Creating a new column having the country code part of the article page
train_1['Country'] = train_1.Page.apply(filter_by_country)

# Function to create a new column having the agent + access part of the article page
def filter_by_agent(page):
    for agent in agents:
        if agent in page:
            return agent

# Creating a new column having the agent part of the article page
train_1['Agent'] = train_1.Page.apply(filter_by_agent)

In [None]:
# Understanding what are the NaN values for the Country column
# It seems that the URL page does not contain the country code for those cases

train_1.Page[train_1['Country'].isna() == True]

In [None]:
plt.figure(figsize=(12, 6))
plt.title("Number of Wikipedia Articles by Country", fontsize="18")
train_1['Country'].value_counts(dropna=False).plot.bar(rot=0);

In [None]:
train_1['Agent'].value_counts(dropna=False)

In [None]:
plt.figure(figsize=(12, 6))
plt.title("Number of Wikipedia Articles by Agents/Access", fontsize="18")
train_1['Agent'].value_counts().plot.bar(rot=0);

In [None]:
# Creating a sample dataset from the Train dataset for analysis
train_1_sample = train_1.drop(['Site','Country','Agent'], axis=1).sample(6, random_state=42)
train_1_sample

In [None]:
# Transposing the sample dataset to have Date Time at the index
train_1_sampleT = train_1_sample.drop('Page', axis=1).T
train_1_sampleT.columns = train_1_sample.Page.values
train_1_sampleT.shape

In [None]:
train_1_sampleT.head()

In [None]:
# Plotting the Series from the sample dataset 
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    train_1_sampleT[v].plot()

plt.tight_layout();

In [None]:
# Plotting the Series from the sample dataset at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sampleT.columns:
    plt.plot(train_1_sampleT[v])
    plt.legend(loc='upper center');

In [None]:
# Plotting the histograms for the Series from the sample dataset
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    sns.distplot(train_1_sampleT[v])

plt.tight_layout();

In [None]:
# Checking that the number of visits to the Wikipedia Articles have Gaussian Distribution (p-value=0)
from scipy.stats import kstest, ks_2samp

pages = list(train_1_sampleT.columns)

print("Kolgomorov-Smirnov - Normality Test")
print()

for p in pages:
    print(p,':', kstest(train_1_sampleT[p], 'norm', alternative = 'less'))    

### Exploring Groups of Time Series for Different Sites     

In [None]:
# List of the main Wikipedia Article sites
sites

In [None]:
# Creating sample datasets from the train dataset and filtering them by sites
train_1_sample_site0 = train_1[train_1['Site'] == sites[0]].drop(['Site','Country','Agent'], axis=1).sample(6, random_state=42)
train_1_sample_site1 = train_1[train_1['Site'] == sites[1]].drop(['Site','Country','Agent'], axis=1).sample(6, random_state=42)
train_1_sample_site2 = train_1[train_1['Site'] == sites[2]].drop(['Site','Country','Agent'], axis=1).sample(6, random_state=42)

# Transposing them to have the Date Time as index
train_1_sampleT_site0 = train_1_sample_site0.drop('Page', axis=1).T
train_1_sampleT_site0.columns = train_1_sample_site0.Page.values
train_1_sampleT_site1 = train_1_sample_site1.drop('Page', axis=1).T
train_1_sampleT_site1.columns = train_1_sample_site1.Page.values
train_1_sampleT_site2 = train_1_sample_site2.drop('Page', axis=1).T
train_1_sampleT_site2.columns = train_1_sample_site2.Page.values

**Time Series of "WIKIPEDIA.ORG" sites only**

In [None]:
# Plotting the Series from the sample datasets
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT_site0.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    train_1_sampleT_site0[v].plot()

plt.tight_layout();

In [None]:
# Plotting the Series from the sample datasets at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sampleT_site0.columns:
    plt.plot(train_1_sampleT_site0[v])
    plt.legend(loc='upper center');

**Time Series of "COMMONS.WIKIMEDIA.ORG" sites only**

In [None]:
# Plotting the Series from the sample datasets
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT_site1.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    train_1_sampleT_site1[v].plot()

plt.tight_layout();

In [None]:
# Plotting the Series from the sample datasets at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sampleT_site1.columns:
    plt.plot(train_1_sampleT_site1[v])
    plt.legend(loc='upper center');

**Time Series of "WWW.MEDIAWIKI.ORG" sites only**

In [None]:
# Plotting the Series from the sample datasets
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT_site2.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    train_1_sampleT_site2[v].plot()

plt.tight_layout();

In [None]:
# Plotting the Series from the sample datasets at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sampleT_site2.columns:
    plt.plot(train_1_sampleT_site2[v])
    plt.legend(loc='upper center');

In [None]:
train_1_sampleT_site2.columns[4]

Notes:

For all the sites samples, some series presented missing data (NaNs).

For one of the WWW.MEDIAWIKI.ORG Series sample, noticed there was no data at all.  
For this series, the URL contains the IP address instead of DNS name and it starts with "User:"

### Exploring a Group of Time Series for a Specific Country - DE

In [None]:
# List of the Wikipedia Article country codes
countries

In [None]:
# Creating a sample dataset from the train dataset for countries having "de" code
train_1_sample_de = train_1[train_1['Country'] == countries[2]].drop(['Site','Country','Agent'], axis=1).sample(6, random_state=42)

# Transposing the sample dataset to have Date Time at the index
train_1_sampleT_de = train_1_sample_de.drop('Page', axis=1).T
train_1_sampleT_de.columns = train_1_sample_de.Page.values

In [None]:
# Plotting the Series from the sample dataset
plt.figure(figsize=(16,8))

for k, v in enumerate(train_1_sampleT_de.columns):
    plt.subplot(2, 3, k + 1)
    plt.title( str(v.split(".org")[0])+".org"+"\n"+str(v.split(".org")[1]) )
    train_1_sampleT_de[v].plot()

plt.tight_layout();

In [None]:
# Plotting the Series from the sample datasets at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sampleT_de.columns:
    plt.plot(train_1_sampleT_de[v])
    plt.legend(loc='upper center');

## Modeling with Facebook Prophet

Facebook Prophet function is used do define a Prophet forecasting model in Python.  

I will now use Prophet to model a specific Time Series got from samples of the training dataset. 

In [None]:
# Import Prophet library
from fbprophet import Prophet

In [None]:
# Picked up one Time Series for the prophet modeling
train_1_sampleT.columns[1]

In [None]:
# Creating a dataframe for the Time Series from the train_1 samples dataset
ds = pd.Series(train_1_sampleT.index)
y = pd.Series(train_1_sampleT.iloc[:,1].values)
frame = { 'ds': ds, 'y': y }
df = pd.DataFrame(frame)
df.head()

In [None]:
df.plot();

In [None]:
# Instantiate and fit the Prophet model with no hyperparameters at all
m = Prophet()
m.fit(df);

In [None]:
# Make dataframe for the future predictions to the next 60 days
# By default it will also include the dates from the history
# In summary it will have 550 + 60 days (610)
future = m.make_future_dataframe(periods=60)
future.tail()

In [None]:
# Predicting the values from the future dataframe
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
forecast.shape

In [None]:
# The forecast object here is a new dataframe that includes a column yhat with the forecast, 
# as well as columns for components and uncertainty intervals
forecast.head()

In [None]:
# Plotting the forecast by calling the Prophet.plot method and passing in the forecast dataframe
fig1 = m.plot(forecast)

In [None]:
# Plotting the forecast components by calling the Prophet.plot_components method
# By default it includes the trend and seasonality of the time series
fig2 = m.plot_components(forecast)

In [None]:
# Plotting both the Actual values and Predict values at the same graph for comparison
plt.figure(figsize=(15, 7))
plt.plot(df.y)                  # Actual values in default blue color
plt.plot(forecast.yhat, "g");   # Predicted values in green color

**Conclusion**: in this case it was possible to capture only the trend

### Prophet - Saturating forecasts

As per the above results, the time Series prediction shows a trend to the bottom, reaching negative values, which is not accepted in this case. There should be no negative visits to a Wikipedia Article...

For this reason, I tried to use the prophet logistic growth model handling a Saturating Minimum, setting the floor value to zero. However, in order to use a logistic growth trend with a saturating minimum, a maximum capacity must also be specified.

In [None]:
forecast['yhat'].tail()

In [None]:
# Setting the floor value to 0 and the capacity to a lower value in the future
df['cap'] = 500
df['floor'] = 0.0
future['cap'] = 500
future['floor'] = 0.0

# Instantiating prophet 'logistic' growth mode, then fitting and predicting future values
m = Prophet(growth='logistic')
forecast = m.fit(df).predict(future)

# Plotting both the forecast predictions and components
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)

**Conclusion:** in this case the prediction trend reached the capacity value defined (500). I will need to explore other prophet parameters to get better results. 

### Prophet - Seasonality

I will include the default seasonality parameters to the Prophet model now.

In [None]:
# Instantiate prophet with default seasonality parameters, fitting and predicting the future
# Plotting both the forecast and its components
# I will keep the default growth='linear' by now instead of 'logistic'
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True)
forecast = m.fit(df).predict(future)
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)

In [None]:
# Plotting both the Actual values and Predict values at the same graph for comparison
plt.figure(figsize=(15, 7))
plt.plot(df.y)                  # Actual values in default blue color
plt.plot(forecast.yhat, "g");   # Predicted values in green color

**Conclusion:** In this case, the fit was much better, which was expected since the seasonality capture the most relevant frequencies. Seasonalities are estimated using a partial Fourier sum. However, we could not capture the high picks.

### Prophet - Changepoints

Now I will explore the use of Prophet changepoints to automatically detect these abrupt changes in the time series trajectories and see if it will allow the trend to adapt appropriately. 

In [None]:
# Checking the locations of the significant changepoints
from fbprophet.plot import add_changepoints_to_plot
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

By default changepoints are only inferred for the first 80% of the time series in order to have plenty of runway for projecting the trend forward and to avoid overfitting fluctuations at the end of the time series.

Since I still see some changepoints after 80%, I will increase it to check for other ones.

In [None]:
# Increasing the 'changepoint_range' parameter from default 80% to 90%
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True,
            changepoint_range=0.9)
forecast = m.fit(df).predict(future)
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

In [None]:
deltas = m.params['delta'].mean(0)
fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas, facecolor='#0072B2', edgecolor='#0072B2')
ax.grid(True, which='major', c='gray', ls='-', lw=1, alpha=0.2)
ax.set_ylabel('Rate change')
ax.set_xlabel('Potential changepoint')
fig.tight_layout()

**Conclusion:** The trend is going down faster when increasing the changepoint_range, making the prediction values more negative, which doesn't make sense. So I will keep changepoint_range to default 80%.

In [None]:
# Changing the changepoint_range back to 80% since I don't want to make the trend more negative
# Also increasing the changepoint_prior_scale from default 0.05 to 0.7
# By default, changepoint_prior_scale parameter is set to 0.05, andi ncreasing it will make the trend more flexible
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True,
            changepoint_range=0.8, changepoint_prior_scale=0.7)
forecast = m.fit(df).predict(future)
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

In [None]:
deltas = m.params['delta'].mean(0)
fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas, facecolor='#0072B2', edgecolor='#0072B2')
ax.grid(True, which='major', c='gray', ls='-', lw=1, alpha=0.2)
ax.set_ylabel('Rate change')
ax.set_xlabel('Potential changepoint')
fig.tight_layout()

In [None]:
# Plotting both the Actual values and Predict values at the same graph for comparison
plt.figure(figsize=(15, 7))
plt.plot(df.y)                  # Actual values in default blue color
plt.plot(forecast.yhat, "g");   # Predicted values in green color

**Conclusion:** Now we got a pretty good model at this point.

### Prophet - Holidays

Now I will include a dataframe for holidays. Since the wikipedia article time series I am analyzing has the country code "es", I will use the Spain holiday. I will also add years from 2015 to 2017 to the dataframe.

In [None]:
train_1_sampleT.columns[1]

In [None]:
"_es." in train_1_sampleT.columns[1]

In [None]:
from datetime import date
import holidays

# Select country
es_holidays = holidays.Spain(years = [2015,2016,2017])
es_holidays = pd.DataFrame.from_dict(es_holidays, orient='index')
es_holidays = pd.DataFrame({'holiday': 'Spain', 'ds': es_holidays.index})

In [None]:
es_holidays.head()

In [None]:
# Instantiate prophet with seasonality, changepoints and holidays parameters
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True,
            changepoint_range=0.8, changepoint_prior_scale=0.7,
            holidays=es_holidays)
m.add_country_holidays(country_name='ES')
# Fitting and predicting the future
forecast = m.fit(df).predict(future)
# Plotting both the forecast and its components
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)

### Prophet - Uncertainty interval

#### Uncertainty in the trend

The width of the uncertainty intervals (by default 80%) can be set using the parameter interval_width.  
I will increase it to 95%.

#### Uncertainty in seasonality

This parameter determines if the model uses Maximum a posteriori (MAP) estimation or a full Bayesian inference with the specified number of Markov Chain Monte Carlo (MCMC) samples to train and predict.
So if you make MCMC zero then it will do MAP estimation, otherwise you need to specify the number of samples to use with MCMC.

Source: <a href="https://towardsdatascience.com/implementing-facebook-prophet-efficiently-c241305405a3">Implementing Facebook Prophet efficiently</a>

Since we are using the SMAPE as the evaluation metric, I decided to keep mcmc_samples parameters to the default zero value.

In [None]:
# Instantiate prophet with seasonality, changepoints and holidays parameters
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True,
            changepoint_range=0.8, changepoint_prior_scale=0.7,
            holidays=es_holidays,
            interval_width=0.95,
            mcmc_samples=0)
m.add_country_holidays(country_name='ES')
# Fitting and predicting the future
forecast = m.fit(df).predict(future)
# Plotting both the forecast and its components
fig1 = m.plot(forecast)
fig2 = m.plot_components(forecast)

In [None]:
deltas = m.params['delta'].mean(0)
fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas, facecolor='#0072B2', edgecolor='#0072B2')
ax.grid(True, which='major', c='gray', ls='-', lw=1, alpha=0.2)
ax.set_ylabel('Rate change')
ax.set_xlabel('Potential changepoint')
fig.tight_layout()

In [None]:
plt.figure(figsize=(15, 7))
plt.plot(df.y)
plt.plot(forecast.yhat, "g");

## An interactive figure of the forecast created with Plotly

In [None]:
from fbprophet.plot import plot_plotly
import plotly.offline as py
py.init_notebook_mode()

fig = plot_plotly(m, forecast)  # This returns a plotly Figure
py.iplot(fig)

### Prophet - All parameters

Let us look at a summary of some of the most important Prophet parameters for reference.

**Trend parameters**

Parameter and Description

- growth -> linear’ or ‘logistic’ to specify a linear or logistic trend
- changepoints -> List of dates at which to include potential changepoints (automatic if not specified)
- n_changepoints -> If changepoints is not supplied, you may provide the number of changepoints to be automatically included
- changepoint_prior_scale -> Parameter for changing flexibility of automatic changepoint selection

**Seasonality & Holiday Parameters**

Parameter and Description

- yearly_seasonality -> Fit yearly seasonality
- weekly_seasonality -> Fit weekly seasonality
- daily_seasonality -> Fit daily seasonality
- holidays -> Feed dataframe containing holiday name and date
- seasonality_prior_scale -> Parameter for changing strength of seasonality model
- holidays_prior_scale -> Parameter for changing strength of holiday model

Source: https://www.analyticsvidhya.com/blog/2018/05/generate-accurate-forecasts-facebook-prophet-python-r/

In [None]:
m.params

## Evaluating the Model

SMAPE function

$$ SMAPE = \frac{100\%}{n} \sum_{t=1}^{n} \frac{\left|F_t - A_t\right|}{(\left|A_t\right|+\left|F_t\right|)/2} $$

In [None]:
def smape(y_true, y_pred):
    denominator = (np.abs(y_true) + np.abs(y_pred))
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return 200 * np.mean(diff)

# Source: http://shortnotes.herokuapp.com/how-to-implement-smape-function-in-python-149

Calculating the SMAPE for the time series prediction for the visits at a single URL page

In [None]:
smape_single_page = smape(df.y, forecast.yhat)
smape_single_page

### Prophet - Cross Validation

In [None]:
from fbprophet.diagnostics import cross_validation

In [None]:
# horizon: forecast horizon
# initial: size of the initial training period
# period: spacing between cutoff dates
#
# Here we do cross-validation to assess prediction performance on a horizon of 60 days, 
# starting with 130 days of training data in the first cutoff and then making predictions every 60 days
# On this 610 days time series, this corresponds to 8 total forecasts

cv_results = cross_validation(m, initial='360 days', period='30 days', horizon='60 days')

In [None]:
smape_baseline = smape(cv_results.y, cv_results.yhat)
smape_baseline

## Prophet - Running for Multiple Time Series

In [None]:
train_1_all = train_1.drop(['Page','Site','Country','Agent'], axis=1).T
train_1_all.columns = train_1.Page.values
train_1_all.shape

In [None]:
train_1_all.head()

In [None]:
# Filling up NaN values with 0 visits to avoid breaking the model fit
train_1_all.fillna(0, inplace=True)

# Selecting a few series to run the Prophet model against
num_series = 10
train_1_sample = train_1_all.sample(num_series, axis=1, random_state=42)

In [None]:
# Plotting the Series from the sample datasets at the same graph
plt.figure(figsize=(15,8))

for v in train_1_sample.columns:
    plt.plot(train_1_sample[v])
    plt.legend(loc='upper center');

In [None]:
%%time

smape_partial = 0

for k, v in enumerate(train_1_sample.columns):
    ds = pd.Series(train_1_sample.index)
    y = pd.Series(train_1_sample.iloc[:,k].values)
    frame = { 'ds': ds, 'y': y }
    df = pd.DataFrame(frame)
    m_partial = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True)
    forecast = m_partial.fit(df).predict(future)
    smape_partial += smape(df.y, forecast.yhat)

smape_average = smape_partial / len(train_1_sample.columns)
smape_average

## Multivariate Time Series models

I could be using Multivariate Time Series (MTS) instead of the univariate models against all Time Series.  
Following this approach, below are some ideas I could try in the future:

- Vector Auto Regression (VAR)
  - Johansen’s test for checking the stationarity of any multivariate time series data  
    (statsmodels.tsa.vector_ar.vecm import coint_johansen)
  - Fit the model using VAR model from statsmodel library  
    (from statsmodels.tsa.vector_ar.var_model import VAR)  
- Random Forest  
- Recurrent Neural Networs (RNN)  

Sources:  

<a href="https://link.medium.com/miaEiLC0c1">A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)</a>)  
<a href="https://towardsdatascience.com/multivariate-time-series-forecasting-using-random-forest-2372f3ecbad1">Multivariate Time Series Forecasting Using Random Forest</a>)  
<a href="https://link.medium.com/XFbTA4O0c1">Interpreting recurrent neural networks on multivariate time series</a>

## Multiple Time Series in parallel  

Another idea could be the use of Python multiprocessing package to forecast multiple Time Series in parallel.  

Source:  

<a href="https://medium.com/spikelab/forecasting-multiples-time-series-using-prophet-in-parallel-2515abd1a245">Forecasting multiple time-series using Prophet in parallel</a>

## Submitting to Kaggle

In [None]:
# train_1_sampleT.columns[1]+"_"+"2017-01-01"
# train_1_sampleT.columns[1]+"_"+"2017-01-01" in list(key_1.Page.values)