# <span style='color:RED'>Web Traffic Time Series Forecasting

#   What is Time Series Forcasting?

Time series forecasting uses information regarding historical values and associated patterns to predict future activity. Most often, this relates to trend analysis, cyclical fluctuation analysis, and issues of seasonality. As with all forecasting methods, success is not guaranteed.

### <span style='color:R'> How Do You Analyze Time Series Data?

Statistical techniques can be used to analyze time series data in two key ways: to generate inferences on how one or more variables affect some variable of interest over time, or to forecast future trends. 

###  What Are Some Examples of Time Series?

A time series can be constructed by any data that is measured over time at evenly-spaced intervals. Historical stock prices, earnings, GDP, or other sequences of financial or economic data can be analyzed as a time series.

## <span style='color:bLUE'> DATA ANALYSIS AND VISUALIZATIONS 

In [None]:
import os
import math
import numpy as np
import pandas as pd
import seaborn as sns
import calendar

%matplotlib inline
import matplotlib.pyplot as plt

import plotly.graph_objs as go


In [None]:
train_df = pd.read_csv("../input/web-traffic-time-series-forecasting/train_1.csv.zip")
key_df = pd.read_csv("../input/web-traffic-time-series-forecasting/key_1.csv.zip")

In [None]:
train_df.head(3)

In [None]:
key_df.head(3)

In [None]:
print("Train--- ", train_df.shape)
print("Key----- ", key_df.shape)

We can see that the page names are separated into sections using hyphen into site, access type and agent. Therefore it would be better to split them into separate sections for ease of analysis.

In [None]:
page= pd.DataFrame([i.split("_")[-3:] for i in train_df["Page"]])
page.columns = ["Site", "Access_Type", "Agent"]
page.describe()

In [None]:
site_column = list(page['Site'].unique())
access_column =list(page['Access_Type'].unique())
agents_column= list(page['Agent'].unique())

In [None]:
print(site_column)
print('-----------------------------------------------------------------------------------------------------------------------')
print(access_column)
print('-----------------------------------------------------------------------------------------------------------------------')
print(agents_column)

In [None]:
import re
def get_language(page):
    res = re.search('[a-z][a-z].wikipedia.org',page)
    if res:
        return res[0][0:2]
    return 'na'

In [None]:
train_df['lang'] = train_df.Page.map(get_language)

page['lang'] = train_df.Page.map(get_language)

In [None]:


page["lang"].value_counts().sort_index().plot.bar().set_title('Language - distribution')



In [None]:
from collections import Counter

print(Counter(train_df.lang))

In [None]:
lang = {}
lang['en'] = train_df[train_df.lang=='en'].iloc[:,0:-1]
lang['ja'] = train_df[train_df.lang=='ja'].iloc[:,0:-1]
lang['de'] = train_df[train_df.lang=='de'].iloc[:,0:-1]
lang['na'] = train_df[train_df.lang=='na'].iloc[:,0:-1]
lang['fr'] = train_df[train_df.lang=='fr'].iloc[:,0:-1]
lang['zh'] = train_df[train_df.lang=='zh'].iloc[:,0:-1]
lang['ru'] = train_df[train_df.lang=='ru'].iloc[:,0:-1]
lang['es'] = train_df[train_df.lang=='es'].iloc[:,0:-1]

In [None]:
sums = {}
for key in lang:
    sums[key] = lang[key].iloc[:,1:].sum(axis=0) / lang[key].shape[0]

In [None]:
days = [r for r in range(sums['en'].shape[0])]

fig = plt.figure(1,figsize=[10,10])
plt.ylabel('Views per Page')
plt.xlabel('Day')
plt.title('Pages in Different Languages')
labels={'en':'English','ja':'Japanese','de':'German',
        'na':'Media','fr':'French','zh':'Chinese',
        'ru':'Russian','es':'Spanish'
       }

for key in sums:
    plt.plot(days,sums[key],label = labels[key] )
    
plt.legend()
plt.show()

In [None]:
train_df_2=pd.read_csv("../input/web-traffic-time-series-forecasting/train_1.csv.zip")

In [None]:
train=pd.melt(train_df_2[list(train_df_2.columns[-50:])+['Page']], id_vars='Page', var_name='date', value_name='Visits')

In [None]:
train.head(3)

In [None]:
train.info()

In [None]:
train['date'] = train['date'].astype('datetime64[ns]')

In [None]:
train['weekend'] = ((train.date.dt.dayofweek) // 5 == 1).astype(float)

In [None]:
median = pd.DataFrame(train.groupby(['Page'])['Visits'].median())
median.columns = ['median']
mean = pd.DataFrame(train.groupby(['Page'])['Visits'].mean())
mean.columns = ['mean']


In [None]:
train = train.set_index('Page').join(mean).join(median)
train.reset_index(drop=False,inplace=True)
train['weekday'] = train['date'].apply(lambda x: x.weekday())

In [None]:
train['year']=train.date.dt.year 
train['month']=train.date.dt.month 
train['day']=train.date.dt.day

In [None]:
train.head(3)

In [None]:
plt.figure(figsize=(50, 2))
mean_g = train[['Page','date','Visits']].groupby(['date'])['Visits'].mean()
plt.plot(mean_g,c='red')
plt.title('Mean Page Views Per Date')
plt.show()

In [None]:
train[['Page','date','Visits']].groupby(['date'])['Visits'].mean().head(3)

In [None]:
plt.figure(figsize=(50, 10))
median_g = train[['Page','date','Visits']].groupby(['date'])['Visits'].median()
plt.plot(median_g, color = 'b')
plt.title('Time Series - median')
plt.show()

# MODELS

##  <span style='color:Green'>Types of time series analysis

<span style='background :#e6ffff' >Classification:</span> Identifies and assigns categories to the data.
    
<span style='background :#e6ffff' >Curve fitting: </span>Plots the data along a curve to study the relationships of variables within the data.
    
<span style='background :#e6ffff' >Descriptive analysis:</span> Identifies patterns in time series data, like trends, cycles, or seasonal variation.

<span style='background :#e6ffff' >Explanative analysis:</span> Attempts to understand the data and the relationships within it, as well as cause and effect.

<span style='background :#e6ffff' >Exploratory analysis:</span> Highlights the main characteristics of the time series data, usually in a visual format.

<span style='background :#e6ffff' >Forecasting:</span> Predicts future data. This type is based on historical trends. It uses the historical data as a model for future
data, predicting scenarios that could happen along future plot points.

<span style='background :#e6ffff' >Intervention analysis:</span> Studies how an event can change the data.

    
    

## <span style='color:Green'> Primary techniques and tools for time series analysis

<span style='background : #f5ccff' >Box-Jenkins ARIMA models:</span> These univariate models are used to better understand a single time-dependent variable, such as temperature over time, and to predict future data points of variables. These models work on the assumption that the data is stationary. Analysts have to account for and remove as many differences and seasonality in past data points as they can. Thankfully, the ARIMA model includes terms to account for moving averages, seasonal difference operators, and autoregressive terms within the model.

<span style='background : #f5ccff' >Box-Jenkins Multivariate Models:</span>  Multivariate models are used to analyze more than one time-dependent variable, such as temperature and humidity, over time.

<span style='background : #f5ccff' >Holt-Winters Method:</span>  The Holt-Winters method is an exponential smoothing technique. It is designed to predict outcomes, provided that the data points include seasonality.

## <span style='color:Green'> PROPHET

## What is Prophet?

<span style='background : #f5ccff' >“Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series data analysis be available without expert works, it is highly user-friendly but still highly customizable, even to non-expert users.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

## The Prophet Forecasting Model

The Prophet uses a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:

<span style='background : #f5ccff' >y(t)=g(t) + h(t) + s(t) + Et </span>

<span style='background : #f5ccff' >g(t) </span> : piecewise linear or logistic growth curve for modeling non-periodic changes in time series

<span style='background : #f5ccff' >s(t) </span>: periodic changes (e.g. weekly/yearly seasonality)

<span style='background : #f5ccff' >h(t) </span>: effects of holidays (user provided) with irregular schedules

<span style='background : #f5ccff' >εt </span>: error term accounts for any unusual changes not accommodated by the model


In [None]:
train = pd.read_csv("../input/web-traffic-time-series-forecasting/train_1.csv.zip")
keys = pd.read_csv("../input/web-traffic-time-series-forecasting/key_1.csv.zip")

In [None]:
from fbprophet import Prophet

In [None]:
means =  pd.DataFrame(mean_g).reset_index(drop=False)
means['weekday'] =means['date'].apply(lambda x: x.weekday())

means['Date_str'] = means['date'].apply(lambda x: str(x))

#create new columns year,month,day in the dataframe bysplitting the date string on hyphen and converting them to a list of values and add them under the column names year,month and day
means[['year','month','day']] = pd.DataFrame(means['Date_str'].str.split('-',2).tolist(), columns = ['year','month','day'])

#creating a new dataframe date by splitting the day column into 2 in the means data frame on sapce, to understand these steps look at the subsequent cells to understand how the day column looked before this step
date = pd.DataFrame(means['day'].str.split(' ',2).tolist(), columns = ['day','other'])
means['day'] = date['day']*1




In [None]:
pd.DataFrame(means['Date_str'].str.split('-',2).tolist(), columns = ['year','month','day']).head(3)

In [None]:
date.head(3)

In [None]:
means.drop('Date_str',axis = 1, inplace =True)
means.head()

In [None]:
pip install pystan==2.19.1.1

In [None]:
pip install fbprophet==0.6.0

# -------------------------------------------------------------------------------------------------------

### Methods used here in prophet

###  <span style='background : #f5ccff'>1) make_future_dataframe:</span> Make dataframe with future dates for forecasting.

<span style='color:Green'>make_future_dataframe(m, periods, freq = "day", include_history = TRUE)

Arguments
    
<span style='background : #f5ccff' >m:</span> Prophet model object.

<span style='background : #f5ccff' >periods	:</span> Int number of periods to forecast forward.

<span style='background : #f5ccff' >freq :</span> 'day', 'week', 'month', 'quarter', 'year', 1(1 sec), 60(1 minute) or 3600(1 hour).

<span style='background : #f5ccff' >include_history	:</span> Boolean to include the historical dates in the data frame for predictions.

####  <span style='background : #f5ccff'>2) prophet_plot_components: </span>
Plot the components of a prophet forecast." Prints a ggplot2 (GGplot2 is like a R grammar system for graphics where the syntax consists various parts of graphics and you can build graphs using this syntax provided by ggplot2. Its like mapping various attributes of graphs in R, it addresses various components of graphs and stitch them together to a build complete chart.) with whichever are available of: trend, holidays, weekly seasonality, yearly seasonality, and additive and multiplicative extra regressors.



In [None]:
import seaborn as sns
sns.set(font_scale=1) 


date_index = means[['date','Visits']]


date_index = date_index.set_index('date')

prophet = date_index.copy()
prophet.reset_index(drop=False,inplace=True)
prophet.columns = ['ds','y']


m = Prophet()

m.fit(prophet)

future = m.make_future_dataframe(periods=30,freq='D')
forecast = m.predict(future)


fig = m.plot(forecast)

 <span style='background : #f5ccff'>This is the plot of the forcast.

 <span style='background : #f5ccff'>The make_future_dataframe is one big reason why Prophet is really user friendly, because making a dataset for future prediction in time-series analysis is usually unpleasant moment because it requires datetime handling. Here with Prophet, just giving the length of future period will provide you the necessary dataframe. Another interesting feature about Prophet is that  Prophet has no problem with missing data. If you set their values to NA in the history but leave the dates in future, then Prophet will give you a prediction for their values.

In [None]:
m.plot_components(forecast)