For fully rendered notebook please visit this direct link https://nbviewer.jupyter.org/github/Cybernorse/WA-jupyter-notebooks/blob/main/web%20traffic%20forcasting.ipynb  

predicting unique visitors for this website http://statforecasting.com/ which we have visitor traffic dataset of.

Dataset and data description can be found at:
https://www.kaggle.com/bobnau/daily-website-visitors

In [None]:
import numpy as np 
import pandas as pd
import pandas_profiling
import warnings
warnings.filterwarnings('ignore')
import datetime
from datetime import date

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")

# import chart_studio.plotly as py
import cufflinks as cf
import plotly.express as px

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

cf.go_offline()

import pandas_profiling
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import xgboost as xg
# from prophet import Prophet

Importing the required dataset, renaming the columns, removing the commas from the columns and converting their 
data types 

In [None]:
df=pd.read_csv('../input/daily-website-visitors/daily-website-visitors.csv')

df.rename(columns = {'Day.Of.Week':'day_of_week'
                    ,'Page.Loads':'page_loads'
                    ,'Unique.Visits':'unique_visits'
                    ,'First.Time.Visits':'first_visits'
                    ,'Returning.Visits':'returning_visits'}, inplace = True)

df=df.replace(',','',regex=True)

df['page_loads']=df['page_loads'].astype(int)
df['unique_visits']=df['unique_visits'].astype(int)
df['first_visits']=df['first_visits'].astype(int)
df['returning_visits']=df['returning_visits'].astype(int)

df

Checking for the null values if any

In [None]:
df.isna().sum()

Checking for duplicate values if any

In [None]:
df.duplicated().sum()

In [None]:
df.info()

generating line plot for visualizing the trend of page loads and visits over time series, it seems that page loads and visits have a constant fluctuation, means they have trend over time and are correlated to each other.

In [None]:
px.line(df,x='Date',y=['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits'],
       labels={'value':'Visits'}
       ,title='Page Loads & visitors over Time')

This histogram plot represent the sum of unique visits for each day in the week against count of unique visits for each day in the week.

but from this plot it's hard to estimate which day had the most unique visitors, so we will explore more deeper.

In [None]:
px.histogram(df,x='unique_visits',color='Day',title='unique visits for each day')

With this bar plot it is clear that tuesday, wednesday, monday and thursday are the days in a week when extensive amount of traffic come to this website  

In [None]:
day_imp=df.groupby(['Day'])['unique_visits'].agg(['sum']).sort_values(by='sum',ascending=False)
px.bar(day_imp,labels={'value':'sum of unique visits'},title='Sum of Unique visits for each day')

sum of unique visits for each week day over time series, we know which days get the most traffic but on what time intervals ? this graph answers to that question.

time intervals are grouped according to their relation with unique visits and days, now we can understand that in  which days, months and years did the website get the most traffic. 

In [None]:
px.histogram(df,x='Date',y='unique_visits',color='Day',title='Sum of unique visits for each day over Time')

get the sum of page_loads 	unique_visits 	first_visits 	returning_visits related to each of their days

In [None]:
sums=df.groupby(['Day'])[['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits']].sum().sort_values(
    by='unique_visits',ascending=False)
sums

this grouped bar chart comes from the crosstab above and it shows the sum of page_loads, unique_visits, first_visits, returning_visits for each day 

In [None]:
px.bar(sums,barmode='group',title='Sum of page loads and visits for each of their days')

This is a heatmap graph that shows the correlation of each datapoint from page_loads, unique_visits, first_visits , returning_visits columns, first visits seems to have a great correlation with unique visits.

The Yellow points indicate a great correlation between first visits and unique visits, but we don't how much let's find that out

In [None]:
px.density_heatmap(df, x='Date',y=['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits']
#                    color_continuous_scale="Viridis"
                   ,marginal_x="histogram", marginal_y="histogram",title='Correlation for each data point')

this shows the paired correlation of page_loads 	unique_visits 	first_visits 	returning_visits columns with annotated values we know that first visits and unique visits are correlated by 0.99 which is a great correlation and page loads have a good correlation with our target variable as well.

let's see how the correlation looks like in our next plot.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(8, 6)
sns.heatmap(df[['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits']].corr(),
            annot=True,
            cmap='viridis_r', 
            fmt='g')

this scatter matrix plot shows the paired plot of page_loads 	unique_visits 	first_visits 	returning_visits we can see that unique visits and first visits have a straight upward line, that means that first visits are increasing as the unique visits increase. we can also other pairs and identify their level of correlation visualy.

The last thing we need is to visualize the trend line.

In [None]:
px.scatter_matrix(df[['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits']])

Okay now we have the regression line pointing upward which confirms the trend between these two columns

In [None]:
px.scatter(
    df, x='first_visits', y='unique_visits',opacity=0.4,
    trendline='ols', trendline_color_override='purple',title="Regression line for unique visits and first visits"
)

there are no outliears that need to be dealt with, data is tightly packed with no dispersion except for returning visits, this column was also less correlated with our target variable.  

In [None]:
px.violin(df,y=['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits'],box=True,points='all')

starting the feature engineering.

we only need these columns 

In [None]:
pred_df=df[['page_loads' ,'unique_visits' ,'first_visits' ,'returning_visits','Day']]

Tuesday, wednesday, thursday and monday are the days when our website received the most traffic so we will create a feature days_f of them 1 value will define their existence and 0 will define the rest of the days.

In [None]:
pred_df['days_f']=np.where((df['Day']=='Tuesday') | 
                      (df['Day']=='Wednesday') | 
                      (df['Day']=='Thursday') |
                      (df['Day']=='Monday'),1,0)

pred_df

Multi Linear Regression model

In [None]:
pred_df.drop('Day',axis=1,inplace=True)
# drop the days column as we don't need it anymore

In [None]:
pred_df.head(5)

separate the independent variable and dependent / target variable 

In [None]:
X2=pred_df[['page_loads','first_visits' ,'returning_visits','days_f']]
y2=pred_df['unique_visits']

split the dataset in train and test samples now

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.3,random_state=42)

train the model with train sample

In [None]:
regressor2 = LinearRegression(fit_intercept=False,normalize=True)
regressor2.fit(X_train, y_train)

In [None]:
y_pred2 = regressor2.predict(X_test)

In [None]:
lr2 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred2})
lr2

visualize the actual and predicted values

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=lr2)

get the accuacy score of the model.

In [None]:
regressor2.score(X_test,y_test)*100

Support Vector Regression

In [None]:
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.00001)
svr_rbf.fit(X_train, y_train)

In [None]:
y_pred3 = svr_rbf.predict(X_test)

In [None]:
svr = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred3})
svr

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=svr)

In [None]:
svr_rbf.score(X_test,y_test)*100

Decision Tree Regression

In [None]:
dtr = DecisionTreeRegressor(random_state=0)
dtr.fit(X_train, y_train)

In [None]:
dtr_pred = dtr.predict(X_test)

In [None]:
dtr_g = pd.DataFrame({'Actual': y_test, 'Predicted': dtr_pred})
dtr_g

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=dtr_g)

In [None]:
dtr.score(X_test,y_test)*100

XGboost regression

In [None]:
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123)
xgb_r.fit(X_train, y_train)

In [None]:
xgb_pred = xgb_r.predict(X_test)

In [None]:
xgb_df = pd.DataFrame({'Actual': y_test, 'Predicted': xgb_pred})
xgb_df

In [None]:
plt.figure(figsize=(16,8))
sns.lineplot(data=xgb_df)

In [None]:
xgb_r.score(X_test,y_test)*100