<a href="https://colab.research.google.com/github/saurabh231088/FBEVAL/blob/master/Google_Analytics_Forecasting_v1_1share.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Website Traffic Prediction With Colaboratory and Facebook Prophet

A few resources to read up on Prophet:


*   https://research.fb.com/prophet-forecasting-at-scale/
*   https://facebook.github.io/prophet/
*   http://pbpython.com/prophet-overview.html

At its core, the Prophet procedure is an additive regression model with four main components:

*   A piecewise linear or logistic growth curve trend. Prophet automatically detects changes in trends by selecting changepoints from the data.
*   A yearly seasonal component modeled using Fourier series.
*   A weekly seasonal component using dummy variables.
*   A user-provided list of important holidays.

This notebook adds historical and future major holiday data, Google algorithm update history (Moz), as well as tools for getting and plotting Google Analytics data forecasts.

**Note**: Works best in Google Chrome.  Some feature like downloading CSVs do not seem to work in other browsers.



## Install needed libraries In Colab
This will attempt to install Facebook Phophet, a requirent pystan, and a custom Google Analytics library updated to run on Colaboratory.


In [0]:
# Install Libraries (This may need to be done first each time the notebook is used here.  Takes a few minutes to install)
from IPython.display import clear_output
try:
  !pip install pystan
  !pip install --upgrade git+https://github.com/jroakes/google-analytics.git
  !pip install fbprophet
except:
  pass
finally:
  clear_output()
  print('All Loaded')

## Upload some datapoint files
This makes available some helper functions, prior algorithm update dates (from Moz), and bank holiday dates through 2020.



In [0]:
import os.path

if not os.path.isfile('holidays.csv'):
  !wget https://raw.githubusercontent.com/jroakes/google-analytics/master/examples/holidays.csv
if not os.path.isfile('algo_updates.csv'):
  !wget https://raw.githubusercontent.com/jroakes/google-analytics/master/examples/algo_updates.csv
if not os.path.isfile('helpers.py'):
  !wget https://raw.githubusercontent.com/jroakes/google-analytics/master/examples/helpers.py

## Import needed libraries and add settings
Imports needed libraries.  Also updates some settings needed to make this run more smoothly.

**NOTE**: This includes limited Analytics API credentials.  Please do not share.  If you copy, pleas update the credentials to your own.

**NOTE**: If you see errors the first run, try running again (or two times). I think this may be a pip/colab issue and not one with this code.


In [0]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from fbprophet import Prophet
import googleanalytics as ga
import datetime
import warnings
import logging
from helpers import Struct, get_months, print_profiles
from google.colab import files
from google.colab import auth
from oauth2client.client import GoogleCredentials
from IPython.core.display import display, HTML, clear_output

# Settings
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline
plt.style.use('seaborn-colorblind')
plt.rcParams['figure.figsize'] = 8.4, 6.8


identity = "analytics_access"
client_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.apps.googleusercontent.com"
client_secret = "XXXXXXXXXXXXXXXXXXXXXXX"

## Let's collect some data

Here you will need to select the Account, Web Property, and Profile from your Google Analytics account (Names, not IDs) Then either enter Historical Data as a specific range, or Months Prior.  If Months Prior is greater than zero(0), it will be used, otherwise Specific Range will attempt to be used. Finally, select how many future months you want to predict, and enter the maximum daily volume (max_available_volume) possible for your specific niche (think amazon.com vs. joeshardware.com). If maximum volume is zero(0), linear growth will be used, else logistic. You can also specify to omit daily values (omit_values_over) above a certain threshold to remove outlier days where traffic spiked, but was a one-off occurance. 

If save output is "yes", a csv will be downloaded when the model is run.  You can use this for ad-hoc plotting in Excel.


In [0]:
#@title Google Analytics { run: "auto", form-width: "50%", display-mode: "form" }
ga_account = "" #@param {type:"string"}
ga_webproperty = "" #@param {type:"string"}
ga_profile = "" #@param {type:"string"}
ga_segment = "organic traffic" #@param ["all users", "organic traffic", "direct traffic", "referral traffic", "mobile traffic", "tablet traffic"] {type:"string"}
ga_metric = "sessions" #@param ["sessions", "pageviews", "unique pageviews", "transactions"] {type:"string"}








In [0]:
#@title Historical Data (Specific Range) { run: "auto", form-width: "50%", display-mode: "form" }
ga_start_date = "2016-03-10" #@param {type:"date"}
ga_end_date = "2018-03-10" #@param {type:"date",name:"GA Date"}


In [0]:
#@title Historical Data (Months Prior) { run: "auto", form-width: "50%", display-mode: "form" }
prior_months = 36 #@param {type:"integer"}

In [0]:
#@title Prediction Data { run: "auto", form-width: "50%", display-mode: "form" }

future_months = 12 #@param {type:"integer"}
max_available_volume = 0  #@param {type:"integer"}
omit_values_over = 400 #@param {type:"integer"}
save_output = "no" #@param ["yes", "no"] {type:"string"}


## Authenticate
This will produce a Authentication url that you must click and follow to authorize access to your gmail account.  The gmail account you choose should be the one with access to the property you want to analyze.

In [0]:
try:
  profile = ga.authenticate(
      client_id=client_id, 
      client_secret=client_secret, 
      identity=identity, 
      account=ga_account.strip(),
      webproperty=ga_webproperty.strip(),
      profile=ga_profile.strip(),     
      interactive=True
  )
except Exception as e:
  print('An error occured', str(e))

  
  

## Run The Models

### Load needed functions
These are only functions that we need to make available to python to run the models.  There are settings in the run_model function that can updated for your specific use case.  The run_model function handles most of the heavy lifting with the prediction.

In [0]:

def get_ga_data(profile, data ):

  try:
    if data.prior_months and int(data.prior_months) > 0:
      sessions = profile.core.query.metrics(data.ga_metric).segment(data.ga_segment).daily(months=(0-int(data.prior_months) )).report
    else:
      sessions = profile.core.query.metrics(data.ga_metric).segment(data.ga_segment).daily(data.ga_start_date,data.ga_end_date ).report
  except Exception as e:
    print('Error. Error retreiving data from Google Analytics.', str(e))

  df = sessions.as_dataframe()
  
  df['date'] = pd.to_datetime(df['date'])
  
  return df

  


def run_model(df, data):
  
  max_daily = df[data.ga_metric].max()
  
  # Remove zero values
  df.loc[(df[data.ga_metric] < 1 ), data.ga_metric] = np.nan
  
  if data.prior_months and int(data.prior_months) > 0:
    prior = data.prior_months
    end_historical = datetime.date.today().strftime("%Y-%m-%d")
  else:
    prior = get_months(data.ga_start_date,data.ga_end_date )
    end_historical = data.ga_end_date
    

  if data.omit_values_over and int(data.omit_values_over) > 0:
    df.loc[(df[data.ga_metric] > data.omit_values_over), data.ga_metric] = np.nan
      
  if df[data.ga_metric].isnull().all():
    print("Error: omit_values_over is set to {} and the largest daily {} value is {}".format(str(data.omit_values_over),str(data.ga_metric), str(max_daily) ))
    return False, False
    
  # Take a look at a plot of the data
  df.set_index('date').plot(title="{}-month {} for {}".format(str(prior),data.ga_metric, data.ga_webproperty))

  # Convert traffic to a log value to understand the data's behavior linearly. 
  df[data.ga_metric] = np.log(df[data.ga_metric])

  # For the Prophet API, rename Day Index and Sessions to ds and y
  df.columns = ['ds', 'y']

  if data.max_available_volume and data.max_available_volume > 0:      
    # Add Cap
    df['cap'] = np.log(data.max_available_volume)
    growth = "logistic"
  else:
    growth = "linear"
    
  
  # Loading algorithm and holiday information
  al_df = pd.read_csv('algo_updates.csv')
  hol_df = pd.read_csv('holidays.csv')
  
  al_dates = pd.to_datetime(al_df['date'].tolist())
  hol_dates = pd.to_datetime(hol_df['date'].tolist())
    
  # Bank Holidays  
  c1 = pd.DataFrame({
    'holiday': 'bank_holidays',
    'ds': hol_dates,
    'prior_scale': 1,
    'lower_window': -5,
    'upper_window': 5,
  })
  
  # Algorithm Updates  
  c2 = pd.DataFrame({
    'holiday': 'prior_algorithm_updates',
    'ds': al_dates,
    'prior_scale':5,
    'lower_window': 0,
    'upper_window': 10,
  })

  calendar = pd.concat([c1,c2])

  # Fit the model to the data
  model = Prophet(growth = growth, holidays=calendar)
  model.fit(df)

  # Define how far in the future for Prophet to predict
  future = model.make_future_dataframe(periods=int(data.future_months*30.42))

  if data.max_available_volume and data.max_available_volume > 0: 
    future['cap'] = np.log(data.max_available_volume)

  # Apply predict
  forecast = model.predict(future)
  
  # Bring back from log space.
  trns_cols = [ 'trend', 'trend_lower', 'trend_upper', 'yhat_lower', 'yhat_upper', 'yhat']
  forecast[trns_cols] = np.exp(forecast[trns_cols]).round()
  model.history['y'] = np.exp(model.history['y']).round()
  
  model.plot(forecast, xlabel='date', ylabel=ga_metric)
  
  model.plot_components(forecast)

  return model, forecast

### Run
This cell will output:


*   Plot of prior Analytics data in your specific metric.
*   Plot of output of the forecast (including prior data) log transformed.
*   Plot of output of the forecast (including prior data) converted back to your original metric.
*   Plots of component parts of the forcast including holidays and weekly / yearly trends.

If save_output is "yes", a csv of the forecast data will attempt to be downloaded.



In [0]:
data = {
        'ga_account': ga_account,
        'ga_webproperty': ga_webproperty,
        'ga_segment': ga_segment,
        'ga_metric': ga_metric,
        'ga_start_date': ga_start_date,
        'ga_end_date': ga_end_date,
        'prior_months': prior_months,
    
        'future_months': future_months,
        'max_available_volume': max_available_volume,
        'omit_values_over': omit_values_over
        }

data = Struct(**data)

# Get data from Analytics
datafile = get_ga_data(profile, data)


# Create Model and get forecast
model, forecast = run_model(datafile, data)

# Maybe save output
if save_output == 'yes':
  forecast.to_csv('forecast.csv')
  files.download('forecast.csv')