#### Define functions for random forest model

Random Forest models are ensemble machine learning algorithms widely used for classification and regression predictive modelling problems.  They are an ensemble of [decision tree algorithms](https://en.wikipedia.org/wiki/Decision_tree_learning) that can fit on slightly different examples of a single dataset in an attempt to produce a better overall result.  To validate these means, K-fold cross validation is often used.  This process involves taking random samples of the data and retraining the model so as not to overfit to one training sample.  Unfortunately, this method is not necessarily fit for modelling time series data as it can lead to overly optimistic models by sample future data and predicting on the past.

Jason Brownlee [documents a workflow](https://machinelearningmastery.com/random-forest-for-time-series-forecasting/) for Random Forest Time Series Forecasting which is used in this notebook to explore the feasibility of it for predicting ambient population. This will be discussed as the data is processed and the models run later.

A note on cross validation - Whilst the use of it for time series has been found to be feasible by Bergmeir et al (2018) by only using lagged versions of the outcome variable, it would be unsuitable for this project which seeks to use other explanatory variables alongside these.

Please note - <b>The Walk Forward Validation used later can take a long time and hog pc resources depending on the size of the dataset, number of trees set and time lags to use.  Please be selective in which sections of the notebook you run.  To avoid any problems with running it in batches, only run up to the cells that undertake validation.  These are clearly marked with a note later.</b>

In [1]:
import pandas as pd
import numpy as np
import datetime
import sklearn
from statsmodels.tsa.seasonal import seasonal_decompose
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"
from plotly.subplots import make_subplots
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from pandas.plotting import lag_plot
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

import os

from source import *

# forecast monthly births with random forest

if not os.path.exists("../images"):
    os.mkdir("../images")

Load in data and define some functions

In [2]:
#import merged footfall data
footfalldf_imported = pd.read_csv("../data/LCC_footfall_2021.gz",
                                  parse_dates=['Date','DateTime'],
                                  dtype={"BRCYear": int,"BRCWeekNum":int})


new_weather = pd.read_csv("../data/weatherdata.csv",parse_dates=['timestamp'],index_col='timestamp')
previous_weather = pd.read_csv("../data/overall_weather.csv",parse_dates=['date'],index_col='date',dayfirst=True)
bankhols = pd.read_csv('../data/ukbankholidays.csv',parse_dates=['ukbankhols'])
schoolterms = pd.read_csv('../data/schoolterms.csv',parse_dates=['date'],dayfirst=True,index_col='date',usecols=['schoolStatus','date'])

min_max_scaler = MinMaxScaler()

def model1_validation(laglist,footfall,tree):
    mae_list = []
    validation_df = pd.DataFrame()
    for lag in laglist:

        n_in=lag
        # transform the time series data into supervised learning and add additional explanatory variables
        processed_data = (footfall
                          .pipe(start_pipeline)
                          .pipe(series_to_supervised,n_in=n_in)
                          .pipe(create_lockdown_predictors)
                          .pipe(create_date_predictors)
                          .pipe(create_holiday_predictors,bankhols,schoolterms)
                          .pipe(create_weather_predictors, new_weather, previous_weather)
                          .pipe(drop_na)
                          .pipe(arrange_cols,n_in=n_in))

        cols_to_scale = ['mean_temp', 'wind_speed', 'rain', 'hosp_indoor', 'hosp_outdoor', 'hotels', 'ent_indoor', 'ent_outdoor',
                         'weddings', 'self_acc', 'sport_lei_indoor', 'sport_lei_outdoor', 'non_ess_retail', 'prim_sch',
                         'sec_sch', 'uni_campus', 'outdoor_grp_public', 'outdoor_grp_private', 'indoor_grp', 'eat_out']

        #variables to count how many days fall after 23rd March 2020
        lockdown_days = len(processed_data.loc[processed_data.index >= "2020-03-23"])
        #variable to count how many days fall after 1st January 2021
        model1_test_days = len(processed_data.loc[processed_data.index >= "2021-01-01"])

        #evaluate model using walk forward validation.  The default days are set at 12 for now.  To set a custom number, please change n_test.  To use all days after 1st January 2021 please set n_test to model1_test_days.  NOTE THIS WILL TAKE A LONG TIME TO VALIDATE.
        mae, y, yhat = walk_forward_validation(processed_data, 30, cols_to_scale, n_in,tree)
        mae_list.append(mae)
        validation_df[f'timelag_{lag}'] = yhat

    return mae_list, validation_df, y

def model2_validation(laglist,footfall,tree):
    mae_list = []
    validation_df = pd.DataFrame()
    for lag in laglist:

        n_in=lag
        # transform the time series data into supervised learning and add additional explanatory variables
        processed_data = (footfall
                      .pipe(start_pipeline)
                      .pipe(series_to_supervised,n_in=n_in)
                      .pipe(create_date_predictors)
                        .pipe(create_holiday_predictors,bankhols,schoolterms)
                        .pipe(create_weather_predictors, new_weather, previous_weather)
                      .pipe(drop_na)
                      .pipe(arrange_cols,n_in=n_in))

        #create list of columns that need scaling
        cols_to_scale = ['mean_temp', 'wind_speed', 'rain']

        #split dataset based on when first measures to reduce social contact were put in place.
        prelockdown_data = processed_data.loc[processed_data.index < "2020-03-16"]

        #evaluate model using walk forward validation.  The default days are set at 12 for now.  To set a custom number, please change n_test.  To use all days after 1st January 2021 please set n_test to model1_test_days.  NOTE THIS WILL TAKE A LONG TIME TO VALIDATE.
        mae, y, yhat = walk_forward_validation(prelockdown_data, 30, cols_to_scale, n_in,tree)
        mae_list.append(mae)
        validation_df[f'timelag_{lag}'] = yhat

    return mae_list, validation_df, y

### Cleaning the data
The next step in the pipeline is to check for duplicates and remove them.  Initial data exploration revealed errors in some of the csv files where individual records had been duplicated.  In some instances, the same records existed in several different files, for example dates in early July appeared towards the end of the June csv.

The cameras don't all come online at the same time, with the last starting on 27th August 2008.  To ensure meaningful comparability, any records before this date have been removed.

Finally, one of the cameras appeared to have moved locations on 31st May 2015 from Commercial Street at Lush to Commercial Street at Sharps.  These are combined and renamed to Commercial Street Combined.

In [3]:
#Pipeline that imports csv files, creates a dataframe and applies cleaning functions
footfalldf = (footfalldf_imported
              .pipe(start_pipeline)
              .pipe(set_start_date, '2008-08-27')
              .pipe(combine_cameras)
              .pipe(check_remove_dup)
              .pipe(remove_new_cameras)
              .pipe(create_BRC_MonthNum))

Footfall hasn't changed when combining cameras
There are 0 duplicates left


In [4]:
#Resample into daily footfall.
daily_footfall = footfalldf.groupby( [pd.Grouper(key='DateTime',freq='D')])['Count'].sum().to_frame()
#dayfinal = pd.concat([day,frame],verify_integrity=True)
daily_footfall = daily_footfall.drop(daily_footfall[daily_footfall['Count'] == 0].index)
#Set frequency to daily, creating additional rows for missing values and impute using the 'time' based interpolation
daily_footfall = daily_footfall.asfreq('D').dropna()#.replace(0,np.nan).interpolate(method='time')


### Random Forest Models for Footfall

The following code blocks generate two Random Forest Models.

The first is trained on all available data up to the most recent (currently April 2021).  It includes lockdown specific conditions and other explanatory variables such as lagged outcomes, weather and time series specific inputs such as day of the week or month of the year. This could quantify how important lockdown variables were in explaining the changes in footfall that occur during periods of lockdown.

The second model is trained just on data up to the beginning of lockdown and ignore the conditions, given that it would be predicting 'business as usual'.  Essentially they're pretending the pandemic never happened so only need to consider the elements that were found to be important in [previous modelling of footfall data](https://github.com/nickmalleson/lcc-footfall/blob/master/LCC_Footfall.ipynb) by [Molly Asher](https://eps.leeds.ac.uk/civil-engineering/pgr/7524/molly-asher).

This approach explores several avenues.  First, it covers quantifying the changes and explains general footfall and then cover making predictions.  It would be useful to have the model with lockdown variables available for future prediction if they turn out to have good explanatory power compared with the usual predictors.

#### Model #1 - Explanation of lockdown variable strength when considering changes in footfall during pandemic period.

To create the models, data is processed as follows:

- Data is transformed into a supervised learning problem, with various time lags created using pandas.shift().  These lags become X input variables, with the original counts becoming a y output.  Each row in the dataset contains however many time shifts you've calculated as X inputs, plus any additional variables you'd like to provide explanatory power
- Creation of lockdown predictors to represent the different parts of society that were closed during lockdowns since March 2020. They cover things such as non-essential retail, hospitality, entertainment, sport/leisure, weddings, education and maximum group sizes.  These are a mix of label encoded and dummy variables.  Some of these might be subject to change depending on model performance
- Creation of Date predictors, time series specific input variables such as day of week, month of year or even whether the day is a weekend or not
- Creation of Holiday predictors that represent whether the day falls on a bank or school holiday
- Creation of Weather predictors that consist of mean daily temperature, total daily rainfall and mean daily wind speed
- Removal of missing values
- Arrange columns so that the observed y variable is the last in the dataframe.  This is required for walk-forward validation later on

This is undertaken for each model, with some changes that will be discussed in the relevant markdown section.

#### Walk Forward Validation

The walk-forward validation method iterates through a test set explicitly specified as being towards the end of the data sequence using the explanatory variables generated to predict values, comparing against expected and providing an MAE.  Once all the parameters have been set and model finalised, then variable importance can be extracted and predictions on out of sample data can be made.

<b>Please note that this method is relatively computationally expensive</b> as it loops through each row and retrains the model to include additional footfall values.  Only run if you want to see the outputs.  It is not required to see the results of the actual model.

The code can be adjusted to account for more/less lags and trees.  These are just for testing purposes.

In [5]:
#Run model 1 validation function using a range of trees
lags = [1,3,7]
trees = [100,200,500,1000]
figs = []
mae_df = pd.DataFrame()


for tree in trees:
    mae_list,validation_df,y = model1_validation(lags,daily_footfall,tree)
    validation_df['y'] = y
    fig = px.line(validation_df,title=f'Model 1 Validation - {tree} trees')
    figs.append(fig)
    fig.write_image(f"../images/Model1Validation{tree}.svg")
    mae_df[f'{tree}_trees'] = mae_list
    mae_df.to_csv(f"../data/model1mae.csv",index=False)

Validation has started on 100 trees with 1 time lag(s).  Please be patient, it may take a while and a message will be displayed when finished.
Validation has started on 100 trees with 3 time lag(s).  Please be patient, it may take a while and a message will be displayed when finished.
Validation has started on 100 trees with 7 time lag(s).  Please be patient, it may take a while and a message will be displayed when finished.


Plots the Mean Absolute Error for each number of trees.

In [6]:
mae_df = pd.read_csv("../data/model1mae.csv")
mae_df = mae_df.rename(index={0:'1 Lag',1: '3 Lag',2: '7 Lag'})

fig = mae_df.plot.bar(facet_col='variable')
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_layout(showlegend=False,
                      title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.write_image("../images/Model1mae.svg")
fig.show()

ValueError: Usecols do not match columns, columns expected but not found: [1, 2, 3, 4]

The next few cells plot the validation line charts for each number of trees.

In [None]:
figs[0].show()

In [None]:
figs[1].show()

In [None]:
figs[2].show()

In [None]:
figs[3].show()

The validation seems to show that the model works fairly well with the shifts in lockdown variables if validating on the period from 1st January 2021.  It's computationally quite expensive to validate on too much data but it will show that the model can follow the general trend of changes.

Changing the number of trees certainly helps it to run quicker, which is good to know seen as 200 seemed to be enough to produce a decent result.

There are a few other reasons why it might take so much time.  The first is that it's a large dataset.  Reducing the amount of training data might help it run quicker.

Furthermore, reducing the dimensionality of the data might also help.

#### Variable Importance

The following code repeats the data processing pipeline, fits the RandomForest model from above and extracts variable importance.

The number of trees is set to 200 by default, this can be changed if necessary

In [None]:
lags = [1,3,7]
tree = 200
importance_list = []
for lag in lags:
    n_in=lag
    processed_data = (daily_footfall
                      .pipe(start_pipeline)
                      .pipe(series_to_supervised,n_in=n_in)
                      .pipe(create_lockdown_predictors)
                      .pipe(create_date_predictors)
                          .pipe(create_holiday_predictors,bankhols,schoolterms)
                          .pipe(create_weather_predictors, new_weather, previous_weather)
                      .pipe(drop_na)
                      .pipe(arrange_cols,n_in=n_in))

    cols_to_scale = ['mean_temp', 'wind_speed', 'rain', 'hosp_indoor', 'hosp_outdoor', 'hotels', 'ent_indoor', 'ent_outdoor',
                     'weddings', 'self_acc', 'sport_lei_indoor', 'sport_lei_outdoor', 'non_ess_retail', 'prim_sch',
                     'sec_sch', 'uni_campus', 'outdoor_grp_public', 'outdoor_grp_private', 'indoor_grp', 'eat_out']

    values = processed_data.values

    data_cols_pos, data_cols = create_data_cols(processed_data)

    #split into input and output columns
    trainX, trainy = processed_data.iloc[:, :-1].copy(), processed_data.iloc[:, -1].copy()
    trainX.loc[:,cols_to_scale] = min_max_scaler.fit_transform(trainX.loc[:,cols_to_scale])
    #fit model
    model = RandomForestRegressor(n_estimators=tree)
    model.fit(trainX, trainy)
    #extract feature importance
    importance = create_importance_df(model.feature_importances_,data_cols,n_in)
    importance_list.append(importance)

importance_fig = []

for df,lag in zip(importance_list,lags):
    fig = df.head(10).plot.bar(color=df.head(10).index,
                                           title=f"Model 1 Variable Importance (Top 10) - {lag} time lagged",
                                           labels={'value': 'Variable Importance',
                                                   'feature_name': 'Model Feature'})
    fig.update_layout(showlegend=False,
                      title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
    importance_fig.append(fig)

Plot importance chart for 1 time lagged variable

In [None]:
#plot importance chart
importance_fig[0].show()
importance_fig[0].write_image(f"../images/Model1Importance1.svg")

Plot importance chart for 3 time lagged variables

In [None]:
#plot importance chart
importance_fig[1].show()
importance_fig[1].write_image(f"../images/Model1Importance3.svg")

Plot importance chart for 7 time lagged variables

In [None]:
#plot importance chart
importance_fig[2].show()
importance_fig[2].write_image(f"../images/Model1Importance7.svg")

Feature importance is a useful measure of assessing what predictors have strong power in predicting the outcome variable.  They do not necessarily explain why footfall changes at specific points but are a useful indicator of what is dominant within the model.  It may be that this can be used to reduce the dimensionality in the model for future iterations.

Despite some reductions in the MAE of the model in validation when adding in more time lagged variables, they tend to dominate the model in terms of importance, squeezing out some explanatory lockdown inputs.

Just using a single time lagged variable seems to be most appropriate for this model.

Model #2 - Prediction of business as usual

The following code repeats the initial data processing undertaken on the previous models, however omits the creation of lockdown variables.  This idea is to predict what footfall might have been like if the pandemic and lockdown measures didn't happen.  The initial pipeline steps through creating a supervised learning dataset and relevant footfall predictors.  Several tests have been run using different processes.

This first block drops missing values from the data altogether, regardless of what columns these occur in.  In most checks, missing values only really occurred in weather related variables as data were not available.

<b>Remember - Walk Forward Validation is potentially computationally expensive depending on how many tree/lag combinations you'd like to test</b>

In [None]:
#Run model 2 validation function using a range of trees
lags = [1,3,7]
trees = [100,200,500,1000]
figs = []
mae_df = pd.DataFrame()

for tree in trees:
    mae_list,validation_df,y = model2_validation(lags,daily_footfall,tree)
    validation_df['y'] = y
    fig = px.line(validation_df,title=f'Model 2A Validation - {tree} trees')
    figs.append(fig)
    fig.write_image(f"../images/Model2AValidation{tree}.svg")
    mae_df[f'{tree}_trees'] = mae_list
    mae_df.to_csv(f"../data/model2Amae.csv",index=False)

Plot the MAE for each combination of trees/lags

In [None]:
mae_df = pd.read_csv("../data/model2Amae.csv")
mae_df = mae_df.rename(index={0:'1 Lag',1: '3 Lag',2: '7 Lag'})

fig = mae_df.plot.bar(facet_col='variable')
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_layout(showlegend=False,
                      title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.write_image("../images/Model2mae.svg")
fig.show()

Show validation charts for each number of trees - These might all generate in one scrollable area.  Use the single shot code underneath the loop if you'd like each to be more visible.

In [None]:
for fig in figs:
    fig.show()

In [None]:
#Replace index to show specific figure
figs[1].show()

The validation above gives an indication of how accurate the model is on this run.  There may be differences each time it's run so the code could be developed to run a number of validations and see if the MAE changes much.  All forecasting is tricky, particularly with time series data, but the model seems to do a fairly good job at getting close to the predicted values for each day of the validation set.

The next stage is to use the model above to make predictions on an 'out of sample' test set.  In this scenario, the data we want to predict on is everything after and including 16th March 2020, when large changes in footfall started to be seen after social distancing announcements.

The same data processing pipeline is used, with train and test data split using the 16th March 2020 point.

In [None]:
importance_list = []
yhat_list = []
tree = 200

lags = [1,3,7]
for lag in lags:
    n_in=lag
    # transform the time series data into supervised learning and add additional explanatory variables.  Filter to exclude period where lockdowns were in effect.
    processed_data = (daily_footfall
                      .pipe(start_pipeline)
                      .pipe(series_to_supervised,n_in=n_in)
                      .pipe(create_date_predictors)
                          .pipe(create_holiday_predictors,bankhols,schoolterms)
                          .pipe(create_weather_predictors, new_weather, previous_weather)
                      .pipe(drop_na)
                      .pipe(arrange_cols,n_in=n_in))

    #Set columns to scale
    cols_to_scale = ['mean_temp', 'wind_speed', 'rain']

    #Create data col files to allow prediction outputs to be joined back to datetime later.
    data_col_pos, data_cols = create_data_cols(processed_data)

    #Split into training and testing datasets
    train = processed_data.loc[processed_data.index < "2020-03-16"]
    test = processed_data.iloc[processed_data.index >= "2020-03-16"]

    #split training data into input and output columns
    trainX, trainy = train.iloc[:, :-1].copy(), train.iloc[:, -1].copy()
    #Fit and apply scaling to training data
    trainX.loc[:,cols_to_scale] = min_max_scaler.fit_transform(trainX.loc[:,cols_to_scale])

    #fit model
    model = RandomForestRegressor(n_estimators=tree)
    model.fit(trainX, trainy)

    #split test data into input and output columns
    testX, testy = test.iloc[:, :-1], test.iloc[:, -1]
    #Apply scaler to test data
    testX.loc[:,cols_to_scale] = min_max_scaler.transform(testX.loc[:,cols_to_scale])

    #extract feature importance and join back to variable names
    importance = create_importance_df(model.feature_importances_, data_cols,n_in)
    importance_list.append(importance)



    yhat_list.append(create_prediction_data(model.predict(testX),test))

processed_data['roll_7_mean'] = processed_data['var1(t)'].rolling(7).mean()
prediction_fig = daily_predicted_chart(yhat_list,processed_data)
importance_fig = []

for df,lag in zip(importance_list,lags):
    fig = df.head(10).plot.bar(color=df.head(10).index,
                                           title=f"Model 2 Variable Importance (Top 10) - {lag} time lagged",
                                           labels={'value': 'Variable Importance',
                                                   'feature_name': 'Model Feature'})
    fig.update_layout(showlegend=False,
                      title={
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
    importance_fig.append(fig)

Plot predictions chart for 1,3 and 7 time lagged variables

In [None]:
fig = prediction_fig
fig.show()
fig.write_image(f"../images/Model2PredictionA.svg")

Plot importance chart for 1 time lagged variable

In [None]:
#plot importance chart
importance_fig[0].show()
importance_fig[0].write_image(f"../images/Model2ImportanceA1.svg")

Plot importance chart for 3 time lagged variables - Won't work if only 1 time lag has been included in the model

In [None]:
#plot importance chart
importance_fig[1].show()
importance_fig[1].write_image(f"../images/Model2ImportanceA3.svg")

Plot importance chart for 7 time lagged variables - Won't work if only 1 time lag has been included in the model

In [None]:
#plot importance chart
importance_fig[2].show()
importance_fig[2].write_image(f"../images/Model2ImportanceA7.svg")

There's a lot going on in those charts, but to summarise:
- Introduction of more than one time lagged variable seems to throw the predictions off in a big way.
- At 7 variables, the predictions mirror more what actually happened due to lockdown rather than what might have been.

There's much less variation in the predictions than actual values from previous years.  There's almost an immediate drop after 16th March 2020 which would go against the trend seen in the data previously.  Maybe this is because seasonality and trends haven't been removed in pre-processing.  It's difficult to quantify any changes right now as I'm not particularly confident this works well.  There's also horrible artifacts in the weekly and monthly resamples caused by dropping entire weeks/months where weather data is missing.

It might be worth looking at variable importance, removing some less powerful ones and potentially imputing some weather data.

The following blocks of code rerun the analysis but with a process to impute weather data.  Note that as it stands the method isn't as robust as it could be.  Really the imputation should be done after splitting the train/test data for validation purposes, however I tried to do so and then rejoin the sets back together.  For some reason Python had problems with the test_train_split function within the validation so I abandoned this and just carried the imputing out as one dataset.  This means that there could be some data leakage but I'm happy to accept that for now as the scaling is still done after the split.

In [None]:
#transform the time series data into supervised learning and add additional explanatory variables.  Filter to exclude period where lockdowns were in effect.
processed_data = (daily_footfall
                  .pipe(start_pipeline)
                  .pipe(series_to_supervised,n_in=6)
                  .pipe(create_date_predictors)
                          .pipe(create_holiday_predictors,bankhols,schoolterms)
                          .pipe(create_weather_predictors, new_weather, previous_weather)
                  .pipe(arrange_cols))

#lists of columns to scale and impute (may or may not be different)
cols_to_scale = ['mean_temp', 'wind_speed', 'rain']
cols_to_impute = ['mean_temp', 'wind_speed', 'rain']

#set processed data as pre lockdown period
processed_data = processed_data.loc[processed_data.index < "2020-03-16"]
#interpolate columns with missing values using a time based method.
processed_data = processed_data.interpolate(method='time')
#drop missing values
processed_data = processed_data.dropna()

#DEPRECATED CODE - initially split, impute and then rejoin but it was causing an indexing error within the walk forward validation method.
# test_rows = 30
# train, test_rows = train_test_split(processed_data,test_rows)
#
# train.loc[:,cols_to_impute], test_rows.loc[:,cols_to_impute] = train.loc[:,cols_to_impute].interpolate(method='time'), test_rows.loc[:,cols_to_impute].interpolate(method='time')
# train,test_rows = train.dropna(),test_rows.dropna()
#
# prelockdown_data = pd.concat([train,test_rows],axis=0)


The rest of the modelling including the weather imputed variables needs to be added back into the notebook.  I initially ran all the models only using a single value for trees and time lags.  Since then I've added in the multiple tests.  Just need to adapt the new code for this section.

### Final Thoughts

Due to the way in which the validation has been set up, it's difficult to fit it into the 'conventional' scikitlearn pipeline and evaluate the hyperparameters so I'll leave those for now.  Tweaking them might result in different predictions but it might just be that this model isn't the best/will take a lot of effort to get something usable.

I've spent a bit of time tidying up the code above to run the preprocessing through pipes, but not necessarily the SciKit learn pipelines used explicitly for modelling.  Due to the transformation of several variables into standardised/normalised ranges prior to splitting the data, I'll need to come back to this and adjust it so that only the variables are created at this point and any adjustments (such as scaling or imputing) are undertaken at the correct time.

The scaling is potentially still an issue as it's only applied once throughout the validation process.  There is a solution, listed below, but it might not be worth the time trying to implement it if a seperate ARIMA/Prophet workflow is developed:
    - After seeding the history list, pull the code to convert into an array and split off the X and y variables.
    - Within the loop, split off test X and y.  Don't isolate the single row in the loop, instead do it inside the random forest function.
    - Instead of feeding history into the random forest function, feed in train X and y as they are.
    - Fit the model inside the random forest function and apply scaling to test X.  isolate the single row the loop is currently on and then predict on that.
    - After appending the last observation to the history, turn it into an array, split off train X and scale it once again.
    - Return to the start of the loop with a new, scaled train X and repeat, refitting the model on this and rescaling test X after.

The overarching problem with this method of modelling using a RF is the amount of custom code and script required to develop more feasible and robust workflows.  As illustrated above, making sure that data leakage is avoided during scaling isn't considered in the intial validation code and could be an enormous time sink going forward.

Future work to be developed on RF if desired:
- Adjust scaling algorithm to retrain everytime the validation makes a one step prediction.
- Undertake dimensionality reduction using an appropriate method
- Remove outliers caused by 'one-off' events or odd days.  Include some of these events as explanatory variables.
- Do a location based analysis/modelling to take into account different areas of the city
- More research into the feasibility of Random Forest for Time Series Modelling