# Experiment with an XGBoost Regression on basic weather data by Street

Updated to use the full Foot Traffic Weather data from 2013 - July 2022

This will use XGBoost Regression as the model. It will use 5x K-Folds Cross Validation to train then fit the model and evaluate the MAE and RMSE. For each fold, it will write out the data with the predictions to the /predictions folder so we can look at what the model is predicting vs the true people count.

Also, This version has been updated so that it will use the FT_Street_Melb versions of files that have the counts by selected street locations. Will need to One hot encode the street before predicting

In [123]:
import pandas as pd
import numpy as np
import xgboost as xgb

from utilities import data_basic_utility as databasic
from utilities import dataframe_utility as dfutil

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# dfFootWeather = pd.read_csv("./data_files/FootTrafficWeather_July2022_Melbourne.csv")
dfFootWeather = pd.read_csv("./data_files/FT_Street_Melb_20130101_20220701.csv", parse_dates=["date"])
thisFileName = "11a.StreetXGboostV1"

print(dfFootWeather.shape)
print(dfFootWeather.info())
dfFootWeather.head()

(24633, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24633 entries, 0 to 24632
Data columns (total 21 columns):
 #   Column                                      Non-Null Count  Dtype         
---  ------                                      --------------  -----         
 0   date                                        24633 non-null  datetime64[ns]
 1   street                                      24633 non-null  object        
 2   total_people                                24633 non-null  int64         
 3   total_rain                                  23534 non-null  float64       
 4   rain_quality                                23534 non-null  object        
 5   max_temp                                    23569 non-null  float64       
 6   max_temp_quality                            23562 non-null  object        
 7   min_temp                                    23562 non-null  float64       
 8   min_temp_quality                            23562 non-null  object        

Unnamed: 0,date,street,total_people,total_rain,rain_quality,max_temp,max_temp_quality,min_temp,min_temp_quality,solar_exp,...,population_annual,population_change_annual,is_holiday,is_lockdown,OfflineRetail_Original_Turnover,OfflineRetail_Seasonally_Adjusted_Turnover,OfflineRetail_Trend_Turnover,all_ords,sp_asx200,dom_equity_market_cap
0,2022-07-31,Bourke Street Mall (North),15434,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
1,2022-07-31,Spencer St-Collins St (North),12349,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
2,2022-07-31,Southern Cross Station,1661,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
3,2022-07-31,QV Market-Peel St,3203,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
4,2022-07-31,Melbourne Central,23363,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0


In [124]:
dfFootWeather = dfFootWeather[dfFootWeather["total_rain"].notna()]
dfFootWeather = dfFootWeather[dfFootWeather["solar_exp"].notna()]

# assume missing quality is an N
dfFootWeather.loc[dfFootWeather["max_temp_quality"].isna(), "max_temp_quality"] = "N"

print(dfFootWeather.shape)
print(dfFootWeather.info())
dfFootWeather.head()

(23527, 21)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23527 entries, 0 to 23568
Data columns (total 21 columns):
 #   Column                                      Non-Null Count  Dtype         
---  ------                                      --------------  -----         
 0   date                                        23527 non-null  datetime64[ns]
 1   street                                      23527 non-null  object        
 2   total_people                                23527 non-null  int64         
 3   total_rain                                  23527 non-null  float64       
 4   rain_quality                                23527 non-null  object        
 5   max_temp                                    23527 non-null  float64       
 6   max_temp_quality                            23527 non-null  object        
 7   min_temp                                    23527 non-null  float64       
 8   min_temp_quality                            23527 non-null  object        

Unnamed: 0,date,street,total_people,total_rain,rain_quality,max_temp,max_temp_quality,min_temp,min_temp_quality,solar_exp,...,population_annual,population_change_annual,is_holiday,is_lockdown,OfflineRetail_Original_Turnover,OfflineRetail_Seasonally_Adjusted_Turnover,OfflineRetail_Trend_Turnover,all_ords,sp_asx200,dom_equity_market_cap
0,2022-07-31,Bourke Street Mall (North),15434,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
1,2022-07-31,Spencer St-Collins St (North),12349,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
2,2022-07-31,Southern Cross Station,1661,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
3,2022-07-31,QV Market-Peel St,3203,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
4,2022-07-31,Melbourne Central,23363,0.0,N,14.7,Y,4.3,Y,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0


### Feature Engineering

First, we need to convert any non-number columns into numbers that the model can understand. This first version isn't doing anything beyond that, later on we should probably look for any missing data flags, and maybe do some column Min/Max scaling or other.

Convert the 3 Quality Y/N columns into 1/0 values, use the shared utility function for future code reuse

In [125]:
dfFootWeather = dfutil.convertBoolColToInt(dfFootWeather, "rain_quality")
dfFootWeather = dfutil.convertBoolColToInt(dfFootWeather, "max_temp_quality")
dfFootWeather = dfutil.convertBoolColToInt(dfFootWeather, "min_temp_quality")
dfFootWeather.head()

Unnamed: 0,date,street,total_people,total_rain,rain_quality,max_temp,max_temp_quality,min_temp,min_temp_quality,solar_exp,...,population_annual,population_change_annual,is_holiday,is_lockdown,OfflineRetail_Original_Turnover,OfflineRetail_Seasonally_Adjusted_Turnover,OfflineRetail_Trend_Turnover,all_ords,sp_asx200,dom_equity_market_cap
0,2022-07-31,Bourke Street Mall (North),15434,0.0,0,14.7,1,4.3,1,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
1,2022-07-31,Spencer St-Collins St (North),12349,0.0,0,14.7,1,4.3,1,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
2,2022-07-31,Southern Cross Station,1661,0.0,0,14.7,1,4.3,1,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
3,2022-07-31,QV Market-Peel St,3203,0.0,0,14.7,1,4.3,1,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0
4,2022-07-31,Melbourne Central,23363,0.0,0,14.7,1,4.3,1,4.8,...,5151000,1.78,,,8562.7,8947.3,,7173.8,6945.21,2453645.0


In [126]:
print(type(dfFootWeather["date"].dtype))
print(dfFootWeather["date"].dtype == "object")
print(dfFootWeather["date"][0])

<class 'numpy.dtype[datetime64]'>
False
2022-07-31 00:00:00


In [127]:
dfFootWeather = dfutil.separateYmdCol(dfFootWeather, "date")
print(dfFootWeather.info())
dfFootWeather.head(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23527 entries, 0 to 23568
Data columns (total 23 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   street                                      23527 non-null  object 
 1   total_people                                23527 non-null  int64  
 2   total_rain                                  23527 non-null  float64
 3   rain_quality                                23527 non-null  int64  
 4   max_temp                                    23527 non-null  float64
 5   max_temp_quality                            23527 non-null  int64  
 6   min_temp                                    23527 non-null  float64
 7   min_temp_quality                            23527 non-null  int64  
 8   solar_exp                                   23527 non-null  float64
 9   WeekDay                                     23527 non-null  int64  
 10  population

Unnamed: 0,street,total_people,total_rain,rain_quality,max_temp,max_temp_quality,min_temp,min_temp_quality,solar_exp,WeekDay,...,is_lockdown,OfflineRetail_Original_Turnover,OfflineRetail_Seasonally_Adjusted_Turnover,OfflineRetail_Trend_Turnover,all_ords,sp_asx200,dom_equity_market_cap,date_year,date_month,date_day
0,Bourke Street Mall (North),15434,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
1,Spencer St-Collins St (North),12349,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
2,Southern Cross Station,1661,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
3,QV Market-Peel St,3203,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
4,Melbourne Central,23363,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
5,Collins Place (North),1410,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
6,Chinatown-Swanston St (North),11123,0.0,0,14.7,1,4.3,1,4.8,6,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,31
7,Spencer St-Collins St (North),15937,0.0,0,13.0,1,2.1,1,11.3,5,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,30
8,Southern Cross Station,2540,0.0,0,13.0,1,2.1,1,11.3,5,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,30
9,QV Market-Peel St,4457,0.0,0,13.0,1,2.1,1,11.3,5,...,,8562.7,8947.3,,7173.8,6945.21,2453645.0,2022,7,30


In [128]:
dfFootWeather.loc[dfFootWeather["is_holiday"].isna(), "is_holiday"] = 0
dfFootWeather.loc[dfFootWeather["is_lockdown"].isna(), "is_lockdown"] = 0

dfFootWeather["OfflineRetail_Trend_Turnover"] = dfFootWeather["OfflineRetail_Trend_Turnover"].fillna(dfFootWeather["OfflineRetail_Seasonally_Adjusted_Turnover"])

Do One Hot Encoding on the street   

In [129]:
dfFootWeather = pd.get_dummies(data=dfFootWeather, columns=["street"])
print(dfFootWeather.info())
dfFootWeather.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23527 entries, 0 to 23568
Data columns (total 29 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   total_people                                23527 non-null  int64  
 1   total_rain                                  23527 non-null  float64
 2   rain_quality                                23527 non-null  int64  
 3   max_temp                                    23527 non-null  float64
 4   max_temp_quality                            23527 non-null  int64  
 5   min_temp                                    23527 non-null  float64
 6   min_temp_quality                            23527 non-null  int64  
 7   solar_exp                                   23527 non-null  float64
 8   WeekDay                                     23527 non-null  int64  
 9   population_annual                           23527 non-null  int64  
 10  population

Unnamed: 0,total_people,total_rain,rain_quality,max_temp,max_temp_quality,min_temp,min_temp_quality,solar_exp,WeekDay,population_annual,...,date_year,date_month,date_day,street_Bourke Street Mall (North),street_Chinatown-Swanston St (North),street_Collins Place (North),street_Melbourne Central,street_QV Market-Peel St,street_Southern Cross Station,street_Spencer St-Collins St (North)
0,15434,0.0,0,14.7,1,4.3,1,4.8,6,5151000,...,2022,7,31,1,0,0,0,0,0,0
1,12349,0.0,0,14.7,1,4.3,1,4.8,6,5151000,...,2022,7,31,0,0,0,0,0,0,1
2,1661,0.0,0,14.7,1,4.3,1,4.8,6,5151000,...,2022,7,31,0,0,0,0,0,1,0
3,3203,0.0,0,14.7,1,4.3,1,4.8,6,5151000,...,2022,7,31,0,0,0,0,1,0,0
4,23363,0.0,0,14.7,1,4.3,1,4.8,6,5151000,...,2022,7,31,0,0,0,1,0,0,0


Do a K-Folds Cross Validation using XGBoost and get an MAE and an RMSE for mean error and indication of variance

In [130]:
# Test a basic XGBoost Regression with KFolds Cross Validation
randomSeed = databasic.get_random_seed()
# Tuned {'colsample_bytree': 0.7, 'eta': 0.01, 'max_depth': 10, 'n_estimators': 1000}
model = xgb.XGBRegressor(objective="reg:squarederror", booster="gbtree", 
     n_estimators=1000, max_depth=10, colsample_bytree=0.7, eta=0.1, seed=randomSeed)
modellingLog = ""   

targetColName = "total_people"
col_names = dfFootWeather.columns
feature_cols = col_names.drop([targetColName])
trainFeatures = dfFootWeather[feature_cols]
trainTargets = dfFootWeather[targetColName]


In [131]:

lstMae = []
lstRmse = []
kfolds = KFold(n_splits=5, random_state=randomSeed, shuffle=True)
for k, (train_index, test_index) in enumerate(kfolds.split(dfFootWeather)):
    # x_train = trainFeatures.loc[train_index, ]
    # x_vali = trainFeatures.loc[test_index, ]

    # y_train = trainTargets.loc[train_index, ]
    # y_vali = trainTargets.loc[test_index, ]
    x_train = trainFeatures.loc[trainFeatures.index.intersection(train_index)]
    x_vali = trainFeatures.loc[trainFeatures.index.intersection(test_index)]
    
    y_train = trainTargets.loc[trainTargets.index.intersection(train_index)]
    y_vali = trainTargets.loc[trainTargets.index.intersection(test_index)]
        
    model.fit(x_train, y_train)
    y_pred = model.predict(x_vali)

    # Compute the mae
    mae = mean_absolute_error(y_pred, y_vali)
    lstMae.append(mae)

    # Compute the rmse
    rmse = np.sqrt(mean_squared_error(y_pred, y_vali))
    lstRmse.append(rmse)
    
    print("Fold {0} MAE: {1}, RMSE: {2}".format(str(k), str(mae), str(rmse)))

    dfPredicted = x_vali
    dfPredicted["total_people"] = y_vali
    dfPredicted["total_people_predicted"] = y_pred
    dfPredicted.to_csv("./predictions/" + thisFileName+"_KFold" + str(k) + ".csv", index=False)

print("Final Result")
print("----------")
print("Average Mean Absolute Error (MAE): " + str(np.mean(lstMae)))
print("Average Root Mean Squared Error (RMSE): " + str(np.mean(lstRmse)))


print("maeResults.append(" + str(np.mean(lstMae)) + ")")
print("rmseResults.append(" + str(np.mean(lstRmse)) + ")")


Fold 0 MAE: 1176.652740648017, RMSE: 2180.3184082722178
Fold 1 MAE: 1158.0145031153927, RMSE: 2234.9601967067983
Fold 2 MAE: 1140.6853325957813, RMSE: 2143.3812485669646
Fold 3 MAE: 1179.889030368108, RMSE: 2185.787020944508
Fold 4 MAE: 1138.755441507718, RMSE: 2110.472002746773
Final Result
----------
Average Mean Absolute Error (MAE): 1158.7994096470034
Average Root Mean Squared Error (RMSE): 2170.9837754474524
maeResults.append(1158.7994096470034)
rmseResults.append(2170.9837754474524)


# Version 3 - same as version 2 but tuned

Run 1:
- Average Mean Absolute Error (MAE): 1121.0414643434003
- Average Root Mean Squared Error (RMSE): 2129.5900508703976

Run 2:
- Average Mean Absolute Error (MAE): 1143.2207060840512
- Average Root Mean Squared Error (RMSE): 2135.8656044644054

Run 3:
- Average Mean Absolute Error (MAE): 1158.7994096470034
- Average Root Mean Squared Error (RMSE): 2170.9837754474524


# Version 2 - Added holidays, lockdown dates, retail data and all ords data

Run 1:
- Average Mean Absolute Error (MAE): 2427.7061911435057
- Average Root Mean Squared Error (RMSE): 3880.6926322712666

Run 2:
- Average Mean Absolute Error (MAE): 2430.2261640958413
- Average Root Mean Squared Error (RMSE): 3865.9539948526444

Run 3:
- Average Mean Absolute Error (MAE): 2428.957682123305
- Average Root Mean Squared Error (RMSE): 3876.7395727045455

# Version 1 - Just the basic weather data
This is before the following was added in: holidays, lockdown dates, retail data and all ords data

Run 1:
- Average Mean Absolute Error (MAE): 2920.0653982122726
- Average Root Mean Squared Error (RMSE): 4807.48902009495

Run 2:
- Average Mean Absolute Error (MAE): 2879.142585249733
- Average Root Mean Squared Error (RMSE): 4723.4315077591855

Run 3:
- Average Mean Absolute Error (MAE): 2898.2915500385693
- Average Root Mean Squared Error (RMSE): 4769.295103327263

Run 4:
- Average Mean Absolute Error (MAE): 2641.3858962528147
 -Average Root Mean Squared Error (RMSE): 4262.662939275366

In [135]:
avgTotalPeople = np.mean(dfFootWeather["total_people"])
maeResults = []
rmseResults = []
maeResults.append(1121.0414643434003)
rmseResults.append(2129.5900508703976)
maeResults.append(1143.2207060840512)
rmseResults.append(2135.8656044644054)
maeResults.append(1158.7994096470034)
rmseResults.append(2170.9837754474524)

avgMae = np.mean(maeResults)
avgRmse = np.mean(rmseResults)

predictionAccuracy = 100 - np.round((avgMae / avgTotalPeople) * 100, 2)
percentAvgAccuracyError = np.round((avgRmse / avgTotalPeople) * 100, 2)

print("Average for Total People per street location reading: " + str(avgTotalPeople) + "")
print("Averaged MAE: " + str(avgMae) + "")
print("Averaged RMSE: " + str(avgRmse) + "")
print("Predictions made to an accuracy of: " + str(predictionAccuracy) + "%")
print("Predictions Error: +/-" + str(percentAvgAccuracyError) + "%")

Average for Total People per street location reading: 14693.907893059039
Averaged MAE: 1141.020526691485
Averaged RMSE: 2145.4798102607515
Predictions made to an accuracy of: 92.23%
Predictions Error: +/-14.6%
