# Modeling-baseline


經過初步對國道五號北向的平均旅行時間的探索  
我們已知在不同時間段、不同門架間的時間序列表現不同  
如果能有效運用多個特徵去對時間序列進行預測會是不錯的成果  
因此在完成我們的模型之前，我們需要先建構baseline，作為後續比較的基礎  

我們會使用NIXTLA開發的statsforecast套件來做baseline  
主要流程如下:
* 使用最基本的時間序列模型進行建模，採用多個指標來觀測誤差落在哪裡  
* 透過cross_validation來評估建模的狀況如何  

# Lib

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import time
import sys

# 將 hwttp 資料夾加入 Python 搜尋路徑
sys.path.append(os.path.abspath('..'))
import hwttp.hwtoolkit as tk

# traditional method as baseline
from statsforecast import StatsForecast
from statsforecast.models import(
    AutoARIMA,
    SeasonalNaive,
    MSTL
)

# evaluation
from utilsforecast.losses import mse, mae, rmse, smape
from utilsforecast.evaluation import evaluate
from utilsforecast.plotting import plot_series

# necessary for unique_id working
os.environ['NIXTLA_ID_AS_COL'] = '1'

  from .autonotebook import tqdm as notebook_tqdm


# Data Preparation

## load basic data

In [8]:
data_paths = {'etag_5n_loc': '../data/cleaned/etag_5n_loc.csv',
              'section_info': '../data/cleaned/section_info.csv',
              'hw5_m04a_df': '../data/cleaned/hw5_m04a.csv',
              'congestion_table': '../data/cleaned/congestion_table.csv',
              'calendar_event': '../data/cleaned/calendar_event.csv',
              'road_build_event': '../data/cleaned/202301_10_road_build_event.xlsx',
              'traffic_accident_data': '../data/cleaned/202301_10_traffic_accident_data.xlsx'
             }

rs = tk.hw_df_resource(data_paths)
rs.load_raw_environment_info()
rs.load_raw_event_info()
rs.generate_mile_location_info()

Complete loading environment and gantry info
Complete loading raw event info
Complete generating mile location info


In [9]:
# 旅行時間的資料
hw5_15watt = pd.read_csv("../data/features/hw5_15watt.csv")

In [10]:
hw5_15watt = hw5_15watt[~hw5_15watt.gf_gt.isin(['03F0150N-03F0140N', 
                                                '03F0201S-03A0041N', 
                                                '03F0201S-03F0217S'])].copy()

In [11]:
hw5_15watt.gf_gt.unique()

array(['05F0001N-03F0150N', '05F0001N-03F0201S', '05F0055N-05F0001N',
       '05F0287N-05F0055N', '05F0309N-05F0287N', '05F0438N-05F0309N',
       '05F0438N-05FR143N', '05F0528N-05F0438N'], dtype=object)

In [12]:
hw5_15watt.shape

(350208, 6)

## data split

這邊要將資料依據門架區間拆分成獨立的df，方便後續根據獨立時間序列進行訓練

In [36]:
df = hw5_15watt[['TimeStamp', 'WeightedAvgTravelTime', 'gf_gt']].copy()#.reset_index()
df.rename(columns={'gf_gt': 'unique_id', 
                   'TimeStamp': 'ds',
                   'WeightedAvgTravelTime': 'y'}, inplace=True)

In [37]:
# data split
f_df = df[df.ds>='2024-01-01 00:00:00'].copy()
p_df = df[df.ds<'2024-01-01 00:00:00'].copy()
f_df.shape, p_df.shape

((69888, 3), (280320, 3))

我們以2023年整年的資料作為訓練集，2024年的會是後續驗證用的部分  

# Ｍodelling

## baseline

Baseline using:
* AutoARIMA
* SeasonalNaive
* MSTL

上述都是可以一定程度上捕捉趨勢的模型  
交通資料在中短期的範圍下是具有季節性存在的  
透過上述三種模型我們可以瞭解使用最簡單的方法下效果可以有多好  
Statsmodel在這塊不需要特別去把序列分拆  
端看當前使用的模型是可以多變量還是單變量來調整輸入  

In [25]:
# SEASON_LENGTH=288 # 因為是每五分鐘一筆資料，一天288筆當作season的假設
SEASON_LENGTH=96 # 因為是每15分鐘一筆資料，一天96筆當作season的假設

models = [
    AutoARIMA(),
    SeasonalNaive(season_length=SEASON_LENGTH), # 5分鐘資料換算成1天的話
    MSTL(season_length=SEASON_LENGTH),
]

sf = StatsForecast(
    models=models,
    freq='15min',
    fallback_model = SeasonalNaive(season_length=SEASON_LENGTH),
    n_jobs=-1,
)

## cross validation

因為一年，我們將windows調整成每周，一次輸入兩周的資料  
預測未來一周的成果  

In [28]:
# 端看過去一年來說，一次跳一個月進行驗證，驗證範圍為未來的7天
# 改成一個禮拜跳一次進行驗證
start_time = time.time()
horizon=4*24*7

crossvaldation_df = sf.cross_validation(
    df=p_df,
    h=horizon, # 未來要預測多少步，這邊先假設1天 = 4*24，應該改成一次預測1個禮拜，96*7
    input_size=horizon*2,
    step_size=horizon, # 1次跳1個禮拜
    n_windows=52
)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 53.85 seconds


In [41]:
crossvaldation_df.to_csv(f'./cv_results/baseline.csv', index=False)

In [42]:
crossvaldation_df

Unnamed: 0,unique_id,ds,cutoff,y,AutoARIMA,SeasonalNaive,MSTL
0,05F0001N-03F0150N,2023-01-02 00:00:00,2023-01-01 23:45:00,51.079063,51.863762,49.187500,49.187500
1,05F0001N-03F0150N,2023-01-02 00:15:00,2023-01-01 23:45:00,51.550102,51.863762,50.619617,50.619617
2,05F0001N-03F0150N,2023-01-02 00:30:00,2023-01-01 23:45:00,50.661072,51.863762,49.515656,49.515656
3,05F0001N-03F0150N,2023-01-02 00:45:00,2023-01-01 23:45:00,49.492664,51.863762,49.298500,49.298500
4,05F0001N-03F0150N,2023-01-02 01:00:00,2023-01-01 23:45:00,47.874908,51.863762,50.595329,50.595329
...,...,...,...,...,...,...,...
279547,05F0528N-05F0438N,2023-12-31 22:45:00,2023-12-24 23:45:00,336.021637,328.242249,333.683319,338.231415
279548,05F0528N-05F0438N,2023-12-31 23:00:00,2023-12-24 23:45:00,337.304749,328.233307,320.679504,342.791321
279549,05F0528N-05F0438N,2023-12-31 23:15:00,2023-12-24 23:45:00,337.973419,328.224365,335.102753,336.178589
279550,05F0528N-05F0438N,2023-12-31 23:30:00,2023-12-24 23:45:00,359.978516,328.215393,343.051270,333.614227


## evaluation

In [30]:
def evaluate_cross_validation(df, metric_list):
    models = df.drop(columns=['unique_id', 'ds', 'cutoff', 'y']).columns.tolist()
    evals = []
    # Calculate loss for every unique_id and cutoff.    
    for cutoff in df['cutoff'].unique():
        # eval_ = evaluate(df[df['cutoff'] == cutoff], metrics=[metric], models=models)
        eval_ = evaluate(df[df['cutoff'] == cutoff], metrics=metric_list, models=models)
        evals.append(eval_)
    evals = pd.concat(evals)
    evals = evals.groupby('unique_id').mean(numeric_only=True) # Averages the error metrics for all cutoffs for every combination of model and unique_id
    evals['best_model'] = evals.idxmin(axis=1)
    return evals


In [36]:
# 這邊要測過
evaluation_df_mse = evaluate_cross_validation(crossvaldation_df, [mse])
evaluation_df_mae = evaluate_cross_validation(crossvaldation_df, [mae])
evaluation_df_rmse = evaluate_cross_validation(crossvaldation_df, [rmse])
evaluation_df_smape = evaluate_cross_validation(crossvaldation_df, [smape])

In [37]:
evaluation_df_mse

Unnamed: 0_level_0,AutoARIMA,SeasonalNaive,MSTL,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
05F0001N-03F0150N,376.167938,97.110977,94.56897,MSTL
05F0001N-03F0201S,33796.117188,4736.969238,941777.6,SeasonalNaive
05F0055N-05F0001N,16516.925781,9754.59082,193390.0,SeasonalNaive
05F0287N-05F0055N,365189.875,115129.757812,197608.1,SeasonalNaive
05F0309N-05F0287N,128602.203125,254116.71875,120323.4,MSTL
05F0438N-05F0309N,152686.515625,249547.9375,5095195.0,AutoARIMA
05F0438N-05FR143N,578.646057,312.696075,198.8433,MSTL
05F0528N-05F0438N,736.866516,1243.195557,717.6853,MSTL


In [38]:
evaluation_df_mae

Unnamed: 0_level_0,AutoARIMA,SeasonalNaive,MSTL,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
05F0001N-03F0150N,10.599102,2.695446,2.934812,SeasonalNaive
05F0001N-03F0201S,150.085693,24.024132,528.020264,SeasonalNaive
05F0055N-05F0001N,86.866219,47.741257,189.126495,SeasonalNaive
05F0287N-05F0055N,451.027863,218.448944,259.41803,SeasonalNaive
05F0309N-05F0287N,185.061447,298.864471,219.115585,AutoARIMA
05F0438N-05F0309N,224.897812,266.812561,1322.447998,AutoARIMA
05F0438N-05FR143N,19.594347,8.654881,7.409321,MSTL
05F0528N-05F0438N,15.577595,18.65905,15.354082,MSTL


In [39]:
evaluation_df_rmse

Unnamed: 0_level_0,AutoARIMA,SeasonalNaive,MSTL,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
05F0001N-03F0150N,14.689403,8.230233,8.192002,MSTL
05F0001N-03F0201S,170.975693,64.596428,624.727783,SeasonalNaive
05F0055N-05F0001N,111.534538,92.10527,230.351501,SeasonalNaive
05F0287N-05F0055N,550.075195,329.042603,324.749542,MSTL
05F0309N-05F0287N,326.329315,486.213745,307.178223,MSTL
05F0438N-05F0309N,341.114594,463.42688,1539.190308,AutoARIMA
05F0438N-05FR143N,23.480635,17.043928,13.883498,MSTL
05F0528N-05F0438N,26.699253,34.287468,26.367897,MSTL


In [40]:
evaluation_df_smape

Unnamed: 0_level_0,AutoARIMA,SeasonalNaive,MSTL,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
05F0001N-03F0150N,0.145415,0.023032,0.025212,SeasonalNaive
05F0001N-03F0201S,0.559216,0.041335,0.337324,SeasonalNaive
05F0055N-05F0001N,0.259395,0.071341,0.170383,SeasonalNaive
05F0287N-05F0055N,0.262765,0.082131,0.094278,SeasonalNaive
05F0309N-05F0287N,0.489881,0.299228,0.390857,SeasonalNaive
05F0438N-05F0309N,0.277011,0.141146,0.389826,SeasonalNaive
05F0438N-05FR143N,0.103869,0.046752,0.040476,MSTL
05F0528N-05F0438N,0.022337,0.026503,0.021945,MSTL


要看哪個指標作為baseline會是好的依據