# modeling - feature selection

在我們進入多變量的建模預測前，必須要針對特徵進行必要的篩選  
確實，在概念上我們可以一股腦地把全部變數倒入nn模型訓練，模型會根據有用的變數去推論  
但這個方法有她麻煩的地方，訓練資料在特徵上變多其實對訓練時長也會有影響  
提前做過一輪特徵的選擇在訓練上的效率是比較好的  

# Lib

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import os
import time
import sys

# 將 hwttp 資料夾加入 Python 搜尋路徑
sys.path.append(os.path.abspath('..'))
import hwttp.hwtoolkit as tk

# extra
import importlib

In [2]:
import matplotlib.pyplot as plt
from os import listdir
from os.path import isfile, join
import itertools

In [3]:
from pathlib import Path

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit

# Load data

因為前面已經先建立過完整的dataset(避免從最開始load, transform資料這塊的時間，很慢)  
這邊可以直接load parquet  

In [3]:
data_paths = {'etag_5n_loc': '../data/cleaned/etag_5n_loc.csv',
              'section_info': '../data/cleaned/section_info.csv',
              'hw5_m04a_df': '../data/cleaned/hw5_m04a.csv',
              'congestion_table': '../data/cleaned/congestion_table.csv',
              'calendar_event': '../data/cleaned/calendar_event.csv',
              'road_build_event': '../data/cleaned/202301_10_road_build_event.xlsx',
              'traffic_accident_data': '../data/cleaned/202301_10_traffic_accident_data.xlsx'
             }

rs = tk.hw_df_resource(data_paths)
rs.load_raw_environment_info()
rs.load_raw_event_info()
rs.generate_mile_location_info()

Complete loading environment and gantry info
Complete loading raw event info
Complete generating mile location info


In [4]:
# 移除推測與交通落差相對大的節日
drop_event_list = ['婦女節', '愚人節', '復活節', '地球日', '感恩節', 'Black Friday', 'Cyber Monday', '冬至']
rs.calendar_event = rs.calendar_event[~rs.calendar_event.event_name.isin(drop_event_list)].copy()

In [5]:
# 旅行時間的資料
hw5_15watt = pd.read_csv("../data/features/hw5_15watt.csv")
hw5_15watt = hw5_15watt[~hw5_15watt.gf_gt.isin(['03F0150N-03F0140N', 
                                                '03F0201S-03A0041N', 
                                                '03F0201S-03F0217S'])].copy()

hw5_15watt = hw5_15watt[hw5_15watt.TimeStamp<'2024-01-01 00:00:00'].copy()
hw5_15watt['TimeStamp'] = pd.to_datetime(hw5_15watt['TimeStamp'])

In [6]:
hw5_15watt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 280320 entries, 131328 to 472799
Data columns (total 6 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   gf_gt                  280320 non-null  object        
 1   GantryFrom             280320 non-null  object        
 2   GantryTo               280320 non-null  object        
 3   TimeStamp              280320 non-null  datetime64[ns]
 4   WeightedAvgTravelTime  280320 non-null  float64       
 5   TotalTraffic           280320 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 15.0+ MB


# Combinations

我們可以透過定義不同函數的處理方式來做到增加變數類型  
這邊會把所有的變數類型塞進去，像是  
c: 路段擁擠時間資訊  
h: 節假日資訊  
t: 交通事故資訊  
r: 道路施工資訊  
p: 對應路段下游前五分鐘的旅行時間  
  
目標產出每個門架間旅行時間預設會參考到的全部特徵  

In [7]:
# 定義每個資料來源的特徵附加到既有df的function
def add_c(df):
    output_df = tk.add_congestion_condition(df, rs.congestion_table, rs.milelocation_info_df).copy()
    return output_df

def add_h(df):
    output_df = tk.add_calendar_event(df, rs.calendar_event).copy()
    return output_df

def add_t(df):
    *_, output_df = tk.add_traffic_event(df, rs.traffic_accident_data, rs.milelocation_info_df)
    return output_df

def add_r(df):
    *_, output_df = tk.add_road_build_event(df, rs.road_build_event, rs.milelocation_info_df)
    return output_df

def add_p(df):
    output_df = tk.add_ds_5prev_traveltime(df)
    return output_df
    
# function_set = [add_c, add_h, add_t, add_r, add_p]

# function放在dict中，可以在後面作法上直接呼叫
element_dict = {'_c': add_c, 
                '_h': add_h, 
                '_t': add_t, 
                '_r': add_r, 
                '_p': add_p
               }

In [8]:
# combinations，透過這個方式，我們預先定義對應門架中我希望附加上去的特徵種類
combinations = {'05F0001N-03F0150N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0001N-03F0201S': ['_c', '_h', '_t', '_r', '_p'],
                '05F0055N-05F0001N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0287N-05F0055N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0309N-05F0287N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0438N-05F0309N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0438N-05FR143N': ['_c', '_h', '_t', '_r', '_p'],
                '05F0528N-05F0438N': ['_c', '_h', '_t', '_r', '_p'],
               }

In [11]:
# 結果儲存用results
results = dict()

# 執行所有排列组合
for gantry_pair, combo in tqdm(combinations.items()):
    
    # initialization
    result_df = hw5_15watt.copy()
    output_string = gantry_pair+'_b'
    print(gantry_pair, combo)
    combo_list = combo.copy()
    
    # 因為p的方法目前還沒有辦法處理多個欄位後再去接previous的問題，不做此處理會讓欄位分身暴增
    if '_p' in combo_list:
        combo_list.remove('_p')
        func = element_dict['_p']
        result_df = func(result_df)
        output_string += '_p'
    
    # 如果清單為空，要直接跳到最後不處理，最後沒有特別加這段處理
    print('combo_list = ', combo_list)
    
    for func_name in combo_list:
        func = element_dict[func_name]
        result_df = func(result_df)
        output_string += func_name

    # 最後才限縮gf_gt範圍，否則再找previous時會找不到資料
    result_df = result_df[result_df['gf_gt']==gantry_pair].copy()

    results[output_string] = result_df.copy()


  0%|          | 0/8 [00:00<?, ?it/s]

05F0001N-03F0150N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1196.68it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.94it/s][A
  7%|▋         | 8/115 [00:00<00:03, 34.17it/s][A
 10%|█         | 12/115 [00:00<00:03, 32.81it/s][A
 15%|█▍        | 17/115 [00:00<00:02, 36.73it/s][A
 18%|█▊        | 21/115 [00:00<00:02, 31.95it/s][A
 22%|██▏       | 25/115 [00:00<00:02, 31.77it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 26.13it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 26.21it/s][A
 30%|███       | 35/115 [00:01<00:03, 26.16it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 31.73it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 28.30it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.85it/s][A
 46%|████▌     | 53/115 [00:01<00:01, 31.95it/s][A
 50%|█████     | 58/115 [00:01<00:01, 34.98it/s][A
 54%|█████▍    | 62/115 [00:01<00:01, 33.89it/s][A
 57%|█████▋    | 66/115 [00:02<00:01, 32.18it/s][A
 62%|██████▏   | 71/115 [00:02<00:01, 32.99it/s][A
 65%|██████▌   | 75/115 [00:02<00:01, 29.85it/s][A
 69%|██████▊   | 79/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  3%|▎         | 120/4354 [00:00<00:03, 1194.03it/s][A
  6%|▌         | 240/4354 [00:00<00:03, 1176.87it/s][A
  8%|▊         | 358/4354 [00:00<00:03, 1164.98it/s][A
 11%|█         | 475/4354 [00:00<00:03, 1114.89it/s][A
 13%|█▎        | 587/4354 [00:00<00:03, 1027.51it/s][A
 16%|█▌        | 691/4354 [00:00<00:03, 929.03it/s] [A
 18%|█▊        | 786/4354 [00:00<00:03, 922.20it/s][A
 20%|██        | 880/4354 [00:00<00:03, 881.32it/s][A
 22%|██▏       | 971/4354 [00:01<00:03, 888.80it/s][A
 24%|██▍       | 1061/4354 [00:01<00:03, 852.57it/s][A
 26%|██▋       | 1147/4354 [00:01<00:04, 800.38it/s][A
 28%|██▊       | 1228/4354 [00:01<00:04, 737.50it/s][A
 30%|██▉       | 1303/4354 [00:01<00:04, 706.91it/s][A
 32%|███▏      | 1375/4354 [00:01<00:04, 689.22it/s][A
 33%|███▎      | 1445/4354 [00:01<00:04, 649.12it/s][A
 35%|███▍      | 1511/4354 [00:01<00:04, 627.45it/s][A
 36%|███▌      | 1574/4354 [00:01<00:04, 611.91it/s][A
 38%|█

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 12/3807 [00:00<00:34, 109.68it/s][A
  1%|          | 25/3807 [00:00<00:31, 120.06it/s][A
  1%|          | 40/3807 [00:00<00:28, 130.73it/s][A
  1%|▏         | 54/3807 [00:00<00:28, 131.15it/s][A
  2%|▏         | 68/3807 [00:00<00:29, 126.92it/s][A
  2%|▏         | 81/3807 [00:00<00:31, 119.40it/s][A
  2%|▏         | 95/3807 [00:00<00:29, 124.41it/s][A
  3%|▎         | 110/3807 [00:00<00:28, 130.55it/s][A
  3%|▎         | 124/3807 [00:00<00:28, 130.54it/s][A
  4%|▎         | 138/3807 [00:01<00:28, 130.23it/s][A
  4%|▍         | 152/3807 [00:01<00:28, 129.56it/s][A
  4%|▍         | 167/3807 [00:01<00:27, 134.31it/s][A
  5%|▍         | 181/3807 [00:01<00:27, 129.92it/s][A
  5%|▌         | 195/3807 [00:01<00:29, 123.75it/s][A
  5%|▌         | 208/3807 [00:01<00:30, 118.22it/s][A
  6%|▌         | 221/3807 [00:01<00:29, 120.69it/s][A
  6%|▌         | 234/3807 [00:01<00:29, 119.49it/s][A
  6%|▋         | 247/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df
05F0001N-03F0201S ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1192.31it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.47it/s][A
  7%|▋         | 8/115 [00:00<00:03, 33.74it/s][A
 10%|█         | 12/115 [00:00<00:03, 32.42it/s][A
 15%|█▍        | 17/115 [00:00<00:02, 36.32it/s][A
 18%|█▊        | 21/115 [00:00<00:02, 31.83it/s][A
 22%|██▏       | 25/115 [00:00<00:02, 31.59it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 27.56it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.23it/s][A
 30%|███       | 35/115 [00:01<00:02, 26.73it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.11it/s][A
 38%|███▊      | 44/115 [00:01<00:03, 21.17it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 24.46it/s][A
 45%|████▌     | 52/115 [00:01<00:02, 27.56it/s][A
 49%|████▊     | 56/115 [00:01<00:02, 28.52it/s][A
 53%|█████▎    | 61/115 [00:02<00:01, 32.21it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 29.13it/s][A
 60%|██████    | 69/115 [00:02<00:01, 31.60it/s][A
 63%|██████▎   | 73/115 [00:02<00:01, 30.73it/s][A
 67%|██████▋   | 77/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  3%|▎         | 111/4354 [00:00<00:03, 1101.58it/s][A
  5%|▌         | 228/4354 [00:00<00:03, 1137.55it/s][A
  8%|▊         | 342/4354 [00:00<00:03, 1117.39it/s][A
 10%|█         | 454/4354 [00:00<00:03, 1100.74it/s][A
 13%|█▎        | 565/4354 [00:00<00:03, 1071.86it/s][A
 15%|█▌        | 673/4354 [00:00<00:03, 1047.97it/s][A
 18%|█▊        | 778/4354 [00:00<00:03, 1008.24it/s][A
 20%|██        | 880/4354 [00:00<00:03, 874.24it/s] [A
 22%|██▏       | 971/4354 [00:00<00:03, 865.93it/s][A
 24%|██▍       | 1060/4354 [00:01<00:03, 849.08it/s][A
 26%|██▋       | 1147/4354 [00:01<00:03, 836.07it/s][A
 28%|██▊       | 1232/4354 [00:01<00:03, 810.13it/s][A
 30%|███       | 1314/4354 [00:01<00:03, 795.28it/s][A
 32%|███▏      | 1394/4354 [00:01<00:03, 777.95it/s][A
 34%|███▍      | 1472/4354 [00:01<00:04, 715.21it/s][A
 35%|███▌      | 1545/4354 [00:01<00:04, 678.61it/s][A
 37%|███▋      | 1614/4354 [00:01<00:04, 653.12it/s][A
 39%

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:36, 104.18it/s][A
  1%|          | 23/3807 [00:00<00:33, 113.03it/s][A
  1%|          | 38/3807 [00:00<00:29, 126.19it/s][A
  1%|▏         | 52/3807 [00:00<00:28, 131.08it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 124.04it/s][A
  2%|▏         | 79/3807 [00:00<00:31, 118.20it/s][A
  2%|▏         | 92/3807 [00:00<00:30, 121.52it/s][A
  3%|▎         | 106/3807 [00:00<00:29, 125.90it/s][A
  3%|▎         | 121/3807 [00:00<00:28, 131.18it/s][A
  4%|▎         | 135/3807 [00:01<00:28, 128.46it/s][A
  4%|▍         | 148/3807 [00:01<00:29, 125.63it/s][A
  4%|▍         | 162/3807 [00:01<00:28, 129.13it/s][A
  5%|▍         | 176/3807 [00:01<00:27, 130.88it/s][A
  5%|▍         | 190/3807 [00:01<00:29, 124.02it/s][A
  5%|▌         | 203/3807 [00:01<00:30, 117.09it/s][A
  6%|▌         | 217/3807 [00:01<00:29, 122.47it/s][A
  6%|▌         | 230/3807 [00:01<00:29, 122.34it/s][A
  6%|▋         | 243/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df
05F0055N-05F0001N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1172.19it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.53it/s][A
  6%|▌         | 7/115 [00:00<00:03, 31.06it/s][A
 10%|▉         | 11/115 [00:00<00:03, 34.43it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.73it/s][A
 17%|█▋        | 19/115 [00:00<00:03, 31.80it/s][A
 20%|██        | 23/115 [00:00<00:04, 21.28it/s][A
 23%|██▎       | 26/115 [00:01<00:03, 22.34it/s][A
 25%|██▌       | 29/115 [00:01<00:03, 23.01it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 23.71it/s][A
 30%|███       | 35/115 [00:01<00:03, 23.71it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 29.36it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 26.60it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 29.37it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 31.90it/s][A
 49%|████▊     | 56/115 [00:02<00:01, 31.50it/s][A
 53%|█████▎    | 61/115 [00:02<00:01, 34.50it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 30.20it/s][A
 60%|██████    | 69/115 [00:02<00:01, 32.37it/s][A
 63%|██████▎   | 73/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  2%|▏         | 91/4354 [00:00<00:04, 905.86it/s][A
  4%|▍         | 185/4354 [00:00<00:04, 921.92it/s][A
  6%|▋         | 283/4354 [00:00<00:04, 946.68it/s][A
  9%|▉         | 382/4354 [00:00<00:04, 962.12it/s][A
 11%|█         | 479/4354 [00:00<00:04, 955.93it/s][A
 13%|█▎        | 575/4354 [00:00<00:05, 705.97it/s][A
 15%|█▌        | 655/4354 [00:00<00:05, 729.24it/s][A
 17%|█▋        | 746/4354 [00:00<00:04, 777.60it/s][A
 19%|█▉        | 835/4354 [00:01<00:04, 807.39it/s][A
 21%|██        | 920/4354 [00:01<00:04, 817.87it/s][A
 23%|██▎       | 1005/4354 [00:01<00:04, 816.44it/s][A
 25%|██▌       | 1089/4354 [00:01<00:04, 806.22it/s][A
 27%|██▋       | 1171/4354 [00:01<00:04, 771.87it/s][A
 29%|██▊       | 1250/4354 [00:01<00:04, 762.89it/s][A
 31%|███       | 1328/4354 [00:01<00:04, 750.02it/s][A
 32%|███▏      | 1404/4354 [00:01<00:04, 733.97it/s][A
 34%|███▍      | 1478/4354 [00:01<00:04, 666.06it/s][A
 36%|███▌     

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:35, 106.05it/s][A
  1%|          | 23/3807 [00:00<00:33, 113.90it/s][A
  1%|          | 38/3807 [00:00<00:29, 126.65it/s][A
  1%|▏         | 52/3807 [00:00<00:28, 131.11it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 123.72it/s][A
  2%|▏         | 79/3807 [00:00<00:31, 117.67it/s][A
  2%|▏         | 92/3807 [00:00<00:30, 120.61it/s][A
  3%|▎         | 105/3807 [00:00<00:30, 119.85it/s][A
  3%|▎         | 120/3807 [00:00<00:29, 125.85it/s][A
  3%|▎         | 133/3807 [00:01<00:29, 123.98it/s][A
  4%|▍         | 146/3807 [00:01<00:30, 121.96it/s][A
  4%|▍         | 160/3807 [00:01<00:29, 125.33it/s][A
  5%|▍         | 174/3807 [00:01<00:28, 127.68it/s][A
  5%|▍         | 187/3807 [00:01<00:29, 121.85it/s][A
  5%|▌         | 200/3807 [00:01<00:31, 115.32it/s][A
  6%|▌         | 213/3807 [00:01<00:30, 118.98it/s][A
  6%|▌         | 226/3807 [00:01<00:30, 118.57it/s][A
  6%|▋         | 238/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df
05F0287N-05F0055N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1104.42it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.56it/s][A
  6%|▌         | 7/115 [00:00<00:03, 31.37it/s][A
 10%|▉         | 11/115 [00:00<00:02, 34.84it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.88it/s][A
 17%|█▋        | 19/115 [00:00<00:03, 31.91it/s][A
 20%|██        | 23/115 [00:00<00:03, 29.07it/s][A
 23%|██▎       | 26/115 [00:00<00:03, 28.55it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 27.93it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.33it/s][A
 30%|███       | 35/115 [00:01<00:02, 26.75it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.47it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 28.35it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.73it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 32.80it/s][A
 49%|████▊     | 56/115 [00:01<00:01, 31.88it/s][A
 53%|█████▎    | 61/115 [00:01<00:01, 32.82it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 28.82it/s][A
 60%|██████    | 69/115 [00:02<00:01, 31.32it/s][A
 63%|██████▎   | 73/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  2%|▏         | 104/4354 [00:00<00:04, 1031.93it/s][A
  5%|▌         | 222/4354 [00:00<00:03, 1116.04it/s][A
  8%|▊         | 337/4354 [00:00<00:03, 1131.48it/s][A
 10%|█         | 451/4354 [00:00<00:03, 1065.81it/s][A
 13%|█▎        | 559/4354 [00:00<00:03, 1034.29it/s][A
 15%|█▌        | 663/4354 [00:00<00:03, 1004.36it/s][A
 18%|█▊        | 764/4354 [00:00<00:03, 956.07it/s] [A
 20%|█▉        | 861/4354 [00:00<00:03, 910.74it/s][A
 22%|██▏       | 953/4354 [00:00<00:03, 870.09it/s][A
 24%|██▍       | 1041/4354 [00:01<00:04, 820.41it/s][A
 26%|██▌       | 1124/4354 [00:01<00:04, 783.44it/s][A
 28%|██▊       | 1203/4354 [00:01<00:04, 753.37it/s][A
 29%|██▉       | 1279/4354 [00:01<00:04, 707.63it/s][A
 31%|███       | 1351/4354 [00:01<00:04, 682.64it/s][A
 33%|███▎      | 1420/4354 [00:01<00:04, 651.14it/s][A
 34%|███▍      | 1486/4354 [00:01<00:04, 618.37it/s][A
 36%|███▌      | 1548/4354 [00:01<00:04, 596.90it/s][A
 37%|

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:35, 107.13it/s][A
  1%|          | 24/3807 [00:00<00:32, 116.14it/s][A
  1%|          | 39/3807 [00:00<00:29, 128.17it/s][A
  1%|▏         | 53/3807 [00:00<00:28, 129.98it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 123.62it/s][A
  2%|▏         | 79/3807 [00:00<00:37, 98.62it/s] [A
  2%|▏         | 92/3807 [00:00<00:34, 106.35it/s][A
  3%|▎         | 106/3807 [00:00<00:32, 113.98it/s][A
  3%|▎         | 120/3807 [00:01<00:30, 121.07it/s][A
  3%|▎         | 133/3807 [00:01<00:30, 121.75it/s][A
  4%|▍         | 146/3807 [00:01<00:30, 120.80it/s][A
  4%|▍         | 160/3807 [00:01<00:29, 125.12it/s][A
  5%|▍         | 175/3807 [00:01<00:27, 130.08it/s][A
  5%|▍         | 189/3807 [00:01<00:29, 122.95it/s][A
  5%|▌         | 202/3807 [00:01<00:30, 117.91it/s][A
  6%|▌         | 215/3807 [00:01<00:29, 120.57it/s][A
  6%|▌         | 228/3807 [00:01<00:29, 122.32it/s][A
  6%|▋         | 241/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df
05F0309N-05F0287N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1136.36it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.56it/s][A
  6%|▌         | 7/115 [00:00<00:03, 31.20it/s][A
 10%|▉         | 11/115 [00:00<00:03, 34.62it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.76it/s][A
 17%|█▋        | 19/115 [00:00<00:03, 31.77it/s][A
 20%|██        | 23/115 [00:00<00:03, 29.03it/s][A
 23%|██▎       | 26/115 [00:00<00:03, 28.49it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 27.76it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.11it/s][A
 30%|███       | 35/115 [00:01<00:03, 26.53it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.10it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 28.01it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.52it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 32.79it/s][A
 49%|████▊     | 56/115 [00:01<00:01, 31.98it/s][A
 53%|█████▎    | 61/115 [00:01<00:01, 34.94it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 30.39it/s][A
 60%|██████    | 69/115 [00:02<00:01, 32.57it/s][A
 63%|██████▎   | 73/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  2%|▏         | 105/4354 [00:00<00:04, 1042.38it/s][A
  5%|▍         | 210/4354 [00:00<00:06, 627.13it/s] [A
  7%|▋         | 302/4354 [00:00<00:05, 724.58it/s][A
  9%|▉         | 392/4354 [00:00<00:05, 780.29it/s][A
 11%|█         | 477/4354 [00:00<00:05, 729.04it/s][A
 13%|█▎        | 555/4354 [00:00<00:05, 691.68it/s][A
 14%|█▍        | 628/4354 [00:00<00:05, 639.17it/s][A
 16%|█▌        | 695/4354 [00:01<00:05, 643.74it/s][A
 17%|█▋        | 761/4354 [00:01<00:06, 588.66it/s][A
 19%|█▉        | 822/4354 [00:01<00:07, 489.39it/s][A
 20%|██        | 875/4354 [00:01<00:07, 450.62it/s][A
 22%|██▏       | 937/4354 [00:01<00:06, 488.89it/s][A
 23%|██▎       | 989/4354 [00:01<00:07, 474.62it/s][A
 24%|██▍       | 1039/4354 [00:01<00:07, 438.40it/s][A
 25%|██▍       | 1085/4354 [00:01<00:08, 405.22it/s][A
 26%|██▌       | 1127/4354 [00:02<00:08, 384.24it/s][A
 27%|██▋       | 1167/4354 [00:02<00:08, 367.67it/s][A
 28%|██▊      

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 10/3807 [00:00<00:38, 98.52it/s][A
  1%|          | 23/3807 [00:00<00:34, 110.09it/s][A
  1%|          | 37/3807 [00:00<00:30, 122.38it/s][A
  1%|▏         | 51/3807 [00:00<00:29, 128.25it/s][A
  2%|▏         | 64/3807 [00:00<00:30, 121.48it/s][A
  2%|▏         | 77/3807 [00:00<00:32, 116.33it/s][A
  2%|▏         | 90/3807 [00:00<00:31, 119.49it/s][A
  3%|▎         | 104/3807 [00:00<00:29, 124.08it/s][A
  3%|▎         | 119/3807 [00:00<00:28, 130.91it/s][A
  3%|▎         | 133/3807 [00:01<00:30, 119.96it/s][A
  4%|▍         | 146/3807 [00:01<00:30, 119.41it/s][A
  4%|▍         | 160/3807 [00:01<00:29, 123.81it/s][A
  5%|▍         | 174/3807 [00:01<00:28, 128.18it/s][A
  5%|▍         | 187/3807 [00:01<00:29, 122.61it/s][A
  5%|▌         | 200/3807 [00:01<00:30, 116.40it/s][A
  6%|▌         | 214/3807 [00:01<00:29, 120.22it/s][A
  6%|▌         | 227/3807 [00:01<00:29, 121.24it/s][A
  6%|▋         | 240/3807 [0

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df
05F0438N-05F0309N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1158.35it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.82it/s][A
  6%|▌         | 7/115 [00:00<00:03, 31.59it/s][A
 10%|▉         | 11/115 [00:00<00:03, 34.56it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.77it/s][A
 17%|█▋        | 19/115 [00:00<00:03, 31.97it/s][A
 20%|██        | 23/115 [00:00<00:03, 29.21it/s][A
 23%|██▎       | 26/115 [00:00<00:03, 28.72it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 28.10it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.51it/s][A
 30%|███       | 35/115 [00:01<00:02, 26.90it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.43it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 27.86it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.10it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 32.09it/s][A
 49%|████▊     | 56/115 [00:01<00:01, 31.24it/s][A
 53%|█████▎    | 61/115 [00:01<00:01, 33.92it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 29.28it/s][A
 60%|██████    | 69/115 [00:02<00:01, 31.39it/s][A
 63%|██████▎   | 73/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  2%|▏         | 99/4354 [00:00<00:04, 985.80it/s][A
  5%|▍         | 211/4354 [00:00<00:03, 1062.68it/s][A
  7%|▋         | 321/4354 [00:00<00:03, 1075.87it/s][A
 10%|▉         | 429/4354 [00:00<00:03, 1041.63it/s][A
 12%|█▏        | 534/4354 [00:00<00:03, 1035.92it/s][A
 15%|█▍        | 638/4354 [00:00<00:03, 1016.87it/s][A
 17%|█▋        | 740/4354 [00:00<00:03, 971.50it/s] [A
 19%|█▉        | 838/4354 [00:00<00:03, 948.71it/s][A
 21%|██▏       | 934/4354 [00:00<00:03, 922.69it/s][A
 24%|██▎       | 1027/4354 [00:01<00:03, 875.04it/s][A
 26%|██▌       | 1115/4354 [00:01<00:03, 836.33it/s][A
 28%|██▊       | 1200/4354 [00:01<00:03, 815.50it/s][A
 29%|██▉       | 1282/4354 [00:01<00:03, 775.38it/s][A
 31%|███       | 1360/4354 [00:01<00:03, 758.99it/s][A
 33%|███▎      | 1437/4354 [00:01<00:04, 719.27it/s][A
 35%|███▍      | 1510/4354 [00:01<00:04, 660.68it/s][A
 36%|███▌      | 1577/4354 [00:01<00:04, 628.07it/s][A
 38%|██

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:36, 105.16it/s][A
  1%|          | 23/3807 [00:00<00:33, 113.67it/s][A
  1%|          | 38/3807 [00:00<00:29, 126.71it/s][A
  1%|▏         | 52/3807 [00:00<00:28, 131.42it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 123.98it/s][A
  2%|▏         | 79/3807 [00:00<00:31, 117.39it/s][A
  2%|▏         | 92/3807 [00:00<00:30, 120.61it/s][A
  3%|▎         | 106/3807 [00:00<00:29, 124.78it/s][A
  3%|▎         | 120/3807 [00:00<00:28, 129.15it/s][A
  3%|▎         | 133/3807 [00:01<00:28, 127.28it/s][A
  4%|▍         | 146/3807 [00:01<00:29, 124.50it/s][A
  4%|▍         | 160/3807 [00:01<00:28, 127.82it/s][A
  5%|▍         | 175/3807 [00:01<00:27, 132.00it/s][A
  5%|▍         | 189/3807 [00:01<00:29, 123.45it/s][A
  5%|▌         | 202/3807 [00:01<00:30, 116.83it/s][A
  6%|▌         | 215/3807 [00:01<00:30, 119.34it/s][A
  6%|▌         | 228/3807 [00:01<00:29, 121.24it/s][A
  6%|▋         | 241/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df


 75%|███████▌  | 6/8 [12:06<04:03, 121.80s/it]

05F0438N-05FR143N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1103.74it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:04, 22.81it/s][A
  6%|▌         | 7/115 [00:00<00:03, 31.47it/s][A
 10%|▉         | 11/115 [00:00<00:03, 34.66it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.91it/s][A
 17%|█▋        | 19/115 [00:00<00:02, 32.05it/s][A
 20%|██        | 23/115 [00:00<00:03, 29.26it/s][A
 23%|██▎       | 26/115 [00:00<00:03, 28.60it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 27.97it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.41it/s][A
 30%|███       | 35/115 [00:01<00:02, 26.81it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.34it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 28.27it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.85it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 33.16it/s][A
 49%|████▊     | 56/115 [00:01<00:01, 32.38it/s][A
 53%|█████▎    | 61/115 [00:01<00:01, 35.24it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 30.64it/s][A
 61%|██████    | 70/115 [00:02<00:01, 33.85it/s][A
 64%|██████▍   | 74/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  2%|▏         | 101/4354 [00:00<00:04, 1005.62it/s][A
  5%|▍         | 210/4354 [00:00<00:03, 1051.13it/s][A
  7%|▋         | 316/4354 [00:00<00:03, 1024.22it/s][A
 10%|▉         | 419/4354 [00:00<00:03, 994.46it/s] [A
 12%|█▏        | 519/4354 [00:00<00:03, 995.56it/s][A
 14%|█▍        | 619/4354 [00:00<00:04, 927.54it/s][A
 16%|█▋        | 713/4354 [00:00<00:03, 929.38it/s][A
 19%|█▊        | 807/4354 [00:00<00:04, 879.98it/s][A
 21%|██        | 897/4354 [00:00<00:03, 883.21it/s][A
 23%|██▎       | 986/4354 [00:01<00:03, 881.12it/s][A
 25%|██▍       | 1075/4354 [00:01<00:03, 839.80it/s][A
 27%|██▋       | 1160/4354 [00:01<00:03, 809.29it/s][A
 29%|██▊       | 1242/4354 [00:01<00:04, 763.92it/s][A
 30%|███       | 1320/4354 [00:01<00:04, 725.44it/s][A
 32%|███▏      | 1394/4354 [00:01<00:04, 727.14it/s][A
 34%|███▎      | 1468/4354 [00:01<00:04, 674.63it/s][A
 35%|███▌      | 1537/4354 [00:01<00:04, 610.64it/s][A
 37%|███▋

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:35, 107.42it/s][A
  1%|          | 23/3807 [00:00<00:32, 114.69it/s][A
  1%|          | 38/3807 [00:00<00:29, 127.12it/s][A
  1%|▏         | 52/3807 [00:00<00:28, 131.77it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 123.98it/s][A
  2%|▏         | 79/3807 [00:00<00:31, 118.04it/s][A
  2%|▏         | 92/3807 [00:00<00:30, 121.14it/s][A
  3%|▎         | 106/3807 [00:00<00:29, 125.12it/s][A
  3%|▎         | 120/3807 [00:00<00:28, 129.49it/s][A
  4%|▎         | 134/3807 [00:01<00:28, 127.08it/s][A
  4%|▍         | 147/3807 [00:01<00:29, 125.61it/s][A
  4%|▍         | 161/3807 [00:01<00:28, 128.15it/s][A
  5%|▍         | 176/3807 [00:01<00:27, 130.83it/s][A
  5%|▍         | 190/3807 [00:01<00:29, 124.03it/s][A
  5%|▌         | 203/3807 [00:01<00:30, 117.26it/s][A
  6%|▌         | 217/3807 [00:01<00:29, 122.12it/s][A
  6%|▌         | 230/3807 [00:01<00:29, 122.44it/s][A
  6%|▋         | 243/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df


 88%|████████▊ | 7/8 [14:06<02:01, 121.21s/it]

05F0528N-05F0438N ['_c', '_h', '_t', '_r', '_p']
combo_list =  ['_c', '_h', '_t', '_r']
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before
locate same df.index before



  0%|          | 0/179 [00:00<?, ?it/s][A
100%|██████████| 179/179 [00:00<00:00, 1091.66it/s][A


Complete checking traffic accident data, total event count = 115



  0%|          | 0/115 [00:00<?, ?it/s][A
  3%|▎         | 3/115 [00:00<00:05, 21.29it/s][A
  6%|▌         | 7/115 [00:00<00:03, 29.52it/s][A
 10%|▉         | 11/115 [00:00<00:03, 33.44it/s][A
 13%|█▎        | 15/115 [00:00<00:03, 32.09it/s][A
 17%|█▋        | 19/115 [00:00<00:03, 31.56it/s][A
 20%|██        | 23/115 [00:00<00:03, 28.95it/s][A
 23%|██▎       | 26/115 [00:00<00:03, 28.47it/s][A
 25%|██▌       | 29/115 [00:00<00:03, 27.93it/s][A
 28%|██▊       | 32/115 [00:01<00:03, 27.43it/s][A
 30%|███       | 35/115 [00:01<00:02, 26.87it/s][A
 35%|███▍      | 40/115 [00:01<00:02, 32.59it/s][A
 38%|███▊      | 44/115 [00:01<00:02, 28.51it/s][A
 42%|████▏     | 48/115 [00:01<00:02, 30.99it/s][A
 45%|████▌     | 52/115 [00:01<00:01, 33.26it/s][A
 49%|████▊     | 56/115 [00:01<00:01, 32.48it/s][A
 53%|█████▎    | 61/115 [00:01<00:01, 35.19it/s][A
 57%|█████▋    | 65/115 [00:02<00:01, 30.29it/s][A
 60%|██████    | 69/115 [00:02<00:01, 32.34it/s][A
 63%|██████▎   | 73/11

Complete traffic accident data insertion
will return 2 object: target_traffic_accident_data, df



  0%|          | 0/4354 [00:00<?, ?it/s][A
  3%|▎         | 120/4354 [00:00<00:03, 1195.37it/s][A
  6%|▌         | 240/4354 [00:00<00:03, 1155.75it/s][A
  8%|▊         | 356/4354 [00:00<00:03, 1140.64it/s][A
 11%|█         | 471/4354 [00:00<00:03, 1095.72it/s][A
 13%|█▎        | 581/4354 [00:00<00:03, 1064.24it/s][A
 16%|█▌        | 688/4354 [00:00<00:03, 1027.68it/s][A
 18%|█▊        | 791/4354 [00:00<00:03, 975.01it/s] [A
 20%|██        | 889/4354 [00:00<00:03, 947.91it/s][A
 23%|██▎       | 985/4354 [00:00<00:03, 902.87it/s][A
 25%|██▍       | 1076/4354 [00:01<00:03, 881.46it/s][A
 27%|██▋       | 1165/4354 [00:01<00:03, 849.80it/s][A
 29%|██▊       | 1251/4354 [00:01<00:03, 825.25it/s][A
 31%|███       | 1334/4354 [00:01<00:03, 805.29it/s][A
 32%|███▏      | 1415/4354 [00:01<00:03, 773.14it/s][A
 34%|███▍      | 1493/4354 [00:01<00:04, 689.53it/s][A
 36%|███▌      | 1564/4354 [00:01<00:04, 654.55it/s][A
 37%|███▋      | 1631/4354 [00:01<00:04, 637.93it/s][A
 39%|

Complete checking & extract road_build_event



  0%|          | 0/3807 [00:00<?, ?it/s][A
  0%|          | 11/3807 [00:00<00:35, 107.58it/s][A
  1%|          | 23/3807 [00:00<00:33, 114.42it/s][A
  1%|          | 38/3807 [00:00<00:29, 127.01it/s][A
  1%|▏         | 52/3807 [00:00<00:28, 131.55it/s][A
  2%|▏         | 66/3807 [00:00<00:30, 124.25it/s][A
  2%|▏         | 79/3807 [00:00<00:31, 117.48it/s][A
  2%|▏         | 92/3807 [00:00<00:30, 120.56it/s][A
  3%|▎         | 106/3807 [00:00<00:29, 124.71it/s][A
  3%|▎         | 121/3807 [00:00<00:28, 130.15it/s][A
  4%|▎         | 135/3807 [00:01<00:28, 127.47it/s][A
  4%|▍         | 148/3807 [00:01<00:29, 125.09it/s][A
  4%|▍         | 162/3807 [00:01<00:28, 128.79it/s][A
  5%|▍         | 176/3807 [00:01<00:27, 131.03it/s][A
  5%|▍         | 190/3807 [00:01<00:29, 124.11it/s][A
  5%|▌         | 203/3807 [00:01<00:30, 117.17it/s][A
  6%|▌         | 217/3807 [00:01<00:29, 121.99it/s][A
  6%|▌         | 230/3807 [00:01<00:29, 122.16it/s][A
  6%|▋         | 243/3807 [

Complete road_build_event insertion to current df
will return 2 object: road_build_event, df





In [14]:
common_cols = ['gf_gt', 'GantryFrom', 'GantryTo', 'TimeStamp', 'WeightedAvgTravelTime']

for i in results.keys():
    print(i)
    print(f'number of extra features = {len(set(results[i].keys()) - set(common_cols))}')
    print('------------')

05F0001N-03F0150N_b_p_c_h_t_r
number of extra features = 58
------------
05F0001N-03F0201S_b_p_c_h_t_r
number of extra features = 58
------------
05F0055N-05F0001N_b_p_c_h_t_r
number of extra features = 58
------------
05F0287N-05F0055N_b_p_c_h_t_r
number of extra features = 58
------------
05F0309N-05F0287N_b_p_c_h_t_r
number of extra features = 58
------------
05F0438N-05F0309N_b_p_c_h_t_r
number of extra features = 58
------------
05F0438N-05FR143N_b_p_c_h_t_r
number of extra features = 58
------------
05F0528N-05F0438N_b_p_c_h_t_r
number of extra features = 58
------------


## to parquet

原始資料包含全部特徵部分，另存好方便下次快速載入使用  
parquet部分沒有object型別，處理過程需要轉換成string。

In [17]:
results['05F0001N-03F0150N_b_p_c_h_t_r'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35040 entries, 0 to 35039
Data columns (total 63 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   gf_gt                            35040 non-null  object        
 1   GantryFrom                       35040 non-null  object        
 2   GantryTo                         35040 non-null  object        
 3   TimeStamp                        35040 non-null  datetime64[ns]
 4   WeightedAvgTravelTime            35040 non-null  float64       
 5   TotalTraffic                     35040 non-null  float64       
 6   ds_prev_1_WATT                   35040 non-null  float64       
 7   ds_prev_2_WATT                   35040 non-null  float64       
 8   ds_prev_3_WATT                   35040 non-null  float64       
 9   ds_prev_4_WATT                   35040 non-null  float64       
 10  ds_prev_5_WATT                   35040 non-null  float64  

In [23]:
# fast save, next time no need to wait for long processing
for df_name, df in results.items():
    object_columns = df.select_dtypes(include='object').columns
    df[object_columns] = df[object_columns].astype('string')
    df.to_parquet(f'../data/features/all_features_by_intergrantry/{df_name}.parquet')

# Feature Selection

## read from parquet

In [5]:
parquet_path = "../data/features/all_features_by_intergrantry/"

p = Path(parquet_path)
results = dict()

for f in p.glob("*.parquet"):
    results[f.name.replace('.parquet', '')] = pd.read_parquet(f)

## main

這邊曾經嘗試使用RFECV來實作，但是該演算法在進行挑選特徵時會自動選擇好目標，不方便讓我們依據重要性做選擇。  
因此我選擇另一個方法來操作。  

In [6]:
def rolling_window_feature_selection(X, y, model, n_splits=5):
    tscv = TimeSeriesSplit(n_splits=n_splits)
    feature_importances = np.zeros(X.shape[1])

    for train_index, test_index in tscv.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        # print(f"Mean Squared Error: {mse}")

        # 累積特徵重要性
        feature_importances += model.feature_importances_

    # 計算平均特徵重要性
    feature_importances /= n_splits
    return feature_importances

In [None]:
# 儲存feature selection結果用
restuls_fs = dict()

for item in tqdm(results.items()):
    cur_df_name = item[0]
    cur_df = item[1]

    if cur_df_name in results_fs.keys():
        continue
    
    # 分離特徵和目標變量
    X = cur_df.drop(columns=['WeightedAvgTravelTime', 'gf_gt', 'GantryFrom', 'GantryTo', 'TimeStamp', 'TotalTraffic'])
    y = cur_df['WeightedAvgTravelTime']
    
    # 定義基準模型
    base_model = RandomForestRegressor(n_estimators=100, random_state=42)
    
    # 計算特徵重要性
    feature_importances = rolling_window_feature_selection(X, y, base_model, n_splits=180)
    feature_importances_df = pd.DataFrame({'feature': X.columns,
                                           'importance': feature_importances
                                          }).sort_values(by='importance', ascending=False)
    
    # 選擇最重要的特徵
    selected_features = feature_importances_df['feature'].head(10).values  # 假設我們選擇前10個重要的特徵
    print(f"df = {cur_df_name} Selected features: {selected_features}")

    results_fs[cur_df_name] = feature_importances_df.copy()

print('feature selection complete')

 12%|█▎        | 1/8 [01:37<11:24, 97.74s/it]

df = 05F0001N-03F0150N_b_p_c_h_t_r Selected features: ['dayofweek' 'holiday_length' 'holiday_name_萬聖節' 'holiday_name_七夕情人節'
 'holiday_name_西洋情人節' 'holiday_continue' 'holiday_name_農曆新年'
 'holiday_name_國慶日' 'holiday_name_二二八紀念日' 'holiday_name_跨年元旦']


 25%|██▌       | 2/8 [03:10<09:28, 94.67s/it]

df = 05F0001N-03F0201S_b_p_c_h_t_r Selected features: ['dayofweek' 'holiday_length' 'holiday_continue' 'holiday_name_七夕情人節'
 'holiday_name_跨年元旦' 'holiday_name_西洋情人節' 'holiday_name_母親節'
 'holiday_name_中秋節' 'holiday_name_教師節' 'holiday_name_端午節']


 38%|███▊      | 3/8 [45:08<1:40:07, 1201.52s/it]

df = 05F0055N-05F0001N_b_p_c_h_t_r Selected features: ['ds_prev_1_WATT' 'ds_prev_2_WATT' 'dayofweek' 'ds_prev_3_WATT'
 'ds_prev_5_WATT' 'ds_prev_4_WATT' 'accident_mileage' 'holiday_length'
 'holiday_continue' 'handling_minutes']


 50%|█████     | 4/8 [1:26:28<1:53:43, 1705.92s/it]

df = 05F0287N-05F0055N_b_p_c_h_t_r Selected features: ['ds_prev_1_WATT' 'ds_prev_2_WATT' 'ds_prev_4_WATT' 'ds_prev_3_WATT'
 'ds_prev_5_WATT' 'dayofweek' 'holiday_length' 'congestion_syndrome'
 'holiday_continue' 'road_build']


 62%|██████▎   | 5/8 [2:17:58<1:50:15, 2205.15s/it]

df = 05F0309N-05F0287N_b_p_c_h_t_r Selected features: ['ds_prev_1_WATT' 'ds_prev_5_WATT' 'dayofweek' 'ds_prev_2_WATT'
 'ds_prev_4_WATT' 'ds_prev_3_WATT' 'holiday_length' 'holiday_continue'
 'congestion_syndrome' 'holiday_name_跨年元旦']


 75%|███████▌  | 6/8 [3:05:32<1:20:51, 2425.94s/it]

df = 05F0438N-05F0309N_b_p_c_h_t_r Selected features: ['ds_prev_1_WATT' 'ds_prev_5_WATT' 'ds_prev_2_WATT' 'ds_prev_4_WATT'
 'ds_prev_3_WATT' 'dayofweek' 'holiday_length' 'congestion_syndrome'
 'holiday_name_端午節' 'road_build']


 88%|████████▊ | 7/8 [3:07:33<27:52, 1672.19s/it]  

df = 05F0438N-05FR143N_b_p_c_h_t_r Selected features: ['road_build' 'dayofweek' 'holiday_length' 'road_block_count'
 'holiday_name_農曆新年' 'holiday_name_跨年元旦' 'congestion_syndrome'
 'holiday_continue' 'total_block_count' 'holiday_name_端午節']


100%|██████████| 8/8 [3:57:08<00:00, 1778.59s/it]

df = 05F0528N-05F0438N_b_p_c_h_t_r Selected features: ['ds_prev_1_WATT' 'ds_prev_3_WATT' 'ds_prev_2_WATT' 'ds_prev_4_WATT'
 'ds_prev_5_WATT' 'dayofweek' 'holiday_length' 'holiday_continue'
 'road_build' 'road_block_count']
feature selection complete





## quick view of results

透過特徵篩選過程我們計算過了平均的重要性，以下的語法可以快速看重要性>0的排序。  
會發現不同路段的特徵重要性不同，有的偏重參考下游的旅行時間，有的是節假日，還有部分靠前的是道路是否處於施工狀態。

In [None]:
for i in results_fs.keys():
    print(i)
    display(results_fs[i].query('importance > 0'))

05F0001N-03F0150N_b_p_c_h_t_r


Unnamed: 0,feature,importance
8,dayofweek,0.610277
7,holiday_length,0.132508
23,holiday_name_萬聖節,0.090272
9,holiday_name_七夕情人節,0.035597
24,holiday_name_西洋情人節,0.028306
6,holiday_continue,0.028191
26,holiday_name_農曆新年,0.014103
16,holiday_name_國慶日,0.011962
12,holiday_name_二二八紀念日,0.011498
25,holiday_name_跨年元旦,0.008412


05F0001N-03F0201S_b_p_c_h_t_r


Unnamed: 0,feature,importance
8,dayofweek,0.636347
7,holiday_length,0.20783
6,holiday_continue,0.12204
9,holiday_name_七夕情人節,0.009345
25,holiday_name_跨年元旦,0.007429
24,holiday_name_西洋情人節,0.002793
18,holiday_name_母親節,0.00237
11,holiday_name_中秋節,0.002101
17,holiday_name_教師節,0.001563
21,holiday_name_端午節,0.001469


05F0055N-05F0001N_b_p_c_h_t_r


Unnamed: 0,feature,importance
0,ds_prev_1_WATT,0.2378905
1,ds_prev_2_WATT,0.1332968
8,dayofweek,0.1318631
2,ds_prev_3_WATT,0.08975141
4,ds_prev_5_WATT,0.07473619
3,ds_prev_4_WATT,0.07428364
27,accident_mileage,0.0655821
7,holiday_length,0.04512436
6,holiday_continue,0.03379307
30,handling_minutes,0.02932843


05F0287N-05F0055N_b_p_c_h_t_r


Unnamed: 0,feature,importance
0,ds_prev_1_WATT,0.702646
1,ds_prev_2_WATT,0.05938129
3,ds_prev_4_WATT,0.05823486
2,ds_prev_3_WATT,0.05525221
4,ds_prev_5_WATT,0.0494052
8,dayofweek,0.0193886
7,holiday_length,0.01147614
5,congestion_syndrome,0.00909245
6,holiday_continue,0.005254969
53,road_build,0.00498422


05F0309N-05F0287N_b_p_c_h_t_r


Unnamed: 0,feature,importance
0,ds_prev_1_WATT,0.6270457
4,ds_prev_5_WATT,0.06945312
8,dayofweek,0.05829736
1,ds_prev_2_WATT,0.05136546
3,ds_prev_4_WATT,0.04757021
2,ds_prev_3_WATT,0.04525812
7,holiday_length,0.04259185
6,holiday_continue,0.02585313
5,congestion_syndrome,0.01061701
25,holiday_name_跨年元旦,0.005217496


05F0438N-05F0309N_b_p_c_h_t_r


Unnamed: 0,feature,importance
0,ds_prev_1_WATT,0.7595523
4,ds_prev_5_WATT,0.05281273
1,ds_prev_2_WATT,0.05242279
3,ds_prev_4_WATT,0.03896254
2,ds_prev_3_WATT,0.03596455
8,dayofweek,0.02365059
7,holiday_length,0.01144466
5,congestion_syndrome,0.004481054
21,holiday_name_端午節,0.003131
53,road_build,0.002596335


05F0438N-05FR143N_b_p_c_h_t_r


Unnamed: 0,feature,importance
53,road_build,0.419806
8,dayofweek,0.376563
7,holiday_length,0.055563
55,road_block_count,0.027151
26,holiday_name_農曆新年,0.0178
25,holiday_name_跨年元旦,0.017045
5,congestion_syndrome,0.016521
6,holiday_continue,0.016115
54,total_block_count,0.009573
21,holiday_name_端午節,0.009007


05F0528N-05F0438N_b_p_c_h_t_r


Unnamed: 0,feature,importance
0,ds_prev_1_WATT,0.2186123
2,ds_prev_3_WATT,0.1770015
1,ds_prev_2_WATT,0.1741344
3,ds_prev_4_WATT,0.1697923
4,ds_prev_5_WATT,0.1678288
8,dayofweek,0.04713649
7,holiday_length,0.01368403
6,holiday_continue,0.005498046
53,road_build,0.00471249
55,road_block_count,0.003497024


## to_csv

In [15]:
for i in results_fs.keys():
    f_name = f'../data/features/feature_selection_results/{i}.csv'
    results_fs[i].to_csv(f_name, index=False)