# Youbike 使用量預測

## 研究背景

Youbike（Ubike）是台北市的一個共享單車系統，為市民提供便捷的短程交通解決方案。本專案的主要目標是透過資料科學方法，評估未安裝Youbike站點的潛在使用量，以提供城市規劃者更有依據的站點選擇。透過分析 Youbike 2.0 的借車/還車流量，結合站點附近的地理區位和時空資料，我們將建立一個預測模型，探討站點安裝後可能的使用趨勢。

> #### 為何選擇 Youbike 2.0? 
> 目前全台北市已全面改為 Youbike 2.0。因為停車柱樁無需串接電源以及網路， 2.0 的設點可以更為便利。以下探討之流量，皆為 Youbike 2.0 的數據。

## 專案目的

1. **評估未安裝 Youbike 站點的潛在使用量：** 透過歷史借車/還車流量數據，我們將探索哪些區域可能受益於新增Youbike站點，以提高共享單車的使用率。

2. **探討地理區位的影響：** 分析站點附近的地理區位資訊，例如是否靠近捷運站、是否有便利超商、是否在學校附近等，以瞭解這些因素對使用量的影響。

3. **整合時空資料進行預測：** 考慮時空因素，例如天氣和時間，以建立更準確的預測模型，預測未來 Youbike 使用量的變化。

## 專案方法

1. **資料收集：** 收集自 2020 年至 2023 年的 Youbike 借車/還車流量數據，包括每個站點的小時單位細節。

2. **地理區位分析：** 分析每個站點周邊地理區位，包括捷運站、便利超商、學校等，並建立相應的特徵。

3. **時空資料整合：** 整合時空資料，包括每個站點在不同時刻的天氣狀況，以及其他可能影響使用量的時間相關因素。

4. **機器學習模型訓練：** 使用機器學習算法，利用歷史資料訓練預測模型，並評估模型的效能。

# 資料概覽

以下簡單介紹資料來源以及型態。特徵工程的細節演算法，請參照`transformation/features.py/VillageFeature`

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import warnings

warnings.filterwarnings('ignore')

## YB 相關 -- 流量

從政府開訪資料網站取得之臺北市公共自行車 2.0 租借紀錄，經過小時加總後得到每一個 Youbike 站點在該小時的借出與還車數量

In [2]:
yb_flow = pd.read_parquet("DATA/Youbike/combined_rent_return.pq", engine='pyarrow').convert_dtypes(dtype_backend="pyarrow")
yb_flow[yb_flow['time'].dt.hour >=7].sample(n=10)

Unnamed: 0,name,time,count_out,count_in
4369541,撫順街41巷(崇德宮前),2022-12-13 18:00:00,6,2
2384787,康寧路一段156巷20弄口,2022-03-21 17:00:00,3,1
4617678,新生公園,2022-02-08 16:00:00,3,1
1190159,南港路三段220巷口,2022-02-14 20:00:00,3,1
1304125,台北數位產業園區,2023-01-06 23:00:00,1,2
7778444,臺灣科技大學後門,2022-03-19 12:00:00,6,7
4786140,星雲街47號,2021-09-09 07:00:00,2,1
8933988,龍門廣場,2021-10-16 10:00:00,2,7
9629337,捷運國父紀念館站(2號出口),2021-11-27 12:00:00,0,1
6654862,臺北市立大學(博愛校區),2022-11-01 18:00:00,11,4


## YB 相關 -- 資訊

Youbike 站點資訊包含經緯度座標、總站點數量、以及地址等資訊，可以從即時 youbike2.0 的 API 中取得。經轉換後資料如下：

In [3]:
yb_info = pd.read_parquet("DATA/Youbike/yb_info.pq").convert_dtypes(dtype_backend='pyarrow')
yb_info.sample(n=10)

Unnamed: 0,name,total,lat,lng,address
575,民權龍江路口,32,25.06225,121.54107,民權東路三段與龍江路口東南側
621,大佳社區公園,15,25.07275,121.5373,濱江街154巷對側
534,仁愛金山路口(東南側),14,25.03811,121.52855,仁愛路二段48-4號前
577,捷運行天宮站(3號出口),42,25.05992,121.5333,松江路261號前
1038,忠孝東路七段527巷口,15,25.05266,121.61275,忠孝東路七段552號
410,景美污水抽水站,15,25.01076,121.53637,汀州路四段10號北側人行道
587,仁德公園,18,25.05337,121.5289,吉林路108巷與南京東路二段21巷口東南側
503,臺北市國父史蹟館(逸仙公園),31,25.04779,121.52042,中山北路一段46號北側
456,永昌公園,15,25.02643,121.50936,詔安街224號前
848,捷運關渡站(1號出口),54,25.12397,121.46712,大度路三段270巷/立功街55巷(西南側)


## 人口學變量 -- 各里各年齡層比例 + 收入分佈

人口結構可能會影響使用 Youbike 的情形。壯年為主的區域使用率應該可以預期較老年為主之區域有較高的使用量，但中高齡中使用 Youbike 進行運動之情形也有可能。由於不清楚實際使用情況，我們將年齡分佈區分為幾個主要區段，分別捕捉青少年、大學生、上班族、壯年、老年、高齡，藉由機器學習的方式判斷規則。

另外，除了人口結構，收入分佈或許也是影響因子之一。高所得之區域自用車上下班情況或許高過低所得區域，而高所得區域同時作為辦公區的機率也較高，因此可預期還車量將高於借車輛。這些因素尚與一地區的繁華程度交互影響，因此也將藉由機器學習的方式處理。

In [4]:
demographic = pd.read_parquet("DATA/Demographic/demographic.pq")
demographic.drop(columns=["行政區", "綜合所得總額"], inplace=True)

In [5]:
demographic

Unnamed: 0,里別,總計,age_0_15,age_16_18,age_19_24,age_25_40,age_41_65,age_66_75,age_76_100,平均數,中位數,第一分位數,第三分位數
0,莊敬里,5045.0,11.912785,2.418236,5.748266,20.118930,38.691774,13.280476,7.829534,844,543,242,1054
1,東榮里,7799.0,17.348378,2.295166,5.064752,17.027824,36.145660,13.437620,8.680600,1512,723,287,1758
2,三民里,6380.0,12.351097,2.147335,5.564263,17.789969,38.605016,13.620690,9.921630,1115,625,260,1320
3,新益里,4326.0,11.534905,2.265372,4.808137,20.850670,37.586685,14.193250,8.760980,919,538,231,1181
4,富錦里,4942.0,14.225010,2.185350,4.917038,19.607446,36.806961,14.467827,7.790368,1118,580,262,1346
...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,關渡里,10982.0,11.946822,2.522309,5.554544,22.272810,37.606993,12.793662,7.302859,859,482,218,1007
452,泉源里,2267.0,7.190119,2.514336,6.793119,21.614468,40.979268,13.056903,7.851787,717,395,190,724
453,湖山里,1492.0,6.300268,2.077748,4.423592,19.235925,41.085791,14.946381,11.930295,967,452,216,980
454,大屯里,1232.0,11.444805,1.948052,6.250000,20.535714,37.662338,12.662338,9.496753,887,457,237,997


### 各里地理資訊
以上為各里之人口學變量資訊，要與 Youbike 2.0 之站點連結，尚需各里之邊界座標。

若一站點位於里的正中心，則該站點被分配到該里的人口學結構。然而若位於各里的邊界，甚至交界處，則判斷上會叫複雜。為簡化判斷，我們採取基於各里重心的逆距離加權法(inverse distance weighting, IDW)
1. 里 k 的人口學變數 U 為 $U^k$
1. 計算各里之幾何重心(centroid)
1. 對於Youbike 站點 i，取得與各里 k 之間的距離 $d(i,k)$
1. 對於站點 i 的人口學變數 $U_i$，其加權值則為
$$
U_i = \frac{
\sum_{k} w_{i,k} U^k
}{
    \sum_{k} w_{i,k}
}
$$

其中 $w_{i,k} = \frac{1}{d(i,k)}$

運算細節請參考 `transformation/features.py/VillageFeature`

In [6]:
li_coord = pd.read_parquet("DATA/Demographic/li_coords_xy.pq")
li_coord

Unnamed: 0,VILLNAME,x,y
0,西新里,309680.933610,2.771898e+06
1,重陽里,310291.290886,2.772318e+06
2,蘆洲里,310928.725286,2.772775e+06
3,石潭里,309759.045690,2.772820e+06
4,朱園里,303960.451833,2.771279e+06
...,...,...,...
451,泉源里,302505.110366,2.782842e+06
452,平等里,308390.308238,2.781948e+06
453,大屯里,300992.353133,2.783640e+06
454,菁山里,307313.564606,2.783887e+06


## 捷運

捷運對 Youbike 使用量的影響很大。毫無疑問，在流量高的出口， Youbike 被借走或歸還的次數勢必也高。另外，因為從捷運轉 Youbike 想有前30分鐘免費的優惠，所以民眾會有誘因以 Youbike 進行短距離的銜接。

最能減少誤差的模型設定方法，應該是直接判斷該 Youbike 站點附近捷運站的流量。然而本專案的目的為評估未來新設站點的效益，若過度依賴捷運站即時的流量，未來進行預測時尚須主動增加此變數之值，增加評估時的複雜度。

解決方法為以附近捷運站（如果有）在該時的對數全年進站量與出站量作為加權。比方說，若該站點 d 距離內為台北車站，則該車站在晚間六點的權重則為 6.57。

In [7]:
MRT = pd.read_parquet("DATA/MRT/MRT_coord_with_flow_weight.pq")
MRT

Unnamed: 0,station,hour,in,out,No,lng,lat
0,台北車站,18,6.575462,6.605948,M5,121.516246,25.046755
1,台北車站,18,6.575462,6.605948,M6,121.516787,25.046234
2,台北車站,18,6.575462,6.605948,M7,121.518643,25.046077
3,台北車站,18,6.575462,6.605948,M8,121.517479,25.045948
4,台北車站,18,6.575462,6.605948,M1,121.518193,25.048232
...,...,...,...,...,...,...,...
9067,十四張,1,1.623249,1.748188,0,121.527701,24.984467
9068,十四張,2,1.204120,1.612784,0,121.527701,24.984467
9069,十四張,5,1.000000,0.903090,0,121.527701,24.984467
9070,十四張,4,1.000000,0.845098,0,121.527701,24.984467


因為與實際時間無關，僅與時段有關，未來於評估新站點效益時，可僅指定地點與一整天的時段，不必指定實際捷運站進出口流量。

此變數可作為繁華程度的指標之一。

## 公車

捷運站並無法全面捕捉台北市個地理位置的繁華程度，因此我們將網絡範圍更廣的公車納入資料中。

然而公車並沒有上下車流量的資料可用，因此我們使用另一種方法來捕捉公車相關的繁華程度。一般來說，交通要道上的公車站不見得會比蛋白區多，但單一公車站點所涵蓋的公車數量則會隨該地點的熱鬧程度增加。


In [8]:
bus = pd.read_parquet("DATA/Bs/bus_agg_by_stop.pq")
bus

Unnamed: 0,PublicId,Name,Latitude,Longitude,Bus_number,x,y
0,1000100040,捷運中正紀念堂站(中山),25.036430,121.516730,1,302143.790748,2.769912e+06
1,1000500040,重慶南路一段,25.045607,121.513187,2,301782.392692,2.770927e+06
2,1000500041,重慶南路一段,25.044691,121.513165,2,301780.557479,2.770826e+06
3,1000500042,重慶南路一段,25.044432,121.513114,2,301775.530123,2.770797e+06
4,1000900060,衡陽路,25.042270,121.510210,5,301483.401044,2.770557e+06
...,...,...,...,...,...,...,...
3694,2517206110,崁頂三路,25.195630,121.430072,1,293342.737269,2.787515e+06
3695,2517206260,輕軌淡水行政中心站,25.189169,121.443384,1,294686.697500,2.786804e+06
3696,2517300400,馬偕醫院,25.141655,121.459785,4,296357.683439,2.781547e+06
3697,2517300500,捷運竹圍站,25.136591,121.460120,4,296393.333479,2.780986e+06


我們將大台北地區所有公車路線以及其經過的佔，藉由爬蟲擷取下來，並統計單一站點所經過之公車數量。站點可能被區分為數個子站點，但並不影響加總。我們將各 Youbike 站點附近距離 $d$ 範圍內，所有公車子站點的公車數量加總，作為公車的指標。

## 高中

相比於大多數大學附近會設有捷運站的情況，高中則常常會與捷運站有一段距離，對於通勤上學的高中生來說，搭乘捷運至最近的捷運站後轉騎乘 Youbike 是一個較有效率的通勤方式。
對於每一個 Youbike 站點，統計附近 $d$ 距離之內，是否設有高中。

In [9]:
highschool = pd.read_parquet("DATA/Education/highschool_coord.pq")
highschool

Unnamed: 0,school_name,lat,lng
0,臺北市立西松高級中學,25.054988,121.565841
1,臺北市立中崙高級中學,25.048714,121.561104
2,臺北市立松山高級中學,25.043634,121.565614
3,臺北市立永春高級中學,25.032446,121.578128
4,國立臺灣師範大學附屬高級中學,25.033674,121.540411
...,...,...,...
134,新北市私立清傳高級商業職業學校,25.052015,121.476840
135,新北市私立能仁高級家事商業職業學校,24.957891,121.540058
136,新北市私立豫章高級工商職業學校,24.997370,121.458729
137,新北市私立莊敬高級工業家事職業學校,24.985717,121.532169


## 景點

觀光景點附近使用 Youbike 機率較高。對於每一個 Youbike 站點，統計附近 $d$ 距離之內，有多少官方定義之觀光景點。

In [10]:
tourist_spot = pd.read_parquet("DATA/Tourist/taipei_tour.pq")
tourist_spot.sample(n=10)

Unnamed: 0,tour_name,lat,lng,Class1,Class2,Class3
3736,國民革命忠烈祠,25.07838,121.53313,1,,
3852,華中河濱公園,25.0154,121.49476,12,13.0,15.0
3729,大屯山系_中正山步道,25.14698,121.51721,11,13.0,
3769,行天宮,25.06308,121.5339,4,,
4137,吳興街商圈,25.02776,121.56311,18,,
4002,站前地下街_K區誠品地下街,25.0471,121.51527,12,,
4086,北投社三層崎公園,25.14462,121.49199,18,,
4171,閻錫山故居,25.13407,121.56052,18,,
4095,香堤大道廣場,25.03725,121.56698,18,,
3753,台北探索館,25.03753,121.56377,5,12.0,


### 河濱自行車道

河濱自行車道為另一種特別之觀光景點。自行車道為路徑，因此只要 Youbike 站點位於河濱自行車道一定距離之內，皆可視為單車娛樂之涵蓋範圍。我們以自行車道向外延伸 1km 的範圍內覆蓋的站點標記為「河濱車柱」，加入特徵中。


In [11]:
bike_route = pd.read_parquet("DATA/Tourist/tpe_river_bike.pq")
bike_route

Unnamed: 0,lat,lng
0,25.109929,121.467283
1,25.109890,121.467547
2,25.109818,121.467912
3,25.109660,121.468860
4,25.109676,121.468981
...,...,...
4366,25.098003,121.511199
4367,25.097981,121.511194
4368,25.097958,121.511195
4369,25.097936,121.511199


## 7-11 座標

為了更全面的捕捉繁華程度，我們將遍布大台北地區的 7-11 納入資料中。
我們將各 Youbike 站點附近距離 $d$ 範圍內，所有 7-11 的數量加總，作為便利商店的指標。

In [12]:
conv_store = pd.read_parquet("DATA/711/all_711_coord.pq")
conv_store

Unnamed: 0,store_name,lat,lng
0,上弘,25.056391,121.548287
1,小巨蛋,25.050944,121.549433
2,中崙,25.048396,121.552737
3,北體,25.050888,121.552850
4,台場,25.048086,121.551158
...,...,...,...
898,懷得,25.114096,121.519656
899,關渡,25.121540,121.467483
900,關渡站,25.125037,121.467181
901,鐏賢,25.117453,121.506854


## 天氣

考量到當天氣太冷、太熱或是下雨天，皆會影響到騎乘 Youbike 的意願，因此我們將 2020/4 ~ 2023/12 大台北地區歷史天氣資料藉由爬蟲擷取下來，其中氣象觀測站包含「信義」、「臺北」、「竹子湖」、「社子」、「石牌」、「天母」、「平等」、「內湖」、「松山」、「文山」。
資料包含了氣溫、濕度、風速、雨量、以及各觀測站的經緯度。

In [13]:
weather_df = pd.read_csv("DATA/Weather/weather_data_with_coord.csv", parse_dates=['datetime']).convert_dtypes(dtype_backend='pyarrow')

In [14]:
weather_df.sort_values(['station_name', 'datetime'])

Unnamed: 0,datetime,temperature,relative_humidity,wind_speed,precipitation,station_id,station_name,station_type,latitude,longitude
289200,2020-04-01 01:00:00,20.0,97,1.1,1.0,C0AC70,信義,auto_C0,25.0378,121.5646
289202,2020-04-01 02:00:00,19.3,96,1.1,0.0,C0AC70,信義,auto_C0,25.0378,121.5646
289204,2020-04-01 03:00:00,18.7,96,0.3,0.5,C0AC70,信義,auto_C0,25.0378,121.5646
289206,2020-04-01 04:00:00,18.3,96,0.9,0.5,C0AC70,信義,auto_C0,25.0378,121.5646
289208,2020-04-01 05:00:00,18.3,94,1.0,0.0,C0AC70,信義,auto_C0,25.0378,121.5646
...,...,...,...,...,...,...,...,...,...,...
52555,2023-12-31 20:00:00,18.4,67,4.2,0.0,466920,臺北,cwb,25.0377,121.5149
52556,2023-12-31 21:00:00,18.2,64,4.1,0.0,466920,臺北,cwb,25.0377,121.5149
52557,2023-12-31 22:00:00,17.9,65,4.2,0.0,466920,臺北,cwb,25.0377,121.5149
52558,2023-12-31 23:00:00,17.6,61,4.7,0.0,466920,臺北,cwb,25.0377,121.5149


In [15]:
obs = weather_df.groupby('station_name').first().reset_index()
px.scatter_mapbox(obs, lat = 'latitude', lon = 'longitude', hover_name='station_name', mapbox_style='carto-positron')

# Adding Features

In [16]:
from transformation import FeaturePipe
import transformation.features as FE

yb_flow_set_index = yb_flow.set_index(['name', 'time'])
yb_info_set_index = yb_info.set_index('name')


In [17]:

fp = FeaturePipe([
    FE.XYCoord(),
    FE.ConvenientStoreFeature(conv_store, 500),
    FE.VillageFeature(demographic, li_coord),
    FE.InAlley(),
    FE.BusAggregateFeature(bus, 400),
    FE.TouristSpotFeature(tourist_spot, 400),
    FE.BikeRouteFeaure(bike_route, 1000),
    FE.HighSchoolFeature(highschool, 400),
    FE.MRTIndex(MRT, 200),
    FE.WeatherIndex(weather_df)
])

print("Station Info with Geo Info Added:")
df_geo = fp.build(yb_info)      # ! can't use name as index, or else concat will fail due to index mismatch
fp.show_features()

Station Info with Geo Info Added:
0	XYCoord
1	ConvenientStoreFeature[distance = 500]
2	VillageFeature
3	InAlley
4	BusAggregateFeature[distance = 400]
5	TouristSpotFeature[distance = 400]
6	BikeRouteFeaure[distance = 1000]
7	HighSchoolFeature[distance = 400]
8	MRTIndex[distance = 200]
9	WeatherIndex
Total columns:  27


In [18]:
df_geo

Unnamed: 0,name,total,lat,lng,address,x,y,conv_count,總計,age_0_15,...,中位數,第一分位數,第三分位數,in_alley,bus_arround,count_tourist_spot,around_riverside,around_highschool,MRT_name,weather_station
0,捷運科技大樓站,28,25.02605,121.54360,復興南路二段235號前,304859.938784,2.768773e+06,9,5289.763742,13.824678,...,613.484660,259.013040,1351.154626,False,57,0,False,True,科技大樓,信義
1,復興南路二段273號前,21,25.02565,121.54357,復興南路二段273號西側,304857.088954,2.768729e+06,11,5298.603467,13.858581,...,613.339145,258.972667,1350.220552,False,55,0,False,True,科技大樓,信義
2,國北教大實小東側門,16,25.02429,121.54124,和平東路二段96巷7號,304622.542775,2.768577e+06,9,5484.350616,14.324470,...,620.931923,261.313565,1364.446159,True,70,0,False,False,,信義
3,和平公園東側,11,25.02351,121.54282,和平東路二段118巷33號,304782.347456,2.768491e+06,7,5398.918114,14.160731,...,616.023285,259.867428,1352.210162,True,62,1,False,False,,信義
4,辛亥復興路口西北側,16,25.02153,121.54299,復興南路二段368號,304800.383642,2.768272e+06,5,5323.235098,14.008011,...,610.184094,258.212969,1337.336069,False,26,1,False,False,,信義
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1306,臺大獸醫館南側,24,25.01791,121.54242,臺大獸醫系館南側,304744.461695,2.767871e+06,3,5264.655805,13.658091,...,599.359170,255.475791,1309.047082,False,38,2,False,False,,信義
1307,臺大新體育館東南側,40,25.02112,121.53591,臺大體育館東側,304086.012244,2.768224e+06,2,5339.805774,13.987413,...,610.972571,258.291626,1337.603485,False,26,2,False,False,,臺北
1308,臺大明達館北側(員工宿舍),18,25.01816,121.54469,明達館北側前空地,304973.457593,2.767900e+06,2,5259.413040,13.589284,...,598.060255,255.019815,1306.648207,False,58,1,False,False,,信義
1309,辛亥路五段73巷口,23,24.99818,121.55312,已移除。手動增加資訊,305833.314032,2.765690e+06,7,5488.574132,12.967324,...,577.983221,249.638505,1248.837684,False,27,2,False,False,,文山


In [19]:
yb_flow_add_geoinfo = yb_flow_set_index.join(df_geo.set_index('name')).reset_index()

## With space-time

In [20]:
fp_spacetime = FeaturePipe([
    FE.MRTTimeFeature(MRT),
    FE.WeatherTimeFeature(weather_df),
    FE.FeatureTime(),
    FE.Covid19Feature(),
])

fp_spacetime.build(yb_flow_add_geoinfo)
fp_spacetime.df

Unnamed: 0,name,time,count_out,count_in,total,lat,lng,address,x,y,...,MRT_inflow,MRT_outflow,temperature,relative_humidity,wind_speed,precipitation,hour,month,weekday,during_covid
0,一壽橋,2021-08-17 17:00:00,1,0,16,24.97837,121.55548,樟新街64號前方,306080.529930,2.763497e+06,...,0.000000,0.000000,30.9,71,1.8,0.0,17,8,1,False
1,一壽橋,2021-08-17 18:00:00,2,1,16,24.97837,121.55548,樟新街64號前方,306080.529930,2.763497e+06,...,0.000000,0.000000,29.2,70,0.3,0.0,18,8,1,False
2,一壽橋,2021-08-17 21:00:00,1,0,16,24.97837,121.55548,樟新街64號前方,306080.529930,2.763497e+06,...,0.000000,0.000000,26.5,84,0.6,0.0,21,8,1,False
3,一壽橋,2021-08-18 08:00:00,2,0,16,24.97837,121.55548,樟新街64號前方,306080.529930,2.763497e+06,...,0.000000,0.000000,26.6,93,0.0,0.0,8,8,2,False
4,一壽橋,2021-08-18 12:00:00,1,0,16,24.97837,121.55548,樟新街64號前方,306080.529930,2.763497e+06,...,0.000000,0.000000,29.3,80,0.4,2.5,12,8,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10748122,龍門廣場,2023-02-22 05:00:00,0,1,50,25.04092,121.54825,敦化南路一段與敦化南路一段236巷口,305322.555268,2.770422e+06,...,2.986324,2.267172,13.2,86,2.4,0.0,5,2,2,False
10748123,龍門廣場,2023-02-22 06:00:00,0,1,50,25.04092,121.54825,敦化南路一段與敦化南路一段236巷口,305322.555268,2.770422e+06,...,4.713062,5.054234,13.4,86,2.4,0.5,6,2,2,False
10748124,龍門廣場,2023-02-23 04:00:00,0,1,50,25.04092,121.54825,敦化南路一段與敦化南路一段236巷口,305322.555268,2.770422e+06,...,3.016197,2.187521,17.0,86,2.9,0.0,4,2,3,False
10748125,龍門廣場,2023-02-28 05:00:00,0,1,50,25.04092,121.54825,敦化南路一段與敦化南路一段236巷口,305322.555268,2.770422e+06,...,2.986324,2.267172,16.3,72,1.4,0.0,5,2,1,False


In [21]:
fp_spacetime.show_features()

0	MRTTimeFeature
1	WeatherTimeFeature
2	FeatureTime
3	Covid19Feature
Total columns:  38


In [22]:
fp_spacetime.df.to_parquet("feature_added.pq")

## Preprocessing for Outcomes

In [23]:
final_fp = FeaturePipe([
    FE.FlowPolarFeature(),
    # FE.FlowRatioFeature(),
    FE.RemoveUseless(),
])


final_df = final_fp.build(fp_spacetime.df)

In [25]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error

training_data = final_df.copy().dropna().astype(float).reset_index(drop = True)
output_columns = ['theta', 'r']

y  = training_data[output_columns]
X = training_data.drop(columns=output_columns, inplace=False)


In [26]:
X_in_train, X_in_test, y_in_train, y_in_test = train_test_split(X, y, test_size=0.2)

In [27]:
from sklearn import ensemble

params = dict(
    learning_rate=0.2,
    categorical_features=[24, 25, 26],
    max_leaf_nodes=100,
    max_iter = 500,
)

reg = MultiOutputRegressor(
    ensemble.HistGradientBoostingRegressor(**params)
)

# scores = cross_val_score(reg, X_in_train, y_in_train, cv = 5)

reg.fit(X_in_train, y_in_train)
mse = mean_squared_error(y_in_test, reg.predict(X_in_test))
print("The mean squared error (MSE) on test set: {:.4f}".format(mse))

The mean squared error (MSE) on test set: 12.6439


In [59]:
scores

array([0.65311733, 0.65375052, 0.65366501, 0.6541119 , 0.65523453])

In [28]:
reg.score(X_in_train, y_in_train), reg.score(X_in_test, y_in_test)

(0.49853939300050065, 0.4860205104233895)

In [32]:
y_predict = reg.predict(X)


In [33]:
training_data['p_theta'], training_data['p_r'] = y_predict[:,0], y_predict[:,1]

In [34]:
training_data

Unnamed: 0,conv_count,總計,age_0_15,age_16_18,age_19_24,age_25_40,age_41_65,age_66_75,age_76_100,平均數,...,wind_speed,precipitation,hour,month,weekday,during_covid,theta,r,p_theta,p_r
0,2.0,5502.346739,13.210193,2.44714,6.001440,20.496243,38.969198,11.916153,6.959633,1039.134462,...,1.8,0.0,17.0,8.0,1.0,0.0,0.000000,1.0,0.740073,6.576026
1,2.0,5502.346739,13.210193,2.44714,6.001440,20.496243,38.969198,11.916153,6.959633,1039.134462,...,0.3,0.0,18.0,8.0,1.0,0.0,0.463648,3.0,0.889504,3.836213
2,2.0,5502.346739,13.210193,2.44714,6.001440,20.496243,38.969198,11.916153,6.959633,1039.134462,...,0.6,0.0,21.0,8.0,1.0,0.0,0.000000,1.0,1.013669,3.825514
3,2.0,5502.346739,13.210193,2.44714,6.001440,20.496243,38.969198,11.916153,6.959633,1039.134462,...,0.0,0.0,8.0,8.0,2.0,0.0,0.000000,2.0,0.693081,2.539704
4,2.0,5502.346739,13.210193,2.44714,6.001440,20.496243,38.969198,11.916153,6.959633,1039.134462,...,0.4,2.5,12.0,8.0,2.0,0.0,0.000000,1.0,0.796980,1.608084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10310676,20.0,5296.725044,13.741042,2.31602,5.353273,19.483756,38.113269,12.920005,8.072636,1179.040696,...,2.4,0.0,5.0,2.0,2.0,0.0,1.570796,1.0,0.847747,2.239201
10310677,20.0,5296.725044,13.741042,2.31602,5.353273,19.483756,38.113269,12.920005,8.072636,1179.040696,...,2.4,0.5,6.0,2.0,2.0,0.0,1.570796,1.0,0.957991,2.165640
10310678,20.0,5296.725044,13.741042,2.31602,5.353273,19.483756,38.113269,12.920005,8.072636,1179.040696,...,2.9,0.0,4.0,2.0,3.0,0.0,1.570796,1.0,1.070639,3.005064
10310679,20.0,5296.725044,13.741042,2.31602,5.353273,19.483756,38.113269,12.920005,8.072636,1179.040696,...,1.4,0.0,5.0,2.0,1.0,0.0,1.570796,1.0,0.851567,2.425752
