# 範例 : 計程車費率預測
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

# [教學目標]
- 使用並觀察特徵組合, 在計程車費率預測競賽的影響

# [範例重點]
- 增加精度差與緯度差兩個特徵, 觀察線性迴歸與梯度提升樹的預測結果有什麼影響 (In[4], Out[4], In[5], Out[5]) 
- 再增加座標距離特徵, 觀察線性迴歸與梯度提升樹的預測結果有什麼影響 (In[6], Out[6], In[7], Out[7])

In [1]:
# 做完特徵工程前的所有準備
import pandas as pd
import numpy as np
import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

import warnings
warnings.filterwarnings('ignore')

data_path = '../data/'
df = pd.read_csv(data_path + 'taxi_data1.csv')

train_Y = df['fare_amount']
df = df.drop(['fare_amount'] , axis=1)
df.head()

Unnamed: 0,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2011-10-21 23:54:10 UTC,-73.99058,40.761071,-73.981128,40.758634,2
1,2015-02-03 10:42:03 UTC,-73.988403,40.723431,-73.989647,40.741695,1
2,2014-03-16 18:58:58 UTC,-74.015785,40.71511,-74.012029,40.707888,2
3,2009-06-13 16:10:54 UTC,-73.977322,40.787275,-73.95803,40.778838,3
4,2014-06-12 03:25:56 UTC,-73.989683,40.729717,-73.98249,40.761887,3


In [2]:
# 時間特徵分解方式:使用datetime
df['pickup_datetime'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S UTC'))
df['pickup_year'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%Y')).astype('int64')
df['pickup_month'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%m')).astype('int64')
df['pickup_day'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%d')).astype('int64')
df['pickup_hour'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%H')).astype('int64')
df['pickup_minute'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%M')).astype('int64')
df['pickup_second'] = df['pickup_datetime'].apply(lambda x: datetime.datetime.strftime(x, '%S')).astype('int64')
df.head()

Unnamed: 0,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,pickup_year,pickup_month,pickup_day,pickup_hour,pickup_minute,pickup_second
0,2011-10-21 23:54:10,-73.99058,40.761071,-73.981128,40.758634,2,2011,10,21,23,54,10
1,2015-02-03 10:42:03,-73.988403,40.723431,-73.989647,40.741695,1,2015,2,3,10,42,3
2,2014-03-16 18:58:58,-74.015785,40.71511,-74.012029,40.707888,2,2014,3,16,18,58,58
3,2009-06-13 16:10:54,-73.977322,40.787275,-73.95803,40.778838,3,2009,6,13,16,10,54
4,2014-06-12 03:25:56,-73.989683,40.729717,-73.98249,40.761887,3,2014,6,12,3,25,56


In [3]:
# 將結果使用線性迴歸 / 梯度提升樹分別看結果
df = df.drop(['pickup_datetime'] , axis=1)
scaler = MinMaxScaler()
train_X = scaler.fit_transform(df)

Linear = LinearRegression()
linear_time_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
print("Linear Reg Score : {s}".format(s=linear_time_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')

GDBT = GradientBoostingRegressor()
tree_time_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_score))
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.026876871475641616
Gradient Boosting Reg Score : 0.7117954336575305


In [4]:
# 增加緯度差, 經度差兩個特徵
df['longitude_diff'] = df['dropoff_longitude'] - df['pickup_longitude']
df['latitude_diff'] = df['dropoff_latitude'] - df['pickup_latitude']
df[['longitude_diff', 'latitude_diff', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']].head()

Unnamed: 0,longitude_diff,latitude_diff,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
0,0.009452,-0.002437,-73.99058,40.761071,-73.981128,40.758634
1,-0.001244,0.018265,-73.988403,40.723431,-73.989647,40.741695
2,0.003756,-0.007222,-74.015785,40.71511,-74.012029,40.707888
3,0.019292,-0.008437,-73.977322,40.787275,-73.95803,40.778838
4,0.007193,0.03217,-73.989683,40.729717,-73.98249,40.761887


In [5]:
# 結果 : 光是用經緯度差, 準確度就有巨幅上升
train_X = scaler.fit_transform(df)

linear_time_Ddiff_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddiff_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()
print("Linear Reg Score : {s}".format(s=linear_time_Ddiff_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddiff_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.026922682001443564
Gradient Boosting Reg Score : 0.7966844863474172


In [6]:
# 增加座標距離特徵
df['distance_2D'] = (df['longitude_diff']**2 + df['latitude_diff']**2)**0.5
df[['distance_2D', 'longitude_diff', 'latitude_diff']].head()

Unnamed: 0,distance_2D,longitude_diff,latitude_diff
0,0.009761,0.009452,-0.002437
1,0.018307,-0.001244,0.018265
2,0.00814,0.003756,-0.007222
3,0.021056,0.019292,-0.008437
4,0.032964,0.007193,0.03217


In [7]:
# 結果 : 加上座標距離後, 準確度再度上升(包含線性迴歸)
train_X = scaler.fit_transform(df)
linear_time_Ddif_2D_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddif_2D_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()
print("Linear Reg Score : {s}".format(s=linear_time_Ddif_2D_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddif_2D_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.027479693774541868
Gradient Boosting Reg Score : 0.8055651666040233


# [作業重點]
- 仿造範例並參考今日課程內容, 使用經緯度一圈的長度比的概念造出新特徵, 觀察有什麼影響 (In[8], Out[8])
- 只使用上面所造的這個新特徵, 觀察有什麼影響 (In[13], Out[13])

# 作業1
* 參考今日教材，試著使用經緯度一圈的長度比這一概念，組合出一個新特徵，再觀察原特徵加上新特徵是否提升了正確率?



**Your answer:** *增加新特徵(經緯度真實長度) 反而降低正確率*

In [8]:
import math
"""
Your Code Here, set new character at df['distance_real']

觀察資料緯度集中在 40.75 度附近
可以算得經度與緯度代表的長度比為 cos(40.75度) : 1 = 0.75756 : 1
由此校正後的兩兩地距離，預測正確度更更⾼高
"""
df['distance_real'] = ((df['longitude_diff']*0.75756)**2 + (df['latitude_diff']*1)**2 ) **0.5

# 觀察結果 
train_X = scaler.fit_transform(df)
linear_time_Ddif_2D_real2D_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddif_2D_real2D_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_time_Ddif_2D_real2D_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddif_2D_real2D_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.02523573872638458
Gradient Boosting Reg Score : 0.8039312746889067


## [Haversine 半正矢公式](https://en.wikipedia.org/wiki/Haversine_formula)

In [9]:
def haversine(lon1, lon2, lat1, lat2):
    
    #將經位度度數轉換成 弧度
    lon1, lon2, lat1, lat2 = map(math.radians, [lon1, lon2, lat1, lat2])
    
    r = 6371 #地球半徑 (公里)
    
    hav_lon = (math.sin((lon2-lon1)/2))**2 #harversine半正矢函數 代入經緯度差
    hav_lat = (math.sin((lat2-lat1)/2))**2
    
    d = 2*r*math.asin((hav_lat + math.cos(lat1) * math.cos(lat2) * hav_lon)**0.5)
    
    return d

In [10]:
import copy

df_temp = copy.deepcopy(df)
df_temp['distance_real'] = list(map(haversine, df['pickup_longitude'], df['dropoff_longitude'], df['pickup_latitude'], df['dropoff_latitude']))
# 觀察結果 
train_X = scaler.fit_transform(df_temp)
linear_time_Ddif_2D_haversine_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddif_2D_haversine_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_time_Ddif_2D_haversine_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddif_2D_haversine_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.3673501080423339
Gradient Boosting Reg Score : 0.8043916433506462


### 以兩點真實距離 取代 兩點歐幾里得距離

In [11]:
df = df.drop(['distance_2D'], axis=1)

# 觀察結果 
train_X = scaler.fit_transform(df)
linear_time_Ddif_real2D_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddif_real2D_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_time_Ddif_real2D_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddif_real2D_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.02742580852964984
Gradient Boosting Reg Score : 0.8071243903002323


In [12]:
df_temp = df_temp.drop(['distance_2D'], axis=1)

# 觀察結果 
train_X = scaler.fit_transform(df_temp)
linear_time_Ddif_haversine_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_time_Ddif_haversine_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_time_Ddif_haversine_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_time_Ddif_haversine_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.027188071191767228
Gradient Boosting Reg Score : 0.8048081527773657


# 作業2
* 試著只使用新特徵估計目標值(忽略原特徵)，效果跟作業1的結果比較起來效果如何?


**Your answer:** *只採用新特徵值估計 可以達到七成正確率  但依舊比作業1還低*

In [13]:
train_X = scaler.fit_transform(df[['distance_real']])

linear_real2D_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_real2D_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_real2D_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_real2D_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.0014467562845988714
Gradient Boosting Reg Score : 0.7221194640622727


In [14]:
train_X = scaler.fit_transform(df_temp[['distance_real']])

linear_haversine_score = cross_val_score(Linear, train_X, train_Y, cv=5).mean()
tree_haversine_score = cross_val_score(GDBT, train_X, train_Y, cv=5).mean()

print("Linear Reg Score : {s}".format(s=linear_haversine_score))
print("Gradient Boosting Reg Score : {s}".format(s=tree_haversine_score))
#print(f'Linear Reg Score : {cross_val_score(Linear, train_X, train_Y, cv=5).mean()}')
#print(f'Gradient Boosting Reg Score : {cross_val_score(GDBT, train_X, train_Y, cv=5).mean()}')

Linear Reg Score : 0.0011536096142396256
Gradient Boosting Reg Score : 0.715704780543987


In [15]:
data = {'Method':['Linear_Time', 'Gradient Boosting_Time',
                 'Linear_Time_Ddiff', 'Gradient Boosting_Time_Ddiff',
                 'Linear_Time_Ddiff_2D', 'Gradient Boosting_Time_Ddiff_2D',
                 'Linear_Time_real2D', 'Gradient Boosting_Time_real2D',
                 'Linear_Time_haversine', 'Gradient Boosting_Time_haversine',
                 'Linear_Time_Ddiff_2D_real2D', 'Gradient Boosting_Ddiff_2D_real2D',
                 'Linear_Time_Ddiff_2D_haversine', 'Gradient Boosting_Ddiff_2D_haversine',
                 'Linear_real2D', 'Gradient Boosting_real2D',
                 'Linear_haversine', 'Gradient Boosting_haversine'],
       'Score':[linear_time_score, tree_time_score,
               linear_time_Ddiff_score, tree_time_Ddiff_score,
               linear_time_Ddif_2D_score, tree_time_Ddif_2D_score,
               linear_time_Ddif_real2D_score, tree_time_Ddif_real2D_score,
               linear_time_Ddif_haversine_score, tree_time_Ddif_haversine_score,
               linear_time_Ddif_2D_real2D_score, tree_time_Ddif_2D_real2D_score,
               linear_time_Ddif_2D_haversine_score, tree_time_Ddif_2D_haversine_score,
               linear_real2D_score, tree_real2D_score,
               linear_haversine_score, tree_haversine_score]}

sheet = pd.DataFrame(data)
sheet.set_index('Method', inplace=True)
sheet

Unnamed: 0_level_0,Score
Method,Unnamed: 1_level_1
Linear_Time,0.026877
Gradient Boosting_Time,0.711795
Linear_Time_Ddiff,0.026923
Gradient Boosting_Time_Ddiff,0.796684
Linear_Time_Ddiff_2D,0.02748
Gradient Boosting_Time_Ddiff_2D,0.805565
Linear_Time_real2D,0.027426
Gradient Boosting_Time_real2D,0.807124
Linear_Time_haversine,0.027188
Gradient Boosting_Time_haversine,0.804808


###  增加兩點差 與 兩點歐幾里得距離 特徵 能提升正確率
- 增加兩點差 0.71 -> 0.79  再增加兩點歐幾里得距離 0.79 -> 0.80 (0.026 -> 0.027)


### 再增加真實兩點距離特徵 反而降低正確率
- 0.806 -> 0.803 (0.027 -> 0.025)
- 但是如果真實距離採用半正矢公式 則可以將線性回歸正確率大幅提升 0.027 -> 0.36

### 使用兩點差 與 真實兩點距離特徵(教材公式) 可以達到最好的效果