# Linear Regression with Python

在本次的課堂中，我們將會使用實際的房屋相關資料來進行機器學習的訓練及預測，我們將會嘗試使用Linear Regression來預測

以下是我們將用來預測的資料集長相：

Variable      | Definition  | 
--------------|:-----------:|
TYPE          | 房屋型態 |
LOCATION      | 行政區 | 
LANDAREA      | 地（平方公尺） | 
AREA          | 建 （平方公尺）|
ROOM          | 房間數 |      
BATH          | 衛浴數 | 
LTP           |  房屋總價取log| 

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm #用於驗證模型預測品質
from sklearn.linear_model import LinearRegression #用於線性迴歸

## The Data

首先我們將taichung.xlsx檔案讀取進入到Pandas中成為dataframe型態

In [2]:
df = pd.read_excel('taichung.xlsx')

In [3]:
df.head(1)

Unnamed: 0,TYPE,LOCATION,LANDAREA,AREA,ROOM,BATH,LTP
0,透天厝,清水區,155.5,165.1,3,2,15.319588


In [4]:
df = df[df['TYPE'] == '透天厝']

In [5]:
df.head(1)

Unnamed: 0,TYPE,LOCATION,LANDAREA,AREA,ROOM,BATH,LTP
0,透天厝,清水區,155.5,165.1,3,2,15.319588


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12770 entries, 0 to 12769
Data columns (total 7 columns):
TYPE        12770 non-null object
LOCATION    12770 non-null object
LANDAREA    12770 non-null float64
AREA        12770 non-null float64
ROOM        12770 non-null int64
BATH        12770 non-null int64
LTP         12770 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 798.1+ KB


## Converting Categorical Features 

機器學習演算法將無法直接使用分類特徵（例如：行政區），所以我們需要使用pandas將分類特徵轉換為虛擬變量！

In [7]:
df = df.join(pd.get_dummies(df['LOCATION']))

## In_samlple and Out_sample Split

我們將資料集分割為訓練用及預測用兩個資料集，並去除訓練不需要的欄位

In [8]:
in_df = df[:10000]
out_df = df[10000:]
X_in = in_df.drop(['TYPE','LOCATION','LTP'],axis=1)
y_in = in_df['LTP']
X_out = out_df.drop(['TYPE','LOCATION','LTP'],axis=1)

In [9]:
li_regr = LinearRegression()

In [10]:
li_regr.fit(X_in, y_in)



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## R-SQUARED
決定係數：若線性回歸的效果比起平均值越好，決定係數的值就越接近於1。

In [11]:
# R-SQUARED
li_regr.score(X_in, y_in)

0.61102845983916054

## Hit rate and Mape
* Hit rate：命中率
* MAPE：平均絕對百分比誤差(mean absolute percentage error)

In [12]:
df_result_in = in_df.copy()
df_result_in['PTP'] = li_regr.predict(X_in)
df_result_in['E_PTP'] = np.exp(df_result_in['PTP'])
df_result_in['E_LTP'] = np.exp(df_result_in['LTP'])

In [13]:
df_result_out = out_df.copy()
df_result_out['PTP'] = li_regr.predict(X_out)
df_result_out['E_PTP'] = np.exp(df_result_out['PTP'])
df_result_out['E_LTP'] = np.exp(df_result_out['LTP'])

In [14]:
df_result_combine = df_result_in.append(df_result_out)

In [15]:
def hit_rate(df, i):
    hit_rate = df[(abs(df['E_PTP'] / df['E_LTP'] -1) <= i)].shape[0] / df['E_LTP'].count() #shape[0] 筆數 shape[1]欄數
    return '%.4f'%hit_rate

In [16]:
hr = [0.1, 0.2]

print('ALL')
for i in hr:
    print(hit_rate(df_result_combine, i))
mape = abs(df_result_combine['E_PTP'] / df_result_combine['E_LTP'] -1).mean()
print('%.2f'%(mape*100))
print()

#in sample    
print('IN SAMPLE')
for i in hr:
    print(hit_rate(df_result_in, i))
mape = abs(df_result_in['E_PTP'] / df_result_in['E_LTP'] -1).mean()
print('%.2f'%(mape*100))
print()

#out sample
print('OUT SAMPLE')
for i in hr:
    print(hit_rate(df_result_out, i))
mape = abs(df_result_out['E_PTP'] / df_result_out['E_LTP'] -1).mean()
print('%.2f'%(mape*100))

ALL
0.2296
0.4421
31.28

IN SAMPLE
0.2335
0.4447
32.03

OUT SAMPLE
0.2155
0.4325
28.54


* Hit rate：越接近1表示表現越好
* MAPE：越小表示表現越好