# Julia 機器學習：GLM 線性迴歸

## 作業 027：波士頓房價預測資料集

請使用 GLM 中的模型，建立一個預測模型來預測波士頓的房價。

In [1]:
using GLM, RDatasets, MLDataUtils

## 讀取資料

#### Boston dataset共14欄
* Crim - per capita crime rate by town
* Zn - proportion of residential land zoned for lots over 25,000 sq.ft.
* Indus - proportion of non-retail business acres per town.
* Chas - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOx - nitric oxides concentration (parts per 10 million)
* Rm - average number of rooms per dwelling
* Age - proportion of owner-occupied units built prior to 1940
* Dis - weighted distances to five Boston employment centres
* Rad - index of accessibility to radial highways
* Tax - full-value property-tax rate per USD 10,000
* PTRatio - pupil-teacher ratio by town
* Black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LStat - % lower status of the population
* MedV - Median value of owner-occupied homes in USD 1000's

In [2]:
boston = dataset("MASS", "Boston")
showall(first(boston, 10))

│   caller = showall(::DataFrame) at deprecated.jl:66
└ @ DataFrames .\deprecated.jl:66


10×14 DataFrame
│ Row │ Crim    │ Zn      │ Indus   │ Chas  │ NOx     │ Rm      │ Age     │
│     │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mInt64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │
├─────┼─────────┼─────────┼─────────┼───────┼─────────┼─────────┼─────────┤
│ 1   │ 0.00632 │ 18.0    │ 2.31    │ 0     │ 0.538   │ 6.575   │ 65.2    │
│ 2   │ 0.02731 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 6.421   │ 78.9    │
│ 3   │ 0.02729 │ 0.0     │ 7.07    │ 0     │ 0.469   │ 7.185   │ 61.1    │
│ 4   │ 0.03237 │ 0.0     │ 2.18    │ 0     │ 0.458   │ 6.998   │ 45.8    │
│ 5   │ 0.06905 │ 0.0     │ 2.18    │ 0     │ 0.458   │ 7.147   │ 54.2    │
│ 6   │ 0.02985 │ 0.0     │ 2.18    │ 0     │ 0.458   │ 6.43    │ 58.7    │
│ 7   │ 0.08829 │ 12.5    │ 7.87    │ 0     │ 0.524   │ 6.012   │ 66.6    │
│ 8   │ 0.14455 │ 12.5    │ 7.87    │ 0     │ 0.524   │ 6.172   │ 96.1    │
│ 9   │ 0.21124 │ 12.5    │ 7.87    │ 0     │ 0.524   │ 5.631   │ 100.0   │
│ 

## 切分訓練資料及測試資料

In [3]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(boston)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [4]:
train = boston[train_ind, :]
test = boston[test_ind, :]

Unnamed: 0_level_0,Crim,Zn,Indus,Chas,NOx,Rm,Age,Dis,Rad,Tax
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Float64,Float64,Float64,Float64,Int64,Int64
1,0.08221,22.0,5.86,0,0.431,6.957,6.8,8.9067,7,330
2,0.06664,0.0,4.05,0,0.51,6.546,33.1,3.1323,5,296
3,0.0686,0.0,2.89,0,0.445,7.416,62.5,3.4952,2,276
4,3.77498,0.0,18.1,0,0.655,5.952,84.7,2.8715,24,666
5,1.27346,0.0,19.58,1,0.605,6.25,92.6,1.7984,5,403
6,0.05188,0.0,4.49,0,0.449,6.015,45.1,4.4272,3,247
7,0.03359,75.0,2.95,0,0.428,7.024,15.8,5.4011,3,252
8,0.01432,100.0,1.32,0,0.411,6.816,40.5,8.3248,5,256
9,7.75223,0.0,18.1,0,0.713,6.301,83.7,2.7831,24,666
10,0.09065,20.0,6.96,1,0.464,5.92,61.5,3.9175,3,223


## 線性迴歸模型

In [5]:
ols = GLM.lm(@formula(MedV ~ Crim + Zn + Indus + Chas + NOx + Rm + Age + Dis + Rad + Tax + PTRatio + Black + LStat), train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

MedV ~ 1 + Crim + Zn + Indus + Chas + NOx + Rm + Age + Dis + Rad + Tax + PTRatio + Black + LStat

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                 Estimate  Std. Error     t value  Pr(>|t|)     Lower 95%     Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)   38.1883      5.65946      6.7477       <1e-10   27.0615       49.3151
Crim          -0.124891    0.0350837   -3.5598       0.0004   -0.193867     -0.0559146
Zn             0.0510185   0.0157817    3.23277      0.0013    0.019991      0.0820461
Indus         -0.0067066   0.0699311   -0.0959029    0.9236   -0.144195      0.130781
Chas           2.19799     0.958293     2.29365      0.0223    0.313936      4.08204
NOx          -19.1561    

## 預測

In [7]:
predict(ols, test)

101-element Array{Union{Missing, Float64},1}:
 24.655206201831877
 30.80684505344785
 32.29429212659351
 14.277307654914473
 31.43703781449451
 22.577019531036456
 34.46409124760812
 33.58093165047392
 16.963230119177844
 25.780991937476994
 27.338905348076267
 20.717207153960256
 23.276975858485635
  ⋮
 24.461462020346563
 30.487463732167967
 11.889839684667177
 10.152952651780929
 25.134343653143652
 23.531658780487334
 19.923836531767826
 13.320353052166368
 27.320757325190755
 31.34218554337242
 18.599278874515974
 25.01640515638521

## 模型評估

In [9]:
GLM.r²(ols)

0.7442087442405037

In [10]:
GLM.adjr²(ols)

0.7357041756346892