# Linear Regression

## Using GLM

In [1]:
using GLM
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("datasets", "mtcars")
first(data, 6)

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS
Unnamed: 0_level_1,String,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1


### Training/Testing set

In [3]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [4]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS
Unnamed: 0_level_1,String,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64
1,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0
2,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0
3,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1
4,Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1
5,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0
6,Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0


### Model

In [5]:
ols = GLM.lm(@formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

MPG ~ 1 + Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
               Estimate  Std. Error     t value  Pr(>|t|)    Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)  26.4146     21.6047      1.22263      0.2403  -19.6346     72.4639
Cyl          -0.590615    1.26807    -0.465759     0.6481   -3.29344     2.11221
Disp          0.0156236   0.0212651   0.734707     0.4738   -0.0297018   0.060949
HP           -0.0383613   0.0288122  -1.33142      0.2029   -0.0997731   0.0230505
DRat          0.995973    1.87521     0.531126     0.6031   -3.00095     4.99289
WT           -2.6337      2.38469    -1.10442      0.2868   -7.71656     2.44915
QSec     

### Prediction

In [6]:
predict(ols, test)

6-element Array{Union{Missing, Float64},1}:
 10.894192349816905
 13.552961534673093
 19.17930062232591
 27.9204981478803
 14.517381360702817
 14.6408661395037

### Validation

In [7]:
GLM.r²(ols)

0.8660455449392241

## Using MLJ

In [8]:
using MLJ



### Casting scientific tpes

In [9]:
y, X = unpack(data[!, 2:end], ==(:MPG), colname -> true);
first(X, 6)

Unnamed: 0_level_0,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,6,225.0,105,2.76,3.46,20.22,1,0,3,1


In [10]:
first(X, 6) |> pretty

┌[0m───────[0m┬[0m────────────[0m┬[0m───────[0m┬[0m────────────[0m┬[0m────────────[0m┬[0m────────────[0m┬[0m───────[0m┬[0m─[0m ⋯
│[0m[1m Cyl   [0m│[0m[1m Disp       [0m│[0m[1m HP    [0m│[0m[1m DRat       [0m│[0m[1m WT         [0m│[0m[1m QSec       [0m│[0m[1m VS    [0m│[0m[1m [0m ⋯
│[0m[90m Int64 [0m│[0m[90m Float64    [0m│[0m[90m Int64 [0m│[0m[90m Float64    [0m│[0m[90m Float64    [0m│[0m[90m Float64    [0m│[0m[90m Int64 [0m│[0m[90m [0m ⋯
│[0m[90m Count [0m│[0m[90m Continuous [0m│[0m[90m Count [0m│[0m[90m Continuous [0m│[0m[90m Continuous [0m│[0m[90m Continuous [0m│[0m[90m Count [0m│[0m[90m [0m ⋯
├[0m───────[0m┼[0m────────────[0m┼[0m───────[0m┼[0m────────────[0m┼[0m────────────[0m┼[0m────────────[0m┼[0m───────[0m┼[0m─[0m ⋯
│[0m 6.0   [0m│[0m 160.0      [0m│[0m 110.0 [0m│[0m 3.9        [0m│[0m 2.62       [0m│[0m 16.46      [0m│[0m 0.0   [0m│[0m [0m ⋯
│[0m 6.0   [0m│

In [11]:
X = coerce(X, :Cyl => Continuous, :HP => Continuous, :VS => Continuous, :AM => Continuous,
              :Gear => Continuous, :Carb  => Continuous)
first(X, 6)

Unnamed: 0_level_0,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,6.0,160.0,110.0,3.9,2.62,16.46,0.0,1.0,4.0
2,6.0,160.0,110.0,3.9,2.875,17.02,0.0,1.0,4.0
3,4.0,108.0,93.0,3.85,2.32,18.61,1.0,1.0,4.0
4,6.0,258.0,110.0,3.08,3.215,19.44,1.0,0.0,3.0
5,8.0,360.0,175.0,3.15,3.44,17.02,0.0,0.0,3.0
6,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0


### Training/testing set

In [12]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([23, 15, 2, 11, 1, 22, 16, 29, 7, 9  …  6, 13, 10, 24, 14, 5, 3, 21, 4, 19], [12, 8, 25, 27, 31, 17, 18, 30, 32, 20])

### Model

In [13]:
model = @load LinearRegressor pkg=GLM

LinearRegressor(
    fit_intercept = true,
    allowrankdeficient = false)[34m @ 1…64[39m

In [14]:
mach = machine(model, X, y)

[34mMachine{LinearRegressor} @ 1…52[39m


### Training

In [15]:
fit!(mach, rows=train)

┌ Info: Training [34mMachine{LinearRegressor} @ 1…52[39m.
└ @ MLJBase /home/yuehhua/.julia/packages/MLJBase/qJs1o/src/machines.jl:182


[34mMachine{LinearRegressor} @ 1…52[39m


### Predict

In [16]:
ŷ = predict_mean(mach, rows=test)

10-element Array{Float64,1}:
 12.05067537141278
 21.592647112997337
 15.926228647069108
 23.649360616726838
 15.707714578491686
  7.8945560241360155
 25.259911599486497
 20.743055578234383
 21.68606014203565
 27.711459523673255

### Evaluation

In [17]:
rms(ŷ, y[test])

4.246228419552175