# Linear Regression

## Using GLM

In [1]:
using GLM
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("datasets", "mtcars")
first(data, 6)

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS
Unnamed: 0_level_1,String,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1


### Training/Testing set

In [3]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [4]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS
Unnamed: 0_level_1,String,Float64,Int64,Float64,Int64,Float64,Float64,Float64,Int64
1,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0
2,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1
3,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0
4,Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0
5,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1
6,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1


### Model

In [5]:
ols = GLM.lm(@formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

MPG ~ 1 + Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb

Coefficients:
───────────────────────────────────────────────────────────────────────────────────
                Estimate  Std. Error     t value  Pr(>|t|)    Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────────────────────
(Intercept)  11.4564      23.5115      0.48727      0.6331  -38.6571     61.57
Cyl           0.0346765    1.32772     0.0261172    0.9795   -2.7953      2.86465
Disp          0.00345135   0.0238037   0.144992     0.8866   -0.0472851   0.0541878
HP           -0.0080817    0.0277733  -0.290988     0.7750   -0.0672791   0.0511157
DRat          0.977499     2.32746     0.419986     0.6804   -3.98336     5.93836
WT           -2.7256       2.58143    -1.05585      0.3077   -8.22778     2.77658
Q

### Prediction

In [6]:
predict(ols, test)

6-element Array{Union{Missing, Float64},1}:
 19.22775863148683
 29.718780395094722
 14.275210958375084
 16.08087360892606
 21.49105615888983
 25.499262705932903

### Validation

In [7]:
GLM.r²(ols)

0.8629425591930656

## Using MLJ

In [8]:
using MLJ



### Casting scientific types

In [9]:
y, X = unpack(data[!, 2:end], ==(:MPG), colname -> true);
first(X, 6)

Unnamed: 0_level_0,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,Int64,Float64,Int64,Float64,Float64,Float64,Int64,Int64,Int64,Int64
1,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,6,225.0,105,2.76,3.46,20.22,1,0,3,1


In [10]:
first(X, 6) |> pretty

┌[0m───────[0m┬[0m────────────[0m┬[0m───────[0m┬[0m────────────[0m┬[0m────────────[0m┬[0m────────────[0m┬[0m───────[0m┬[0m─[0m ⋯
│[0m[1m Cyl   [0m│[0m[1m Disp       [0m│[0m[1m HP    [0m│[0m[1m DRat       [0m│[0m[1m WT         [0m│[0m[1m QSec       [0m│[0m[1m VS    [0m│[0m[1m [0m ⋯
│[0m[90m Int64 [0m│[0m[90m Float64    [0m│[0m[90m Int64 [0m│[0m[90m Float64    [0m│[0m[90m Float64    [0m│[0m[90m Float64    [0m│[0m[90m Int64 [0m│[0m[90m [0m ⋯
│[0m[90m Count [0m│[0m[90m Continuous [0m│[0m[90m Count [0m│[0m[90m Continuous [0m│[0m[90m Continuous [0m│[0m[90m Continuous [0m│[0m[90m Count [0m│[0m[90m [0m ⋯
├[0m───────[0m┼[0m────────────[0m┼[0m───────[0m┼[0m────────────[0m┼[0m────────────[0m┼[0m────────────[0m┼[0m───────[0m┼[0m─[0m ⋯
│[0m 6.0   [0m│[0m 160.0      [0m│[0m 110.0 [0m│[0m 3.9        [0m│[0m 2.62       [0m│[0m 16.46      [0m│[0m 0.0   [0m│[0m [0m ⋯
│[0m 6.0   [0m│

In [11]:
X = coerce(X, :Cyl => Continuous, :HP => Continuous, :VS => Continuous, :AM => Continuous,
              :Gear => Continuous, :Carb  => Continuous)
first(X, 6)

Unnamed: 0_level_0,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,6.0,160.0,110.0,3.9,2.62,16.46,0.0,1.0,4.0
2,6.0,160.0,110.0,3.9,2.875,17.02,0.0,1.0,4.0
3,4.0,108.0,93.0,3.85,2.32,18.61,1.0,1.0,4.0
4,6.0,258.0,110.0,3.08,3.215,19.44,1.0,0.0,3.0
5,8.0,360.0,175.0,3.15,3.44,17.02,0.0,0.0,3.0
6,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0


### Training/testing set

In [12]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([26, 31, 13, 11, 19, 20, 3, 7, 15, 12  …  22, 21, 27, 14, 6, 17, 10, 23, 18, 5], [4, 16, 1, 32, 29, 2, 25, 8, 9, 24])

### Model

In [13]:
model = @load LinearRegressor pkg=GLM

LinearRegressor(
    fit_intercept = true,
    allowrankdeficient = false)[34m @ 9…18[39m

In [14]:
mach = machine(model, X, y)

[34mMachine{LinearRegressor} @ 1…76[39m


### Training

In [15]:
fit!(mach, rows=train)

┌ Info: Training [34mMachine{LinearRegressor} @ 1…76[39m.
└ @ MLJBase /home/yuehhua/.julia/packages/MLJBase/O5b6j/src/machines.jl:187


[34mMachine{LinearRegressor} @ 1…76[39m


### Predict

In [16]:
ŷ = predict_mean(mach, rows=test)

10-element Array{Float64,1}:
 19.88022954974272
 11.239138690584138
 21.499655005017065
 25.933066616422433
 25.786485030753983
 21.675070035816976
 16.689578159696
 20.325390248156463
 25.457322333839
 14.623636775685082

### Evaluation

In [17]:
rms(ŷ, y[test])

3.9460847127901757