# Linear Regression

In [1]:
using GLM
using StatsBase
using RDatasets
using MLDataUtils

## Load data

In [2]:
data = RDatasets.dataset("datasets", "mtcars")
first(data, 6)

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String⍰,Float64⍰,Int64⍰,Float64⍰,Int64⍰,Float64⍰,Float64⍰,Float64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


## Training/Testing set

In [3]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [4]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String⍰,Float64⍰,Int64⍰,Float64⍰,Int64⍰,Float64⍰,Float64⍰,Float64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2
4,Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
5,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
6,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


## Model

In [5]:
ols = GLM.lm(@formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train)

StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: MPG ~ 1 + Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb

Coefficients:
               Estimate Std.Error   t value Pr(>|t|)
(Intercept)     23.4015    19.967   1.17201   0.2595
Cyl             -0.8718   1.19619 -0.728816   0.4773
Disp          0.0223236   0.02083    1.0717   0.3008
HP           -0.0196154 0.0259391 -0.756209   0.4612
DRat           0.213961   1.73492  0.123326   0.9035
WT             -5.63146   2.35533  -2.39095   0.0304
QSec           0.905362  0.747622   1.21099   0.2446
VS             0.237087   2.15355  0.110091   0.9138
AM              2.02058   2.23834  0.902711   0.3809
Gear          -0.277109    1.6279 -0.170225   0.8671
Carb           0.379176  0.967901   0.39175   0.7008


## Prediction

In [6]:
predict(ols, test)

6-element Array{Union{Missing, Float64},1}:
  8.785617245040202
 22.06684882711072 
 24.448572745930615
 12.849452574920683
 20.093483353947622
 25.503968628708527

## Validation

In [7]:
GLM.r²(ols)

0.910994112617272

# Packages

## Distributions

* Distributions.jl

## Regression

* Lasso.jl
    * Ridge regression
    * LASSO regression
    * ElasticNet
* LARS.jl
    * Least angle regression
    * L1-regularized linear regression
* Isotonic.jl
    * Linear PAVA (fastest)
    * Pooled PAVA (slower)
    * Active Set (slowest)

## Clustering

* Clustering.jl
    * K-means
    * K-medoids
    * Affinity Propagation
    * Density-based spatial clustering of applications with noise (DBSCAN)
    * Markov Clustering Algorithm (MCL)
    * Fuzzy C-Means Clustering
    * Hierarchical Clustering
        * Single Linkage
        * Average Linkage
        * Complete Linkage
        * Ward's Linkage

## Dimensional Reduction

* MultivariateStats.jl
    * Data Whitening
    * Principal Components Analysis (PCA)
    * Canonical Correlation Analysis (CCA)
    * Classical Multidimensional Scaling (MDS)
    * Linear Discriminant Analysis (LDA)
    * Multiclass LDA
    * Independent Component Analysis (ICA), FastICA
    * Probabilistic PCA
    * Factor Analysis
    * Kernel PCA

## Dimensional Reduction

* NMF.jl
    * Lee & Seung's Multiplicative Update (for both MSE & Divergence objectives)
    * (Naive) Projected Alternate Least Squared
    * ALS Projected Gradient Methods
    * Random Initialization
    * NNDSVD Initialization

## Kernel Density Estimtion

* KernelDensity.jl

## Time Series Analysis

* TimeSeries.jl

# Go to deep learning!