# 1) OLS vs. Regularized Linear Regression

In OLS, the goal is to minimize the cost function
$$J(\theta) = \frac{1}{2m}\sum_{j=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2; \quad h_\theta(x^{(i)}) = \theta^T x$$
where $i$ runs over all data points and the goal is to minimize the function over the $\theta$ vector.
In regularized linear regression, we add a term to this cost function that helps make the coefficients smaller. Specifically, the new cost function is
$$J(\theta) = \frac{1}{2m}\sum_{j=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n}\theta_j^2$$
where $j$ runs over all comonents of the $\theta$ vector exept for the bias.

# 2) Where to Use Regularized Linear Regression

When we have too many features and little data, our model may overfit the dataset. In this kind of situation, we can remove features. But this might be undesirable, and we might want to use every feature of our data. By using regularized linear regression, we can stop coefficients from growing too large and by doing this we can keep every feature and avoid overfitting at the same time.

When we have few features and too much data, regularized linear regression will not help fix underfitting (or too much bias).

The disadvantage of regularized linear regression is that it will reduce the model's ability to fit the data. So if we can alternatively get more data, that would be a better approach to avoid overfitting, since we can let our model fit the data unrestrained.

# 3) A Regularized Linear Regression Example (Insurance)
## Processing the Data

In [1]:
using Flux, DataFrames, CSV
df = CSV.read("insurance.csv", DataFrame)

Unnamed: 0_level_0,age,sex,bmi,children,smoker,region,charges
Unnamed: 0_level_1,Int64,String7,Float64,Int64,String3,String15,Float64
1,19,female,27.9,0,yes,southwest,16884.9
2,18,male,33.77,1,no,southeast,1725.55
3,28,male,33.0,3,no,southeast,4449.46
4,33,male,22.705,0,no,northwest,21984.5
5,32,male,28.88,0,no,northwest,3866.86
6,31,female,25.74,0,no,southeast,3756.62
7,46,female,33.44,1,no,southeast,8240.59
8,37,female,27.74,3,no,northwest,7281.51
9,37,male,29.83,2,no,northeast,6406.41
10,60,female,25.84,0,no,northwest,28923.1


In [2]:
# one-hot encode sex, being a smoker, and regions
sex = permutedims(df.sex .== "female")
smoker = permutedims(df.smoker .== "yes")
region = (unique(df.region) .== permutedims(df.region))

4×1338 BitMatrix:
 1  0  0  0  0  0  0  0  0  0  0  0  1  …  0  0  0  1  0  1  1  0  0  0  1  0
 0  1  1  0  0  1  1  0  0  0  0  1  0     0  1  0  0  1  0  0  0  0  1  0  0
 0  0  0  1  1  0  0  1  0  1  0  0  0     0  0  0  0  0  0  0  1  0  0  0  1
 0  0  0  0  0  0  0  0  1  0  1  0  0     1  0  1  0  0  0  0  0  1  0  0  0

In [3]:
xs = Matrix{Float32}(vcat(permutedims(df.age), sex, permutedims(df.bmi), permutedims(df.children), smoker, region))
ys = permutedims(df.charges)

# Normalization
using Statistics: mean, std

xs .-= mean(xs, dims=2)
xs ./= std(xs, dims=2)

ys .-= mean(ys, dims=2)
ys ./= std(ys, dims=2)

1×1338 Matrix{Float64}:
 0.298472  -0.953333  -0.728402  0.719574  …  -0.961237  -0.930014  1.31056

In [4]:
# separating the test set and training set
test_set_percentage = 15

test_set_indices = rand(1:nrow(df), Int(round(test_set_percentage / 100 * nrow(df))))
train_set_indices = setdiff(1:nrow(df), test_set_indices)

xtest, ytest = xs[:, test_set_indices], ys[:, test_set_indices]
xtrain, ytrain = xs[:, train_set_indices], ys[:, train_set_indices]

(Float32[-1.438226 -0.797655 … -1.2958769 1.5511055; 1.0101442 -0.98922414 … 1.0101442 1.0101442; … ; -0.5662043 -0.5662043 … -0.5662043 1.7648153; -0.56505513 -0.56505513 … -0.56505513 -0.56505513], [0.29847220322195916 -0.728402318770214 … -0.9300137749678821 1.3105634441336376])

## Training

In [5]:
lambda = 0.1
model = Flux.Dense(9 => 1)
loss(x, y) = Flux.mse(model(x), y) + lambda * sum(abs2, model.weight)
optimiser = Flux.Descent()
parameters = Flux.params(model)

Params([Float32[-0.71297485 0.3522394 … -0.57007253 0.059994113], Float32[0.0]])

In [6]:
loss(xtest, ytest)

3.153158343188032

In [7]:
for _ in 1:10000
    Flux.train!(loss, parameters, [(xtrain, ytrain)], optimiser)
end

In [8]:
loss(xtest, ytest)

0.32709130973580575

In [11]:
display("text/markdown", "\$\\theta_0 = $(model.bias[1])\$")
for i in 1:9
    display("text/markdown", "\$\\theta_$i = $(model.weight[i])\$")
end

$\theta_0 = -0.0076105306$

$\theta_1 = 0.2738731$

$\theta_2 = 0.0013144038$

$\theta_3 = 0.15348485$

$\theta_4 = 0.044129185$

$\theta_5 = 0.72482073$

$\theta_6 = -0.016531605$

$\theta_7 = -0.010159851$

$\theta_8 = 0.005979021$

$\theta_9 = 0.021117795$

# 4) Identical Data for a Neural Network

Every example will be assigned as B. Specifically, the output number will be much closer to 1. That is unless the optimizer somehow gets stuck in a local minimum, which is unlikely if everything is configured correctly.

The cost function is a simple mean squared error function. this is trivially minimalizable, given a linear hypothesis. But with a nonlinear hypothesis, there might exist local minimums. Nevertheless, given our data, the global minimum is where all data is classified as B (output closer to 1).

# 5) Partial Derivative of Integral Cost Function

The optimum hypothesis is irrelevant here; this is just a simple derivative problem. According to the Leibnitz integral rule,
$${\frac {d}{dx}}\left(\int _{a(x)}^{b(x)}f(x,t)\,dt\right)=f{\big (}x,b(x){\big )}\cdot {\frac {d}{dx}}b(x)-f{\big (}x,a(x){\big )}\cdot {\frac {d}{dx}}a(x)+\int _{a(x)}^{b(x)}{\frac {\partial }{\partial x}}f(x,t)\,dt, $$
so,
$$\frac{\partial E}{\partial w_s} = \int\int \frac{\partial y}{\partial w_s}(y(\mathbf{x}, \mathbf{w}) - t) p(\mathbf{x}, t)\, d\mathbf{x} \, dt.$$
Taking the second derivative,
$$\frac{\partial^2 E}{\partial w_r\partial w_s} = \int\int \frac{\partial y}{\partial w_r}\frac{\partial y}{\partial w_s} p(\mathbf{x}, t)\, d\mathbf{x} \, dt.$$
Finally, $y$ does not depend on $t$, so we can take the $t$ integral.
$$\frac{\partial^2 E}{\partial w_r\partial w_s} = \int \frac{\partial y}{\partial w_r}\frac{\partial y}{\partial w_s} p(\mathbf{x})\, d\mathbf{x}$$