# Introduction to Machine Learning

## Definition of Machine Learning
*Field of study that gives computers the ability to learn without being explicitly programmed. Arthur Samuel (1959)*

# Machine Learning algorithms

## Supervised learning
For every given input we know how correct output should look like.
### Examples
* Given data about the size of houses on the real estate market, try to predict their price.
* Given an email content, we would like to classify it either as spam or not spam

## Unsupervised learning
For set of inputs we're trying to find the structure or relationships between different inputs.
### Example
* Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

# Supervised learning example

## House pricing prediction

In [None]:
include("helper.jl")
m = 10
X, y = gen_samples(m)

[X y]

In [None]:
plot_samples(X, y)


# Hypothesis

## Single variable
\begin{equation*}
h_{\theta}(x) = \theta_0 + \theta_1 x
\end{equation*}

## Multiple variables
\begin{equation*}
h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n
\end{equation*}

## Multiple variables - vector notation
\begin{equation*}
h_{\theta}(x) = x \cdot \Theta
\end{equation*}

\begin{equation*}
x = \left[1\ x_1\ x_2\ \cdots\ x_n \right],
\Theta = \begin{bmatrix}
       \theta_0 \\
       \theta_1 \\
       \theta_2 \\
       \vdots \\
       \theta_n
\end{bmatrix}
\end{equation*}

# Cost function

Choose $\theta_0$, $\theta_1$ so that $h_{\theta}(x)$ is close to $y$ for our training examples $(x,y)$

\begin{equation*}
J(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^{m}\left(h_{\theta}(x^{(i)})-y^{(i)}\right)^2
\end{equation*}

$x^{(i)}$ and $y^{(i)}$ denotes the i-th example in training set

In [None]:
_X=[ones(m, 1) X]

J(Θ) = 1/2m .* sum(((_X * Θ) - y).^2)

plot_cost(J)

# Cost optimization (TODO: add derivatives here)

## Gradient descent algorithm
`repeat for every` $j=0,\dots,n$

\begin{equation*}
\theta_j:=\theta_j - \alpha \frac{\partial}{\partial \theta_j}J\left(\theta_0,\dots,\theta_n\right)
\end{equation*}

`end repeat`

$\alpha$ denotes the learning rate

In [None]:
α = 10.0^-4
Θ = [0.0, 0.0]

G(Θ) = 1/m .* _X' * (_X * Θ - y)

for i=1:10^6
    Θ = Θ - α * G(Θ)
end

Θ

# Improving the algorithm

In [None]:
using Optim

init_Θ=[0.0, 0.0]

G!(res, Θ) = res[:] = G(Θ)

optimize(J, G!, init_Θ, GradientDescent())

# Classification problems

## Why linear regression is not suitable for classification problems?

In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8]
y = [0, 0, 0, 0, 1, 1, 1, 1]
plot(X, y, "rx")
plot(X, 1/6 .* (X .- 1.5))

In [None]:
X = [1, 2, 3, 4, 5, 6, 7, 8, 14]
y = [0, 0, 0, 0, 1, 1, 1, 1, 1]
plot(X, y, "rx")
plot(X, 1/10 .* (X .- 2))

## Output of logistic regression
\begin{equation*}
h_{\theta}(x) = s(\Theta^Tx)\\
s(z)=\frac{1}{1+e^{-z}}
\end{equation*}

In [None]:
sig(z) = 1.0 ./ (1.0 .+ exp.(-z))

sig([-10, -1, 0, 1, 10])

In [None]:
z = range(-5,stop=5,length=100)
plot(z, sig(z))

# Installing required dependencies

# Importing the data

In [None]:
using MLDatasets

X, y = MNIST.traindata();

# Inspecting the data

In [None]:
size(X), size(y)

In [None]:
(minimum(X), maximum(X))

## Displaying individual images

In [None]:
using Interact

w, h, m = size(X)

f_img = figure(figsize=(5,5))
@manipulate throttle = 0.5 for i = 1:m
    withfig(f_img) do;
        imshow(X[:,:,i]', cmap="gray")
    end
end

## Training the model

Training is done only on 300 out of 30000 examples available in the MNIST dataset. To increase size of the training dataset modify the `_m` variable - note that the training time will increase as well.

In [None]:
using Optim

_n = w * h
_m = 300
_X = X[:, :, 1:_m]
_X = reshape(_X, (_n, _m))'
_X = [ones(_m, 1) _X]
init_Θ = zeros(_n+1)
λ = 0.1
all_Θ = zeros(10, _n+1)

for i = 1:10
    println("iteration $i/10")
    
    @time begin
        _y = y[1:_m] .== i % 10

        local J(Θ) = (-_y' * log.(sig(_X * Θ)) .- (1.0 .- _y') * log.(1 .- sig(_X * Θ)))[1] .+ λ/2_m * sum(Θ[2:end] .^ 2)
        local G(Θ) = 1/_m .* _X' * (sig(_X * Θ) .- _y) + pushfirst!(λ/_m * Θ[2:end], 0.0)
        local G!(res, Θ) = res[:] = G(Θ)

        res = optimize(J, G!, init_Θ, GradientDescent())
        all_Θ[i,:] = res.minimizer
    end
end

# Predictions based on trained model

Move the slider to check how the trained model predicts the number in the image.

In [None]:
f = figure(figsize=(10,10))
@manipulate throttle = 0.5 for i = 1:m
    withfig(f) do;
        subplot(211)
        bar(1:10,reshape(sig([1 X[:,:,i]...] * all_Θ'), 10))
        subplot(212)
        axis("off")
        imshow(X[:,:,i]', cmap="gray")
    end
end