# Machine Learning
_(by Standford University)_

**WEEK 3**

Course of introduction to machine learning offered by Standford University on [Coursera.org](https://www.coursera.org/learn/machine-learning). These are **notes** and **comments** based on lectures and assignments.

The IPython kernel of choice is *Octave* since many exercises and assignments have been devised for that language.

## Classification

- typical of 0/1 decisions: spam/not spam, fraudolent/non fraudolent, etc.;
- in this case the **classes** are typically 2: true and falso (i.e. 1 and 0);
- **multiclass classification** deals with more classes;

## Algorithms

- we can use thresholds to use for instance **linear regression**. E.g.: $h_{\theta}(x) \ge 0.5 \Rightarrow $ predict $1$;
- the slope of linear regression is usually an issue because it may change the threshold even if it should not;
- even though the samples are $0 \le y^{(i)} \le 1$, $h_{\theta}(x)$ might be very different from the real values;
- **logistic regression** is always $0 \le h_{\theta}(x) \le 1$.

## Logistic Regression

- let $g(z) = \frac{1}{1 + e^{-z}}$, then $h_{\theta}(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$ are called **sigmoid** or **logistic functions**;
- the interpretation is: $h_{\theta}(x)$ is the probability that $y = 1$ given the input $x$ (that is, if $h_{\theta}(x) = 0.7$, then I have $70%$ probability that $y_{pred} = 1$. This is usually written $h_{\theta}(x) = \mathrm{P}(y = 1 \vert x; \theta)$;
- clearly $\mathrm{P}(y = 0 \vert x; \theta) + \mathrm{P}(y = 1 \vert x; \theta) = 1$;

## Decision Boundaries

- we still have to predict $y_{pred} \in \left\{ 0, 1 \right\}$, thus we need to use a "trigger" to discretize the output (e.g.: $y = 1$ if $h_{\theta}(x) \ge 0.5$;
- N.B.: $g(z) \ge 0.5$ if $z \ge 0$, thus $h_{\theta}(x) \ge 0.5$ if $\theta^T x \ge 0$;
- the boundary discriminating between $y_{pred} = 0$ from $y_{pred} = 1$ is called **decision boundary**;
- the decision boundary is a property of the **hypothesis function**, not the training set.

## Cost Function (Optimization Objective)

- the simple _mean squared error_ for logistic regression is a non convex function, thus it has many local minima;
- in the case of logistic regression we use $\mathrm{Cost}(h_{\theta}(x), y) = \begin{cases} -\ln(h_{\theta}(x))~~~if~~~y = 1 \\ -\ln(1 - h_{\theta}(x))~~~if~~~y = 0 \end{cases}$, which makes sense since we want $0$ for the correct prediction;
- given the definition of the cost function for logistic regression, the penalty for the wrong prediction is a very large number, since the cost function goes to $\infty$ in the case of a totally wrong prediction;
- notice that the cost function can be written $\mathrm{Cost}(h_{\theta}(x), y) = - \frac{1}{m} \sum\limits_{i = 1}^m ( y^{(i)} \ln(h_{\theta}(x^{(i)})) - (1 - y^{(i)}) \ln(1 - h_{\theta}(x^{(i)})) )$ and can be derived from maximum likelihood principles;
- we then use **gradient descent** to compute the minimum of the cost function (at the end of the day the form of the derivative is the same as in linear regression, but $h_{\theta}(x)$ is different;
- feature scaling can also help in this case.

## Optimization

- there are more sophisticated optimization algorithms which are usually faster than gradient descent and which do not need to manually choose the learning rate (e.g. **BFGS**, **L-BFGS**, etc.), but they are quite complex;
- Octave has very good implementation of minimization functions:

In [6]:
% build the cost function j(theta) = (theta_1 - 5)^2 + (theta_2 - 5)^2
function [j, grad] = CostFunction(theta)
    j       = (theta(1) - 5)^2 + (theta(2) - 5)^2;
    grad    = zeros(2,1);
    grad(1) = 2 * (theta(1) - 5);
    grad(2) = 2 * (theta(2) - 5);
end
    
% call the minimization function
options  = optimset('GradObj', 'on', 'MaxIter', '100');
in_theta = zeros(2,1);
[opt_theta, f_val, exit_flag] = fminunc(@CostFunction, in_theta, options) % the @ sign is a pointer to the function

opt_theta =

   5.0000
   5.0000

f_val =    7.8886e-31
exit_flag =  1


## Multiclass Classification

- we have more than $0$ and $1$ classes;
- **one-vs-all** (or **one-vs-rest**) works by turning the $n$ class problem into $n$ binary classification problems: we train $n$ different classifiers capable of distinguish the classes and then we pick the classifier with the maximum value for the prediction of the class.