# Machine Learning
_(by Standford University)_

**WEEK 2**

Course of introduction to machine learning offered by Standford University on [Coursera.org](https://www.coursera.org/learn/machine-learning). These are **notes** and **comments** based on lectures and assignments.

The IPython kernel of choice is *Octave* since many exercises and assignments have been devised for that language.

## Multiple Features

### Notations

- $x$ is the full feature matrix;
- $x^{(i)}$ labels the i-th feature vector;
- $x^{(i)}_j$ labels the j-th component of the i-th feature vector;
- $\theta_j$ is the j-th component of the parameters vector;
- by definition $x^{(i)}_0 = 1$ $\forall i = 1, 2, \dots, m$ such that $h_{\theta}(x) = \theta^T x$ (that is, we add one component to couple $\theta_0$, i.e. the intercept of the linear regression);
- the **cost function** now becomes $J(\theta) = \frac{1}{2 m} \sum\limits_{i = 1}^m ( h_{\theta}( x^{(i)} ) - y^{(i)} )^2$;
- the **gradient descent** now reads $\theta_j' = \theta_j - \frac{\alpha}{m} \sum\limits_{i = 1}^m ( h_{\theta}( x^{(i)} ) - y^{(i)} ) x^{(i)}_j$.

### Convergence and Scaling

- **gradient descent** can converge quickly if the features are on a similar **scale** (graphically: the level curves of the cost functions are wider);
- we roughly want to have all features in $-1 \le x_j \le 1$. The way we do it is by rescaling $x_j \mapsto \frac{x_j - \mu_j}{s_j}$ ($s_j$ is the range of variation of the feature $x_j$);
- the **cost function** should decrease after every iteration (we can stop at the plateau, e.g. when the difference between iteration is smaller than a small parameter $\epsilon$);
- a crescent or oscillating $J(\theta)$ are usually a hint that the **learning rate** $\alpha$ should be smaller (notice that if $\alpha$ is too small, than the convergence will require a lot of time) since for appropriate $\alpha$, $J(\theta)$ should be always decreasing.

### Features Selection and Polynomial Regression

- given my features, I can create new features as combinations of the given input;
- with the same idea in mind, I can use polynomial models to fit my data (notice that **feature scaling** becomes increasingly important in this case)

### Normal Equation

- alternative to **gradient descent** (does not need iteration);
- find the minimum of the cost function for linear regression analytically: $\theta = ( X^T X )^{-1} X^T y$, where $X$ is the **design matrix** made of the feature verctors on the lines;

In [3]:
% given a matrix X and a vector y, we can compute the normal equation
X = [ 1, 145, 165; 1, 323, 123; 1, 345, 234 ]
y = [ 3; 5; 6 ]

th = pinv( X' * X ) * X' * y

X =

     1   145   165
     1   323   123
     1   345   234

y =

   3
   5
   6

th =

   0.0800696
   0.0127647
   0.0064791



- it may usually be a good idea to revert to **gradient descent** if the no. of features is $\ge 10000$, since $X^T X \in \mathbb{M}^{n \times n}( \mathbb{R} )$ is a large matrix to invert (and inversion is an algorithm which scales as $O(n^3)$ in this case).