# Lecture 2 - Introduction to Supervised Learning

We have introduced the concept of **Machine Learning** last class:

<div class="alert alert-info" role="alert">
  <strong>Machine Learning</strong>
    
is the machine general ability to solve *intelligent* tasks by learning from experience/data without being explicitly programmed.
</div>

We have also distinguished Machine Learning from the concepts of Deep Learning and Artificial Intelligence.

* Deep Learning is a subset of Machine Learning
* Machine Learning is a subset of Artificial Intelligence

The term **Artificial Intelligence (AI)** was established back in 1950 (before the term *Machine Learning*), and can be defined as the ability of machines to carry out tasks that typically require human-level intelligent behavior.

* An example of an AI that is not ML are rule-based systems, like e.g. diagnostic table used by doctors to determine Heart Failure with Normal Ejection Fraction. The *intelligent* knowledge in this table was encoded directly from experts and carefully crafted set of rules.

![HFNEF](https://ars.els-cdn.com/content/image/1-s2.0-S0735109708041272-gr3.jpg)

A great deal of AI is carried out using Machine Learning.

* **Deep Learning** is ML using Artificial Neural Networks models, which is a specific type of model architecture.

In this course we will concentrate on AI tasks that require ML models or *training* a model to fit a set of data. We will also study some Deep Learning models.

# 1. Types of Learning

In Machine Learning, there are different types of learning:

* **Supervised Learning:** the model will *learn* from labeled data
* **Unsupervised Learning:** the model will *learn* from unlabeled data
* **Semi-Supervised Learning:** the model will *learn* with a training data in which some is labeled, some not, and both are used during training
* **Reinforcement Learning:** the model will *learn* which action to take based on reinforcement from environment so to maximize/minimize a reward/penalty
* **Multiple Instance Learning:** the model will *learn* based on multiple-instance labels that have a particular form of imprecision
* **Active Learning:** the model will *learn* by obtaining labels online from a user/oracle in an intelligent fashion
* **Transfer Learning:** the model will *learn* by transferring learnt knowledge from a similar task.
* and many more...

In this course, we will study Supervised Learning, Unsupervised Learning and we will briefly discuss Transfer Learning.

## 1.1. Supervised Learning

<div class="alert alert-info" role="alert">
  <strong>Supervised Learning</strong>
    
Learning a mapping from input data to desired output values given labeled training data.
</div>

* In Supervised Learning you can have 2 different types of tasks: **classification** and **regression**.

### Classification

**Classification** is a form of predictive modeling approach to characterize the relationship between some collection of observational input data and a set of categorical labels.

Suppose we have training images from two classes, $C_1=\text{conure}$ and $C_2=\text{macaw}$, and we would like to train a classifier to assign a label to incoming test images whether they belong to class $C_1$ or $C_2$.

<div><img src="figures/classification.png", width="800"><!div>
    
This is a **classification** example. Each data point was classified into a discrete class (either conure or macaw). 

Classifiers can further be sub-categorized as **discriminative** or **generative** classifiers.

* A **discriminative** approach for classification is one in which we partition the feature space into regions for each class. Then, when we have a test point, we evaluate in which region it landed on and classify it accordingly.

* A **generative** approach for classification is one in which we estimate the parameters for distributions that generate the data for each class using Bayesian principles. When we have a test point, we can compute the posterior probability of that point belonging to each class and assign the point to the class with the highest posterior probability.

Applications of classification include: topic labeling, spam emails, medical diagnosis, recommender systems, etc.

### Regression

**Regression** is a form of *predictive modeling* approach to characterize the relationship between some collection of observational input data and a continuous desired response. 

* A linear regression model is a linear *weighted* combination of input values.

Consider the example below:

* The goal is to *train* a model that takes in the silhouette images with their correspondent labels (age of the person in the silhouette) and learn a linear weighted relationship between images and age.

<div><img src="figures/regression.png", width="400"><!div>

* After the model is trained, the **goal** is to be able to *predict* the desired output value of any *new* unlabeled test data.
    
Applications of regression include: electric/solar power forecast, stock market, inventory investment, etc.

### Typical Flowchart for Supervised Learning

The usual flow (but not always) for supervised learning is:
* **Training stage**
    1. Collect labeled training data - often the most time-consuming and expensive task. This constitutes the **input space**.
    2. Extract features - extract *useful* features from the input (or observational) data such that they have discriminatory information in successfully mapping the desired output. This constitutes the **feature space**.
    3. Select a model - relationship between input data and desired output.
    4. Fit the model - change model parameters (*Learning Algorithm*) in order to meet some *Objective Function*.
<div><img src="figures/training.png", width="800"><!div>

* **Testing:**
    1. Given unlabeled test data
    2. Extract (the same) features
    3. Run the unabeled data through the trained model
<div><img src="figures/testing.png", width="800"><!div>

### "Fitting the Model"

Let's open the *virtual whiteboard* to describe the steps of "fitting a model".

(See whiteboard notes) The system has a **feedback** loop which will make this approach automatic without user intervention.

* But we have fully control on each stage and how it's setup.

### Challenges

Some of the challenges of supervised learning include:

* How do you know if you have *representative* training data?
* How do you know if you extracted *good* features?
* How do you know if you selected the *right* model?
* How do you know if you trained the model *well*?

Many of these challenges are alleviated (not solved entirely, but helped significantly) with *LOTS AND LOTS* of **data** and good **experimental design**.

<div><img src="figures/PHDcomics.png", width="600"><!div>
    
* Sometimes, obtaining labeled training data is hard, expensive, time consuming and, in some cases, infeasible.

**<h1 align="center">Any Questions?</h1>**

## 1.2. Unsupervised Learning

<div class="alert alert-info" role="alert">
  <strong>Unsupervised Learning</strong>
    
Learning structure from data without any labels.
</div>

* In Unsupervised Learning we typically perform **Clustering**.

### Clustering

**Clustering** algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups (also called *clusters*) of points.

Suppose you collect pictures of the following objects:

<div><img src="figures/clustering2.png", width="500"><!div>
    
* How many groups would you partition this data into?

In clustering, we typically *search* for the number of clusters that optimize some relationship between members within a group and members between different groups.

Applications of clustering include: consumer market research, social networks, product display in grocery stores, etc.

### Challenges

Some of the challenges of unsupervised learning include:

* How do you *validate* your clustered results?
* How do you know if you selected the *right* similarity measure?

**<h1 align="center">Any Questions?</h1>**

# 2. Polynomial Regression

<div class="alert alert-info">
    
**Polynomial Regression** is a type of linear regression that uses a special set of *features* - polynomial features.
</div>

<div class="alert alert-success">
    <b>Step 1 - Input Space</b> 

Suppose we are given a training set comprising of $N$ observations of $\mathbf{x}$, $\mathbf{x} = \left[x_1, x_2, \ldots, x_N \right]^T$, and its corresponding desired outputs $\mathbf{t} = \left[t_1, t_2, \ldots, t_N\right]^T$, where sample $x_i$ has the desired label $t_i$.

So, we want to learn the *true* function mapping $f$ such that $\mathbf{t}  = f(\mathbf{x}, \mathbf{w})$, where $\mathbf{w}$ are *unknown* parameters of the model.
</div>

* We generally organize data into *vectors* and *matrices*. Not only is it a common way to organize the data, but it allows us to easily apply linear algebraic operations during analysis. It also makes it much simpler when it comes to code implementation!
    * In engineering textbooks and in this course's notation, **vectors** are defined as *column vectors*. This is why we write $\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_N\end{array} \right] = \left[x_1, x_2, \ldots, x_N \right]^T$.

* Note that both the training data and desired outputs can be noisy.

<div class="alert alert-success">
    <b>Step 2 - Feature Space</b> 

For the polynomial regression problem, let's consider *polynomial features* for each data point $x_i$. Let's say we can find these features using a **basis function**, $\phi(\mathbf{x})$. In the *polynomial regression* example, let's consider $\phi(x_i) = \left[ x_{i}^{0}, x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{M}\right]^T$, where $x_i^M$ is the $M^{th}$-power of $x_i$.
</div>

* These are particularly called **polynomial features** but other features can be extracted.

* For all data observations $\{x_i\}_{i=1}^N$ and using the feature space defined as $\phi(x_i) = \left[x_{i}^{0}, x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{M}\right]^T$, we can write the input data in a *matrix* form as:

$$\mathbf{X} =\left[\begin{array}{c} \phi(x_1)^T \\ \phi(x_2)^T \\ \vdots \\ \phi(x_N)^T \end{array}\right]  = \left[\begin{array}{ccccc}
1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{M}\\
1 & x_{2} & x_{2}^{2} & \cdots & x_{2}^{M}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
1 & x_{N} & x_{N}^{2} & \cdots & x_{N}^{M}
\end{array}\right] \in \mathbb{R}^{N\times (M+1)}$$

where each row is a feature representation of a data point $x_i$.

<div class="alert alert-info">
    <b>Feature Space</b> 

The set of features drawn by the transformation 

\begin{align}
\phi: \mathbb{R}^D & \rightarrow \mathbb{R}^M \\
\mathbf{x} & \rightarrow \left[x_{i}^{0}, x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{M}\right]^T
\end{align}
is often called the **feature space**.
When we write a linear regression with respect to a set of basis functions, the regression model is linear in the *feature space*.

$M$ is dimensionality of the feature space and is often called the *model order*.
</div>

* Now, we want to find the mapping from the feature input data $\mathbf{X}$ to the desired output values $\mathbf{t}$.

Suppose the data actually comes from some **unknown hidden function**, that takes in the data points $\mathbf{x}$ with some parameters $\mathbf{w}$ and produces the desired values $\mathbf{t}$, i.e. $\mathbf{t} = f(\mathbf{x},\mathbf{w})$.
* We do not know anything about the function $f$. If we knew the hidden function, we would not need to learn the *mapping* - we would already know it. However, since we do not know the true underlying function, we need to do our best to estimate from the examples of input-output pairs that we have.

<div class="alert alert-success">
    <b>Step 3 - Model Selection or Mapping</b> 

Let's assume that the desired output values are a *linear combination* of the feature input space, i.e., a **linear regression model** of **polynomial features**

$$t \sim f(x,\mathbf{w}) = w_0x^0 + w_1x^1 + w_2x^2+\cdots+w_Mx^M = \sum_{j=0}^M w_jx^j = \mathbf{X}\mathbf{w}$$
</div>

* This means that for every paired training data point $\{x_i, t_i\}_{i=1}^N$, we can model the output value as 

$$t_i \sim f(x_i,\mathbf{w}) = w_0x_i^0 + w_1x_i^1 + w_2x_i^2+\cdots+w_Mx_i^M $$

* Although the polynomial function $f(x,\mathbf{w})$ is a nonlinear function of $x$, it is a linear function of the coefficients $\mathbf{w}$. Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are called *linear models*.


The values of the coefficients $\mathbf{w}$ will be determined by *fitting* the polynomial to the training data. 

This can be done by minimizing an **error function** (also defined as **cost function**, **objective function**, or **loss function**) that measures the *misfit* between the function $f(x,\mathbf{w})$, for any given value of $\mathbf{w}$, and the training set data points $\{x_i,t_i\}_{i=1}^N$.

* What is the model's *objective* or goal?

<div><img src="figures/LeastSquares.png", width="300"><!div>

One simple choice for fitting the model is to consider the error function given by the sum of the squares of the errors between the predictions $f(x_i,\mathbf{w})$ for each data point $x_i$ and the corresponding target values $t_i$, so that we minimize

$$J(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N \left(f(x_n,\mathbf{w}) - t_n\right)^2 = \frac{1}{2}\left\Vert f(\mathbf{x},\mathbf{w}) - \mathbf{t} \right\Vert^2_2$$

* This error function is minimizing the (Euclidean) *distance* of every point to the curve.

* **What other/s objective function can we use?**

* We can write the error function compactly in matrix/vector form:
\begin{align*}
J(\mathbf{w}) &= \frac{1}{2}\left\Vert f(\mathbf{x},\mathbf{w}) - \mathbf{t} \right\Vert^2_2 \\
&= \frac{1}{2} \left\Vert \mathbf{X}\mathbf{w} - \mathbf{t}\right\Vert^2_2\\
&= \frac{1}{2} \left(\mathbf{X}\mathbf{w} - \mathbf{t}\right)^T \left(\mathbf{X}\mathbf{w} - \mathbf{t}\right)\\
\text{where: } & \mathbf{X} = \left[\begin{array}{ccccc}
1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{M}\\
1 & x_{2} & x_{2}^{2} & \cdots & x_{2}^{M}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
1 & x_{N} & x_{N}^{2} & \cdots & x_{N}^{M}
\end{array}\right], \mathbf{w} =  \left[\begin{array}{c}
w_{0}\\
w_{1}\\
\vdots\\
w_{M}
\end{array}\right], \text{and }  \mathbf{t} = \left[\begin{array}{c}
t_{1}\\
t_{2}\\
\vdots\\
t_{N}
\end{array}\right]
\end{align*}

<div class="alert alert-success">
    <b>Step 4 - Fitting the Model or Learning Algorithm</b> 

Also referred as training the model.

We *fit* the polynomial function model such that the *objective function* $J(\mathbf{w})$ is minimized, i.e. we *optimize* the following error function

\begin{align}
J(\mathbf{w}) &= \frac{1}{2} \left(\mathbf{X}\mathbf{w} - \mathbf{t}\right)^T \left(\mathbf{X}\mathbf{w} - \mathbf{t}\right) \\
&= \frac{1}{2} \left\Vert \mathbf{X}\mathbf{w} - \mathbf{t} \right\Vert_2^2
\end{align}

* This function is called the **least squares error** objective function.

The optimization function is then:
$$\arg_{\mathbf{w}}\min J(\mathbf{w})$$ 
</div>

* So, we want $J(\mathbf{w})$ to be small. What is the set of parameters $\mathbf{w}$ that minimize the objective function $J(\mathbf{w})$?

* What do you mean by **optimize** $J(\mathbf{w})$? **How do you find $\mathbf{w}$?**

<!-- * To do that, we **take the derivative of $J(\mathbf{w})$ with respect to the parameters $\mathbf{w}$**.

* How do you take the derivative of a *scalar*, such as $J(\mathbf{w})$, with respect to a vector, such as $\mathbf{w}$?

    * What is the derivative of a scalar with respect to a vector? www.wooclap.com/RBLGOK --> 

We will continue next class. Be sure to review any Linear Algebra and Calculus concepts as needed.