**CS596 - Machine Learning**
<br>
Date: **31 August 2020**
<br>

Title: **Lecture 1: Introduction to Machine Leanring**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Bibliography:
<br>
[1] Bishop, Christopher M., *Pattern Recognition and Machine Learning*, Springer, 2006

<h1 align="center">Machine Learning Basics</h1>

<h3 align="center">What Is Machine Learning?</h3>

- **Machine Learning** is a field of computer science which gives the ability of IT systems to independently find solutions to problems by recognizing patterns in databases.

- ML main focus is to provide algorithms which can be **trained to perform a task**.

- It is closely related to the field of **computational statistics** as well as **mathematical optimization**.

<center><img src="images/arthur_samuel.jpg" width="250" alt="Example" /></center>

- The name **Machine Learning** was coined in **1959** by **Arthur Samuel**.

<h3 align="center">Why is Machine Learning important?</h3>

- Recent years have shown that ML can be used to automate a lot of different tasks that were thought of as tasks that only humans can do like **Image Recognition**, **Text Generation** or **Playing Games**.

- ML is going to have huge effects on the economy and living in general.
- Entire work tasks and industries can be automated and the job market will be changed forever.

    For example:

- In 2014 Google's **AlphaGo** program becomes the first Computer **Go** program to beat an unhandicapped professional human player.
- April 13, 2019. **OpenAI Five** wins back-to-back games versus **Dota 2 world champions OG at Finals**, becoming the first AI to beat the world champions in an esports game.

<h3 align="center">"A.I. Fathers"</h3>
<br>
<center><img src="images/ai_nobels.png" width="1200" alt="Example" /></center>

<h3 align="center">Learning Algorithms</h3>


- A **Machine Learning Algorithm** is an algorithm that is able to **learn from data**.

- What do we mean by **learning**? 
<center><img src="images/tom_mitchel.jpg" width="700" alt="Example" /></center>
- In 1997, **Tom M. Mitchell** provided a definition: *A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$*.

<h3 align="center">The Task, T</h3>

- The **Task** $T$ is usually described in terms of how the ML system should process an **example**.

- An **example** is a collection of **features** that have been quantitatively measured from some object or event.

- An example is represented as a vector $\mathbf{x} \in \mathbb{R}^n$, where each entry $x_i$ of the vector $\mathbf{x}$ is some feature.



<h3 align="center">Tasks can be solved with ML</h3>

<br> <br> &emsp; $\bullet$ **Classification**: the computer program is asked to specify which of $k$ categories some input belongs to. 
<br> &emsp; &ensp; The learning algorithm is usually asked to produce **a function** $f: \mathbb{R}^n\to \{1, ..., k\}$.
<br> <br> &emsp; $\bullet$ **Classification with missing inputs**: the computer program is not guaranteed that every measurement
<br> &emsp; &ensp; in its input vector will always be provided, i.e. the learning algorithm must learn a **set of functions**.
<br> &emsp; &ensp; Each function corresponds to classifying $\mathbf{x}$ with a different subset of its inputs missing.
<br> <br> &emsp; $\bullet$ **Regression**: the computer program is asked to predict a numerical value given some input.
<br> &emsp; &ensp; the learning algorithm is asked to output **a function** $f: \mathbb{R}^n \to \mathbb{R}$.
<br> <br> &emsp; $\bullet$ **Transcription**: the machine learning system is asked to observe a relatively unstructured representation 
<br> &emsp; &ensp; of some kind of data and **transcribe it into textual form**.
<br> <br> &emsp; $\bullet$ **Machine translation**: the input already consists of a **sequence of symbols** in **some language**,
<br> &emsp; &ensp; and the computer program must **convert** this into a **sequence of symbols** in **another language**.

&emsp; $\bullet$ **Structured output**: involve any task where the **output is a vector** (or other data structure containing 
<br> &emsp; &ensp; multiple values) with important **relationships** between the different elements.
<br> <br> &emsp; $\bullet$ **Anomaly detection**: the computer program sifts through a set of events or objects, 
<br> &emsp; &ensp; and **flags** some of them as being **unusual or atypical**.
<br> <br> &emsp; $\bullet$ **Synthesis and sampling**: the machine learning algorithm is asked to **generate new examples** that are 
<br> &emsp; &ensp; similar to those in the training data.
<br> <br> &emsp; $\bullet$ **Imputation of missing values**: the machine learning algorithm is given a **new example** $\mathbf{x} \in \mathbb{R}^n$ but with 
<br> &emsp; &ensp; some $x_i$ of $\mathbf{x}$ missing. The algorithm must provide a prediction of the values of the missing entries.
<br> <br> &emsp; $\bullet$ **Denoising**: the machine learning algorithm is given in input a **corrupted example** $\widetilde{\mathbf{x}} \in \mathbb{R}^n$, 
<br> &emsp; &ensp; obtained by an unknown corruption process from a **clean example** $\mathbf{x} \in \mathbb{R}^n$. 
<br> &emsp; &ensp; The algorithm must provide a prediction of the conditional probability distribution $p(\mathbf{x}| \widetilde{\mathbf{x}})$.
<br> <br> &emsp; $\bullet$ **Density estimation or probability mass function estimation**: the machine learning algorithm is asked to 
<br> &emsp; &ensp; learn a **function** $p_{model}: \mathbb{R}^n\to\mathbb{R}$, where $p_{model}(\mathbf{x})$ is a **PDF** if $\mathbf{x}$ is continuous, or  PMF $\mathbf{x}$ is discrete.

<h3 align="center">The Performance Measure, P</h3>


&ensp; $\bullet$ The **Performance Measure** $P$ is a quantitative measure of the performance of a machine learning algorithm.

<br> &ensp; $\bullet$ The **Performance Measure** $P$ specific to the task $T$ being carried out by the system.

<br> &ensp; $\bullet$ For the **classification** tasks we often measure the **accuracy** of the model, using the **Confusion Matrix**.

<br> &ensp; $\bullet$ For tasks such as **density estimation** the most common approach is to report the average **log-probability**.



<h3 align="center">Confusion Matrix</h3>

$\bullet$ The **Confusion Matrix** is a specific table layout that allows visualization of the performance of an algorithm:

|                    |  Actual: Yes    |  Actual: No    |
|:-------------------|:---------------:|:--------------:|
| **Predicted: Yes** |      **TP**     |     **FP**     |
|  **Predicted: No** |      **TN**     |     **FN**     |

&emsp; $\bullet$ **TP** is **True Positive**: You predicted positive and it’s true;
<br> &emsp; $\bullet$ **TN** is **True Negative**: You predicted negative and it’s true;
<br> &emsp; $\bullet$ **FP** is **False Positive** (**Type 1 Error**): You predicted positive and it’s false.
<br> &emsp; $\bullet$ **FN** is **False Negative** (**Type 2 Error**): You predicted negative and it’s false.
<br>


<br> &emsp; $\bullet$ $\mathbf{Precision} = \frac{\mathbf{TP}}{\mathbf{TP}+\mathbf{FP}};$

<br> &emsp; $\bullet$ $\mathbf{Accuracy} = \frac{\mathbf{TP}+ \mathbf{TN}}{\mathbf{TP}+\mathbf{TN} + \mathbf{FP}+\mathbf{FN}};$

<br> &emsp; $\bullet$ $\mathbf{Recall} = \frac{\mathbf{TP}}{\mathbf{TP}+\mathbf{FN}};$

<br> &emsp; $\bullet$ $\mathbf{F-measure} = 2 \cdot \frac{\mathbf{Recall} \times \mathbf{Precision}}{\mathbf{Recall} + \mathbf{Precision}}.$

<center><img src="images/accuracy_vs_precision.png" width="700" alt="Example" /></center>



<h3 align="center">The Experience, $E$</h3>


&ensp; $\bullet$ Most of the **ML algorithms** can be understood as being allowed to experience an entire **dataset**.

&ensp; $\bullet$ A **dataset** is a collection of many **examples**.

&ensp; $\bullet$ **Supervised learning algorithms**: 
<br> &emsp; &ensp; $\bullet$ involves observing several examples of a random vector $\mathbf{x}$
<br> &emsp; &ensp; $\bullet$ involves observing value or vector $\mathbf{y}$, called **label** or **target**, associated with a random vector $\mathbf{x}$
<br> &emsp; &ensp; $\bullet$ learning to predict $\mathbf{y}$ from $\mathbf{x}$, usually by estimating $p(\mathbf{y} | \mathbf{x})$.

&ensp; $\bullet$ **Unsupervised learning algorithms**:
<br> &emsp; &ensp; $\bullet$ involves observing several examples of a random vector $\mathbf{x}$
<br> &emsp; &ensp; $\bullet$ attempting to implicitly or explicitly learn the probability distribution $p(x)$.


&ensp; $\bullet$ **Reinforcement learning algorithms**:
<br> &emsp; &ensp; $\bullet$ interact with an environment, i.e. **no fixed dataset**.
<br> &emsp; &ensp; $\bullet$ there is a feedback loop between the learning system and its experiences.



<center><img src="images/Supervised_learning.png" width="1250" alt="Example" /></center>

- **Classification**: classifying images, spam detection, document classification, medical diagnosis.
- **Regression**: predicting prices, vehicle details life cycle prediction.

<center><img src="images/Unsupervised_learning.png" width="1250" alt="Example" /></center>

- **Dimensionality reduction**: data compression, visualization, topic modeling.
- **Clustering**: social network analysis, targeted marketing, big data analysis, detecting outliers.

<center><img src="images/Reinforcement_learning.png" width="1250" alt="Example" /></center>

- **Real Time Decisions**: recommendation systems, playing games (OpenAI).
- **Robot Control**: robot navigation, self-driving cars.

<h3 align="center"> Unsupervised Learning versus Supervised Learning</h3>

- **Unsupervised Learning** and **Supervised Learning** are not **formally defined** terms.
- Many machine learning technologies can be used to **perform both tasks**.


- The **chain rule** of probability states that for a vector $\mathbf{x} \in \mathbb{R}^n$, the join distribution can be written as:
$$p(\mathbf{x}) = \prod_{i=1}^{n} p(x_i| x_1, ..., x_{i-1}).$$
This decomposition means that we can solve the **unsupervised problem** of modeling $p(\mathbf{x})$ by splitting it into $n$ **supervised learning** problems.


- Alternatively, we can solve the **supervised learning** problem of learning $p(\mathbf{y} | \mathbf{x})$ by using traditional unsupervised learning technologies to learn the joint distribution $p(\mathbf{x} | \mathbf{y})$ and inferring:
$$p(\mathbf{y} | \mathbf{x}) = \frac{p(\mathbf{x} | \mathbf{y})}{\sum_{\mathbf{y}'}p(\mathbf{x} | \mathbf{y}')}.$$



<h3 align="center">Design matrix</h3>

$\bullet$ A **Design Matrix** is a matrix containing a different example in each row, i.e. $\mathbf{X} \in \mathbb{R}^{m\times n}$

$\bullet$ One of the oldest datasets studied by machine learning researchers is the **Iris dataset** (**Fisher, 1936**):
<br> &ensp; $\bullet$ A collection of measurements of different parts of **150 iris plants**.
<br> &ensp; $\bullet$ The **features** within each example are: the **sepal length**, **sepal width**, **petal length** and **petal width**.

<center><img src="images/L1_Iris_Dataset.png" width="950" alt="Example" /></center>

$\bullet$ We can represent the dataset with a **design matrix** $\mathbf{X} \in \mathbb{R}^{150\times 4}$, where $X_{i,1}$ is the sepal length of plant $i$, 
<br> &emsp; $X_{i,2}$ is the sepal width of plant $i$, $X_{i,3}$ is the petal length of plant $i$ and $X_{i,4}$ is the petal width of plant $i$.

$\bullet$ In case of a dataset where **vectors** have **different sizes**, we describe it as a **set containing $m$ elements**: 

$$\mathbf{X} = \{ \mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_m\},$$ 

&ensp; where vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ may have different sizes.

<h3 align="center">Example: Linear Regression</h3>

$\bullet$ As the name implies, linear regression solves a regression problem.
<br>
$\bullet$ A system takes a vector $\mathbf{x}\in \mathbb{R}^n$ as input and predict the value of a scalar $y \in \mathbb{R}$ as an output, 
<br> &ensp; which is a linear function of the input.
$$\hat{y} = \mathbf{w}^T \mathbf{x},$$
<br> &ensp; where $\hat{y}$ is the value that our model predicts and $w \in \mathbb{R}^n$ is a vector of **features**.

$\bullet$ One way of measuring the **performance** of the model is to compute the **mean squared error**:

$$MSE^{(test)} = \frac{1}{m}\sum_{i} \left (\hat{y}^{(test)} - {y}^{(test)} \right )_i^2 = \frac{1}{m} \left \| \hat{y}^{(test)} - {y}^{(test)} \right \|_2^2$$
$\bullet$ To make a machine learning algorithm we need to design an algorithm that minimize the $MSE^{(test)}$.
<br>
$\bullet$ To minimize $MSE^{(test)}$ we can simply solve for where its gradient is $0$:
$$\nabla_\mathbf{w} MSE^{(test)}=0 \Rightarrow \nabla_\mathbf{w} \frac{1}{m} \left \| \hat{y}^{(test)} - {y}^{(test)} \right \|_2^2 = 0 \Rightarrow  \nabla_\mathbf{w} \frac{1}{m} \left \| \mathbf{X}^{(train)}\mathbf{w} - {y}^{(train)} \right \|_2^2 = 0 \Rightarrow$$
    $$\mathbf{w} = \left ( \mathbf{X}^{(train)T} \mathbf{X}^{(train)} \right )^{-1} \mathbf{X}^{(train)T}y^{(train)}$$

<h3 align="center">Example</h3>

$\bullet$ Let's consider the **dataset generated by the function** $f(x) = \sin(2\pi x)$.
<br>
$\bullet$ In particular, we can fit the data using a **polynomial function** of the form:
$$ y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M.$$
$\bullet$ Plots of polynomials having various orders $M$, fitted to the data set, shown as red curves below:

<center><img src="images/Polynomials.png" width="950" alt="Example" /></center>

<h3 align="center">Overfitting and Underfitting</h3>

$\bullet$ The central challenge in machine learning is that we must perform well on **previously unseen (new) inputs**.
<br>
$\bullet$ The ability to perform well on previously unobserved inputs is called **generalization**.
<br>
$\bullet$ There are two type of errors we can compute:
<br> &emsp; $\bullet$ error measure on the training set called the **training error**.
<br> &emsp; $\bullet$ **test error**, or **generalization error**, defined as the expected value of the error on a **new input**.

$\bullet$ In our linear regresion example, the **training** and **test errors** are:

$$\text{training error} = \frac{1}{m^{(train)}} || \mathbf{X}^{(train)} - \mathbf{y}^{(train)}||_2^2$$

$$\text{test error} = \frac{1}{m^{(test)}} || \mathbf{X}^{(test)} - \mathbf{y}^{(test)}||_2^2$$

$\bullet$ The factors determining **how well** a **machine learning** algorithm will **perform** are its ability to:
<br> &emsp; $\bullet$ Make the **training error** small.
<br> &emsp; $\bullet$ Make the gap between **training** and **test error** small.
<br>
$\bullet$ These factors correspond to the two central challenges in machine learning: **underfitting** and **overfitting**.

<h3 align="center">Capacity</h3>



$\bullet$ Model’s **capacity** is its ability to fit a wide variety of functions. 
<br> &ensp; It **can control** whether a model is more likely to **underfit** or **overfit**.

$\bullet$  One way to control the **capacity** of a ML algorithm is by choosing its **hypothesis space**, 
<br> &ensp; the set of functions that the learning algorithm is allowed to select as being the solution.

<center><img src="images/Capacity.png" width="1000" alt="Example" /></center>

<h3 align="center">Regularization</h3>

$\bullet$ **Regularization** is any modification we make to a ML algorithm that is intended to reduce its **test error** but not its **training error**.

$\bullet$ The **no free lunch theorem** has made it clear that there is **no best machine learning algorithm**, and, in particular, **no best form of regularization**.

$\bullet$ For example, we can modify the training criterion for **linear regression** to include **weight decay**:

$$J(\mathbf{w}) = MSE_{train} + \lambda \mathbf{w}^T \mathbf{w}.$$

<center><img src="images/Regularization.png" width="800" alt="Example" /></center>

$\bullet$ Regularization **allows to select from** a larger family of **hypothesis space**, but express a preference for **simpler models**.

<h3 align="center">Hyperparameters and Validation Sets</h3>

$\bullet$ The **hyperparameters** is a settings that helps us to control the behavior of the learning algorithm.

$\bullet$ For example, In the polynomial regression the degree of the polynomial is **capacity hyperparameter**
<br>
$\bullet$ The $\lambda$, used to control the strength of **weight decay**, is another example of a **hyperparameter**.
<br>

$\bullet$ If **learned on the training se**t, such **hyperparameters** will lead to **overfitting**.
<br>
$\bullet$ To solve this problem, we need a **validation set** of examples that the training algorithm does not observe.

<center><img src="images/Train_validation_test_sets.png" width="1300" alt="Example" /></center>

$\bullet$ **Training dataset**: the sample of data used to fit the model.
<br>
$\bullet$ **Validation dataset**: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
<br>
$\bullet$ **Test dataset**: the sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.


<h3 align="center">Cross-Validation</h3>

$\bullet$ Dividing the **dataset** into a fixed training set and a fixed test set can be **problematic** if the resulting **test set being small**.
<br>
$\bullet$ A **small test set** implies **statistical uncertainty** around the estimated average test error, 
<br> &ensp; making it difficult to claim that algorithm $A$ works better than algorithm $B$ on the given task.

$\bullet$ An **alternative procedures** enable one to use all of the examples in the estimation of the mean test error,
<br> &ensp;  **at the price of increased computational cost**, known as $k$-fold cross-validation:

<center><img src="images/k-fold.png" width="1100" alt="Example" /></center>



<h1 align="center">End of Lecture</h1>