<a href="https://colab.research.google.com/github/yiboxu20/MachineLearning/blob/main/Resources/Module0/Intro_to_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Machine learning

## What is machine learning

A popular defintion of **machine learning** or **ML**, due to Tom Mitchell:

"A computer program is said to learn from experience $E$ with respect to some class of tasks $T$, and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$."


---



One of my friends says:

"Machine learning is driving by looking at the rearview mirror."


---



<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/ml_map.png?raw=true" width="900" />

## Main topics

There are three basic machine learning paradigms:

- supervised learning: Labeled data, direct feedback, predict future/outcome.

- unsupervised learning: No labels, no feedback, find hidden struture in data.


- reinforcement learning (not covered this course): Decision process, reward system.



---



Here are several examples of topics:

1. Discrete supervised learning:     **classification**

2. Continuous supervised learning:   **regression**

3. Discrete unsupervised learning: **clustering**

4. Continuous unsupervised learning: **dimensionality reduction**



# Supervised learning

<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/01_02.png?raw=true" width="600" />

## Setup of supervised learning
Learn a function $F$, that maps an input to an output based on example input-output pairs

$$\{(x^{(i)}, y^{(i)}): i = 1, \dots, n\} $$

with input $x^{(i)}$ and label $y^{(i)}$ such that

$$ y^{(i)}=F(x^{(i)})+\epsilon_i,  $$

So for new input data $x$, we can make a prediction $y=F(x)$.


## Parametric function approximation

Parameterize the (abstract) mapping $F$  as a (concrete) function $F(x; \theta):x \mapsto y$  with  $y = F(x; \theta)$

-  $\theta\in\mathbb{R}^d$: model parameters that determines the expression of $F$, and thus the performance of learning.

-   **training**:  find the optimal $\theta^\ast$ using certain algorithms, such that on the training data $\{(x^{(i)}, y^{(i)})\}_{i=1}^n$ , $F(x^{(i)}; \theta^*)\approx y^{(i)},  \forall i \quad \mbox{in some sense} $

- **prediction**:  for new data $x$, predict $y = F(x;\theta^\ast)$


## Training

Fit the model $F(x;\theta)$ on training data $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^n$

- performance of rule $F(x;\theta)$ on data sample $(x^{(i)},y^{(i)})$ is measured by a loss function $\ell(F(x^{(i)}; \theta),y^{(i)})$ with the property
  $$\ell(F, y) \downarrow 0,  \quad \mbox{if } F \to y
  $$
One possible lost function is $\ell(F(x; \theta), y) = (F(x; \theta) - y)^2$


- minimize the averaged loss function over $\mathcal{D}$ w.r.t. $\theta$:
 $$
 \theta^* = \arg\min_{\theta} \; \frac{1}{n} \sum_{i=1}^n \ell(F(x^{(i)}; \theta),y^{(i)})
 $$

- objective value = 0 means correct output for every training data.


## 1. Regression
$y^{(i)}$ is a continuous variable. (Most covered in AMAT 565 (Applied Statistics for Data Scientist) and AMAT 593 (Practical)).

<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/regression1.png?raw=true" width="500" />

We can use polynormial to fit the datasets

<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/regression2.png?raw=true" width="500" />


## 2. Classification

$y^{(i)}$ is a discrete variable, for example, +- or integers. (Most covered in AMAT 592(Machine Learning) and partly covered in AMAT 565 (Applied Statistics for Data Scientists) and AMAT 593 (Practical) ).

<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/01_03.png?raw=true" width="400" />


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/digits.png?raw=true" width="500" />


## Testing and generalization

**Testing/Validation**: examine the performance of $F(x;\theta^*)$ on a new set of labeled data. (Most covered in AMAT 565 (Applied Statistics for Data Scientists), I will briefly mentioned in AMAT 592).

- generalizability: the ability of trained model $F(x;\theta^*)$ to react to new data. (Hold-out method)
 - divide the available data into two sets: training set and validation/testing set.
 - train on the training set and evaluate test error on validation set
 - adjust number of model parameters for best generalization on validation set

- overfitting: trained too well but generalize poorly; usually caused by more parameters than can be justified by the data.

- underfitting: bad fit on training & test sets


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/overfitting.png?raw=true" width="600" />


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/generalise.png?raw=true" width="500" />

## More on supervised learning: non-paramatric learning

- $k$-nearest neighbour

- Deep learning (covered in AMAT 592(Machine Learning) and AMAT 593 (Practical) )



## No free lunch theorem
Unfortunately, there is no single best model that works optimally for all kinds of problems

The best way is to pick a suitable model is based on domain knowledge, and/or trial and error


# Unsupervised learning

Find  previously unknown patterns in data set  without pre-existing labels.



## 3. Clustering
Group data in a way that data in the same group (i.e., cluster) are more similar (in some sense) to each other than to those in other clusters. (partly covered in AMAT 592 and mostly covered in AMAT 810 (Advanced Machine Learning) and AMAT 585 (Practical TDA))

<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/01_06.png?raw=true" width="500" />



## 4. Dimensional Reduction

Dimensionality reduction technique that emphasizes variation in a dataset.  (partly covered in AMAT 592 and mostly covered in AMAT 810 (Advanced Machine Learning) )

Here in the example, Direction of largest
variation is relevant. The rest is “noise”. This is the method of **Principle component analysis**.


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/pca_demo.png?raw=true" width="500" />



## Dimension reduction for data compression

Manifold learning


<img src="https://github.com/yiboxu20/MachineLearning/blob/main/Resources/images/01_07.png?raw=true" width="600" />
