<img src="images/ublogo.png"/>

### CSE610 - Bayesian Non-parametric Machine Learning

  - Lecture Notes
  - Instructor - Varun Chandola
  - Term - Fall 2020

### Objective
The objective of this notebook is to provide a gentle introduction to Machine Learning and to the course in general.

### What is Machine Learning
> Use what we have to infer what we do not

There are many types of ML problems, but most can be thought of as the following task:
- Given ${\bf x}$ find $y$
  * ${\bf x}$ refers to the observed part of the data. Other names include - inputs, independent covariates, etc. Here we have denoted it as a mathematical vector, i.e., ${\bf x} \in \mathbb{R}^D$, though this might not always be the case.
  * $y$ refers to the part that we want the ML method to estimate/infer. Depending on the problem, $y$ could mean:
    - A categorical target (classification)
    - A continuous value or vector of values (univariate/multivariate regression)
    - A membership indicator (clustering)
    - A vector representation (representation learning, dimensionality reduction)
    - A state of a stochastic system (reinforcement learning)

### Supervised ML
We will start with one flavor of machine learning - *supervised learning*, where we use a collection of $\langle {\bf x},y\rangle$ pairs (referred to as **training data**) to learn an ML model that can be used for the aforementioned inference. The training data provides supervision, hence the name.

Following discussions will be based on supervised ML, even though many of the concepts translate to other ML flavors (unsupervised learning, reinforcement learning, etc.).

### Functional vs. Probabilistic Methods
**Functional models** learn a mathematical relationship between ${\bf x}$ and $y$, i.e.,
$$
y = f_{\Theta}({\bf x})
$$
The function $f$ is parameterized by a set of parameters ($\Theta$) that are learned using the training data.

**Probabilistic models** estimate the probability (or density) $p(y \vert {\bf x})$. Some models directly estimate the conditional probability (discriminative models), while others treat ${\bf x}$ also as a random variable, model the joint distribution $(y,{\bf x})$ and then estimate $p(y \vert {\bf x})$ by applying the *Bayes rule* (generative models).
> *Question*: How will you use the output of a probabilistic model, i.e., $p(y \vert {\bf x})$ for the actual task, i.e., classification or regression


### Bayesian vs. Non-Bayesian Probabilistic Models
This is to do with *Bayesian* vs. *Frequentist* schools of thought in Statistics. In a typical non-Bayesian ML formulation, the unknown model parameters are learned/estimated by maximizing the log-likelihood of data, along with some other *prior* constraints on the solution.

For instance, consider **Ordinary Linear Regression**:
$$
y = {\bf w}^\top{\bf x} + \epsilon
$$
or
$$
y \sim \mathcal{N}({\bf w}^\top {\bf x}, \sigma^2)
$$
The frequentist MLE approach gives the following estimate for ${\bf w}$:
$$
{\bf w}  =  ({\bf X}^\top{\bf X})^{-1}{\bf X}^\top {\bf y}
$$

While the *Bayesian* treatment of regression has the same formulation as above, the only difference is that the parameters, i.e., ${\bf w}$ and $\sigma^2$ are also treated as random variables. This means that we have a prior distribution of for the parameters, and we can calculate the posterior distribution for the parameters, after observing the data, using the Bayes rule.

In general, that is the fundamental difference between the two types of approaches. In Bayesian methods, the model parameters are treated as random variables, and the observed data is used to revise the prior distribution for these parameters, to obtain a posterior distribution, i.e.,
$$
p({\Theta}\vert D) \propto p(\Theta) \times p(D\vert \Theta)
$$
where $D$ denotes the observed (training) data. We will go over this in more detail in the first part of this course, for linear regression.


### Parametric vs. Non-parametric Methods

Parameteric machine learning methods, as the name suggests, have parameters. Linear regression is a parametric method, because it has a parameter, the weight vector ${\bf w}$. A neural network is a parametric method, too. 

A non-parametric method, does not have parameters. Which essentially means that it **does not make any assumptions about the form of the relationship between ${\bf x}$ and $y$**. For instance, a $k$-nearest neighbor classifier is a non-parametric method. To classify a test instance, it uses the entire training data set to perform the classification. 

> A $k$-nearest neighbor does have a user-defined "parameter" - $k$. We will call such entities as *hyper-parameters* to differentiate them from the actual parameters.

### So what is this course about?
<div align="centered">
    <img src="images/mlmodels.png"/>
</div>

### Why?
- Because not all problems can be solved by deep learning
- How to solve problems with "small data"
- Uncertainty?

<img src="images/deepgp.png" />
          