## Module 1

### What is Machine Learning?

Program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
To have a learning problem, we must identify:
- class of tasks T
- performance measure P
- source of experience E

Traditional programming vs Machine Learning
- Traditional programming: (data) + (program) = (output)
- Machine Learning: (data) + (output) = (program)

#### Table with learning tasks, performance measures and experience sources

| Task | Performance Measure | Experience Source |
| --- | --- | --- |
| Email spam filter | Accuracy of the filter | User marks emails as spam/not spam |
| Handwritten digit recognition | Accuracy of the classifier | User provides examples of digits |
| Self-driving car | Safety and efficiency of the car | User drives the car |
| Playing checkers | % of games won against opponent | Games played against itself |

#### When do we use / not use Machine Learning?
Used when:
- lots of hand-tuning, long lists of rules, or hard to define rules
- complex / fluctuating environment
- expert knowledge does not exist, or is difficult to obtain
- models based on huge amount of data, must be customized to each individual

Not used when:
- simple, static environment, well-defined rules
- no uncertainty in the environment
- expert knowledge is available

### Machine Learning Process

| Step | Description |
| --- | --- |
| 1. Define the Problem     | Clearly define the problem statement, including the goal and the target variable(s).<br> Identify the available resources, constraints, and relevant stakeholders.<br> Understand the domain knowledge and business context to ensure the problem's relevance. |
| 2. Data Collection        | Determine the data requirements based on the problem definition.<br> Identify potential data sources and acquire the necessary datasets.<br> Ensure data quality by performing data validation, cleaning, and handling missing values or outliers. |
| 3. Data Exploration       | Perform statistical analysis, such as summary statistics and data distributions.<br> Visualize the data through plots, histograms, scatterplots, or heatmaps.<br> Identify correlations, patterns, and outliers within the dataset.<br> Conduct feature correlation analysis to understand relationships between variables. |
| 4. Feature Engineering    | Select relevant features based on domain knowledge and exploration.<br> Handle categorical variables through techniques like one-hot encoding or ordinal encoding.<br> Scale numerical features to a common range or apply normalization techniques.<br> Create new features by transforming or combining existing ones (e.g., feature interactions, polynomial features). |
| 5. Model Selection        | Identify the problem type (classification, regression, clustering, etc.).<br> Consider the characteristics of the dataset (e.g., size, dimensionality) and the assumptions of different algorithms.<br> Evaluate various algorithms and choose the one that best suits the problem and data. |
| 6. Model Training         | Split the data into training and testing sets (e.g., using random sampling or time-based splitting).<br> Apply the chosen algorithm to the training data and optimize its hyperparameters.<br> Evaluate the model's performance on the testing set using appropriate metrics.<br> Repeat the training process with different algorithms or parameter settings if necessary. |
| 7. Model Evaluation       | Calculate evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error.<br> Perform cross-validation or holdout validation to estimate the model's performance on unseen data.<br> Analyze the model's strengths, weaknesses, and potential biases.<br> Consider business requirements and domain-specific metrics for a comprehensive evaluation. |
| 8. Model Optimization     | Fine-tune the model's hyperparameters through techniques like grid search, random search, or Bayesian optimization.<br> Regularize the model to prevent overfitting using techniques like L1/L2 regularization or dropout.<br> Explore ensemble methods, such as bagging or boosting, to improve model performance.<br> Use feature selection techniques to remove irrelevant or redundant features. |
| 9. Model Deployment       | Prepare the model for deployment by saving its trained parameters and associated preprocessing steps.<br> Integrate the model into an application, system, or cloud infrastructure.<br> Design and implement an API for making predictions using the deployed model.<br> Ensure the model's scalability, robustness, and security in a production environment. |
| 10. Monitoring and Maintenance | Continuously monitor the model's performance in real-world scenarios.<br> Collect feedback and track performance metrics to detect any degradation or concept drift.<br> Retrain the model periodically with new data to keep it up-to-date and maintain its accuracy.<br> Conduct regular model audits and updates as needed. |
| 11. Iteration and Improvement | Regularly revisit and refine the model as new insights are gained, data quality improves, or new techniques emerge.<br> Incorporate feedback from stakeholders and address any limitations or shortcomings.<br> Continuously experiment with new algorithms or approaches to improve the model's performance and adapt to evolving requirements. |

#### Types of learning:
- Supervised (inductive) learning
    - given training data, desired outputs (labels)
    - learn a function that maps inputs to outputs
    - types:
        - classification (predict class or category, discrete value)
            - binary classification (2 classes)
            - multi-class classification (more than 2 classes)
        - regression (predict continuous value)
- Unsupervised (deductive) learning
    - given training data, no desired outputs
    - learn a function that describes hidden structure from unlabeled data
- Semi-supervised learning
    - given training data, some desired outputs
    - learn a function that maps inputs to outputs
- Reinforcement learning
    - rewards from sequence of actions
    - learn a function that maximizes a reward signal

High level, general comparison table:

|                       | Supervised Learning               | Unsupervised Learning          | Semi-Supervised Learning             | Reinforcement Learning                    |
|-----------------------|-----------------------------------|--------------------------------|--------------------------------------|------------------------------------------|
| Data                  | Labelled                          | Unlabelled                     | Mix of Labelled and Unlabelled       | Depends on State and Reward              |
| Task                  | Prediction                        | Pattern Recognition            | Prediction                           | Sequential Decision Making               |
| Example Algorithms    | Linear Regression, SVM, Neural Networks | Clustering, K-Means, PCA | Self-Training, Multi-View Training   | Q-Learning, SARSA, DQN                    |
| Feedback              | Direct                            | None                           | Partial                              | Reward-based                              |
| Goal                  | Minimize Error on Given Labels    | Discover Hidden Structure      | Better Generalization Accuracy       | Maximize Cumulative Reward                |
| Typical Use Case      | Image Recognition, Email Spam Detection | Customer Segmentation, Anomaly Detection | Web Content Classification, Bioinformatics | Game AI, Robot Navigation, Real-time Decisions |
| Training Efficiency   | High (due to direct feedback)     | Medium (no feedback)           | Varies (depends on labeled/unlabeled ratio) | Typically slow, trial and error-based      |
| Complexity of Problem | Low-Medium                        | High                           | Medium-High                          | High                                      |
| Real-time Adaptation  | Not Typically                     | Not Typically                  | Not Typically                        | Yes, using online learning                 |


# Machine Learning Formulas

### What is Machine Learning?
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Example: playing checkers.
$E$ = the experience of playing many games of checkers
$T$ = the task of playing checkers
$P$ = the probability that the program will win the next game

Traditional programming : (data) + (program) = (output)
Machine Learning : (data) + (output) = (program)

ML is used when:
- lots of hand-tuning, long lists of rules, or hard to define rules
- complex / fluctuating environment
- expert knowledge does not exist, or is difficult to obtain
- models based on huge amount of data, must be customized to each individual

It is based on learning a function $h$ that approximates $y$ as a function of $x$: $$h: X \rightarrow Y$$ where $X$ is the input space and $Y$ is the output space.
Steps:
1. Define the objective of the problem
2. Collect data
3. Prepare / preprocess data
4. Explorative data analysis
5. Building a machine learning model
6. Model evaluation and optimization
7. Deploy the model, predict new values

#### Types of learning:
1. Supervised (inductive) learning
    - given training data, desired outputs (labels)
    - learn a function that maps inputs to outputs
    - types:
        - classification (predict class or category, discrete value)
            - binary classification (2 classes)
            - multi-class classification (more than 2 classes)
        - regression (predict continuous value)
2. Unsupervised (deductive) learning
    - given training data, no desired outputs
    - learn a function that describes hidden structure from unlabeled data
3. Semi-supervised learning
    - given training data, some desired outputs
    - learn a function that maps inputs to outputs
4. Reinforcement learning
    - rewards from sequence of actions
    - learn a function that maximizes a reward signal

## Module 2

### Data
data properties:
- value : how useful the data is for the problem
- volume : how much data to be analyzed and processed
- variety : what types of data (structured, unstructured, semi-structured)
- velocity : how fast data is generated and processed
- veracity : how accurate and reliable the data is

#### Data quality
Have to check:
- accuracy : should reflect reality
- completeness : should have all the required data
- consistency : should be consistent with other data
- validity : should be valid according to the domain
- uniqueness : should be unique, no duplications or redundancies
- timeliness : should be up-to-date

Issues might be:
- missing values
    - data not collected
    - variable not applicable for that observation
- outliers 
    - data point that differs significantly from other observations
- inconsistent data
- invalid data
    - can be due to:
        - measurement error
        - experimental error
        - data corruption
        - data entry error
        - natural variation
- noise
    - extraneous object, modification, or event that interferes with the data
- duplicate data
    - when merging data from heterogeneous sources
- biased / unrepresentative data

#### Data types


## Linear Regression

- linear regression can be used to fit a model to an observed dataset of values of the response (dependent variable) and explanatory variables (independent variables / features)
- $x^{(i)}$ is the vector of input variables / features, $x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} _{((n+1) \times 1)}$, where $n$ is the number of features, with $x_0^{(i)} = 1$ being the intercept term. 
- $y^{(i)}$ is the output variable / target.
- $(x^{(i)}, y^{(i)})$ is a training example.
- $\{(x^{(i)}, y^{(i)}) : i = 1 \dotsm m\}$ is the training set, where $m$ is the number of examples in the training set.

Goal : to learn a function $h(x) : \text{space of input values} \rightarrow \text{space of output values}$, so that $h(x)$ is a good predictor for the corresponding value $y$

#### Equations

If we decide to approximate $y$ as a linear function of $x$, then for the $i^{th}$ training example:

$$\hat{y}^{(i)} =  h_\theta(x^{(i)}) = \theta_0 + \theta_1 x^{(i)}_1 + \theta_2 x^{(i)}_2 + \dotsm + \theta_n x^{(i)}_n = \sum_{j=0}^n \theta_j x^{(i)}_j$$

This is called **simple / univariate** linear regression for $n = 1$, and **multiple** linear regression, (if $n > 1$). This is different from **multivariate** regression, which pertains to multiple dependent variables and multiple independent variables. [Link](https://stats.stackexchange.com/q/2358/331716)

Then we can define the cost function as:
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$$

This is the **ordinary least squares (OLS)** cost function, working to minimize the **mean squares error (MSE)**.

Goal : to choose $\theta$ so as to minimize $J(\theta)$

#### Vectorized

$$ X = \begin{bmatrix} - \left( x^{(1)} \right)^T - \\ - \left( x^{(2)} \right)^T - \\ \vdots \\ - \left( x^{(m)} \right)^T - \end{bmatrix}_{(m \times (n+1))} , \qquad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}_{((n+1) \times 1)} \qquad and \qquad y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} _{(m \times 1)}$$

Then the vector of predictions, 

$$ \hat{y} =  X\theta = \begin{bmatrix} - \left( x^{(1)} \right)^T\theta - \\ - \left( x^{(2)} \right)^T\theta - \\ \vdots \\ - \left( x^{(m)} \right)^T\theta - \end{bmatrix}_{(m \times 1)} $$

We can rewrite the least-squares cost as following, replacing the explicit sum by matrix multiplication:

$$J(\theta) = \frac{1}{2m} (X\theta - y)^T(X\theta - y)$$

#### Finding coefficients for simple linear regression

The simple linear regression model is $y = \theta_0 + \theta_1 x$, where $\theta_0$ is the intercept and $\theta_1$ is the slope. The coefficients are found by minimizing the sum of squared residuals (SSR), which is the sum of the squares of the differences between the observed dependent variable ($y$) and those predicted by the linear function ($\hat{y}$).

Make a table with $x_i$, $y_i$, $x_i - \bar{x}$, $y_i - \bar{y}$, $(x_i - \bar{x})^2$, $(x_i - \bar{x})(y_i - \bar{y})$.

Equation for $\theta_1$: $$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$
Equation for $\theta_0$: $$\theta_0 = \bar{y} - \theta_1 \bar{x} $$

#### Assumptions for linear regression
1. dependent and independent variables are linearly related
2. independent variables are not random
3. residuals are normally distributed
4. residuals are homoscedastic (constant variance)

### Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n^{th}$ degree polynomial in $x$. Polynomial regression fits a nonlinear relationship between the value of $x$ and $y$.
- Simplest form of polynomial regression is a quadratic equation, $y = \theta_0 + \theta_1 x + \theta_2 x^2$

### Normal Equation

The normal equation is an analytical solution to the linear regression problem with a ordinary least square cost function. That is, to find the value of $\theta$ that minimizes $J({\theta})$, take the [gradient](https://mathinsight.org/gradient_vector) of $J(\theta)$ with respect to $\theta$ and equate to $0$, ie $\nabla_\theta J(\theta) = 0$.

Solving for $\theta$, we get 

$$\theta = (X^TX)^{-1} X^Ty$$

[Here](https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/) is a post containing the derivation of the normal equation.

### Gradient Descent

Gradient descent is based on the observation that if the function $J({\theta})$ is differentiable in a neighborhood of a point $\theta$, then $J({\theta})$ decreases fastest if one goes from $\theta$ in the direction of the negative gradient of $J({\theta})$ at $\theta$. 

Thus if we repeatedly apply the following update rule, ${\theta := \theta - \alpha \nabla J(\theta)}$ for a sufficiently small value of **learning rate**, $\alpha$, we will eventually converge to a value of $\theta$ that minimizes $J({\theta})$.

For a specific paramter $\theta_j$, the update rule is 

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J({\theta}) $$

Using the definition of $J({\theta})$, we get

$$\frac{\partial}{\partial \theta_j} J({\theta}) = \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

Therefore, we repeatedly apply the following update rule:

$\qquad Loop \: \{$
    $\qquad \qquad \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
$\qquad \}$

This method looks at every example in the entire training set on every step, and is called **batch gradient descent (BGD)**. 

When the cost function $J$ is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.

There is an alternative to BGD that also works very well:

$\qquad Loop \: \{$
    $\qquad \qquad for \: i=1 \: to \: m \: \{$
    $\qquad \qquad \qquad \theta_j := \theta_j - \alpha \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
    $\qquad \qquad \}$
$\qquad \}$

This is **stochastic gradient descent (SGD)** (also incremental gradient descent), where we repeatedly run through the training set, and for each training example, we update the parameters using gradient of the error for that training example only.

Whereas BGD has to scan the entire training set before taking a single step, SGD can start making progress right away with each example it looks at. 

Often, SGD gets $\theta$ *close* to the minimum much faster than BGD. However it may never *converge* to the minimum, and $\theta$ will keep oscillating around the minimum of $J(\theta)$; but in practice these values are reasonably good approximations. Also, by slowly decreasing $\alpha$ to $0$ as the algorithm runs, $\theta$ converges to the global minimum rather than oscillating around it.

#### Underfitting and Overfitting

Error(model) = Bias(model) + Variance(model) + Irreducible Error

Bias : how far off in general the model is from the actual value. High bias means the model is not complex enough to capture the underlying trend of the data. Low bias means the model is complex enough to capture the underlying trend of the data.

Variance : how much the model changes based on the training data. High variance means the model changes a lot based on the training data. Low variance means the model does not change much based on the training data.

**Underfitting** – High bias and low variance
- model does not fit the training data and does not generalize well to unseen data

Techniques to reduce underfitting :
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

**Overfitting** – High variance and low bias
- model fits the training data well, but does not generalize well to unseen data

Techniques to reduce overfitting :
1. Increase training data (data augmentation)
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
6. Ensemble learning (bagging, boosting, stacking)
7. Cross-validation, holdout validation, k-fold cross-validation

<img src="https://i.imgur.com/b4CWHHf.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

Simple models like linear and logistic regression are prone to underfitting, whereas complex models like decision trees and neural networks are prone to overfitting.


### Adding regularization

Regularization is a technique to reduce overfitting in machine learning. This technique discourages learning a more complex or flexible model, by shrinking the parameters towards $0$.

We can regularize machine learning methods through the cost function using $L1$ regularization or $L2$ regularization. $L1$ regularization adds an absolute penalty term to the cost function, while $L2$ regularization adds a squared penalty term to the cost function. A model with $L1$ norm for regularisation is called **lasso regression**, and one with (squared) $L2$ norm for regularisation is called **ridge regression**. [Link](https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261)

$$J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right)$$

$$J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

The partial derivative of the cost function for lasso linear regression is:

\begin{align}
& \frac{\partial J(\theta)_{L1}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L1}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{2m} signum (\theta_j)
& \qquad \text{for } j \ge 1
\end{align}

Similarly for ridge linear regression,

\begin{align}
& \frac{\partial J(\theta)_{L2}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L2}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j 
& \qquad \text{for } j \ge 1
\end{align}

These equations can be substituted into the general gradient descent update rule to get the specific lasso / ridge update rules.

Elastic Net regression is a combination of lasso and ridge regression. It's regularization term is a combination of the $L1$ and $L2$ regularization terms. The cost function is:
$$ J(\theta)_{ElasticNet} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda_1}{2m} \left( \sum_{j=1}^n |\theta_j| \right) + \frac{\lambda_2}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

#### Note:
- $\theta_0$ is NOT constrained
- scale the data before using Ridge regression
- $\lambda$ is a hyperparameter: bigger results in flatter and smoother model 
- Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection
- Last way to constrain the weights is Elastic net, a combination of Ridge and Lasso
- When to use which?
    * Ridge is a good default
    * If you suspect some features are not useful, use Lasso or Elastic
    * When features are more than training examples, prefer Elastic

### Metrics for evaluating regression models
1. Mean Absolute Error (MAE)
    - average of absolute differences between predictions and actual values
    - robust to outliers, does not penalize large errors like MSE, not differentiable at 0
    - $$ MAE = \frac{1}{m} \sum_{i=1}^m |y^{(i)} - \hat{y}^{(i)}| $$
2. Mean Squared Error (MSE)
    - average of squared differences between predictions and actual values
    - penalizes large errors more than MAE, more sensitive to outliers, differentiable
    - $$ MSE = \frac{1}{m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 $$
3. Mean Absolute Percentage Error (MAPE)
    - average of absolute percentage differences between predictions and actual values
    - $$ MAPE = \frac{1}{m} \sum_{i=1}^m \left| \frac{y^{(i)} - \hat{y}^{(i)}}{y^{(i)}} \right| $$
4. Root Mean Squared Error (RMSE)
    - square root of MSE
    - $$ RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2} $$
5. R-squared
    - proportion of variance in the dependent variable that is predictable from the independent variable(s)
    - $$ R^2 = 1 - \frac{\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^m (y^{(i)} - \bar{y})^2} $$
    - value varies between 0 and 1 usually, where 0 indicates that the model explains none of the variability of the response data around its mean, and 1 indicates that the model explains all the variability of the response data around its mean
    - if you have negative $R^2$, it means that your model is worse than the mean model
    - tends to increase when more predictors are added to the model, even if they are unrelated to the response
        - this could be misleading, because the model may not actually have a better fit
6. Adjusted R-squared
    - penalizes the addition of unnecessary predictors to the model
    - $$ R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1} $$
    - where $n$ is the number of observations and $k$ is the number of predictors
    - $R^2_{adj}$ increases only when the increase in $R^2$ is more than what is expected to happen by chance
    - $R^2_{adj}$ decreases when the model contains useless predictors
    - $R^2_{adj}$ can be negative, and it is always less than or equal to $R^2$

The metrics with squared errors (MSE, RMSE) are more commonly used than MAE and MAPE, because they are differentiable and penalize large errors more. RMSE is the most popular metric, because it is interpretable in the "y" units.

### Cross-validation

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Types:
1. holdout validation set approach
    - split the data into training and validation sets
    - train the model on the training set
    - evaluate the model on the validation set
2. leave-p-out cross-validation
    - split the data into training and testing sets (with $p$ observations in the testing set)
    - repeat the above steps with different splits of the data, average the metric with different p-size testing sets
3. k-fold cross-validation
    - split the data into $k$ folds
    - for each fold, train the model on the remaining $k-1$ folds and evaluate it on the current fold
    - average the metric with different folds
4. stratified k-fold cross-validation
    - same as k-fold cross-validation, but the folds are made by preserving the percentage of samples for each class

## Basis functions

Simplest model for linear regression is given by $y(\bf{x}, \bf{w}) = w_0 + w_1 x_1 + \dotsm + w_D x_D$
- key property of this model is that it is a linear function of the parameters $w_0, \dotsm, w_D$
- basis functions can be used to extend linear models to make them non-linear
- basis functions are fixed and known functions of the input variables
- the model is still linear in the parameters, but non-linear in the input variables $$ y(\bf{x}, \bf{w}) = w_0 + \sum_{j=1}^M w_j \phi_j(\bf{x}) $$ where $\phi_j(\bf{x})$ are the basis functions
- using linear combinations of fixed nonlinear functions of the input variables, we can model a wide range of nonlinear functions
- examples of basis functions:
    - polynomial basis functions $$ \phi_j(x) = x^j $$
    - Gaussian basis functions $$ \phi_j(x) = exp \left\{ - \frac{(x - \mu_j)^2}{2s^2} \right\} $$
    - sigmoidal basis functions $$ \phi_j(x) = \sigma \left( \frac{x - \mu_j}{s} \right) $$ where $\sigma(a) = \frac{1}{1 + exp(-a)}$ is the logistic sigmoid function
- advantages of basis functions:
    - closed-form solution for the parameters
    - non linear models mapping input variables to output variables through basis functions
- disadvantages:
    - assumption that the basis functions are fixed and not learned
    - curse of dimensionality, to capture the input space with a fine grid of basis functions, the number of basis functions grows exponentially with the number of input variables


## Discriminative classifiers

### Generative vs Discriminative classifiers

| Aspect                     | Discriminative Models                         | Generative Models                            |
|----------------------------|----------------------------------------------|---------------------------------------------|
| Objective                  | Focuses on learning the decision boundary that separates different classes or categories in the data. | Focuses on modeling the joint distribution of features and labels to generate new data samples. |
| Modeling Approach          | Directly models the conditional probability of the class labels given the features (P(y\|x)). | Models the joint probability of both class labels and features (P(x, y)). |
| Use Case                   | Well-suited for classification tasks where the primary goal is to predict the class labels of new data points. | Suitable for classification tasks and can also be used for data generation and sampling. |
| Data Generation            | Cannot be used for generating new data samples as it only models the decision boundary. | Can be used to generate new data samples by sampling from the learned joint distribution. |
| Training Data              | Requires labeled data for learning the conditional probabilities. | Requires both labeled data for estimating class priors and conditional probabilities. |
| Dimensionality Reduction   | Not well-suited for dimensionality reduction tasks. | Can be used for dimensionality reduction tasks, such as generating low-dimensional representations of data. |
| Example Algorithms         | Logistic Regression, Support Vector Machines (SVM), Neural Networks (for classification). | Naive Bayes, Gaussian Mixture Models, Hidden Markov Models. |


Types of classifiers:
1. Linear classifiers
    - classes are seperated by a linear decision boundary, if for a given input $x$, $\theta^Tx \ge 0$ then $y = 1$, else $y = 0$
    - $\theta$'s are learned from the training data during the training phase, then used to classify new data
    - examples:
        - logistic regression
        - support vector machines
2. Non-linear classifiers
    - classes are seperated by a non-linear decision boundary
    - examples:
        - Decision trees
        - Random forests
        - Neural networks

## Logistic Regression
- transforms the output of a linear regression model into a probability by applying the logistic function (sigmoid function) $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
- the output of the logistic function is interpreted as the probability of the input belonging to the positive class, $$ h_{\theta}(x) = P(y = 1 \mid x) = \sigma(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}} $$
    - if $\theta^Tx = 0$, then $P(y = 1 \mid x) = 0.5$
    - if $\theta^Tx \gg 0$, then $P(y = 1 \mid x) \approx 1$
    - if $\theta^Tx \ll 0$, then $P(y = 1 \mid x) \approx 0$
    - here $f(x) = \theta^Tx$ is called logit function
- works by determining the weights $\theta$ such that the predicted probability is maximized for the positive class and minimized for the negative class
- you maximize the log-likelihood function, $$ \ell(\theta) = \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$ 
    - if $y^{(i)} = 1$, then $P(y^{(i)} = 1 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is maximized
    - if $y^{(i)} = 0$, then $P(y^{(i)} = 0 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is minimized
- this method is called maximum likelihood estimation (MLE)
- Logit function
    - the logit function is the inverse of the logistic function, $$ \text{logit}(p) = \log \left( \frac{p}{1 - p} \right) = \sigma^{-1}(p) $$
    - because of this, logit is also called log-odds, because it is the logarithm of the odds, $$ \text{odds}(p) = \frac{p}{1 - p} $$
- dependent variable follows Bernoulli distribution
- cost function is $$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$
- gradient descent update rule is $$ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{for } j \ge 1 $$ but here the hypothesis function is different from that of linear regression, as defined above
- regularized cost functions are exactly the same as in linear regression
    - L1 regularization : $$ J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right) $$
    - L2 regularization : $$ J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right) $$

### Types of logistic regression
1. Binary logistic regression
    - the dependent variable has only two possible outcomes
    - the goal is to determine the probability that an observation is in a particular category
2. Multinomial logistic regression
    - the dependent variable has three or more unordered categories
    - the goal is to determine the probability that an observation is in each category
3. Ordinal logistic regression
    - the dependent variable has three or more ordered categories
    - the goal is to determine the probability that an observation is in each category

### Metrics for evaluating classification models

Confusion Matrix:
<img src="https://i.imgur.com/v4FpYTm.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

Accuracy = $\frac{TP + TN}{TP + TN + FP + FN}$

Error Rate = $\frac{FP + FN}{TP + TN + FP + FN} = 1 - Accuracy$

Precision = $\frac{TP}{TP + FP}$

Recall / Sensitivity / TPR = $\frac{TP}{TP + FN}$

TNR / Specificity = $\frac{TN}{TN + FP}$

FPR / Fall-out / Type I error = $\frac{FP}{FP + TN} = 1 - TNR$

Type II error / False Negative Rate = $\frac{FN}{TP + FN} = 1 - Recall$

F1 Score = $\frac{2 * Precision * Recall}{Precision + Recall}$

#### ROC Curve

AUC = Area Under the ROC Curve : ROC Curves are used to see how well your classifier can separate positive and negative examples and to identify the best threshold for separating them. To be able to use the ROC curve, your classifier has to be ranking - that is, it should be able to rank examples such that the ones with higher rank are more likely to be positive.
- y axis : TPR
- x axis : FPR

<img src="https://i.imgur.com/FICMCrT.png" width="600" style="display: block; margin-left: auto; margin-right: auto;">

Given a dataset and a classifier, you can plot the ROC curve by doing the following:
1. Rank the examples according to the classifier's output, from highest to lowest.
2. Start at (0,0).
3. For each example in the dataset:
    - If the example is positive, move $1/positive\_examples$ up.
    - If the example is negative, move $1/negative\_examples$ to the right.
4. The resulting curve is the ROC curve.

<src img="https://habrastorage.org/files/267/36b/ff1/26736bff158a4d82893ff85b2022cc5b.gif" width="300" style="display: block; margin-left: auto; margin-right: auto;">

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC is desirable for these two reasons:
- scale-invariant : measures how well predictions are ranked, rather than their absolute values.
- classification-threshold-invariant : measures the quality of the model’s predictions irrespective of classification threshold

Caveats, which limit usefulness of AUC in certain cases:
- scale-invariance : sometimes we really do need well calibrated probability outputs
- classification-threshold invariance : cases where there are wide disparities in the cost of false negatives vs. false positives

## Decision Trees
Advantages:
- inexpensive to construct
- extremely fast at classifying unknown records
- easy to interpret for small-sized trees
- can easily handle redundant or irrelevant attributes (unless the attributes are interacting)

Disadvantages:
- space of possible decision trees is exponentially large
- greedy approaches are often unable to find the best tree
- does not take into account interactions between attributes
- each decision boundary involves only a single attribute

#### Decision tree impurity measures

$${\displaystyle \text{Entropy}(t) = -\sum_{c=1}^{C} p(c|t) log_2(p(c|t))}$$

$${\displaystyle Gini(t) = 1 - \sum_{c=1}^{C} [p(c|t)]^2}$$

$$\text{Misclassification error}(t) =  1 - \max_c[p(c|t)]$$

Where $t$ is the current node, $C$ is the number of classes, and $p(c|t)$ is the proportion of the samples that belong to class $c$ at node $t$.

<img src="https://i.imgur.com/0Zn675n.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

### Using information gain to decide split
1. Calculate the entropy of the target. $$ \text{Entropy}(S) = - \sum_{i=1}^n p_i \log_2 p_i $$
2. Calculate the entropy of the target for each feature. $$ \text{Entropy}(S, A) = \sum_{i=1}^n \frac{|S_i|}{|S|} \text{Entropy}(S_i) $$ where $S_i$ is the subset of $S$ for which feature $A$ has value $i$. $$ \text{Entropy}(S_i) = - \sum_{i=1}^n p_i \log_2 p_i $$
3. Calculate the information gain for each feature. $$ \text{Information Gain}(S, A) = \text{Entropy}(S) - \text{Entropy}(S, A) $$
4. Choose the feature with the highest information gain.

### Using Gini index to decide split
1. Calculate the Gini index of the target. $$ \text{Gini}(S) = 1 - \sum_{i=1}^n p_i^2 $$
2. Calculate the Gini index of the target for each feature. $$ \text{Gini}(S, A) = \sum_{i=1}^n \frac{|S_i|}{|S|} \text{Gini}(S_i) $$ where $S_i$ is the subset of $S$ for which feature $A$ has value $i$. $$ \text{Gini}(S_i) = 1 - \sum_{i=1}^n p_i^2 $$
3. Calculate the information gain for each feature. $$ \text{Gini Gain}(S, A) = \text{Gini}(S) - \text{Gini}(S, A) $$
4. Choose the feature with the highest gini gain.


## Instance based learning

- systems that learn the training examples by heart and then generalizes to new instances based on some similarity measure
- called instance-based because it builds the hypotheses from the training instances
- also called lazy learning, memory-based learning, or case-based reasoning
- time complexity is $O(n)$ where $n$ is the number of training instances

### k-Nearest Neighbors (kNN) 
- non-parametric method, supervised learning
- used for classification
    - object is classified by a majority vote of its $k$ nearest neighbors
    - if $k=1$, then the object is simply assigned to the class of that single nearest neighbor
- used for regression
    - object's value is estimated by the average of its $k$ nearest neighbors
- $k$ is a hyperparameter that is usually chosen by cross-validation
- kNN is sensitive to the local structure of the data
    - if the data is not uniformly sampled, then the nearest neighbors will not be representative of the entire data set
    - in this case, the decision boundary will be irregular
- kNN is sensitive to the distance metric used (Euclidean, Manhattan, Minowski, etc.)
- drawback when class distribution is skewed
    - if one class is much more frequent than the others, then the nearest neighbors will be dominated by the most frequent class
    - solution: use weighted voting, where the weights are the inverse of the distance to the query point

### Locally Weighted Regression (LWR)
- memory-based method that performs a regression around a point of interest using only training data that are local to that point
- non-parametric method (parameters are computed individually for each query point)
- cost function is modified as $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m w^{(i)} (h_\theta(x^{(i)}) - y^{(i)})^2$$ where $w^{(i)} = \exp \left( - \frac{(x^{(i)} - x)^2}{2\tau^2} \right)$ and $\tau$ is the bandwidth parameter that controls the degree of smoothing
    - if $(x^{(i)} - x)^2$ is small, then $w^{(i)}$ is close to 1, and vice versa
- training data must be available at the time of prediction

### Kernel function
- for linear regression, the hypothesis is $h_\theta(x) = \theta^T x$, the dot product is integral to the prediction operation
- suppose the vectors are not linearly separable, then we can use a function $\phi(x)$ to map the vectors to a higher dimensional space where they are linearly separable
- kernel function is a function that maps a function of the vectors in the original space to a dot product of the vectors in a higher dimensional space
- Let $\phi(x) : \mathbb{R}^2 \rightarrow \mathbb{R}^3$ be a function that maps $x$ from 2D to 3D space, and the kernel function is defined as $K(x, x^*) = \phi(x) \cdot \phi(x^*)$
- Consider $x = [x_1, x_2]$,  $x^* = [x^*_1, x^*_2]$, then $\phi(x) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]$ and $\phi(x^*) = [x_1^{*2}, \sqrt{2}x_1^*x_2^*, x_2^{*2}]$
- Here $\phi(x) \cdot \phi(x^*) = x_1^2x_1^{*2} + 2x_1x_2x_1^*x_2^* + x_2^2x_2^{*2} = (x_1x_1^* + x_2x_2^*)^2 = (x \cdot x^*)^2$
- So $K(x, x^*) = (x \cdot x^*)^2$ is a kernel function, and we were able to perform the dot product in a higher dimensional space without explicitly computing $\phi(x)$ and $\phi(x^*)$, which is computationally expensive operation
- $x \rightarrow \phi(x)$, $x^* \rightarrow \phi(x^*)$, then $x \cdot x^* \rightarrow K(x, x^*) = \phi(x) \cdot \phi(x^*)$

### Radial Basis Functions (RBF) 
- The basic idea behind RBFs is to model the data using a set of basis functions, where each basis function represents a localized influence on the data.
- a real-valued function $\varphi$ whose value depends only on the distance from a fixed point
    - point is either the origin, so that $\varphi(\mathbf{x}) = \hat{\varphi}(\|\mathbf{x}\|)$
    - a fixed point $\mathbf{c}$, called a center, so that $\varphi_c(\mathbf{x}) = \hat{\varphi}(\|\mathbf{x} - \mathbf{c}\|)$
- examples of RBFs:
    - Gaussian: $\varphi(r) = e^{{-(\epsilon r)}^2}$ where $r = \|\mathbf{x} - \mathbf{c}\|$ and $\epsilon$ is a shape parameter
    - Multiquadric: $\varphi(r) = \sqrt{1 + (\epsilon r)^2}$
    - Inverse quadratic: $\varphi(r) = \frac{1}{1 + (\epsilon r)^2}$
    - Inverse multiquadric: $\varphi(r) = \frac{1}{\sqrt{1 + (\epsilon r)^2}}$
    - Thin plate spline: $\varphi(r) = r^2 \ln(r)$
    - Cubic: $\varphi(r) = r^3$
    - Wendland $\varphi(r) = (1 - \epsilon r)^4_+ (4\epsilon r + 1)$
- used to build function approximations of the form $$y(\mathbf{x}) = \sum_{i=1}^N w_i \varphi(\|\mathbf{x} - \mathbf{x}_i\|)$$
    - approximating function is represented as a sum of $N$ radial basis functions, each associated with a different center $\mathbf{x}_i$, and weighted by an appropriate coefficient $w_i$
    - weights $w_i$ can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights $w_i$
- numerical: finding target value for a new point
    - say the data points are [1, 2, 3, 6, 7] and their targets [4, 6, 2, 10, 8]
    - new point is $x_{new} = 4$
    - choosing Gaussian RBF, $\varphi(r) = e^{{-(\epsilon r)}^2}$
    - for simplicity, let $\epsilon = 1$
    - choose the data points themselves as the centers, so $\mathbf{x}_i = [1, 2, 3, 6, 7]$
    - RBF for each center using the formula is $\varphi(x_{new}, c_i) = e^{-1 \times (x_{new} - c_i)^2}$
    - $predicted\_target = \frac{\sum_{i=1}^N target(c_i) \times \varphi(x_{new}, c_i)}{\sum_{i=1}^N \varphi(x_{new}, c_i)}$
- numerical: interpolation function that passes through the data points
    - say the data points are (x, y): (1, 2) (2, 3) (3, 4) (4, 5) (5, 6)
    - choose 3 centers: $c_1 = 1, c_2 = 3, c_3 = 5$
    - choose Gaussian RBF, $\varphi_i(r) = e^{{-(\epsilon ||x - c_i||)}^2}$ where $\epsilon = 1$
    - then interpolation function, $F(x) = \sum_{i=1}^3 w_i \varphi_i(x, c_i)$
    - to find the weights, we need to solve the system of linear equations
        1. $2 = w_1 \varphi_1(1, 1) + w_2 \varphi_2(1, 3) + w_3 \varphi_3(1, 5)$
        2. $3 = w_1 \varphi_1(2, 1) + w_2 \varphi_2(2, 3) + w_3 \varphi_3(2, 5)$ and so on...


### RBF Network
- fundamental idea is that an item's predicted target value is likely to be the same as other items with close values of predictor variables
- places one or many RBF neurons in the space described by the predictor variables
- space has multiple dimensions corresponding to the number of predictor variables present
- calculates the Euclidean distance from the evaluated point to the center of each neuron
- RBF (kernel function) is applied to the distance to calculate every neuron's weight (influence)
- the greater the distance of a neuron from the point being evaluated, the less influence (weight) it has
- predict value for new points by adding the output values of RBF functions applied to the distance between the new point and the center of each neuron multiplied by the weight of each neuron
- RBF network is a three-layer neural network
    - input layer: neurons that receive the input values
    - hidden layer: neurons that apply the RBF function to the distance between the input values and the center of each neuron
    - output layer: neurons that sum the output values of the hidden layer neurons multiplied by the weight of each neuron $$ y(\mathbf{x}) = \sum_{i=1}^N w_i \varphi(\|\mathbf{x} - \mathbf{x}_i\|)$$
- the approximant $y(\mathbf{x})$ is differentiable with respect to the weights $w_i$, hence the weights can be estimated using any of the standard iterative methods for neural networks
- numerical: RBF network, you have output value for a few 2D points
    - method is same as interpolation (pg. 700)

## Support Vector Machines (SVM) [🔗](https://drive.google.com/file/d/12KgpHBHalf4WFJHsVfbim1dQDdVyPk8d/view?usp=drive_link)
- maps training examples to points in space so as to maximise the width of the gap between the two categories
- new examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall
- SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
- [Link](https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-support-vector-machines/)

The functional margin is represented as $\hat{\gamma}$ and the geometric margin is represented as $\gamma$. The geometric margin can be expressed as $\gamma = \frac{\hat{\gamma}}{||w||}$, where $w$ is the weight vector. So, the geometric margin is a scaled version of the functional margin. Here $\hat{\gamma} = y_i(w^Tx_i + b)$. The functional margin represents the correctness and confidence of the prediction if the magnitude of the $w^T$ orthogonal to the hyperplane has a constant value all the time.

The functional margin gives the position of a point with respect to the hyperplane, which does not depend on the magnitude. The geometric margin is a scaled version of the functional margin and gives the distance between a given training example and the given hyperplane. It is invariant to the scaling of the vector orthogonal to the hyperplane.

The optimization equation for hard margin SVM is to maximize the margin between two classes subject to the constraint that all data points are classified correctly. The margin is defined as the distance between two parallel hyperplanes that separate the two classes. The optimization problem can be expressed as minimizing $\frac{1}{2}||w||^2$ subject to the constraint $y_i(w^Tx_i + b) \geq 1$ for all data points $i$. [Video](https://www.youtube.com/watch?v=vNt_WCM1M3M&list=PLAoF4o7zqskR7U98D799FKHkZ4YrHKPqs&index=82)

The primal optimization problem for SVM is $$ L(w, b, \alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^n \alpha_i[y_i(w^Tx_i + b) - 1] $$

The dual optimization problem for SVM is 
$$ \Theta_D(\alpha) = \min_{w, b} L(w, b, \alpha) $$

Solving for $w$ and $b$ in the dual optimization problem gives $$ \Theta_D(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j <x_i^T x_j> $$

This gives us the dual problem, which is a quadratic optimization problem. The dual problem is easier to solve than the primal problem because it is a convex optimization problem. It is given by $$ \max_{\alpha} \Theta_D(\alpha) $$ subject to $ \sum_{i=1}^n \alpha_i y_i = 0 $ and $ \alpha_i \geq 0 $ for all $i$.

So we can first solve for $\alpha$ using the dual optimization problem, and then solve for $w$ and $b$ using the values of $\alpha$.

#### Kernel Trick

Kernel functions are functions that map data points from a low-dimensional space to a higher-dimensional space. The kernel function is used to transform the data into a higher-dimensional space so that the data becomes linearly separable. The kernel function is defined as $$ K(x_i, x_j) = \phi(x_i)^T \phi(x_j) $$ where $\phi(x_i)$ is the transformed data point. The kernel function is a dot product between the transformed data points.

Equation of linear kernel function is $$ K(x_i, x_j) = x_i^T x_j $$
Equation of polynomial kernel function is $$ K(x_i, x_j) = (x_i^T x_j + c)^d $$
Equation of radial basis function kernel function is $$ K(x_i, x_j) = \exp(\gamma ||x_i - x_j||^2) $$
    - small value of $\gamma$ will make the model behave like a linear SVM
    - large value of $\gamma$ will make the model heavily impacted by the support vectors examples
Equation of sigmoid kernel function is $$ K(x_i, x_j) = \tanh(\beta x_i^T x_j + \theta) $$

#### Soft margin SVM

The optimization problem is given by $$ \min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^n \xi_i $$ subject to $ y_i(w^Tx_i + b) \geq 1 - \xi_i $ and $ \xi_i \geq 0 $ for all $i$.

In soft margin SVM, a slack variable $\xi_i$ is introduced for every data point $x_i$. The value of $\xi_i$ is the distance of $x_i$ from the corresponding class’s margin if $x_i$ is on the wrong side of the margin, otherwise zero¹. This allows some misclassifications to happen while keeping the margin as wide as possible so that other points can still be classified correctly.

#### Regularization parameter C
- determines how important $xi$ should be
    - smaller $C$ emphasizes the importance of $xi$
    - larger $C$ diminishes the importance of $xi$
- controls how the SVM will handle errors
    - if $C$ is positive infinite, then we will get the same result as the hard margin SVM
    - if $C$ is 0, then there will be no constraint anymore, and we will end up with a hyperplane not classifying anything
- small values of $C$ will result in a wider margin, at the cost of some misclassifications
- large values of $C$ will give you the hard margin classifier and tolerates zero constraint violation



## Naive Bayes classifier

- Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions between the features.
- We have a set of features $X = {X_1, X_2, ..., X_n}$ and a class variable $Y$.
- We want to find the class $Y$ that maximizes the posterior probability $P(Y|X)$.
- Then $$ P(Y = y_k|X_1, X_2, ..., X_n) = \frac{P(Y = y_k)P(X_1, X_2, ..., X_n|Y = y_k)}{\sum\limits_{j} P(Y = y_j)P(X_1, X_2, ..., X_n|Y = y_j)} $$
- Assuming conditional independence, we have $P(X_1, X_2, ..., X_n|Y = y_k) = \prod\limits_{i=1}^n P(X_i|Y = y_k)$. Therefore, $$P(Y = y_k|X_1, X_2, ..., X_n) = \frac{P(Y = y_k)\prod\limits_{i=1}^n P(X_i|Y = y_k)}{\sum\limits_{j} P(Y = y_j)\prod\limits_{i=1}^n P(X_i|Y = y_j)}$$
- Pick the most probable class: $$\hat{y} = \arg\max\limits_{y_k} P(Y = y_k)\prod\limits_{i} P(X_i|Y = y_k)$$

Steps to apply Naive Bayes classifier, given a table like this:

|Weather|Play|
|---|---|
|Sunny|No|
|...|...|
|Rainy|Yes|

We convert it into a frequency table like this:

|Weather|No|Yes|Total|Prob|
|---|---|---|---|---|
|Sunny|2|3|5| P(Sunny) = $\frac{5}{14}$|
|Overcast|0|4|4| P(Overcast) = $\frac{4}{14}$|
|Rainy|3|2|5| P(Rainy) = $\frac{5}{14}$|
|Total|5|9|14|1|
|Prob|P(No) = $\frac{5}{14}$|P(Yes) = $\frac{9}{14}$|1|

Then we can calculate the posterior probability of each class, given the evidence (weather), for example, $P(Yes|Sunny)$:
$$ P(Yes|Sunny) = \frac{P(Sunny|Yes)P(Yes)}{P(Sunny)} = \frac{\frac{3}{9}\frac{9}{14}}{\frac{5}{14}} = \frac{3}{5}$$

If there are multiple features, we can calculate the posterior probability of each class, given the evidence (weather and temperature), for example, $P(Yes|Sunny, Cool)$:
$$ P(Yes|Sunny, Cool) = \frac{P(Sunny, Cool|Yes)P(Yes)}{P(Sunny, Cool)} = \frac{P(Sunny|Yes)P(Cool|Yes)P(Yes)}{P(Sunny)P(Cool)} $$

## Naive Bayes for text classification
You need a document $d$, a set of classes $C = {c_1, c_2, ..., c_n}$, and a set of $m$ hand-labelled documents $(d_1, c_1), (d_2, c_2), ..., (d_m, c_m)$. The for a document $d$, we want to find the class $c$ that maximizes the posterior probability $P(c|d)$.
$$ P(c|d) = \frac{P(c)P(d|c)}{P(d)} = \frac{P(c)\prod\limits_{i=1}^n P(w_i|c)}{P(d)}$$
Here, there are two assumptions : bag of words (position doesn't matter) and conditional independence.
Then, we pick the most probable class: $$c_{MAP} = \arg\max\limits_{c} P(c)\prod\limits_{i=1}^n P(w_i|c)$$
Here, $$ P(c_j) = \frac{docCount(C = c_j)}{N_{doc}} $$ and $$ P(w_i|c_j) = \frac{wordCount(w_i, C = c_j)}{\sum\limits_{w \in V} wordCount(w, C = c_j)} $$, where $V$ is the vocabulary.
This has a problem of zero probability, so we use Laplace smoothing: $$ P(w_i|c_j) = \frac{wordCount(w_i, C = c_j) + 1}{\sum\limits_{w \in V} wordCount(w, C = c_j) + |V|} $$

[Example](https://www.fi.muni.cz/~sojka/PV211/p13bayes.pdf):
<img src="https://i.imgur.com/p3nZUNM.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">
<img src="https://i.imgur.com/kcNsCro.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

Therefore, $$P(C|d_5) = \frac{3}{4} {(\frac{3}{7})}^3 \frac{1}{14} \frac{1}{14} \frac{1}{P(d_5)}$$
and $$P(\bar{C} | d_5) = \frac{1}{4} {(\frac{2}{9})}^3 \frac{2}{9} \frac{2}{9} \frac{1}{P(d_5)}$$

$P(d_5)$ is the same for both classes, so we can ignore it.

## Ensemble learning

- Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem, and then combined to get better results.
- types of ensemble methods:
1. bagging - decrease variance
    - building multiple models (typically of the same type) from different subsamples of the training dataset (with replacement) 
    - considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process
    - eg. random forest, extra trees
2. boosting - decrease bias
    - building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain
    - considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy
3. stacking  - increase predictive power
    - building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the base models
    - considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

### Random forest vs Extra trees
- Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. The steps for random forest are:
    1. Sample $n$ cases at random with replacement to create a subset of the data, called the bag used to build a tree
    2. At each node, randomly select $d$ features without replacement
    3. Calculate the best split point for the selected features
    4. Split the node into two daughter nodes
    5. Repeat steps 1 to 4 $k$ times
    6. Aggregate the prediction by each tree to assign the class label by majority vote (classification) or average (regression)
- Extra trees is an ensemble learning method for classification, regression and other tasks that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally random decision tree.

### AdaBoost vs Gradient Boosting vs XGBoost
- AdaBoost is an ensemble learning method for classification and regression. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers.

<img src="https://i.imgur.com/uYsoCL2.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

- Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Gradient boosting is a greedy algorithm and can overfit if run for too many iterations.

<img src="https://i.imgur.com/Fz2HzoG.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

- XGBoost is short for eXtreme Gradient Boosting. It is an optimized distributed gradient boosting library. It provides a parallel tree boosting (also known as GBDT, GBM) that solves many ML problems quickly and accurately.

## Clustering

- Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

### K-means clustering
- K-means clustering aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
    - K-means clustering minimizes within-cluster variances (squared Euclidean distances)
    - Given a set of observations $(x_1, x_2, ..., x_n)$, where each observation is a $d$-dimensional real vector, k-means clustering aims to partition the $n$ observations into $k$ sets $S = {S_1, S_2, ..., S_k}$ so as to minimize the within-cluster sum of squares (WCSS) $$ \sum_{i=1}^k \sum_{x \in S_i} ||x - \mu_i||^2 $$ where $\mu_i$ is the mean of points in $S_i$.

#### Evaluation metrics
- Distortion
    - the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
    - The distortion is given by $$ J = \sum_{i=1}^k \frac{1}{|S_i|} \sum_{x \in S_i} ||x - \mu_i||^2 $$
- Inertia
    - the sum of squared distances of samples to their closest cluster center.
    - The inertia is given by $$ I = \sum_{i=1}^k \sum_{x \in S_i} ||x - \mu_i||^2 $$

- Dunn index
    - the ratio between the minimum inter-cluster distance to maximum intra-cluster distance. The higher the value of Dunn index, the better the clustering.
    - The Dunn index is defined as $$ D = \frac{\min\limits_{1 \leq i < j \leq n} d(i, j)}{\max\limits_{1 \leq k \leq n} \Delta(k)} = \frac{\min(\text{inter-cluster distance})}{\max(\text{intra-cluster distance})}$$ where $d(i, j)$ is the distance between clusters $i$ and $j$, and $\Delta(k)$ is the diameter of cluster $k$.

In [None]:
import numpy as np

# Define the number of clusters and the number of iterations
K = 2
max_iterations = 3

# Generate some sample data
# data = np.random.randint(0, 10, size=(5, 2))
data = np.array([[1, 1], [2, 2], [2, 3], [1, 2], [5,6], [5, 7], [6, 7], [6, 6]])
print(f"Points: {data}")

# Initialize the centroids by randomly selecting K data points
# centroids = data[np.random.choice(data.shape[0], K, replace=False)]
centroids = np.array([[1, 1], [5, 6]])
centroids = centroids.astype(float)

# Iterate the k-means algorithm
for i in range(max_iterations):
    # Assign each point to the nearest centroid
    distances = np.sqrt(np.sum((data[:, np.newaxis, :] - centroids) ** 2, axis=2))
    labels = np.argmin(distances, axis=1)
    
    # Print the centroids and the distances at each iteration
    print(f"Iteration {i+1}:")
    print(f"Centroids: {centroids}")
    print(f"Distances: {distances}")
    
    # Update the centroids to the mean of the assigned points  
    for k in range(K):
        centroids[k] = np.mean(data[labels == k], axis=0)

Points: [[1 1]
 [2 2]
 [2 3]
 [1 2]
 [5 6]
 [5 7]
 [6 7]
 [6 6]]
Iteration 1:
Centroids: [[1. 1.]
 [5. 6.]]
Distances: [[0.         6.40312424]
 [1.41421356 5.        ]
 [2.23606798 4.24264069]
 [1.         5.65685425]
 [6.40312424 0.        ]
 [7.21110255 1.        ]
 [7.81024968 1.41421356]
 [7.07106781 1.        ]]
Iteration 2:
Centroids: [[1.5 2. ]
 [5.5 6.5]]
Distances: [[1.11803399 7.1063352 ]
 [0.5        5.70087713]
 [1.11803399 4.94974747]
 [0.5        6.36396103]
 [5.31507291 0.70710678]
 [6.10327781 0.70710678]
 [6.72681202 0.70710678]
 [6.02079729 0.70710678]]
Iteration 3:
Centroids: [[1.5 2. ]
 [5.5 6.5]]
Distances: [[1.11803399 7.1063352 ]
 [0.5        5.70087713]
 [1.11803399 4.94974747]
 [0.5        6.36396103]
 [5.31507291 0.70710678]
 [6.10327781 0.70710678]
 [6.72681202 0.70710678]
 [6.02079729 0.70710678]]


### K-means++ algorithm

- K-means++ algorithm is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm
- The algorithm is as follows:
    1. Choose one center uniformly at random from among the data points.
    2. For each data point $x$, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
    3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proportional to $D(x)^2$.
    4. Repeat Steps 2 and 3 until $k$ centers have been chosen.
    5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
- The k-means++ seeding method gives a provable upper bound on the expected running time of the resulting k-means algorithm, which is nearly-optimal up to constant factors.
- main advantage is that it helps avoid poor local minima, by ensuring a more spread out initial set of centers, leading to faster convergence and better final solution

### K Medoids clustering
- k-medoids chooses datapoints as centers (medoids or exemplars) and forms clusters around them. It is similar to k-means clustering, but the difference is that the center of the cluster is always a data point. In k-means, the center of the cluster is the mean of the data points in the cluster.
- The algorithm is as follows:
    1. Initialize: randomly select $k$ of the $n$ data points as the medoids
    2. Associate each data point to the closest medoid. (Thus forming $k$ clusters of data points.)
    3. For each cluster $k$ and its medoid $m$:
        - For each non-medoid data point $o$ in the cluster:
            - Swap $o$ and $m$ and compute the total cost of the configuration
        - Select the configuration with the lowest cost.
    4. Repeat Steps 2 and 3 until there is no change in the medoid.

### Fuzzy C-means clustering
- Fuzzy C-means clustering is a method of clustering which allows one piece of data to belong to two or more clusters
- The algorithm is as follows:
    1. Specify the number of clusters $c$ and the fuzzy parameter $m$.
    2. Initialize the cluster centers randomly, $v_j \in \mathbb{R}^d$ for $j = 1, 2, ..., c$.
    3. For each data point, compute the degree of membership of that point to each cluster center: $$\mu_{ij} = \frac{1}{\sum\limits_{k=1}^c \left( \frac{d_{ij}}{d_{kj}} \right) ^ \frac{2}{m-1}}$$ where $d_{ij}$ is the distance between the $i^{th}$ data point and the $j^{th}$ cluster center.
    4. Recompute the cluster centers: $$v_j = \frac{\sum\limits_{i=1}^n \mu_{ij}^m x_i}{\sum\limits_{i=1}^n \mu_{ij}^m}$$
    5. Repeat Steps 3 and 4 until the membership coefficients $\mu_{ij}$ do not change.


## Expectation Maximization (EM) algorithm
- Expectation Maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between:
    - Expectation step (E-step): create a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
        $$ Q(\theta | \theta^{(t)}) = E_{Z|X, \theta^{(t)}} [\log L(\theta; X, Z)] $$
    - Maximization step (M-step): compute parameters maximizing the expected log-likelihood found on the E-step
        $$ \theta^{(t+1)} = \arg\max\limits_{\theta} Q(\theta | \theta^{(t)}) $$
- The EM algorithm is guaranteed to converge to a local maximum, but not necessarily to the global maximum of the likelihood. In practice, EM can be susceptible to getting stuck in local maxima, so multiple restarts are used. The EM algorithm can also be generalized to maximize incomplete-data likelihood functions.

## Model evaluation / Comparison

### Confidence Interval for Accuracy
$$ CI = \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$
where $\hat{p}$ is the observed accuracy, $z_{\alpha/2}$ is the critical value of the normal distribution at $\alpha/2$ (e.g. for 95% confidence interval, $\alpha = 0.05$ and $z_{\alpha/2} = 1.96$), and $n$ is the number of test instances.