## Module 1

### What is Machine Learning?

Program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
To have a learning problem, we must identify:
- class of tasks T
- performance measure P
- source of experience E

Traditional programming vs Machine Learning
- Traditional programming: (data) + (program) = (output)
- Machine Learning: (data) + (output) = (program)

#### Table with learning tasks, performance measures and experience sources

| Task | Performance Measure | Experience Source |
| --- | --- | --- |
| Email spam filter | Accuracy of the filter | User marks emails as spam/not spam |
| Handwritten digit recognition | Accuracy of the classifier | User provides examples of digits |
| Self-driving car | Safety and efficiency of the car | User drives the car |
| Playing checkers | % of games won against opponent | Games played against itself |

#### When do we use / not use Machine Learning?
Used when:
- lots of hand-tuning, long lists of rules, or hard to define rules
- complex / fluctuating environment
- expert knowledge does not exist, or is difficult to obtain
- models based on huge amount of data, must be customized to each individual

Not used when:
- simple, static environment, well-defined rules
- no uncertainty in the environment
- expert knowledge is available

### Machine Learning Process

| Step | Description |
| --- | --- |
| 1. Define the Problem     | Clearly define the problem statement, including the goal and the target variable(s).<br> Identify the available resources, constraints, and relevant stakeholders.<br> Understand the domain knowledge and business context to ensure the problem's relevance. |
| 2. Data Collection        | Determine the data requirements based on the problem definition.<br> Identify potential data sources and acquire the necessary datasets.<br> Ensure data quality by performing data validation, cleaning, and handling missing values or outliers. |
| 3. Data Exploration       | Perform statistical analysis, such as summary statistics and data distributions.<br> Visualize the data through plots, histograms, scatterplots, or heatmaps.<br> Identify correlations, patterns, and outliers within the dataset.<br> Conduct feature correlation analysis to understand relationships between variables. |
| 4. Feature Engineering    | Select relevant features based on domain knowledge and exploration.<br> Handle categorical variables through techniques like one-hot encoding or ordinal encoding.<br> Scale numerical features to a common range or apply normalization techniques.<br> Create new features by transforming or combining existing ones (e.g., feature interactions, polynomial features). |
| 5. Model Selection        | Identify the problem type (classification, regression, clustering, etc.).<br> Consider the characteristics of the dataset (e.g., size, dimensionality) and the assumptions of different algorithms.<br> Evaluate various algorithms and choose the one that best suits the problem and data. |
| 6. Model Training         | Split the data into training and testing sets (e.g., using random sampling or time-based splitting).<br> Apply the chosen algorithm to the training data and optimize its hyperparameters.<br> Evaluate the model's performance on the testing set using appropriate metrics.<br> Repeat the training process with different algorithms or parameter settings if necessary. |
| 7. Model Evaluation       | Calculate evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error.<br> Perform cross-validation or holdout validation to estimate the model's performance on unseen data.<br> Analyze the model's strengths, weaknesses, and potential biases.<br> Consider business requirements and domain-specific metrics for a comprehensive evaluation. |
| 8. Model Optimization     | Fine-tune the model's hyperparameters through techniques like grid search, random search, or Bayesian optimization.<br> Regularize the model to prevent overfitting using techniques like L1/L2 regularization or dropout.<br> Explore ensemble methods, such as bagging or boosting, to improve model performance.<br> Use feature selection techniques to remove irrelevant or redundant features. |
| 9. Model Deployment       | Prepare the model for deployment by saving its trained parameters and associated preprocessing steps.<br> Integrate the model into an application, system, or cloud infrastructure.<br> Design and implement an API for making predictions using the deployed model.<br> Ensure the model's scalability, robustness, and security in a production environment. |
| 10. Monitoring and Maintenance | Continuously monitor the model's performance in real-world scenarios.<br> Collect feedback and track performance metrics to detect any degradation or concept drift.<br> Retrain the model periodically with new data to keep it up-to-date and maintain its accuracy.<br> Conduct regular model audits and updates as needed. |
| 11. Iteration and Improvement | Regularly revisit and refine the model as new insights are gained, data quality improves, or new techniques emerge.<br> Incorporate feedback from stakeholders and address any limitations or shortcomings.<br> Continuously experiment with new algorithms or approaches to improve the model's performance and adapt to evolving requirements. |

#### Types of learning:
- Supervised (inductive) learning
    - given training data, desired outputs (labels)
    - learn a function that maps inputs to outputs
    - types:
        - classification (predict class or category, discrete value)
            - binary classification (2 classes)
            - multi-class classification (more than 2 classes)
        - regression (predict continuous value)
- Unsupervised (deductive) learning
    - given training data, no desired outputs
    - learn a function that describes hidden structure from unlabeled data
- Semi-supervised learning
    - given training data, some desired outputs
    - learn a function that maps inputs to outputs
- Reinforcement learning
    - rewards from sequence of actions
    - learn a function that maximizes a reward signal

High level, general comparison table:

|                       | Supervised Learning               | Unsupervised Learning          | Semi-Supervised Learning             | Reinforcement Learning                    |
|-----------------------|-----------------------------------|--------------------------------|--------------------------------------|------------------------------------------|
| Data                  | Labelled                          | Unlabelled                     | Mix of Labelled and Unlabelled       | Depends on State and Reward              |
| Task                  | Prediction                        | Pattern Recognition            | Prediction                           | Sequential Decision Making               |
| Example Algorithms    | Linear Regression, SVM, Neural Networks | Clustering, K-Means, PCA | Self-Training, Multi-View Training   | Q-Learning, SARSA, DQN                    |
| Feedback              | Direct                            | None                           | Partial                              | Reward-based                              |
| Goal                  | Minimize Error on Given Labels    | Discover Hidden Structure      | Better Generalization Accuracy       | Maximize Cumulative Reward                |
| Typical Use Case      | Image Recognition, Email Spam Detection | Customer Segmentation, Anomaly Detection | Web Content Classification, Bioinformatics | Game AI, Robot Navigation, Real-time Decisions |
| Training Efficiency   | High (due to direct feedback)     | Medium (no feedback)           | Varies (depends on labeled/unlabeled ratio) | Typically slow, trial and error-based      |
| Complexity of Problem | Low-Medium                        | High                           | Medium-High                          | High                                      |
| Real-time Adaptation  | Not Typically                     | Not Typically                  | Not Typically                        | Yes, using online learning                 |


## Module 1

### What is Machine Learning?
A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Example: playing checkers.
$E$ = the experience of playing many games of checkers
$T$ = the task of playing checkers
$P$ = the probability that the program will win the next game

Traditional programming : (data) + (program) = (output)
Machine Learning : (data) + (output) = (program)

ML is used when:
- lots of hand-tuning, long lists of rules, or hard to define rules
- complex / fluctuating environment
- expert knowledge does not exist, or is difficult to obtain
- models based on huge amount of data, must be customized to each individual

It is based on learning a function $h$ that approximates $y$ as a function of $x$: $$h: X \rightarrow Y$$ where $X$ is the input space and $Y$ is the output space.
Steps:
1. Define the objective of the problem
2. Collect data
3. Prepare / preprocess data
4. Explorative data analysis
5. Building a machine learning model
6. Model evaluation and optimization
7. Deploy the model, predict new values

#### Types of learning:
1. Supervised (inductive) learning
    - given training data, desired outputs (labels)
    - learn a function that maps inputs to outputs
    - types:
        - classification (predict class or category, discrete value)
            - binary classification (2 classes)
            - multi-class classification (more than 2 classes)
        - regression (predict continuous value)
2. Unsupervised (deductive) learning
    - given training data, no desired outputs
    - learn a function that describes hidden structure from unlabeled data
3. Semi-supervised learning
    - given training data, some desired outputs
    - learn a function that maps inputs to outputs
4. Reinforcement learning
    - rewards from sequence of actions
    - learn a function that maximizes a reward signal

## Module 2

### Data
data properties:
- value : how useful the data is for the problem
- volume : how much data to be analyzed and processed
- variety : what types of data (structured, unstructured, semi-structured)
- velocity : how fast data is generated and processed
- veracity : how accurate and reliable the data is

#### Data quality
Have to check:
- accuracy : should reflect reality
- completeness : should have all the required data
- consistency : should be consistent with other data
- validity : should be valid according to the domain
- uniqueness : should be unique, no duplications or redundancies
- timeliness : should be up-to-date

Issues might be:
- missing values
    - data not collected
    - variable not applicable for that observation
- outliers 
    - data point that differs significantly from other observations
- inconsistent data
- invalid data
    - can be due to:
        - measurement error
        - experimental error
        - data corruption
        - data entry error
        - natural variation
- noise
    - extraneous object, modification, or event that interferes with the data
- duplicate data
    - when merging data from heterogeneous sources
- biased / unrepresentative data

#### Data types


## Linear Regression

- linear regression can be used to fit a model to an observed dataset of values of the response (dependent variable) and explanatory variables (independent variables / features)
- $x^{(i)}$ is the vector of input variables / features, $x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} _{((n+1) \times 1)}$, where $n$ is the number of features, with $x_0^{(i)} = 1$ being the intercept term. 
- $y^{(i)}$ is the output variable / target.
- $(x^{(i)}, y^{(i)})$ is a training example.
- $\{(x^{(i)}, y^{(i)}) : i = 1 \dotsm m\}$ is the training set, where $m$ is the number of examples in the training set.

Goal : to learn a function $h(x) : \text{space of input values} \rightarrow \text{space of output values}$, so that $h(x)$ is a good predictor for the corresponding value $y$

#### Equations

If we decide to approximate $y$ as a linear function of $x$, then for the $i^{th}$ training example:

$$\hat{y}^{(i)} =  h_\theta(x^{(i)}) = \theta_0 + \theta_1 x^{(i)}_1 + \theta_2 x^{(i)}_2 + \dotsm + \theta_n x^{(i)}_n = \sum_{j=0}^n \theta_j x^{(i)}_j$$

This is called **simple / univariate** linear regression for $n = 1$, and **multiple** linear regression, (if $n > 1$). This is different from **multivariate** regression, which pertains to multiple dependent variables and multiple independent variables. [Link](https://stats.stackexchange.com/q/2358/331716)

Then we can define the cost function as:
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$$

This is the **ordinary least squares (OLS)** cost function, working to minimize the **mean squares error (MSE)**.

Goal : to choose $\theta$ so as to minimize $J(\theta)$

#### Vectorized

$$ X = \begin{bmatrix} - \left( x^{(1)} \right)^T - \\ - \left( x^{(2)} \right)^T - \\ \vdots \\ - \left( x^{(m)} \right)^T - \end{bmatrix}_{(m \times (n+1))} , \qquad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}_{((n+1) \times 1)} \qquad and \qquad y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} _{(m \times 1)}$$

Then the vector of predictions, 

$$ \hat{y} =  X\theta = \begin{bmatrix} - \left( x^{(1)} \right)^T\theta - \\ - \left( x^{(2)} \right)^T\theta - \\ \vdots \\ - \left( x^{(m)} \right)^T\theta - \end{bmatrix}_{(m \times 1)} $$

We can rewrite the least-squares cost as following, replacing the explicit sum by matrix multiplication:

$$J(\theta) = \frac{1}{2m} (X\theta - y)^T(X\theta - y)$$

#### Finding coefficients for simple linear regression

The simple linear regression model is $y = \theta_0 + \theta_1 x$, where $\theta_0$ is the intercept and $\theta_1$ is the slope. The coefficients are found by minimizing the sum of squared residuals (SSR), which is the sum of the squares of the differences between the observed dependent variable ($y$) and those predicted by the linear function ($\hat{y}$).

Make a table with $x_i$, $y_i$, $x_i - \bar{x}$, $y_i - \bar{y}$, $(x_i - \bar{x})^2$, $(x_i - \bar{x})(y_i - \bar{y})$.

Equation for $\theta_1$: $$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$
Equation for $\theta_0$: $$\theta_0 = \bar{y} - \theta_1 \bar{x} $$

#### Assumptions for linear regression
1. dependent and independent variables are linearly related
2. independent variables are not random
3. residuals are normally distributed
4. residuals are homoscedastic (constant variance)

### Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n^{th}$ degree polynomial in $x$. Polynomial regression fits a nonlinear relationship between the value of $x$ and $y$.
- Simplest form of polynomial regression is a quadratic equation, $y = \theta_0 + \theta_1 x + \theta_2 x^2$

### Normal Equation

The normal equation is an analytical solution to the linear regression problem with a ordinary least square cost function. That is, to find the value of $\theta$ that minimizes $J({\theta})$, take the [gradient](https://mathinsight.org/gradient_vector) of $J(\theta)$ with respect to $\theta$ and equate to $0$, ie $\nabla_\theta J(\theta) = 0$.

Solving for $\theta$, we get 

$$\theta = (X^TX)^{-1} X^Ty$$

[Here](https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/) is a post containing the derivation of the normal equation.

### Gradient Descent

Gradient descent is based on the observation that if the function $J({\theta})$ is differentiable in a neighborhood of a point $\theta$, then $J({\theta})$ decreases fastest if one goes from $\theta$ in the direction of the negative gradient of $J({\theta})$ at $\theta$. 

Thus if we repeatedly apply the following update rule, ${\theta := \theta - \alpha \nabla J(\theta)}$ for a sufficiently small value of **learning rate**, $\alpha$, we will eventually converge to a value of $\theta$ that minimizes $J({\theta})$.

For a specific paramter $\theta_j$, the update rule is 

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J({\theta}) $$

Using the definition of $J({\theta})$, we get

$$\frac{\partial}{\partial \theta_j} J({\theta}) = \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

Therefore, we repeatedly apply the following update rule:

$\qquad Loop \: \{$
    $\qquad \qquad \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
$\qquad \}$

This method looks at every example in the entire training set on every step, and is called **batch gradient descent (BGD)**. 

When the cost function $J$ is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.

There is an alternative to BGD that also works very well:

$\qquad Loop \: \{$
    $\qquad \qquad for \: i=1 \: to \: m \: \{$
    $\qquad \qquad \qquad \theta_j := \theta_j - \alpha \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
    $\qquad \qquad \}$
$\qquad \}$

This is **stochastic gradient descent (SGD)** (also incremental gradient descent), where we repeatedly run through the training set, and for each training example, we update the parameters using gradient of the error for that training example only.

Whereas BGD has to scan the entire training set before taking a single step, SGD can start making progress right away with each example it looks at. 

Often, SGD gets $\theta$ *close* to the minimum much faster than BGD. However it may never *converge* to the minimum, and $\theta$ will keep oscillating around the minimum of $J(\theta)$; but in practice these values are reasonably good approximations. Also, by slowly decreasing $\alpha$ to $0$ as the algorithm runs, $\theta$ converges to the global minimum rather than oscillating around it.

#### Underfitting and Overfitting

Error(model) = Bias(model) + Variance(model) + Irreducible Error

Bias : how far off in general the model is from the actual value. High bias means the model is not complex enough to capture the underlying trend of the data. Low bias means the model is complex enough to capture the underlying trend of the data.

Variance : how much the model changes based on the training data. High variance means the model changes a lot based on the training data. Low variance means the model does not change much based on the training data.

**Underfitting** – High bias and low variance
- model does not fit the training data and does not generalize well to unseen data

Techniques to reduce underfitting :
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

**Overfitting** – High variance and low bias
- model fits the training data well, but does not generalize well to unseen data

Techniques to reduce overfitting :
1. Increase training data (data augmentation)
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
6. Ensemble learning (bagging, boosting, stacking)
7. Cross-validation, holdout validation, k-fold cross-validation

<img src="https://i.imgur.com/b4CWHHf.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

Simple models like linear and logistic regression are prone to underfitting, whereas complex models like decision trees and neural networks are prone to overfitting.


### Adding regularization

Regularization is a technique to reduce overfitting in machine learning. This technique discourages learning a more complex or flexible model, by shrinking the parameters towards $0$.

We can regularize machine learning methods through the cost function using $L1$ regularization or $L2$ regularization. $L1$ regularization adds an absolute penalty term to the cost function, while $L2$ regularization adds a squared penalty term to the cost function. A model with $L1$ norm for regularisation is called **lasso regression**, and one with (squared) $L2$ norm for regularisation is called **ridge regression**. [Link](https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261)

$$J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right)$$

$$J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

The partial derivative of the cost function for lasso linear regression is:

\begin{align}
& \frac{\partial J(\theta)_{L1}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L1}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{2m} signum (\theta_j)
& \qquad \text{for } j \ge 1
\end{align}

Similarly for ridge linear regression,

\begin{align}
& \frac{\partial J(\theta)_{L2}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L2}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j 
& \qquad \text{for } j \ge 1
\end{align}

These equations can be substituted into the general gradient descent update rule to get the specific lasso / ridge update rules.

Elastic Net regression is a combination of lasso and ridge regression. It's regularization term is a combination of the $L1$ and $L2$ regularization terms. The cost function is:
$$ J(\theta)_{ElasticNet} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda_1}{2m} \left( \sum_{j=1}^n |\theta_j| \right) + \frac{\lambda_2}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

#### Note:
- $\theta_0$ is NOT constrained
- scale the data before using Ridge regression
- $\lambda$ is a hyperparameter: bigger results in flatter and smoother model 
- Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection
- Last way to constrain the weights is Elastic net, a combination of Ridge and Lasso
- When to use which?
    * Ridge is a good default
    * If you suspect some features are not useful, use Lasso or Elastic
    * When features are more than training examples, prefer Elastic

### Metrics for evaluating regression models
1. Mean Absolute Error (MAE)
    - average of absolute differences between predictions and actual values
    - robust to outliers, does not penalize large errors like MSE, not differentiable at 0
    - $$ MAE = \frac{1}{m} \sum_{i=1}^m |y^{(i)} - \hat{y}^{(i)}| $$
2. Mean Squared Error (MSE)
    - average of squared differences between predictions and actual values
    - penalizes large errors more than MAE, more sensitive to outliers, differentiable
    - $$ MSE = \frac{1}{m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 $$
3. Mean Absolute Percentage Error (MAPE)
    - average of absolute percentage differences between predictions and actual values
    - $$ MAPE = \frac{1}{m} \sum_{i=1}^m \left| \frac{y^{(i)} - \hat{y}^{(i)}}{y^{(i)}} \right| $$
4. Root Mean Squared Error (RMSE)
    - square root of MSE
    - $$ RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2} $$
5. R-squared
    - proportion of variance in the dependent variable that is predictable from the independent variable(s)
    - $$ R^2 = 1 - \frac{\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^m (y^{(i)} - \bar{y})^2} $$
    - value varies between 0 and 1 usually, where 0 indicates that the model explains none of the variability of the response data around its mean, and 1 indicates that the model explains all the variability of the response data around its mean
    - if you have negative $R^2$, it means that your model is worse than the mean model
    - tends to increase when more predictors are added to the model, even if they are unrelated to the response
        - this could be misleading, because the model may not actually have a better fit
6. Adjusted R-squared
    - penalizes the addition of unnecessary predictors to the model
    - $$ R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1} $$
    - where $n$ is the number of observations and $k$ is the number of predictors
    - $R^2_{adj}$ increases only when the increase in $R^2$ is more than what is expected to happen by chance
    - $R^2_{adj}$ decreases when the model contains useless predictors
    - $R^2_{adj}$ can be negative, and it is always less than or equal to $R^2$

The metrics with squared errors (MSE, RMSE) are more commonly used than MAE and MAPE, because they are differentiable and penalize large errors more. RMSE is the most popular metric, because it is interpretable in the "y" units.

### Cross-validation

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Types:
1. holdout validation set approach
    - split the data into training and validation sets
    - train the model on the training set
    - evaluate the model on the validation set
2. leave-p-out cross-validation
    - split the data into training and testing sets (with $p$ observations in the testing set)
    - repeat the above steps with different splits of the data, average the metric with different p-size testing sets
3. k-fold cross-validation
    - split the data into $k$ folds
    - for each fold, train the model on the remaining $k-1$ folds and evaluate it on the current fold
    - average the metric with different folds
4. stratified k-fold cross-validation
    - same as k-fold cross-validation, but the folds are made by preserving the percentage of samples for each class

## Basis functions

Simplest model for linear regression is given by $y(\bf{x}, \bf{w}) = w_0 + w_1 x_1 + \dotsm + w_D x_D$
- key property of this model is that it is a linear function of the parameters $w_0, \dotsm, w_D$
- basis functions can be used to extend linear models to make them non-linear
- basis functions are fixed and known functions of the input variables
- the model is still linear in the parameters, but non-linear in the input variables $$ y(\bf{x}, \bf{w}) = w_0 + \sum_{j=1}^M w_j \phi_j(\bf{x}) $$ where $\phi_j(\bf{x})$ are the basis functions
- using linear combinations of fixed nonlinear functions of the input variables, we can model a wide range of nonlinear functions
- examples of basis functions:
    - polynomial basis functions $$ \phi_j(x) = x^j $$
    - Gaussian basis functions $$ \phi_j(x) = exp \left\{ - \frac{(x - \mu_j)^2}{2s^2} \right\} $$
    - sigmoidal basis functions $$ \phi_j(x) = \sigma \left( \frac{x - \mu_j}{s} \right) $$ where $\sigma(a) = \frac{1}{1 + exp(-a)}$ is the logistic sigmoid function
- advantages of basis functions:
    - closed-form solution for the parameters
    - non linear models mapping input variables to output variables through basis functions
- disadvantages:
    - assumption that the basis functions are fixed and not learned
    - curse of dimensionality, to capture the input space with a fine grid of basis functions, the number of basis functions grows exponentially with the number of input variables


## Discriminative classifiers

### Generative vs Discriminative classifiers

| Aspect                     | Discriminative Models                         | Generative Models                            |
|----------------------------|----------------------------------------------|---------------------------------------------|
| Objective                  | Focuses on learning the decision boundary that separates different classes or categories in the data. | Focuses on modeling the joint distribution of features and labels to generate new data samples. |
| Modeling Approach          | Directly models the conditional probability of the class labels given the features (P(y\|x)). | Models the joint probability of both class labels and features (P(x, y)). |
| Use Case                   | Well-suited for classification tasks where the primary goal is to predict the class labels of new data points. | Suitable for classification tasks and can also be used for data generation and sampling. |
| Data Generation            | Cannot be used for generating new data samples as it only models the decision boundary. | Can be used to generate new data samples by sampling from the learned joint distribution. |
| Training Data              | Requires labeled data for learning the conditional probabilities. | Requires both labeled data for estimating class priors and conditional probabilities. |
| Dimensionality Reduction   | Not well-suited for dimensionality reduction tasks. | Can be used for dimensionality reduction tasks, such as generating low-dimensional representations of data. |
| Example Algorithms         | Logistic Regression, Support Vector Machines (SVM), Neural Networks (for classification). | Naive Bayes, Gaussian Mixture Models, Hidden Markov Models. |


Types of classifiers:
1. Linear classifiers
    - classes are seperated by a linear decision boundary, if for a given input $x$, $\theta^Tx \ge 0$ then $y = 1$, else $y = 0$
    - $\theta$'s are learned from the training data during the training phase, then used to classify new data
    - examples:
        - logistic regression
        - support vector machines
2. Non-linear classifiers
    - classes are seperated by a non-linear decision boundary
    - examples:
        - Decision trees
        - Random forests
        - Neural networks

## Logistic Regression
- transforms the output of a linear regression model into a probability by applying the logistic function (sigmoid function) $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
- the output of the logistic function is interpreted as the probability of the input belonging to the positive class, $$ h_{\theta}(x) = P(y = 1 \mid x) = \sigma(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}} $$
    - if $\theta^Tx = 0$, then $P(y = 1 \mid x) = 0.5$
    - if $\theta^Tx \gg 0$, then $P(y = 1 \mid x) \approx 1$
    - if $\theta^Tx \ll 0$, then $P(y = 1 \mid x) \approx 0$
    - here $f(x) = \theta^Tx$ is called logit function
- works by determining the weights $\theta$ such that the predicted probability is maximized for the positive class and minimized for the negative class
- you maximize the log-likelihood function, $$ \ell(\theta) = \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$ 
    - if $y^{(i)} = 1$, then $P(y^{(i)} = 1 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is maximized
    - if $y^{(i)} = 0$, then $P(y^{(i)} = 0 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is minimized
- this method is called maximum likelihood estimation (MLE)
- Logit function
    - the logit function is the inverse of the logistic function, $$ \text{logit}(p) = \log \left( \frac{p}{1 - p} \right) = \sigma^{-1}(p) $$
    - because of this, logit is also called log-odds, because it is the logarithm of the odds, $$ \text{odds}(p) = \frac{p}{1 - p} $$
- dependent variable follows Bernoulli distribution
- cost function is $$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$
- gradient descent update rule is $$ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{for } j \ge 1 $$ but here the hypothesis function is different from that of linear regression, as defined above
- regularized cost functions are exactly the same as in linear regression
    - L1 regularization : $$ J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right) $$
    - L2 regularization : $$ J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right) $$

### Types of logistic regression
1. Binary logistic regression
    - the dependent variable has only two possible outcomes
    - the goal is to determine the probability that an observation is in a particular category
2. Multinomial logistic regression
    - the dependent variable has three or more unordered categories
    - the goal is to determine the probability that an observation is in each category
3. Ordinal logistic regression
    - the dependent variable has three or more ordered categories
    - the goal is to determine the probability that an observation is in each category

### Metrics for evaluating classification models

Confusion Matrix:
<img src="https://i.imgur.com/v4FpYTm.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

Accuracy = $\frac{TP + TN}{TP + TN + FP + FN}$

Error Rate = $\frac{FP + FN}{TP + TN + FP + FN} = 1 - Accuracy$

Precision = $\frac{TP}{TP + FP}$

Recall / Sensitivity / TPR = $\frac{TP}{TP + FN}$

TNR / Specificity = $\frac{TN}{TN + FP}$

FPR / Fall-out / Type I error = $\frac{FP}{FP + TN} = 1 - TNR$

Type II error / False Negative Rate = $\frac{FN}{TP + FN} = 1 - Recall$

F1 Score = $\frac{2 * Precision * Recall}{Precision + Recall}$

#### ROC Curve

AUC = Area Under the ROC Curve : ROC Curves are used to see how well your classifier can separate positive and negative examples and to identify the best threshold for separating them. To be able to use the ROC curve, your classifier has to be ranking - that is, it should be able to rank examples such that the ones with higher rank are more likely to be positive.
- y axis : TPR
- x axis : FPR

<img src="https://i.imgur.com/FICMCrT.png" width="600" style="display: block; margin-left: auto; margin-right: auto;">

Given a dataset and a classifier, you can plot the ROC curve by doing the following:
1. Rank the examples according to the classifier's output, from highest to lowest.
2. Start at (0,0).
3. For each example in the dataset:
    - If the example is positive, move $1/positive\_examples$ up.
    - If the example is negative, move $1/negative\_examples$ to the right.
4. The resulting curve is the ROC curve.

<src img="https://habrastorage.org/files/267/36b/ff1/26736bff158a4d82893ff85b2022cc5b.gif" width="300" style="display: block; margin-left: auto; margin-right: auto;">

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

AUC is desirable for these two reasons:
- scale-invariant : measures how well predictions are ranked, rather than their absolute values.
- classification-threshold-invariant : measures the quality of the model’s predictions irrespective of classification threshold

Caveats, which limit usefulness of AUC in certain cases:
- scale-invariance : sometimes we really do need well calibrated probability outputs
- classification-threshold invariance : cases where there are wide disparities in the cost of false negatives vs. false positives

## Decision Trees
Advantages:
- inexpensive to construct
- extremely fast at classifying unknown records
- easy to interpret for small-sized trees
- can easily handle redundant or irrelevant attributes (unless the attributes are interacting)

Disadvantages:
- space of possible decision trees is exponentially large
- greedy approaches are often unable to find the best tree
- does not take into account interactions between attributes
- each decision boundary involves only a single attribute

#### Decision tree impurity measures

$${\displaystyle \text{Entropy}(t) = -\sum_{c=1}^{C} p(c|t) log_2(p(c|t))}$$

$${\displaystyle Gini(t) = 1 - \sum_{c=1}^{C} [p(c|t)]^2}$$

$$\text{Misclassification error}(t) =  1 - \max_c[p(c|t)]$$

Where $t$ is the current node, $C$ is the number of classes, and $p(c|t)$ is the proportion of the samples that belong to class $c$ at node $t$.

<img src="https://i.imgur.com/0Zn675n.png" width="400" style="display: block; margin-left: auto; margin-right: auto;">

### Using information gain to decide split
1. Calculate the entropy of the target. $$ \text{Entropy}(S) = - \sum_{i=1}^n p_i \log_2 p_i $$
2. Calculate the entropy of the target for each feature. $$ \text{Entropy}(S, A) = \sum_{i=1}^n \frac{|S_i|}{|S|} \text{Entropy}(S_i) $$ where $S_i$ is the subset of $S$ for which feature $A$ has value $i$. $$ \text{Entropy}(S_i) = - \sum_{i=1}^n p_i \log_2 p_i $$
3. Calculate the information gain for each feature. $$ \text{Information Gain}(S, A) = \text{Entropy}(S) - \text{Entropy}(S, A) $$
4. Choose the feature with the highest information gain.

### Using Gini index to decide split
1. Calculate the Gini index of the target. $$ \text{Gini}(S) = 1 - \sum_{i=1}^n p_i^2 $$
2. Calculate the Gini index of the target for each feature. $$ \text{Gini}(S, A) = \sum_{i=1}^n \frac{|S_i|}{|S|} \text{Gini}(S_i) $$ where $S_i$ is the subset of $S$ for which feature $A$ has value $i$. $$ \text{Gini}(S_i) = 1 - \sum_{i=1}^n p_i^2 $$
3. Calculate the information gain for each feature. $$ \text{Gini Gain}(S, A) = \text{Gini}(S) - \text{Gini}(S, A) $$
4. Choose the feature with the highest gini gain.
