# A memo and review of the popular ML algorithms
- First we will try to answer generally why we need a model and how do we train, interpret, evaluate, validate and choose between different models based on the pros and cons of each model
- Then we will have detailed discussions of different models
- Linear models: Linear regression and logistic regression
- Tree-based models: Basic decision tree and ensembled trees such as random forest and gradient boosted decison tree
- SVM
- Neural network

## General aspects

### What is a model?

A model represents the relationship between variables. We want to build the relationship between the features and response variables, while making some simplifying assuptions.

### Why we model?

There are mainly two purposes: prediction and inference.
- For prediction purpose, we want to predict the value of response variable given a set of features. And we care more about getting the right answer over knowing the relationship between features and respons. For e.g, predicting stock prices.
- For inference purpose, in contrast, we care more about the relationship between features and response. For e.g, finding out the most important features in predicting house prices.

Therefore there is tradeoff between these two purposes. Linear regression is great for inference, the coefficient tells us by how much the response will change with one unit change of the feature. While a Gaussian kernal SVM could result in higher accurary, but it is hard to interpret.

### Getting the right model

Generally there are two steps in achieving a good model. Building the model given a set of inputs and choosing between the models. The pool of metrics used for training and choosing model are the same for each model. The key difference is that when we train a model, we only use one metric score and when we choose a model, we use any number of metrics.
- Building the model: Each type of model has its own algorithm/equation when fitting a fixed set of inputs. The result is a model. The training process tries to minimize the target metric in order to optimize the model. This target metric is referred as loss. For regression problems we ofen use residual sum of squares(RSS) as the loss and for classification we use cross entropy.
- Choosing the model: When choosing a model, we need to consider like model performance, interprebility, scalabibilty and so on since there are trade-offs on these among different models. It is often determined by the project need and knowledge and experience of the data scientist.

### Cross validation

When training a supervised model, we want to know how the model performs by validating it on a different data set. Therefore we often devide the available data set to train and test sets, with ratio 80:20. 
- If the model performs much worse on the test set we will say there is overfitting. And in order to solve this, we could get more data, use regularization, reduce the features, decrease the flexibility of the model and so on.
- If the model perform about the same on training and test data sets but are both very bad we will say there exists bias in the model. We could engineer more features, increase the flexibility of the model ... to solve it.

We often use the k-fold cross validation to choose some hyperparameters. It works as follows:
- We devide the training data set into k groups. Each time we train a model of (k - 1) groups of data and used the model to predict observations in the left group and return a score.
- We repeat this step k times for each group of data, and average the scores and use this score as the standard to choose the parameters of interest.

### Characteristics of models:
- Dimension safe: How well will the model perfrom if we have a lot of features? If the number of features(p) is greater than the number of observations in data(n)?
- Training speed: How long does it take to train the model?
- Prediction speed: Once we have the trained model, how long does it take to make predictions?
- Interpretability: Especially important when our purpose is to make inference. Can we determing the most important features and their direct cause on the response?
- Communication: How do we explain the model in 2 sentences to non-techical colleagues?
- Visulization: How do we present the model visually?
- Evaluation: What metrics can we use to score the models in order to choose the best model?
- Nonlinearity: How does this model react to nonlinear data?
- Multicoliearity: Could the model take multilinearity well?
- Outliers: Is the model robust to outliers?
- Overfitting: Does this model tend to overfit?
- Hyperparameters: What parameters can we tune in order to achieve a better score for this model?
- Online: Can the model be easily updated with new data(without fitting using previous fitted data)?
- Unique attributes for each model
- Special use cases

### Other general notes
More data usually produces better model. Better features usually beat a better model. Therefore data collection and feature engineering typically trump model selection. Keep this in mind when tackling a new problem.

The real life data are usually messy: the inputs tend to be mixtures of quantitative, binary and categorical values, the latter often with mamy levels. There are also generally many missing values. Distributions of numerical predictors and response are often long-tailed and highly skewed.

Usually only a small fraction of the large number of predictors that have been included in the analysis are actually relevant to the response.

Furthermore, when doing model selection, trying out several different kinds of models is one of the best way to determine which model to use, and you might ultimated find an ensemble works the best.

Things like interpretability, speed, simplicity will guide your choice of models.

## Linear models

### Linear regression

Assume simple linear relationship between the predictors and the response variables.
$ Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon$, where $\epsilon$ is random error. When fitting the model, we choose then parameters which minimizes the residual sum of squares.

#### Assumptions of linear regression
- There is linear relationship between predictors and response variables
- The distribution of X is arbitraty, and the observations are indepent of each other
- The random error is independent for each observation, the expected value is 0, and the variance is constant
- Gaussian noise assumption: $\epsilon \sim N(0, \sigma^2)$

#### Some notes on the assumptions
- True relationship between Y and X might not be linear, but according to Taylor expansion, it is a good approximationa and it is better than nothing
- There is no assumption about the distribution of X, Y or the disjoint distribution of X and Y
- No assumption about causality that X causes Y
- No assumption that X is more precise, Y is more accurate
- It is not always normal distributed error term, but a good approximation according to CLT

### Logistic regression

## Tree-based models