# Machine Learning Models

## Supervised Learning 

### What is Regression?
The method of finding best fitting line or curve from an infinite number of lines or curves that can be drawn for given set of observations in an optimized way is called Regression.
For regression, we optimize the loss function.

### Linear Regression
The dependent variable is continuous in nature.

### Classification
The dependent variable is categorical in nature.

## Unsupervised Learning 

### Clustering
No pre defined labels, the model seperates the data into a number of categories.

# Linear Regression

The regression in which the best fit line is single degree polynomial. 
The equation of the best fit regression line can be found by minimizing the cost function which can be done using one of the following two methods
1. Differentiation
2. Gradient Descent

The strength of linear regression can be given by R2 (R squared)

## Cost Function
The Linear Regression assums linear relationship between response variable and the predictor variables. Mathematically this can be give as: 

 $ 𝑦_i = β_0 + β_1x_i + ϵ_i$ 

where $y_i$ is the response variable, $x_i$ is the predictor variable, $β_0, β_1$ are the coefficients for the population and $ϵ_i$ is the error.

We estimate the parameters using Linear regression. The predicted value is given by:

$\bar{𝑦_i} = b_0 + b_1x_i$

where the $b_0 and b_1$ are the estimated parameters which wll be very close to the population parameters $β_0 and β_1$. 

The error in original and predicted value is thus:
$ ϵ_i = y_i - \bar{y_i}$

For finding the best fit line using Linear regression, we need to minimize this error term. To do this we define a function as:
$ \sum_{i=1}^n (y_i - \bar{y_i})^2$
which is nothing but our Cost function. Solving the cost function will give us optimial values for parameters/coefficients.

### Types of Cost Functions
#### Minimisation
We minimise the cost function by equating it to 0. BY minimising the cost function, the fitted line will be very close to the original data points.
Types of minimisation
1. Constrained Minimisation - the cost funtion is equated with a constant instead of 0 
2. Unconstrained Minimisation - the cost function is equated with 0

Unconstrained minimisation can be done using one of the following methods
- Closed form - the cost function is differenciated and equated to 0 to find the solution. The solution is double differenciated to check if its greater than 0.
- Gradiant Descend - iterative process of finding global minima. We start with an initial value $X_0$ and a learning rate $\alpha$. The cost function is differentiated denoted by $f'(x)$ and output is calculated for $X_0$. The new value of parameters becomes $x - f'(x) * \alpha$. Thus with each step, the alogoritm moves near to the global minima.

#### Maximisation
We maximise the cost function, thus we find a best fir line which is as far as possible from the original value. e.g. classification/clustering problem - we want a line which is far enough from all the data points such that it clearly distinguishes classes/clusters in the data. 

### 

### R2
Its the square of the Pearson's correlation also known as _r_

R² = 1 - (RSS / TSS)

#### RSS: Residual Sum Squared 
The squared sum of distance of each actual data point and the predicted point which lies on the regression line.

#### TSS: Total Sum Squared
The squared sum of distance of each predicted point lying on the regression line with a line which is average of all the data points i.e. parallel to x axis. 

# Linear Regression
The dependent variable is continuous in nature.
The relationship is betweeen depdendent variables and independent variable is linear in nature. 

The least square error which gives the sum of the square of differences between the actual values and the predicted values (using the regression line fitted) is used to determine the best fit line. The key to getting the best fit line is minimising these errors.

## Simple Linerar Regression
Sigle independent variable

$y = β_0 + β_1X$


#### Assumptions of simple linear regression
1. Linear relationship between X and y.
2. Normal distribution of error terms.
3. Independence of error terms.
4. Constant variance of error terms.

#### Hypothesis testing in linear regression
1. To determine the significance of beta coefficients.
2. $H_0: β_1=0; H_A:β_1≠0.$
3. T-test on the beta coefficient.
4. $t score = \frac {\hatβ_i}{SE(\hatβ_i)}.$

The null hypothesis can be rejected if p-value of less than 0.05 i.e. the $/beta$ coefficients learned by the models are indeed significant. 

#### Building Linear model
1. OLS (Ordinary Least Squares) method in statsmodels to fit a line.
2. Summary statistics\
    F-statistic, R-squared, coefficients and their p-values.
    
The basic idea behind the F-test is that it is a relative comparison between the model that you've built and the model without any of the coefficients except for $β_0$. If the value of the F-statistic is high, it would mean that the Prob(F) would be low and hence, you can conclude that the model is significant. On the other hand, if the value of F-statistic is low, it might lead to the value of Prob(F) being higher than the significance level (taken 0.05, usually) which in turn would conclude that the overall model fit is insignificant and the intercept-only model can provide a better fit.    
    
#### Residual analysis
1. Histogram of the error terms to check normality.
2. Plot of the error terms with X or y to check independence.

#### Predict


## Steps of implementing Linear Regression

1. Step 1: Reading and understanding the data 
2. Step 2: Training the model
3. Step 3: Residual analysis
4. Step 4: Predicting and evaluating the model on test set

## Multiple Linear Regression
As the name suggestes, we have multiple independent variables and the model captures the change in dependent variable per unit increase in independent variables.

Multiple independent variables

$y = β_0 + β_1X_1 + β_2X_2 + ... + β_nX_n$

**The model now fits a hyperplane instead of a line**

### Considerations for MLR
All the assumptions of SLR hold true for MLR, however there are certain considerations we need to keep in mind 
1. Adding more isn't always helpful
  - Adding more variables may run into overfit
  - Multicollinearity - the independent variables should not be correlated to each other, in other words they must be independent of each other. Then only we can claim that the coefficients learned by model are significant because the coefficients measure the change in dependent variable per unit increase in an independent variable when other predictors are held constant.
2. Feature selection becomes an important aspect of model development

### Dealing with categorical variables
1. Dummy variables - Used when there are fewer levels. You learnt about it using the marital status example.

### Multicollinearity
Multicollinearity affects 
1. Interpretation
    - change in dependent variable by unit increase in an independent variable when all other independent variables held constant does not apply
2. Inference
    - Coefficients change wildly, signs can change
    - p-values are not reliable
    
However, multicollinearity does not affect 
1. Predictions, precision of predictions 
    - because all the independent variables are in the model and the predictions will affect due to the fact that some independent variables are correlated.
2. goodness of fit 
    - the fit too is not affected
    
### Detecting multicollinearity
Use pairwise correlation between independent variables. However pairwise correlation is 1:1 association. The variable can be associated with multiple variables.

To find out the association of a predictor $X_1$ with other predictors $X_2$, $X_3$... $X_n$, we can build a model to predict the value of $X_1$ using other predictors i.e. $X_2$, $X_3$... $X_n$. The measure of this process is known as **Veriance Inflation Factor (VIF)**. VIF calculates how well one independent variable is explained by all the other independent variables combined.

$VIF_i = \frac {1}{1−R_i^2}$

where _i_ referes to the i<sup>th</sup> variable which is being represented as a linear combination of rest of the independent variables.

The common heuristic we follow for the VIF values is:\
    **> 10**:  Definitely high VIF value and the variable should be **eliminated**.\
    **> 5**:  Can be okay, but it is **worth inspecting**.\
    **< 5**: Good VIF value. **No need to eliminate** this variable.
 
### Dealing with multicollinearity
Multicollinearity does not affect the prediction and its precision but the goal of modeling is to build a lean model with minimum variables so we should deal with multicollinear variables.
1. Drop variables
  - Drop variables which are highly correlated 
  - Keep the variables which are important to business
2. Create variables 
  - Create new variables using interaction of old variables
  - Add interaction features i.e. features derried using some of the original features 
  - drop original features
3.Transform variables   
  - Tranform original variables to new features which are independent of each other
    - PCA - Principal Component Analysis
    - PLA - Partial Least Square 

### Scaling variables
- Always good to have all variables in a closeby range. 
- Faster convergence of gradient descent.
- scale after train/validation split on data to avoid any use of validation data in computation of mean/standard deviation
- Variable's scale does not affect p-value or accurary of the model. It just affects the coeeficients.
- scaling will not affect relationship with the target variable
- the distribution of original variable is not affected, its just shifted 

#### Methods of scaling 
1. Standardization 
  - bring all data into standard normal distribution with mean 0 and standard deviation of 1.
  - $ x = \frac {x - mean(x)}{sd(x)} $
  - in standardized regression i.e. both X and y are standardized the beta coefficient is same as the correlation
2. MinMax Scaling 
  - also known as Normalization
  - data is shifted and rescaled such that its in between 0 and 1
  - $ Xnew = \frac {X - Xmin}{Xmax - Xmin} $

#### Should you scale categorical variables?
- no definite answer 
- its already between 0 and 1
- depends on what you are doing. e.g. if you are doing regularization like LASSO we may do it
- with dummy variable values of 0 and 1, its easy to interpret. 0 being base state and 1 being state over the base state.

### Model assessment and comparision
While $R^2$ is good measure for model fit, its not sufficient to compare two models with different set of features because $R^2$ will be higher for model with higher number of variables. So exactly which variable is contributing in the prediction is not clear from $R^2$.
Also while comparing models we need to strike a balance between **keeping the model sinpler** and **explaning highest variance** i.e. keep as many variables as possible. This is known as **Bias-Variance Tradeoff.**
There are some measures which can be used to compare models
1. Adjusted R^2 
  - The adjusted R^2 penalizes the model with higher number of variables. 
  - $ adjusted R^2 = 1 - \frac {(1-R^2)(N-1)} {(N - p - 1)}$
    - N is number of records/samples in data
    - p is number of variables
2. $ AIC = n×log(\frac {RSS}{n})+2p $ 
3. BIC
4. Mallows's $C_p$

### Feature selection
#### Manual feature elimination
1. Build the model with all the features
2. Drop the features that are least helpful in prediction (high p-value)
3. Drop the features that are redundant (using correlations and VIF)
4. Rebuild model and repeat

#### Automated approach
1. Top 'n' features: Recursive Feature Elimination
2. Forward/backward/Step wise selection: based on AIC
3. Regularization: LASSO

- The regression guarantees **interpolation** of data and **not extrapolation**. Interpolation basically means using the model to predict the value of a dependent variable on independent values that **lie within the range of data** you already have. Extrapolation, on the other hand, means predicting the dependent variable on the independent values that **lie outside the range of the data** the model was built on.
- 



# Handling nonlinearity in data
## Identify 
1. Simple Linear regression
  - plot dependent variable against independent variable.
2. Multiple Linear Regression
  - Residuals are normally distributed with mean of 0
  - Residuals have constant variance
  - There are no signiicant patterns when residuals are plotted against the independent variables i.e. residuals/error terms are independent of each other
  
## Addressing nonlinearity 
When response variable is not have linear relationship with independent variable we can use one of the following methods 
1. Polynomial Regression
2. Data transformation
3. Non-linear regression

## Polynomial Regression
The equation for linear regression is transformed in to a polymonial of order n like, 

$ 𝑦 = β0 + β1𝑋 + β2𝑋^2 +...+ β𝑛𝑋^n$

its necessary that the equation include independent variable (X) terms with powers from 0 to n, where n is the order of the polynomial. 
Each term is separate independent variable. Its special case of multiple linear regression.

### Advantages
1. Polynomial smoothens the curve such that it passes through majority of data points as compared to linear regression

### Disadvantages
1. Complex model
2. Risk of overfitting

## Data Transformation
The independent variable and/or the dependent variable are transformed using one of the following
- $\ln(x)$
- $\sqrt{x}$
- $\exp(x)$ 
- inverse
- many more

If there is non-linearity in data - transform predictor variables.
If error terms are not normally distributed or have unqual variance, transform response variable. This can help with non-linearity in data as well.
When regression function is not lienar and error terms are not normal and have unequal variance, transform both predictor and response variables.

### Advantages
1. The transformation helps with non0-linearity between the dependent and independent variables.
2. Helps model stay in linear regression framework.

### Disadvantages
1. Choice of tranformation becomes an inportant criteria.
2. Its a trial and error approach

## Nonlinear regression


# Pitfalls of Linear Regression
1. Non-constant variance
2. Autocorrelation and time series issue
3. Multicolliniarity
4. Extrapolation
5. Overfitting

## Non-constant variance
Constant variance of error terms is one of the assumptions of linear regression. However its often seen that error terms do not have constant variance. It shows gradual increase/decrease as we move from right to left on residual plot. This is called as *heteroscedasticity*.
One way to tackle this is to transform the response variable with function like log or square root so that larger values of response variable have greater shrinkage thus reducing the heteroscedasticity.

## Autocorrelation and time series issue
This happens when data is collected over time and model fails to recognize the time trends in the data. Due to this, error terms are correlated positively over time, such that each error term is similar on the previous. This happens with time series data. 
One way to detect this is to plot the residuals as a function of time. There should not be any visible patterns if error terms are uncorrelated. However if consecutive values closely follow each other, then its the case of autocorrelation. To tacke this we use autoregression models.

## Multicollinearity
If one or more independent variables are linearly dependent then its called as multicollinearity. Use correlation matrix or VIF to detect multicollinearity. Drop the multicollinear variables. Regularization can also help in dealing with multicollinearity.

## Overfitting 
If model is too complex, it tends to overfit. Model will perform well on training data but poorly on the test/unseen data. One way to tackle overfitting is to increase amount and diversity of the data. Regularization can also help in dealing with overfitting.

## Extrapolation
Extrapolation occurs when linear regression model is used to predict values for predictor variables which are outside the range of data used to build the model. Linear regression predictoins are only valid for the range of data used to build the model. We can increase the amount and diversity of the data to increase the range of data.

# Further readings

## F-statistic

## Partial Least Square

## Bias-Variance Tradeoff
### Bias 
- The bias is the error in average predicted value and the correct value. 
- model with high bais pays little attention to training data
- its oversimplified model 
- leads to high errors on training and test data
- model is said to be underfitting
- represents correctness of the model

### Variance
- The variance is the variability of the predicted value i.e. difference between predicted value and the mean of predicted values. 
- very complex model
- does very well on training data but performs poorly on test data
- model is said to be overfitting
- represents consistancy of the model

So ideally we desire our model to be simple (low bias) and complex (low variance) which clearly is contradition. This is known as Bias-Variance trafdeoff. 
Therefore we have to have a tradeoff between bias and variance to while finding best model. 

![bias-variance-tradeoff.png](attachment:bias-variance-tradeoff.png)

The expected error for a model is bias + variance.

### Parametric models

### Non-parametric models

### Prediction vs Projection

|  Consideration         |    Prediction                 |          Projection               |
| -----------------------|-------------------------------|-----------------------------------|
| Importance of outcome  |Identify driver variables and measure their impact on the dependent variable.| Final projected result/forcasted value
| Assumption             | No speciifc assumptions  | Assumes everything remains the same as today.<br> Forcast will change if anything changes.</br>
| Complexity/Accuracy of model|  Simple models are better for inference | Accuracy is important. Expected to be complex.


What is the coefficient of correlation for below lines?
![image.png](attachment:image.png)