# ML DL Notes

## ML/DL basics introduction

#### Perceptron
- Takes several 'binary' inputs to produce a 'binary' output by comparing the weighted sum of inputs against a threshold, 
- The threshold is treated as bias and moved to left side of equation to compare weighted sum against zero
- If a small change in a weight or bias causes only a small change in output, it is possible for a network to learn. 
- But, this doesn't happen with perceptrons sometimes as **small change in weights can entirely flip the output** from say 1 to 0

#### Sigmoid Neuron
- Sigmoid neurons are similar to perceptrons (shape is a smoothed out version of a step function), <br> but modified so that **small changes in their weights and bias cause only a small change in their output**
- instead of being just 0 or 1, these inputs can also take on any values between 0 and 1
- output is not 0 or 1. Instead, it's σ(w⋅x+b), where σ is called the sigmoid function
- Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. 

#### Gradient decent
- To quantify how well we're achieving this goal we define a cost function
- to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent
- 

#### Backpropagation

#### Overfitting

- One of the problems that occur during neural network training is called overfitting. 
- The error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has ** memorized the training examples, but it has not learned to generalize to new situations **

#### How to avoid overfitting

- Go for simpler models over more complicated models. Generally, the **fewer parameters** that you have to tune the better. 
- Use **more data** to train the model. 
- Some sort of **regularization** can help penalize certain sources of overfitting.

#### Vanishing Gradients

- if a change in the parameter's value causes very small change in the network's output - the network just can't learn the parameter effectively, which is a problem.
-  For example, **sigmoid maps the real number line onto a "small" range** of [0, 1]. As a result, there are large regions of the input space which are mapped to an extremely small range. In these regions of the input space, even a large change in the input will produce a small change in the output - hence the gradient is small.
- This **becomes much worse when we stack multiple layers** of such non-linearities on top of each other. <br> For instance, first layer will map a large input region to a smaller output region, which will be mapped to an even smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so on. As a result, even a large change in the parameters of the first layer doesn't change the output much.

#### How to avoid vanishing gradients

- We can avoid this problem by using activation functions which don't have this property of 'squashing' the input space into a small region. A popular choice is Rectified Linear Unit which maps x to max(0,x)

#### Cross validation

- **Cross validation is a method for estimating the prediction accuracy of a model.**
- One way to evaluate a model is to see how well it predicts the data used to fit the model. But this is too optimistic -- a model tailored to a particular data set will make better predictions on that data set than on new data. 
- Another way is to hold out some data and fit the model using the rest. Then you can test your accuracy on the holdout data.  But the held out data is "wasted" from the point of view of building the model. If you have huge amounts of data, so holding some data out won't make the model much worse
- Cross validation does something like this but tries to **make more efficient use of the data**: you divide the data into (say) 10 equal parts. Then **successively hold out each part and fit the model using the rest**. This gives you 10 estimates of prediction accuracy which can be combined into an overall measure.

## Regression Algorithms

- **Regression is a statistical way to establish a relationship between a dependent variable and a set of independent variable(s)**
- Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.
- Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning.

#### Linear regression

- While doing linear regression our objective is to **fit a line through the distribution which is nearest to most of the points**. Hence reducing the distance (error term) of data points from the fitted line. 
- It is conventional to use squares, as Regression line minimizes the sum of “Square of Residuals”. That’s why the method of **Linear Regression is known as “Ordinary Least Square (OLS)”**

### Logistic regression

-  regression means fitting a particular family of function to data: <br> "linear regression" means obtaining a best-fine line; <br> "polynomial regression" means obtaining a best-fit polynomial (of given degree); <br> and **"logistic regression" means obtaining a best-fit logistic function**
- Logistic regression works largely the same way linear regression works: it multiplies each input by a coefficient, sums them up, and adds a constant.
- In linear regression, the output is very straightforward. In the case of predicting heights, our output is simply someone's predicted height. 
- **In logistic regression, however, the output is actually the log of the odds ratio.**

## Regularization Algorithms

- To avoid over optimizing the training set to **use early termination as soon as the learning stops, other method is to use regularization**
- An extension made to another method (typically regression methods) that **penalizes models based on their complexity, favoring simpler models that are also better at generalizing**
- other regularization technique is dropout

#### L2 regularization

#### Ridge regularization

#### Least Absolute Shrinkage and Selection Operator (LASSO)

#### Dropout

- remove few connections randomly to force the network to learn redundant representation of input, so that it doesn't overfit and depend on any particular parameter, so that all learns independently

## Bayesian Algorithms

## Clustering Algorithms

## Dimensionality Reduction Algorithms

#### Principal Component Analysis (PCA)

## Ensemble Algorithms

-  Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction

#### Boosting

#### Bootstrapped Aggregation (Bagging)

#### AdaBoost

#### Random Forest

## Training hyper parameters

#### Learning rate

#### Learning rate decay

#### momentum

#### mini batch size

#### weight initialization

#### ADAM

#### ADAGARD

## Activation Functions

- One of the crucial factors in deep networks is activation function, which **brings the non-linearity into networks**

#### Sigmoid

#### TanH

#### Rectified Linear Units (ReLU)

#### Leaky ReLU

#### Parametric ReLU

#### Randomized ReLU

#### Exponential Linear Unit (ELU)

## Layer types

#### Pooling

#### Batch Normalization layer

#### Local Response Normalization (LRN)

#### Element-wise

#### Fully Connected (FC)

## Loss functions

#### Softmax

#### Corss entropy

#### Hinge

## Popular ANN types

#### Auto encoders, Stacked auto encoders

#### Restricted Boltzman Machines (RBM)

#### Deep Belief Networks: stacked RBM

## RNN's