# ML DL Notes

## <font color=red> Topics to learn </font>
- [** ML Cheat sheet **](https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6)
- classification loss functions
- log odd ratio
- Precision and recall
- knn
- Naive bayes
- decision tree, random forest
- svd
- Unsupervised learning
    - clusstering
        - k-means
        - heirarchical
    - dimentionality reduction
        - pca
- confusion matrix
- batch normalization
- instance normalization
- generative and discriminative model

## <font color=brown> ML/DL basics introduction </font>

### <font color=DarkMagenta> Perceptron </font>
- Takes several 'binary' inputs to produce a 'binary' output by comparing the weighted sum of inputs against a threshold, 
- The threshold is treated as bias and moved to left side of equation to compare weighted sum against zero
- If a small change in a weight or bias causes only a small change in output, it is possible for a network to learn. 
- But, this doesn't happen with perceptrons sometimes as <font color=blue> **small change in weights can entirely flip the output** </font> from say 1 to 0

### Parametric Vs Non-Parametric algorithms
- Parametric
    - Parametric methods makes an assumption about the form of the function relating X and Y
    - Linear regression is a parametric method
- Non-Parametric
    - non-parametric learners do not have a model structure specified a priori. 
    - We don’t speculate about the form of the function f that we’re trying to learn before training the model, as we did previously with linear regression. 
    - Instead, the model structure is purely determined from the data.

### <font color=DarkMagenta> Sigmoid Neuron </font>
- Sigmoid neurons are similar to perceptrons (shape is a smoothed out version of a step function), <br> but modified so that **small changes in their weights and bias cause only a small change in their output**
- instead of being just 0 or 1, these inputs can also take on any values between 0 and 1
- output is not 0 or 1. Instead, it's σ(w⋅x+b), where σ is called the sigmoid function
- Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. 

### Gradient decent
- To quantify how well we're achieving this goal we define a cost function
- to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent 

### Backpropagation

### Overfitting
- One of the problems that occur during neural network training is called overfitting. 
- The error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has <font color=blue> ** memorized the training examples, but it has not learned to generalize to new situations ** </font>

### How to avoid overfitting
- Go for simpler models over more complicated models. Generally, the **fewer parameters** that you have to tune the better. 
- Use **more data** to train the model. 
- Some sort of <font color= blue> **regularization** </font> can help penalize certain sources of overfitting.

### Vanishing Gradients
- if a change in the parameter's value causes very small change in the network's output - the network just can't learn the parameter effectively, which is a problem.
-  For example, **sigmoid maps the real number line onto a "small" range** of [0, 1]. As a result, there are large regions of the input space which are mapped to an extremely small range. In these regions of the input space, even a large change in the input will produce a small change in the output - hence the gradient is small.
- This **becomes much worse when we stack multiple layers** of such non-linearities on top of each other. <br> For instance, first layer will map a large input region to a smaller output region, which will be mapped to an even smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so on. ** As a result, even a large change in the parameters of the first layer doesn't change the output much **

### How to avoid vanishing gradients
- We can avoid this problem by using activation functions which don't have this property of 'squashing' the input space into a small region. 
- A popular <font color=blue> **choice is Rectified Linear Unit** </font> which maps x to max(0,x)

### Cross validation
- **Cross validation is a method for estimating the prediction accuracy of a model.**
- One way to evaluate a model is to see how well it predicts the data used to fit the model. But this is too optimistic -- a model tailored to a particular data set will make better predictions on that data set than on new data. 
- Another way is to hold out some data and fit the model using the rest. Then you can test your accuracy on the holdout data.  But the held out data is "wasted" from the point of view of building the model. If you have huge amounts of data, so holding some data out won't make the model much worse
- Cross validation does something like this but tries to <font color=blue> **make more efficient use of the data** </font>: you divide the data into (say) 10 equal parts. Then **successively hold out each part and fit the model using the rest**. This gives you 10 estimates of prediction accuracy which can be combined into an overall measure.

### Types of data
- ** Categorical**: Categorical variables take on values that are names or labels. The colour of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables.
- ** Quantitative **: Quantitative variables are numerical. They represent a measurable quantity. For example, when we speak of the population of a city, we are talking about the number of people in the city — a measurable attribute of the city. Therefore population would be a quantitative variable.

### Classification and Regression
- ** So in very simple terms, classification is about predicting a label and regression is about predicting a quantity **

## <font color=brown> Classification Algorithms </font>

### Support Vector Machine (SVM)
- [SVM Overview](https://towardsdatascience.com/support-vector-machines-a-brief-overview-37e018ae310f)
- Support vector machines attempt to pass a <font color=blue> linearly separable hyperplane through a dataset in order to classify the data into two groups </font>
- This hyperplane is a linear separator for any dimension; it could be a line (2D), plane (3D), and hyperplane (4D+)
- the <font color=blue> best hyperplane is the one that maximizes the margin </font>. The margin is the distance between the hyperplane and a few close points. These <font color=blue> close points are the support vectors because they control the hyperplane. </font>
- The classes have to be linearly separable to be classified using SVM, a variant of SVM is proposed to classify the data's which are not perfectly separable, it is known as a <font color=blue> Soft Margin Classifier or a Support Vector Classifier </font>, which allows slight mis-classification. SVM classifier contains a tuning parameter in order to control how much misclassification it will allow
- ** Kernel Trick **:
    - The non-linear lower feature space from lower dimension is transformed to higher dimension to classify non-linear, which is known as kernel trick
    - these kernels transform our data in order to pass a linear hyperplane and thus classify our data

## <font color=brown> Regression Algorithms </font>
- **Regression is a statistical way to establish a relationship between a dependent variable and a set of independent variable(s)**
- Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.
- Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning.

### Linear regression
- While doing linear regression our objective is to **fit a line through the distribution which is nearest to most of the points**. Hence reducing the distance (error term) of data points from the fitted line. 
- It is conventional to use squares, as Regression line minimizes the sum of “Square of Residuals”. 
- That’s why the method of **Linear Regression is <font color=blue> known as “Ordinary Least Square (OLS)”** </font>

### Logistic regression
- The idea of <font color=blue> ** Logistic Regression is to find a relationship between features and probability of particular outcome.** </font>
-  Logistic regression works largely the same way linear regression works: it multiplies each input by a coefficient, sums them up, and adds a constant. ** <font color=blue> In logistic regression, however, the output is actually the log of the odds ratio. </font> **
- This type of a problem is referred to as ** Binomial Logistic Regression **, where the response variable has two values 0 and 1 or pass and fail or true and false. ** Multinomial Logistic Regression deals ** with situations where the response variable can have three or more possible values.

## <font color=brown> Loss Functions </font>
- All the algorithms in machine learning rely on minimizing or maximizing a function, which we call “**objective function**”. The group of functions that are minimized are called “loss functions”. 
- A loss function is a measure of how good a prediction model does in terms of being able to predict the expected outcome. 
- A most commonly used method of finding the minimum point of function is “gradient descent”.
- Loss functions can be broadly categorized into 2 types: <font color=blue> **Classification loss and Regression Loss ** </font>

## <font color=brown> Classification Loss Functions </font>

## <font color=brown> Regression Loss Functions </font>
- [** 5 regression loss funtioncs **](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0)

### L2 Loss, Mean Square Error (MSE), Quadratic loss
- Mean Square Error (MSE) is the most commonly used regression loss function. MSE is the sum of squared distances between our target variable and predicted values.

<h4 align="center"> $ MSE = \sum\limits_{i=1}^n  {(y_i - y_i^p)}^2 $ </h4>

### L1 Loss, Mean Absolute Error (MAE)
- MAE is the sum of absolute differences between our target and predicted variables. 
- So it ** measures the average magnitude of errors ** in a set of predictions, without considering their directions. 
- If we consider directions also, that would be called ** Mean Bias Error (MBE) **

<h4 align="center"> $ MAE = \sum\limits_{i=1}^n  {|y_i - y_i^p|} $ </h4>

### L1 vs L2 Loss
- [** L1 vs L2 Comparison**](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)
- <font color=blue> **L1 loss is more robust to outliers than L2 ** </font>,
    - Since MSE squares the error (y — y_predicted = e), the value of error (e) increases a lot if e > 1. 
    - If we have an outlier in our data, the value of e will be high and e² will be >> |e|. 
    - This will make the model with **MSE loss give more weight to outliers ** than a model with MAE loss.
    - MAE loss is useful if the training data is corrupted with outliers
- <font color=blue> **L1 loss derivative is not continuous, hence inefficient to find solution, i.e. unstable ** </font>,
    - which can lead to missing minima
    - As L2 derivative is continuous, it gives more stable solution, however it not robust in case of outliers
- <font color=blue> ** Issue with L1 and L2 loss functions: ** </font>
    - There can be cases where neither loss function gives desirable predictions. 
    - **For example,** if 90% of observations in our data have true target value of 150 and the remaining 10% have target value between 0–30. 
    - Then a model with MAE as loss might predict 150 for all observations, ignoring 10% of outlier cases, as it will try to go towards median value. 
    - In the same case, a model using MSE would give many predictions in the range of 0 to 30 as it will get skewed towards outliers. Both results are undesirable in many business cases.
    - ** An easy fix would be to transform the target variables. Another way is to try a different loss function. This is the motivation behind our 3rd loss function, Huber loss. **

### Huber Loss, Smooth Mean Absolute Error
- Huber loss is less sensitive to outliers in data than the squared error loss. It’s also **differentiable at 0**. 
- It’s basically absolute error, which becomes quadratic when error is small.
- problem with Huber loss is that we might need to train hyperparameter delta which is an iterative process
- ** it’s twice differentiable everywhere **
- Many ML model implementations like XGBoost use Newton’s method to find the optimum, which is why the second derivative (Hessian) is needed. For ML frameworks like XGBoost, **twice differentiable functions are more favorable**.

### Log-CosH Loss
- Log-cosh is another function used in regression tasks that’s **smoother than L2**. 
- Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.
- 'logcosh' works mostly like the mean squared error, but will ** not be so strongly affected by the occasional wildly incorrect prediction **

### Quantile Loss
- Quantile loss functions turns out to be useful when we are interested in predicting an interval instead of only point predictions. 

## <font color=brown> Regularization Algorithms </font>
- To avoid over optimizing/over-fitting the training set to use early termination as soon as the learning stops, other method is ** to use regularization **
- An extension made to another method (typically regression methods) that <font color=blue> ** penalizes models based on their complexity ** </font>, favoring simpler models that are also better at generalizing.
- other regularization technique is ** dropout **
- [Regularization techniques](https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/)

### <font color=DarkMagenta> L1 (Lasso) and L2 (Ridge) as Reguralization </font>

### Drop-out
- remove few connections randomly to force the network to learn redundant representation of input, so that it doesn't overfit and depend on any particular parameter, so that all learns independently

## <font color=brown> Bias Variance Trade-Off </font>
- [Bias Variance Trade-off Tutorial](https://www.listendata.com/2017/02/bias-variance-tradeoff.html)
- [Bias Variance Trade-off Infograph](https://elitedatascience.com/bias-variance-tradeoff)
- ** Bias **
    - Bias is a measure of the prediction accuracy on training data
    - <font color=blue> High bias means low prediction accuracy </font>, which means model may be too simple not able learn from training data known as underfitting
    - For example, a linear regression model would have high bias when trying to model a non-linear relationship
    - High Bias Techniques
        - Linear Regression, Linear Discriminant Analysis and Logistic Regression
    - Low Bias Techniques
        - Decision Trees,  K-nearest neighbours and Gradient Boosting
    - <font color=blue> Parametric algorithms which assume something about the distribution of the data points </font> suffer from High Bias. Whereas non-parametric algorithms which does not assume anything special about distribution have low bias.
- ** Variance **
    - Variance is a measure of the generalization of the network
    - <font color=blue> High variance means less generalization </font>, complex models that fits well on training data but they cannot generalise the pattern well which results to overfitting
    - Low Variance Techniques
        - Linear Regression, Linear Discriminant Analysis and Logistic Regression
    - High Variance Techniques
        - Decision Trees,  K-nearest neighbours and SVM
- ** Bias Variance Trade-off **
    - It means there is a trade-off between predictive accuracy and generalization of pattern outside training data. Increasing the accuracy of the model will lead to less generalization of pattern outside training data. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.

## Bayesian Algorithms

## Probability

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('o4QmoNfW3bI')

### Binomial Theorem

In [13]:
YouTubeVideo('6nOhUqA29EI')

## Clustering Algorithms

## Dimensionality Reduction Algorithms

## Ensemble Algorithms

## Training hyper parameters

## Activation Functions

## Layer types

## Recurrent Neural Networks (RNN)

## <font color=green> New DNN Research Concepts

### Capsule Net

### Transparency by Design (TbD Net)

## <font color=brown> Interview Questions </font>

## Skills required for ML/DL Engineer Jobs
- 
- Strong expertise in DL frameworks (eg: Tensorflow, Keras)

## <font color=brown> Best Resources </font>
- [Over 200 best ML, NLP, Python Tutorials - 2018 Edition](https://medium.com/machine-learning-in-practice/over-200-of-the-best-machine-learning-nlp-and-python-tutorials-2018-edition-dd8cf53cb7dc)
- [ML, DL Tutorials](https://github.com/ujjwalkarn/Machine-Learning-Tutorials)
- [Machine Learning Introduction- Series](https://medium.com/machine-learning-for-humans/supervised-learning-740383a2feab)