# Supervised Learning

## Linear Regression

### Gradient Descent

$$
w_i \rightarrow w_i - \alpha \frac{{\partial}}{{\partial w_i}} Error
$$

#### Error Functions
- Mean Absolute Error

![MeanAbsoluteError](img/MeanAbsoluteError.png)

$$
Error = \frac{1}{m} \sum_{i=1}^m |y- \hat{y}|
$$

- Mean Squared Error

![MeanSquaredError](img/MeanSquaredError.png)

$$
Error = \frac{1}{2m} \sum_{i=1}^m (y- \hat{y})^2
$$


### [Mini-batch Gradient Descent](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/Mini-batch%20Gradient%20Descent.pdf)

#### Batch Gradient Descent
By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

#### Stochastic Gradient Descent
By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.

![batch-stochastic](img/batch-stochastic.png)

#### Mini-batch Gradient Descent
The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.

![minibatch](img/minibatch.png)

[Quiz: Mini-Batch Gradient Descent](../../edit/01%20Linear%20Regression/batch_graddesc_solution.py)

[Programming Quiz: Linear Regression in scikit-learn](../../edit/01%20Linear%20Regression/gapminder1.py)

[Programming Quiz: Multiple Linear Regression](../../edit/01%20Linear%20Regression/multiple_linear_Regression.py)


### [Linear Regression Warnings](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/Linear%20Regression%20Warnings.pdf)

__Linear Regression Works Best When the Data is Linear__
Linear regression produces a straight line model from the training data. If the relationship in the training data is not really linear, you'll need to either make adjustments (transform your training data), add features (we'll come to this next), or use another kind of model.

__Linear Regression is Sensitive to Outliers__
Linear regression tries to find a 'best fit' line among the training data. If your dataset has some outlying extreme values that don't fit a general pattern, they can have a surprisingly large effect.


### Polynomial Regression
[Quiz: Polynomial Regression](../../edit/01%20Linear%20Regression/poly_reg.py)


### Regularization
- L1
- L2

![Regularization](img/Regularization.png)

[Quiz: Regularization](../../edit/01%20Linear%20Regression/regularization.py)


### [Feature Scaling](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/FeatureScaling.pdf)

What is feature scaling? Feature scaling is a way of transforming your data into a common range of values. There are two common scalings:

1. Standardizing
__Standardizing__ is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column.

2. Normalizing
With __normalizing__, data are scaled between 0 and 1.

#### When Should I Use Feature Scaling?
In many machine learning algorithms, the result will change depending on the units of your data. This is especially true in two specific cases:

1. When your algorithm uses a distance-based metric to predict.
2. When you incorporate regularization.

#### Regularization
When you start introducing regularization, you will again want to scale the features of your model. The penalty on particular coefficients in regularized linear regression techniques depends largely on the scale associated with the features. When one feature is on a small range, say from 0 to 10, and another is on a large range, say from 0 to 1 000 000, applying regularization is going to unfairly punish the feature with the small range. Features with small ranges need to have larger coefficients compared to features with large ranges in order to have the same effect on the outcome of the data. (Think about how `ab = ba` for two numbers `a` and `b`.) Therefore, if regularization could remove one of those two features with the same net increase in error, it would rather remove the small-ranged feature with the large coefficient, since that would reduce the regularization term the most.

[Quiz: Feature Scaling](../../edit/01%20Linear%20Regression/feature_scaling.py)

## Decision Trees

### [Entropy](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/03%20Decision%20Trees/Multiclass%20Entropy.pdf)
The more rigid the set is or the more homogeneous, the less entropy you have, and vice versa.

![Entropy](img/Entropy.png)

$$
Entropy = -\frac{m}{m+n}log_2(\frac{m}{m+n})-\frac{n}{m+n}log_2(\frac{n}{m+n})
$$

We can state this in terms of probabilities instead for the number of red balls as $p_1$ and the number of blue balls as $p_2$:

$$
p_1=\frac{m}{m+n}
\\
p_2=\frac{n}{m+n}
\\
Entropy = -p_1log_2(p_1) - p_2log_2(p_2)
$$

This entropy equation can be extended to the multi-class case, where we have three or more possible values:

$$
Entropy = -p_1log_2(p_1) - p_2log_2(p_2) - ... - p_nlog_2(p_n) = \sum_{i=1}^np_ilog_2(p_i)
$$

The minimum value is still 0, when all elements are of the same value. The maximum value is still achieved when the outcome probabilities are the same, but the upper limit increases with the number of different outcomes. (For example, you can verify the maximum entropy is 2 if there are four different possibilities, each with probability 0.25.)


### Information Gain

$$
IG = Entropy(Parent)-[\frac{m}{m+n}Entropy(Child_1)+\frac{n}{m+n}Entropy(Child_2)]
$$

### [Hyperparameters](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/03%20Decision%20Trees/Hyperparameters.pdf)

- Maximum Depth
The maximum depth of a decision tree is simply the largest possible length between the root to a leaf. A tree of maximum length `k` can have at most $2^k$

- Minimum number of samples to split
A node must have at least `min_samples_split` samples in order to be large enough to split. If a node has fewer samples than `min_samples_split` samples, it will not be split, and the splitting process stops.

- Minimum number of samples per leaf
When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.

[Decision Trees in sklearn](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/03%20Decision%20Trees/Decision%20Trees%20in%20sklearn.pdf)

[Quiz:Decision Trees](../../edit/03%20Decision%20Trees/dt.py)

[Lab: Titanic Survival Exploration with Decision Trees](../../notebooks/03%20Decision%20Trees/titanic_survival_exploration.ipynb)

## Navie Bayes

![BayesTheorem](img/BayesTheorem.png)

[Practice Project: Building a spam classifier](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/04%20Naive%20Bayes/Building%20a%20Spam%20Classifier.pdf)

[Lab: Spam Classifier](../../notebooks/04%20Naive%20Bayes/Bayesian_Inference.ipynb)

## Support Vector Machines

### Error Function

Error = Classification Error + Margin Error

Minimize using gradient descent

- Classification Error

![ClassificationError](img/ClassificationError.png)

- [Margin Error](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/05%20Support%20Vector%20machines/Margin%20Error%20Calculation.pdf)

![MarginError](img/MarginError.png)

    - large margin, small error
    - small margin, large error
    
![MarginError](img/MarginError2.png)


### The C Parameter

Error = C * Classification Error + Margin Error

- Large C: Focus on classifying points, may have a small margin
- Small C: Focus on a large margin, may make classification errors


### Kernel

- Polynomial Kernel
- RBF Kernel

[SVMs in sklearn](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/05%20Support%20Vector%20machines/SVMs%20in%20sklearn.pdf)

[Quiz: SVM](../../edit/05%20Support%20Vector%20machines/svm.py)

## Ensemble Methods

### [Ensembles](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/06%20Ensemble%20Methods/Ensembles.pdf)

This whole lesson (on ensembles) is about how we can combine (or ensemble) the models you have already seen in a way that makes the combination of these models better at predicting than the individual models.

Commonly the "weak" learners you use are decision trees. In fact the default for most ensemble methods is a decision tree in sklearn. However, you can change this value to any of the models you have seen so far.


#### Why Would We Want to Ensemble Learners Together?

There are two competing variables in finding a well fitting machine learning model: __Bias__ and __Variance__. It is common in interviews for you to be asked about this topic and how it pertains to different modeling techniques. As a first pass, [the wikipedia is quite useful](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). However, I will give you my perspective and examples:

__Bias__: When a model has high bias, this means that means it doesn't do a good job of bending to the data. An example of an algorithm that usually has high bias is linear regression. Even with completely different datasets, we end up with the same line fit to the data. When models have high bias, this is bad.

__Variance__: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset. Linear models like the one above has low variance, but high bias. An example of an algorithm that tends to have high variance and low bias is a decision tree (especially decision trees with no early stopping parameters). A decision tree, as a high variance algorithm, will attempt to split every point into its own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.

By combining algorithms, we can often build models that perform better by meeting in the middle in terms of bias and variance. There are some other tactics that are used to combine algorithms in ways that help them perform better as well. These ideas are based on minimizing bias and variance based on mathematical theories, like the central limit theorem.

- __High Bias, Low Variance__ models tend to underfit data, as they are not flexible. __Linear models__ fall into this category of models.

- __High Variance, Low Bias__ models tend to overfit data, as they are too flexible. __Decision trees__ fall into this category of models.

#### Introducing Randomness Into Ensembles

Another method that is used to improve ensemble methods is to introduce randomness into high variance algorithms before they are ensembled together. The introduction of randomness combats the tendency of these algorithms to overfit (or fit directly to the data available). There are two main ways that randomness is introduced:

1. __Bootstrap the data__ - that is, sampling the data with replacement and fitting your algorithm to the sampled data.

2. __Subset the features__ - in each split of a decision tree or with each algorithm used in an ensemble, only a subset of the total possible features are used.

In fact, these are the two random components used in the next algorithm you are going to see called __random forests__.


These ensemble methods use a combination of techniques you have seen throughout this lesson:

- __Bootstrap the data__ passed through a learner (bagging).
- __Subset the features__ used for a learner (combined with bagging signifies the two random components of random forests).
- __Ensemble learners__ together in a way that allows those that perform best in certain areas to create the largest impact (boosting).

### [Techniques](https://scikit-learn.org/stable/modules/ensemble.html)

You saw a number of ensemble methods in this lesson including:

- [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

- [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)

#### Random Forests

Pick up some features randomly and build a decision tree and do for few times, and the ensemble of tree will recommend the result.

![RandomForests](img/RandomForests.png)

#### Bagging

Since data may be huge, in general, we do not want to train many models on the same data. This would be very expensive. Instead, we will just take subsets of it and train a weak learner on each one of these subsets. Then we will figure out how to combine these learners.

#### AdaBoost

We fit our first learner in order to maximize accuracy or equivalently minimize the number of errors. The second learner needs to fix on the mistakes that this one has made. We will take the misclassified points and make them bigger. In other words, we will punish the model more if it misses these points, so next weak learner needs to focus on these more. Keep going and combine these models.

![AdaBoost](img/AdaBoost.png)

##### Combining the Models

Calculate weight of each model by:

$$
weight = ln(\frac{accuracy}{1-accuracy})
$$

![CombiningModels](img/CombiningModels.png)

![CombiningModels](img/CombiningModels2.png)

[AdaBoost in sklearn](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/06%20Ensemble%20Methods/AdaBoost%20in%20sklearn.pdf)

[Lab: Spam Classifying](../../notebooks/06%20Ensemble%20Methods/Spam_&_Ensembles.ipynb)


### Additional Resources

Additionally, here are some great resources on AdaBoost if you'd like to learn some more!

- Here is the original [paper](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/06%20Ensemble%20Methods/IntroToBoosting.pdf) from Freund and Schapire.

- A follow-up [paper](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/06%20Ensemble%20Methods/boostingexperiments.pdf) from the same authors regarding several experiments with Adaboost.

- A great [tutorial](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/06%20Ensemble%20Methods/explaining-adaboost.pdf) by Schapire.

## Model Evaluation Metrics

### Confusion Matrix

![confusion](img/confusion.png)

In this image, the blue points are labelled positive, and the red points are labelled negative. Furthermore, the points on top of the line are predicted (guessed) to be positive, and the points below the line are predicted to be negative.

#### Type 1 and Type 2 Errors
Sometimes in the literature, you'll see False Positives and False Negatives as Type 1 and Type 2 errors. Here is the correspondence:

- __Type 1 Error (Error of the first kind, or False Positive)__: In the medical example, this is when we misdiagnose a healthy patient as sick.
- __Type 2 Error (Error of the second kind, or False Negative)__: In the medical example, this is when we misdiagnose a sick patient as healthy.


## Classification Measures
If you are fitting your model to predict categorical data (spam not spam), there are different measures to understand how well your model is performing than if you are predicting numeric values (the price of a home).

As we look at classification metrics, note that the [wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) on this topic is wonderful, but also a bit daunting. I frequently use it to remember which metric does what.

Specifically, you saw how to calculate:

### Accuracy

Accuracy is often used to compare models, as it tells us the proportion of observations we correctly labeled.

![accuracy](img/accuracy.png)

$$
Accuracy = \frac{Correctly Classified Points}{All points} = \frac{True Positives + True Negatives}{Total}
$$

Often accuracy is not the only metric you should be optimizing on. This is especially the case when you have class imbalance in your data. Optimizing on only accuracy can be misleading in how well your model is truly performing. With that in mind, you saw some additional metrics.

### Precision

Out of the points we have predicted to be positive, how many are correct?

Precision focuses on the __predicted__ "positive" values in your dataset. By optimizing based on precision values, you are determining if you are doing a good job of predicting the positive values, as compared to predicting negative values as positive.

![precision](img/precision.png)

$$
Precision = \frac{True Positive}{True Positive + False Positive}
$$

### Recall

Out of the points labelled "positive", how many did we correctly predict?

Recall focuses on the __actual__ "positive" values in your dataset. By optimizing based on recall values, you are determining if you are doing a good job of predicting the positive values __without__ regard of how you are doing on the __actual__ negative values. If you want to perform something similar to recall on the __actual__ 'negative' values, this is called specificity (TN / (TN + FP)).

![recall](img/recall.png)

$$
Recall = \frac{True Positive}{True Positive + True Negative}
$$

![PrecisionRecall](img/PrecisionRecall.png)


### F-beta Score

In order to look at a combination of metrics at the same time, there are some common techniques like the F-Beta Score (where the F1 score is frequently used), as well as the ROC and AUC. You can see that the $\beta$ parameter controls the degree to which precision is weighed into the F score, which allows precision and recall to be considered simultaneously. The most common value for beta is 1, as this is where you are finding the harmonic average between precision and recall.

$$
F_{\beta} Score = (1+\beta^2) \cdot \frac{Precision \cdot Recall}{(\beta^2 \cdot Precision) + Recall}
$$

- If $\beta$ = 0, then we get precision.
- If $\beta$ = $\infty$, then we get recall.
- For other values of $\beta$, if they are close to 0, we get something close to precision, if they are large numbers, then we get something close to recall, and if $\beta$ = 1, then we get the __harmonic mean__ of precision and recall.

#### F1 Score

$$
F1 Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$

### ROC Curve & AUC

Receiver Operating Characteristic

By finding different thresholds for our classification metrics, we can measure the area under the curve (where the curve is known as a ROC curve). Similar to each of the other metrics above, when the AUC is higher (closer to 1), this suggests that our model performance is better than when our metric is close to 0.

$$
True Positive Rate = \frac{True Positives}{All Positives}
\\
False Positive Rate = \frac{False Positives}{All Positives}
$$

![ROCCurve](img/ROCCurve.png)

![ROC](img/ROC.png)

__Summary__: The closer your area under the ROC curve is to 1, the better your model is.

You may end up choosing to optimize on any of these measures. I commonly end up using AUC or an F1 score in practice. However, there are always reason to choose one measure over another depending on your situation.


## Regression Measures
You want to measure how well your algorithms are performing on predicting numeric values? In these cases, there are three main metrics that are frequently used. __mean absolute error__, __mean squared error__, and __r2__ values.

As an important note, optimizing on the mean absolute error may lead to a different 'best model' than if you optimize on the mean squared error. However, optimizing on the mean squared error will __always__ lead to the same 'best' model as if you were to optimize on the __r2__ value.

Again, if you choose a model with the best r2 value (the highest), it will also be the model that has the lowest (MSE). Choosing one versus another is based on which one you feel most comfortable explaining to someone else.

### Mean Absolute Error (MAE)

The first metric you saw was the mean absolute error. This is a useful metric to optimize on when the value you are trying to predict follows a skewed distribution. Optimizing on an absolute value is particularly helpful in these cases because outliers will not influence models attempting to optimize on this metric as much as if you use the mean squared error. The optimal value for this technique is the median value. When you optimize for the R2 value of the mean squared error, the optimal value is actually the mean.

![mae](img/mae.png)

### Mean-Squared Error (MSE)

The mean squared error is by far the most used metric for optimization in regression problems. Similar to with MAE, you want to find a model that minimizes this value. This metric can be greatly impacted by skewed distributions and outliers. When a model is considered optimal via MAE, but not for MSE, it is useful to keep this in mind. In many cases, it is easier to actually optimize on MSE, as the a quadratic term is differentiable. However, an absolute value is not differentiable. This factor makes this metric better for gradient based optimization algorithms.

![mse](img/mse.png)

### R2 Score

Finally, the r2 value is another common metric when looking at regression values. Optimizing a model to have the lowest MSE will also optimize a model to have the the highest R2 value. This is a convenient feature of this metric. The R2 value is frequently interpreted as the 'amount of variability' captured by a model. Therefore, you can think of MSE, as the average amount you miss by across all the points, and the R2 value as the amount of the variability in the points that you capture with a model.

![r2](img/r2.png)


## [Recap](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/07%20Model%20Evaluation%20Metrics/Recap.pdf)

[Lab: Classification_Metrics](../../notebooks/07%20Model%20Evaluation%20Metrics/Classification_Metrics.ipynb)

[Lab: Regression Metrics](../../notebooks/07%20Model%20Evaluation%20Metrics/Regression%20Metrics.ipynb)

## Training and Tuning

### Types of Errors

![tradeoff](img/tradeoff.png)

### Cross Validation

- Training dataset: training our model
- Cross Validation: making decisions
- Testing dataset: final testing

![CrossValidation](img/CrossValidation.png)

### K-Fold Cross Validation

![K-FoldCrossValidation](img/K-FoldCrossValidation.png)

### Learning Curves

![LearningCurves](img/learning-curves.png)

[Detecting Overfitting and Underfitting](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/08%20Training%20and%20Tuning/Detecting%20Overfitting%20and%20Underfitting%20Solution.pdf)

### Grid Search

[Grid Search in sklearn](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/08%20Training%20and%20Tuning/Grid%20Search%20in%20sklearn.pdf)

[Lab: Grid Search](../../notebooks/08%20Training%20and%20Tuning/Grid_Search_Lab.ipynb)

[Lab: Diabetes Case Study](../../notebooks/08%20Training%20and%20Tuning/Diabetes%20Case%20Study.ipynb)