# Supervised Learning

## Linear Regression

### Gradient Descent

$$
w_i \rightarrow w_i - \alpha \frac{{\partial}}{{\partial w_i}} Error
$$

#### Error Functions
- Mean Absolute Error

![MeanAbsoluteError](img/MeanAbsoluteError.png)

$$
Error = \frac{1}{m} \sum_{i=1}^m |y- \hat{y}|
$$

- Mean Squared Error

![MeanSquaredError](img/MeanSquaredError.png)

$$
Error = \frac{1}{2m} \sum_{i=1}^m (y- \hat{y})^2
$$


### [Mini-batch Gradient Descent](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/Mini-batch%20Gradient%20Descent.pdf) 
#### Batch Gradient Descent
By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

#### Stochastic Gradient Descent
By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.

![batch-stochastic](img/batch-stochastic.png)

#### Mini-batch Gradient Descent
The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.

![minibatch](img/minibatch.png)

[Quiz: Mini-Batch Gradient Descent](../../edit/01%20Linear%20Regression/batch_graddesc_solution.py)

[Programming Quiz: Linear Regression in scikit-learn](../../edit/01%20Linear%20Regression/gapminder1.py)

[Programming Quiz: Multiple Linear Regression](../../edit/01%20Linear%20Regression/multiple_linear_Regression.py)


### [Linear Regression Warnings](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/Linear%20Regression%20Warnings.pdf)

__Linear Regression Works Best When the Data is Linear__
Linear regression produces a straight line model from the training data. If the relationship in the training data is not really linear, you'll need to either make adjustments (transform your training data), add features (we'll come to this next), or use another kind of model.

__Linear Regression is Sensitive to Outliers__
Linear regression tries to find a 'best fit' line among the training data. If your dataset has some outlying extreme values that don't fit a general pattern, they can have a surprisingly large effect.


### Polynomial Regression
[Quiz: Polynomial Regression](../../edit/01%20Linear%20Regression/poly_reg.py)


### Regularization
- L1
- L2

![Regularization](img/Regularization.png)

[Quiz: Regularization](../../edit/01%20Linear%20Regression/regularization.py)


### [Feature Scaling](https://github.com/stephengineer/Introduction-to-Machine-Learning-with-TensorFlow/blob/main/Supervised%20Learning/01%20Linear%20Regression/FeatureScaling.pdf)

What is feature scaling? Feature scaling is a way of transforming your data into a common range of values. There are two common scalings:

1. Standardizing
__Standardizing__ is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column.

2. Normalizing
With __normalizing__, data are scaled between 0 and 1.

#### When Should I Use Feature Scaling?
In many machine learning algorithms, the result will change depending on the units of your data. This is especially true in two specific cases:

1. When your algorithm uses a distance-based metric to predict.
2. When you incorporate regularization.


#### Regularization
When you start introducing regularization, you will again want to scale the features of your model. The penalty on particular coefficients in regularized linear regression techniques depends largely on the scale associated with the features. When one feature is on a small range, say from 0 to 10, and another is on a large range, say from 0 to 1 000 000, applying regularization is going to unfairly punish the feature with the small range. Features with small ranges need to have larger coefficients compared to features with large ranges in order to have the same effect on the outcome of the data. (Think about how `ab = ba` for two numbers `a` and `b`.) Therefore, if regularization could remove one of those two features with the same net increase in error, it would rather remove the small-ranged feature with the large coefficient, since that would reduce the regularization term the most.

[Quiz: Feature Scaling](../../edit/01%20Linear%20Regression/feature_scaling.py)

## Perceptron Algorithm
