### Linear Regression

* Linear Regression is a statistical technique used to find the relationship between variables
* In ML context, linear regression finds the relationship between features and a label
* In algebraic terms, the model is defined as y = mx + b
    * y is the value we want to predict
    * m is the slope of the line
    * x is the input value
    * b is the y-intercept
* In ML terms, the model is defined as y' = b + w1*x1
    * y' is the predicted value
    * b is the bias term
    * w1 is the weight for feature x1
    * x1 is the input value

<img src="../pics/linear-equation.png" alt="Mathematical representation of a linear model" width="550"/>

* A more sophisticated model can be defined as y' = b + w1*x1 + w2*x2 + ... + wn*xn
    * y' is the predicted value
    * b is the bias term
    * w1, w2, ..., wn are the weights for features x1, x2, ..., xn
    * x1, x2, ..., xn are the input values

### Linear Regression: Loss

* Loss is a numerical metric the predicts how wrong a model's predictions are
* The goal of training a model is to minimize the loss
* Loss focuses on the distances between the values, not the direction
* Four main types of loss in linear regression:
    1. L1 loss
        * The sum of the absolute values of the difference between the predicted and actual values
        * $\sum |actual value - predicted value|$
    2. Mean Absolute Error
        * The average of L1 losses across a set of *N* examples
        * $\frac{1}{N}\sum |actual value - predicted value|$
    3. L2 loss
        * The sum of the squared differences between the predicted and actual values
        * $\sum (actual value - predicted value)^2$
    4. Mean Squared Error
        * The average of L2 losses across a set of *N* examples
        * $\frac{1}{N}\sum (actual value - predicted value)^2$
* When processing multiple examples at once, MAE or MSE is preferred
* When choosing the best loss function, consider how you want the model to treat outliers
    * MSE moves the model toward outliers

### Gradient Descent

* Gradient descent is a mathematical technique that iteratively finds the weights and bias that produce the model with the lowest loss
* Process of gradient descent"
    1. Calculate the loss with the current weight and bias
    2. Determine the direction to move the weights and bias that reduce loss
    3. Move the weight and bias values a small amount in the direction that reduces loss
    4. Repeat steps 1-3 until the loss stops decreasing

<img src="../pics/loss-process.png" alt="Gradient descent is an interative process" width="550"/>

### Hyperparameters

* Hyperparameters are variables that control different aspects of training
* Learning rate is a floating point number that influences how quickly the model converges
    * If learning rate is too low, model takes too long to converge
    * If learning rate is too high, the model bounces around the weights and bias that minimize the loss and never converges
* Learning rate determines that magnitude of the changes to make to the weights and bias during each step of gradient descent
    * Gradient multiplied by learning rate to determine parameters for the next iteration
* Batch size refers to the number of examples the model processes before updating its weights and bias
* Stochastic gradient descent uses only a single example per iteration
    * This one example is chosen at random
    * Works given enough iterations, but can be noisy
* Mini-batch stochastic gradient descent is between SGD and full-batch gradient descent
    * Uses a small number of examples per iteration
    * Reduces noise while still being efficient
* Epoch means the model has processed all examples in the training set once
* Number of epochs is the number of times the model processes all examples in the training set
    * Given a training set with 1,000 examples and a batch size of 100, the model will take 10 iterations to complete one epoch
    