# Gradient Descent

In previous notes, we used gradient descent to solve regression for linear functions. Now we want to gather the step that works for any function, multi variable non-linear functions! The order is same as what we did for multi variable linear regression. But lets recap what we did.

## Feature Engineering

The process to come up with a formula for our function is called feature engineering. This is where we use our logic to put the features we have, to come up with a function that should predict the output.

### Feature Selection

First we need to select the features that has impact on the output. For example, when predicting a house price, the unit number of house might not have any impact on the price of it (In some special case it might have!) but the number of bedrooms it has definitely has a impact on the price of house. 

So first we should select the related features to our output


### Adding Features

Sometimes that we have a better understandation of the output, we might be able to introduce a new feature that explain that data better. For example, for the house pricing example, we know that houses with 0 age, which called new homes, are more expensive than others. But after they are not new homes, the price drops down a lot. So we can add a new features that from age it decides if it's a new home or no (is_new_home: true or false). It's true that we already was using the age of houses to predict price, but now we have a better feature that can explain a big drop on the price. So outcome prediction is more clear now.


### Final Feature Engineering Step

Now that we have list of features that we know have impact on the output, we need to come up with a formula that can explain the relation between the features and the output. The more understandation from data we have, the better function we can come up with:
$ f_{\mathbf{w},b}(\mathbf{x^{(i)}}) $

## Scaling

You can refer to multi variable linear regression to see how we apply scaling on a function. The mathematical process is same for each function.

### Normalization Scaling

For feature **j**, we use this formula to scale each of training data:

$$  x^{(i)}_{\text{scaled},j} = \frac{x^{(i)}_j - \min{x_j}}{\max{x_j} - \min{x_j}} $$


### Standardization Scaling 

There is a better way of scaling, that uses Z-score to scale features. First lets define some famous statistic function (**m** is number of data we have):

**Population Mean**

$$ \mu = \frac{\sum_{i=0}^{m-1} x_i}{m} $$

**Standard Deviation:**

$$ \sigma = \sqrt{\frac{\sum_{i=0}^{m-1}(x_i - \mu)^2}{m}} $$

**X-Scaled**


$$\Rightarrow x_\text{scaled,j}^{(i)} = \frac{x^{(i)} - \mu(x_j)}{\sigma(x_j)} $$

## Cost Function

We define cost function from loss function, where **L** is our loss function.

$$ J_{\mathbf{w},b} = \frac {1}{m} \sum_{i=0}^{m-1}L(f_{\mathbf{w},b}(\mathbf{x^{(i)}}),y^{(i)})$$


### Loss Function

Loss function should define how far is our estimated function output from the real output. If our output is exactly same as the expected output, loss function should be zero. And if the output from the function is very far from the expected output, loss function should be a large number (positive). Note that loss function is always zero or a positive number. 

#### Requirements
1. Should determine how far is the function output from the expected output
2. Should be positive
3. We should select the correct loss function that makes a convex (a function with only one global minimum and no local minimum) cost function ($J_{\mathbf{w},b}$).

#### Most Used Solution

Simplest function that we can come up with is the difference between the output from function from the expected output (Note: it should be positive so we need the absolute of the difference):
$$ L(f_{\mathbf{w},b}(\mathbf{x^{(i)}}),y^{(i)}) = \lvert f_{\mathbf{w},b}(\mathbf{x^{(i)}}) - y^{(i)} \rvert $$

Using absolute value, in mathematic makes our process very harder, specially during derivation. So we can use square value to make the value positive:

$$ L(f_{\mathbf{w},b}(\mathbf{x^{(i)}}),y^{(i)}) = (f_{\mathbf{w},b}(\mathbf{x^{(i)}}) - y^{(i)}) ^2$$

In machine learning, usually we add a half to the loss function, so when we get derivative of the loss function, we don't get an extra 2:

$$ L(f_{\mathbf{w},b}(\mathbf{x^{(i)}}),y^{(i)}) = \frac{1}{2}(f_{\mathbf{w},b}(\mathbf{x^{(i)}}) - y^{(i)}) ^2$$


$$ \Rightarrow J_{\mathbf{w}} = \frac {1}{2m} \sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x^{(i)}}) - y^{(i)}) ^2$$


#### Other Solutions

Any other loss function that meets the defenition above, can be used. For example in the next course you see that we use another loss function to make our cost function convex.

### Overfitting

Our current loss function has a problem. The problem is called overfitting. Lets see what is overfitting and what are the ways to face this problem.

#### Definition

**Overfitting** , sometimes called **high variance** is when our model works exactly on our training data, but not in new data. It usually happens when we have a small training dataset, and a lot of features. The model matches our training data (with a very low cost function) but it's not predicting the output but rather than it's giving the exactly output for the training data. Because of this, when a new data is going to be added to our training data, our model changes a lot. Or when we give a unseen data to the model, the prediction is very bad.

**Underfitting**, sometimes called **high bias**, on the other hand, is when we have simplified our model a lot, and we're not having a lot of features. This makes our data not even work for our training data (very large cost function) and also it's not good for unseen data, as it's not considering a lot of features.

**Generalization** is what we're looking for. Happens when we choose a model not too complex and not too simple, so it makes our model to actually predict the output. It fits training set pretty well, and it doesn't change when a new data is added to training set, and it can predict a good output for unseen data.

#### Solutions to overfitting

##### More Training Data

If we have more training data, we can make sure that even though we have a lot of features, but our model has been trained enough that it has found what features are more important and what features have a very low impact on output. So when we give a unseen data, it can predict correctly.

##### Select the correct amount of features to be included

If we select only the features correctly in feature engineering part, our model is going to predict a correct model even on low amount of training set.

##### Regularization

In regularization, we reduce the size of parameters for features ($\mathbf{w}$), specially for those features that we think have a very low impact on output. This make sure that one feature is not going to have a very large impact on output model. Even if we don't know what features have not a big impact on output, we can use regularization for all the parameters ($\mathbf{w}$) (usually we don't apply regularization on $b$). 

The goal is by keeping all of our $w_j$ small, we make sure that when we multiple them to our $x_j$, the function does not change rapidly. To achieve this, we introduce a new variable called lambda ($\lambda$) which is called **regularization parameter**. By adding a new expression to cost function, we make sure that our function is going to be large when we use large $w_j$. 


Example **regularization term** is usually like this:

$$ \frac {\lambda}{2m}(\sum_{j=1}^{n}w_j^2)$$

Now by adding it to our cost function:

$$ J_{\mathbf{w},b} = \frac {1}{m} (\sum_{i=0}^{m-1}L(f_{\mathbf{w},b}(\mathbf{x^{(i)}}),y^{(i)})) + \frac {\lambda}{2m}(\sum_{j=1}^{n}w_j^2)$$

###### Choosing Lambda

Now for our lambda, we should choose a value that is not close to zero, but not too big. If it's close to zero our regularization does not work, and if it's too large, our $w_j$ are going to be super small and model does not work!

## Calculating Gradient Descent

Now that we have our cost function, we can get derivative of it. We need to calculate each of $\frac {\partial J_{\mathbf{w}, b}} {\partial w_1}$ ,$\frac {\partial J_{\mathbf{w}, b}} {\partial w_2}$, ... ,$\frac {\partial J_{\mathbf{w}, b}} {\partial w_n}$, $\frac {\partial J_{\mathbf{w}, b}} {\partial b}$ (mathematical work!)

After that, we need to update all the $w_j$ and $b$ simultaneously on each step, to get the next values for them, and keep doing it, to get the minimal cost function:

$$ w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} $$
$$ b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} $$

### Chossing alpha

For choosing the correct alpha, please refer to one variable linear regression lesson!

## Code

Now that we have the mathematical term for our gradient descent, we can now write the code. For understanding when to stop the finding next values, and return the output, we have two ways. First using a limited number of iterations, or better call them steps (We implemented this for linear regression in the previous notes). The other way is using $\epsilon$, which we define it as a very small number such as $10^{-8}$, and whenever after updating our values, our cost function didn't changed more than $\epsilon$ we stop the iterations.


That's it! Now you can implement the 