# Cost Function

Consider training set (features and targets) and a linear regression model (univaraite) with $w$ and $b$ parameters, $\hat{y} = wx + b$. 
Here $w$ is the slope, $b$ is the intercept, $x$ features (input), $y$ (labels)
Using training set, we need to find $w,b$ that provide a good fit.  
$\hat{y}^{(i)} = f_{w,b}(x^{(i)})$

Const function: compares $\hat{y}$ to $y$. The Error=$\hat{y}-y$.  
- Squared error cost function:  
$J(w,b)=\frac{1}{2m}\Sigma_{i=1}^N(\hat{y}^{(i)}-y^{(i)})^2$ or  
 $j(w,b)=\frac{1}{2m}\Sigma_{i=1}^N(f_{w,b}(x^{(i)})-y^{(i)})^2$

By iterating over $w,b$, a set of cost-function values can be computed. And $w,b$ that gives the smallest $j(w,b)$ can be found.


In [None]:
import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt
from lab_utils_uni import plt_intuition, plt_stationary, plt_update_onclick, soup_bowl

# define th const function for lienar regression:
def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] 
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

# define training data
x_train = np.array([1.0, 2.0])           #(size in 1000 square feet)
y_train = np.array([300.0, 500.0])           #(price in 1000s of dollars)

# plot data using custom func # TODO make it work here! It freezes the kernal
plt.close('all') 
fig, ax, dyn_items = plt_stationary(x_train, y_train)
updater = plt_update_onclick(fig, ax, x_train, y_train, dyn_items)

In [None]:
soup_bowl()

# Gradient Descend

Find minimum of a given function: $\min(j(w_i...w_n,b))$. 
- Start with random intiial guess. Change parameters and see if $j(...)$ goes down. 
- Keep taking steps into the direction of a steepest descend. 

Local minima are the attractors of gradient descend.  

### Implementation
Algorithm: update **simultaneously** $w$ and $b$ as  
| $w_{tmp} = w - \alpha \frac{\partial}{\partial w}(J(w,b))$  
| $b_{tmp} = b - \alpha \frac{\partial}{\partial b}J(w,b)$  
| $w = w_{tmp}$  
| $b = b_{tmp}$
- $\alpha$ is a `learning rate` *how big of a step to take*. If too small, slow convergence. If too large algorithm may diverge. 
- $\frac{\partial}{\partial x}$ 'derivative' the sign of whick determins the increase or decrease of a $w$ or $b$

If the $J(w...)$ is already at a local minima, but there are **many** local minima. In the algotithm $\frac{\partial}{\partial x} = 0$ at the minima and $w=\rm const$. 

Gradient descend can reach a local minima with **fixed** learning rate, as $\frac{\partial}{\partial x}$ decreases when approaching a local minima.

### Building the model

Consider $f_{w,b} =  wx + b$ and compute the derivatives  
$$
\frac{\partial}{\partial w}J(w,b) 
= \frac{\partial}{\partial w}\frac{1}{2m}\Sigma_{i=1}^m\Big(f_{w,b}(x^{(i)}-y^{(i)}\Big)^2 \\
= \frac{\partial}{\partial w}\frac{1}{2m}\Sigma_{i=1}^m\Big(wx^{(i)}+b-y^{(i)}\Big)^2 \\
= \frac{1}{2m}\Sigma_{i=1}^m\Big(wx^{(i)}+b-y^{(i)}\Big) \times 2x^{(i)} \\
= \frac{1}{m}\Sigma_{i=1}^m\Big(wx^{(i)}+b-y^{(i)}\Big)x^{(i)}
$$

This also shows why `cost function` had $2$ in it. To cancel the term in the derivative.

Then, the algorithm is: Repeat:
$$
w = w - \alpha\frac{1}{m}\Sigma_{i=1}^m\Big(f_{w,b}(x^{(i)})-y^{(i)}\Big)x^{(i)} \\
b = b - \alpha\frac{1}{m}\Sigma_{i=1}^m\Big(f_{w,b}(x^{(i)})-y^{(i)}\Big)
$$
untill convergence (updating them simultaneously).

Note. `Squared error cost function` will **never** have multiple local minima as it is 'bowl shaped'. This is so-called `convex function`.

Note. If at each step of gradient descend, **all the training data** is used, it is called `batch gradient descend`. 

In [1]:
import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients

In [2]:
# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value

In [3]:
#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost
# compute gradient of the cost function
def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0 # derivatives
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b  # linear regression
        dj_dw_i = (f_wb - y[i]) * x[i] # its derivative w/r w
        dj_db_i = f_wb - y[i]  # its derivative w/r b
        dj_db += dj_db_i # increment
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m # recall normalization by m
    dj_db = dj_db / m 
        
    return (dj_dw, dj_db)