## The problem of overfitting
- The algorithm can run into a problem called overfitting, which can cause it to perform poorly.
- Overfitting alomost opposite prolem called underfitting.
- Regularization will help us minimize the overfitting problem and get our learning algorithms to work much better.
- The term bias which is if the algorithm has underfit the data, meaning that it's just not even able to fit the training set that well.
- The learning algorithms to do well, even on examples that are not on the training set, that's called generalization. We want our learning algorithm generalize well, which means to make good predictions even on brand ne examples that the model has never seen before.
- Overfitting has high variance.
- Underfitting = High variance Overfitting = high variance
- We can say that goal of machine learning is to find a model that hopefully is neither underfitting nor overfitting. In other words, a model that has neither high bias or high variance.
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

## Regularization to Reduce Overfitting
- If our model has high variance, is overfit.
- One way to address the overfitting is able to get to more data, that is more training examples. So ith larger training set, the learning algorithm will learn to fit a function that is less wiggly. So we could continue to fit a high order polynomial or some of the function with a lot of features.
- In summary, the number one tool we can use aganist overfitting is to get more training data.
    ![image.png](attachment:image.png)
- A second option for addressing overfitting is to see if we can use fewer features. So using most relevant features, we may find that our model no longer overfits as badly. 
- Choosing the most appropriate set of features to use is sometimes also called feature Selection.
- The one disadvantages of feature selection is that by using only a subset of the features, the algorithm is throwing away some of the information.
    ![image-2.png](attachment:image-2.png)
- The third option is called regularization. 
- Setting a parameter to 0 is equivalent to eliminating a feature, which is what feature selection. It turns out that regularization is a way to more gently reduce the impacts of some of the features without doing something as harsh as eliminating it outright. 
- What regularization does is encourage the learning algorithm to shrink the values of the parameters without necessarily demanding that the parameter is set to exactly 0.
- So, waht regularization does, is it lets us keep all our features, but they just prevents the features from having an overaly large effect, which is what sometimes can cause overfitting.\
![image-3.png](attachment:image-3.png)
- These are the 2 ways for addressing overfitting
    1. Collect more data
    2. Try selecting and using only a subset of the features : Feature Selection
    3. Reduce the size of the parameters using regularization

## Optional Lab : Overfitting
- The situation where overfitting can occur.

### Regression Examples
- Underfitting
![image.png](attachment:image.png)
- Precise
![image-2.png](attachment:image-2.png)
- Overfitting
![image-3.png](attachment:image-3.png)

### Categorical Examples
- Underfitting
![image-5.png](attachment:image-5.png)
- Precise
![image-6.png](attachment:image-6.png)
- Overfitting
![image-4.png](attachment:image-4.png)

## Cost Function with Regularization
- ![image.png](attachment:image.png)
- The idea is that if there are smaller balues for parameters, then that's bit like having a simpler model. Maybe one with fewer features, which is therefore less prone to overfitting. 
- The way the regularization tends to be implemeted is if you have a lot of features, say a 100 features, we may not know which are the most important features and which ones to penalize. So the way regularization is typically implemented is to penalize all of the features or more precisely, we penalize all the Wj parameters and it's possible to show that this will usually result in fitting a smoother simpler, less weekly function that's less prone to overfitting
- ![image-2.png](attachment:image-2.png)
- This cost function with regularization term trades of 2 goals
    1. Trying to minimize this 2st term encourages the algorithm to fit the training data well by minimizing the squared differences of predictions and the actual values. 
    2. Tries to minimize the 2nd term.
    - The algorithm also tries to keep the parameters wj small, which will tend to reduce overfitting.
- The value of lambda that we choose, specifies the relative importance or the relative trade off or how we balance b/w these 2 goals. 
- If lambda was set to 0, we're not using the regularization term at all because it becomes 0. And so if lambda was 0, we end up fitting this overly wiggly, overly complex curve and it over fits. So that was one extreme of it lamba was 0.
- If lambda is very large then, the learning algorithm will choose W's (W1, W2, W3,..) to be extremely close to 0 and thus f(x) is equal to b and so the learning algorithm fits a horizontal line and underfits.
- If lambda is 0 this model will overfit and if lambda is enormous then this model will underfit. So, should choose value of lambda that is in b/w that more appropriately balance these first and second terms of trading off, minimizing the mean squared error and keeping the parameters small.
![image-3.png](attachment:image-3.png)

## Regularized Linear Regression
- ![image.png](attachment:image.png)
- ![image-2.png](attachment:image-2.png)
- ![image-3.png](attachment:image-3.png)
- ![image-4.png](attachment:image-4.png)

## Regularized Logistic Regression
- ![image.png](attachment:image.png)
- ![image-2.png](attachment:image-2.png)

## Optional Lab : Regularized Cost and Gradient
- Extend the linear and logistic cost functions with a regularization term

In [1]:
import numpy as np
import matplotlib.pyplot as plt

### Adding Regularization
- ![image-2.png](attachment:image-2.png)
- ![image-3.png](attachment:image-3.png)
- The cost function differ significantly b/w linear and logistic regression, but adding regularization to the equations is the same.
- The gradient functions for linear and logistic regression are very similar. They differ only in the implementation of fw,b.

## Cost function with regularization
### Cost function for regularized linear regression
- The equation for the cost function regularized linear regression is : 
    ![image.png](attachment:image.png)
- Compare this to the cost function without regularization which is of the form
    ![image-2.png](attachment:image-2.png)
- The difference is the regularization term.

In [3]:
def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
    '''
    Computes the cost over all examples
    Args:
        X (ndarray (m, n)) : Data, m examples with n features
        y (ndarray (m,)) : target values
        w (ndarray (n,)) : model parameters
        b (scalar) : model parameter
        lamda_ (scalar) : controls amount of regularization
    Return : 
        total_cost (scalar) : cost
    '''
    
    m = X.shape[0]
    n = len(w)
    cost = 0.
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b
        cost = cost + (f_wb_i - y[i])**2
    cost = cost / (2*m)
    
    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = (lambda_/(2*m)) * reg_cost
    
    total_cost = cost + reg_cost
    return total_cost

In [25]:
np.random.seed(1)
X_tmp = np.random.rand(5, 6)
y_tmp = np.array([0, 1, 0, 1, 0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)
cost_tmp

0.07917239320214277

### Cost function for regularized logistic regression
- For regularized logistic regression, the cost is of the form
    ![image.png](attachment:image.png)
- Compare this to the cost function without regularization 
    ![image-2.png](attachment:image-2.png)
- The differnece is the regularization term

In [26]:
def sigmoid(z):
    '''compute the sigmoid of z
    
    Args : 
        z (ndarray) : A Scalar, numpy of any size.
        
    Retrurns : 
        g (ndarray) : sigmoid(z), with the same shape as z
    '''
    
    g = 1/(1+np.exp(-z))
    return g

In [27]:
def compute_cost_logistic_reg(X, y, w, b, lambda_):
    '''
    Compute the cost over all examples
    Args : 
        X (ndarray (m, n)) : Data, m examples with n features
        y (ndarray (m,)) : Target values
        w (ndarray (n,)) : Model parameters
        b (scalar) : Model parameter
        lambda_ (scalar) : Controls amount of regularization
    Returns : 
        total_cost (scalar) : cost
    '''
    
    m, n = X.shape
    cost = 0.
    for i in range(m):
        z_i = np.dot(X[i], w) + b
        f_wb_i = sigmoid(z_i)
        cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1 - f_wb_i)
    cost = cost /m 
    
    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = (lambda_/(2*m)) * reg_cost
    
    total_cost = cost + reg_cost
    return total_cost

In [30]:
np.random.seed(1)
X_tmp = np.random.rand(5, 6)
y_tmp = np.array([0, 1, 0, 1, 0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1, ) - 0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)
cost_tmp

0.6850849138741673

## Gradient Descent with Regularization
- The basic algorithm for running gradient descent does not change with regularization
    ![image.png](attachment:image.png)
- What changes with regularization is the computing the gradients
### Computing the gradient with regularization (both linear/logistic)
- The gradient calculation for both linear and logistic regression are nearly idential, differing only in computation of fw,b
    ![image-2.png](attachment:image-2.png)
    ![image-3.png](attachment:image-3.png)

### Gradient function for reguarized linear regression

In [35]:
def compute_gradient_linear_reg(X, y, w, b, lambda_):
    '''
    Computes the gradient for linear regression
    Args : 
        X (ndarray (m, n)) : Data, m examples with n features
        y (ndarray (m, )) : Target values
        w (ndarray (n, )) : Model parameters
        b (scalar) : Model parameters
        lambda_ (scalar) : Controls amount of regularization
    Returns : 
        dj_dw (ndarray (n,)) : The graident of the cost w.r.t the parameters w
        dj_db (scalar) : the gradient of the cost w.r.t the parameter b.
    '''
    
    m, n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.
    
    for i in range(m):
        err = (np.dot(X[i], w) + b) - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err*X[i,j]
        dj_db = dj_db + err
    dj_dw = dj_dw/m
    dj_db = dj_db/m
    
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
    return dj_db, dj_dw

In [37]:
np.random.seed(1)
X_tmp = np.random.rand(5, 3)
y_tmp = np.array([0, 1, 0, 1, 0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lamda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.6648774569425726
Regularized dj_dw:
 [0.29653214748822276, 0.4911679625918033, 0.21645877535865857]


### Gradient function for regularized logistic regression

In [40]:
def compute_gradient_logistic_reg(X, y, w, b, lambda_):
    '''
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns
      dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)            : The gradient of the cost w.r.t. the parameter b.
    '''
    
    m, n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.0
    
    for i in range(m):
        f_wb_i = sigmoid(np.dot(X[i], w) + b)
        err_i = f_wb_i - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err_i * X[i, j]
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m
    dj_db = dj_db/m
    
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
        
    return dj_db, dj_dw

In [41]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
