# Overvitting with linear and logistic regression

`Underfitting` = `high bias` (algorithm cannot fit the data well. Missing a trend). This is due to an underliying assumption about the data. If we chouse a polynomial with a higher order that fits the data we.  
`Generalization` - building a model that should fit well not only the training set but also the new data.  
`Overfitting` = `high variance` - model fits the data well, but predicts the values that goes against the trend seen in data.

This applies to linear regression and classification.

# Regularization to reduce overfitting

Solution to **overfitting**:
- Collect more data 
- Use **fewer** features (`feature selection`)
- Regularization (gently reduce the feature impact, or size of parameters (at higher order))

## Cost function with regularization

Consider a minimizations that needs to be done 

$$
\min\frac{1}{2m}\sum_{i=1}^m\Big( f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)} \Big)^2 \textcolor{gray}{ + 1000 w_e^2} \textcolor{gray}{ + 1000 w_e^4}
$$

where the last 2 terms aim to **penalize** model if $w_{3,4}$ are too large.  
Usually one does not know which features to penalize and one penalizes them all! 

$$
J(\vec{w},b) = \min\frac{1}{2m}\sum_{i=1}^m\Big( f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)} \Big)^2 + \frac{\lambda}{2m}\sum_{j=1}^n w_{j}^2 \textcolor{gray}{+\frac{\lambda}{2m}b^2}
$$

$\lambda > 0$ is the regularization parameter , and the last term is generally ignored for meing small; 
- If $\lambda$ small -- **overfitting** (no regularization)
- If $\lambda$ large -- **underfitting** (only $b$ is left, line fit)

# Regularized linear regression

The algorithm is the same. The only difference is the cost function derivatives (and we do not regularize $b$)

$$
w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(\vec{w},b) = w_j - \alpha \frac{1}{m}\sum_{i=1}^{m}\Big( f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)} \Big) x_j^{(i)} + \frac{\lambda}{m}w_j \\
b = b - \alpha \frac{\partial}{\partial b} J(\vec{w},b) = w_j - \alpha \frac{1}{m}\sum_{i=1}^{m}\Big( f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)} \Big)
$$

It can be rearranged to get $w_j = w_j + w_j\Big( 1 - \alpha\frac{\lambda}{m} \Big)$. This shrinks $w_j$ at every iterations. 

# Regulirized Logistic regression 

Recall cost function for logistic regression. Modify it to include the regularization: 
$$
J(\vec{w},b) = -\frac{1}{m}\sum_{i=1}^m \Big[y^{(i)}\log(f_{\vec{w},b}(\vec{x}^{(i)})) - (1 - y^{(i)})\log(1-f_{\vec{w},b}(\vec{x}^{(i)}))\Big] \textcolor{gray}{+\frac{\lambda}{2m}\sum_{j=1}^nw_j^2}
$$

there, the large $w_j$ are penilized.

The alorithm for gradient descend looks the same. 

In [2]:
import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt
from plt_overfit import overfit_example, output
from lab_utils_common import sigmoid
np.set_printoptions(precision=8)

In [3]:
# LINEAR REGRESSION
def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """

    m  = X.shape[0]
    n  = len(w)
    cost = 0.
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b  # Linear regression #(n,)(n,)=scalar, see np.dot
        cost = cost + (f_wb_i - y[i])**2 # Cost function  #scalar             
    cost = cost / (2 * m)  #scalar  
 
    reg_cost = 0. 
    for j in range(n):
        reg_cost += (w[j]**2)  # add regularization term  #scalar
    reg_cost = (lambda_/(2*m)) * reg_cost #scalar
    
    total_cost = cost + reg_cost # combine into regulirzed cost function #scalar
    return total_cost   

In [4]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.07917239320214275


In [6]:
# LOGISTIC REGRESSION
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """

    m,n  = X.shape
    cost = 0.
    for i in range(m):
        z_i = np.dot(X[i], w) + b  # Regression #(n,)(n,)=scalar, see np.dot
        f_wb_i = sigmoid(z_i)  # scalar
        cost +=  -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)      #scalar
             
    cost = cost/m                                                      #scalar

    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)                                          #scalar
    reg_cost = (lambda_/(2*m)) * reg_cost                              #scalar
    
    total_cost = cost + reg_cost                                       #scalar
    return total_cost                                                  #scalar

In [7]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.6850849138741673


In [8]:
# GRADIENT DESCENT FOR LINEAR REGRESSION WITH REGULARIZATION
def compute_gradient_linear_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]                 
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]               
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m   
    
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]

    return (dj_db, dj_dw)

In [9]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.6648774569425727
Regularized dj_dw:
 [0.29653214748822276, 0.4911679625918033, 0.21645877535865857]


In [10]:
# GRADIENT DESCEND FOR LOGISTIC REGRESION WITH REGULARIZATION
def compute_gradient_logistic_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns
      dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)            : The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape
    dj_dw = np.zeros((n,))                            #(n,)
    dj_db = 0.0                                       #scalar

    for i in range(m):
        f_wb_i = sigmoid(np.dot(X[i],w) + b)          #(n,)(n,)=scalar
        err_i  = f_wb_i  - y[i]                       #scalar
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err_i * X[i,j]      #scalar
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m                                   #(n,)
    dj_db = dj_db/m                                   #scalar

    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]

    return (dj_db, dj_dw) 


In [11]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
