In [1]:
import numpy as np
import copy
import math
X_train = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])  #(m,n)
y_train = np.array([0, 0, 0, 1, 1, 1])                                           #(m,)

In [2]:
X_train,y_train

(array([[0.5, 1.5],
        [1. , 1. ],
        [1.5, 0.5],
        [3. , 0.5],
        [2. , 2. ],
        [1. , 2.5]]),
 array([0, 0, 0, 1, 1, 1]))

## Logistic Regression model
A logistic regression model applies the sigmoid to the familiar linear regression model as shown below:

$$  \hat{y}(x) = g(z) = \frac{1}{1+e^{-z}} \tag{2} $$ 

  where

  $$z = \mathbf{w} \cdot \mathbf{x}^{(i)} + b$$
and $\mathbf{w} \cdot \mathbf{x}$ is the vector dot product:
  
  $$\mathbf{w} \cdot \mathbf{x} = w_0 x_0 + w_1 x_1 + ... + w_n x_n $$

* We interpret the output of the model ($f_{\mathbf{w},b}(x)$) as the probability that $y=1$ given $\mathbf{x}$ and parameterized by $\mathbf{w}$ and $b$.
* Therefore, to get a final prediction ($y=0$ or $y=1$) from the logistic regression model, we can use the following heuristic -

  if $f_{\mathbf{w},b}(x) >= 0.5$, predict $y=1$
  
  if $f_{\mathbf{w},b}(x) < 0.5$, predict $y=0$
  
  

## Cost function

Logistic Regression uses a loss function more suited to the task of categorization where the target is 0 or 1 rather than any number. 

>**Definition Note:**   In this course, these definitions are used:  
**Loss** is a measure of the difference of a single example to its target value while the  
**Cost** is a measure of the losses over the training set


This is defined: 
* $loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)})$ is the cost for a single data point, which is:

\begin{equation}
  loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = \begin{cases}
    - \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=1$}\\
    - \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) & \text{if $y^{(i)}=0$}
  \end{cases}
\end{equation}


*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target value.

*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = g(\mathbf{w} \cdot\mathbf{x}^{(i)}+b)$ where function $g$ is the sigmoid function.

The defining feature of this loss function is the fact that it uses two separate curves. One for the case when the target is zero or ($y=0$) and another for when the target is one ($y=1$). Combined, these curves provide the behavior useful for a loss function, namely, being zero when the prediction matches the target and rapidly increasing in value as the prediction differs from the target. Consider the curves below:

The loss function above can be rewritten to be easier to implement.
    $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$
  
This is a rather formidable-looking equation. It is less daunting when you consider $y^{(i)}$ can have only two values, 0 and 1. One can then consider the equation in two pieces:  
when $ y^{(i)} = 0$, the left-hand term is eliminated:
$$
\begin{align}
loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 0) &= (-(0) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 0\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \\
&= -\log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$
and when $ y^{(i)} = 1$, the right-hand term is eliminated:
$$
\begin{align}
  loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), 1) &=  (-(1) \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - 1\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\\
  &=  -\log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)
\end{align}
$$

OK, with this new logistic loss function, a cost function can be produced that incorporates the loss from all the examples. This will be the topic of the next lab. For now, let's take a look at the cost vs parameters curve for the simple example we considered above:

## Apply gradient descent on the logistic regression cost function 

## Logistic Gradient Descent

Recall the gradient descent algorithm utilizes the gradient calculation:
$$\begin{align*}
&\text{repeat until convergence:} \; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$

Where each iteration performs simultaneous updates on $w_j$ for all $j$, where
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{2} \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} 
\end{align*}$$

* m is the number of training examples in the data set      
* $f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target
* For a logistic regression model  
    $z = \mathbf{w} \cdot \mathbf{x} + b$  
    $f_{\mathbf{w},b}(x) = g(z)$  
    where $g(z)$ is the sigmoid function:  
    $g(z) = \frac{1}{1+e^{-z}}$   
    


## Code implementation

In [3]:
def sigmoid(z): #correct
    """
    Compute the sigmoid of z

    Args:
        z (ndarray): A scalar, numpy array of any size.

    Returns:
        y_prid (ndarray): sigmoid(z), with the same shape as z
         
    """

    y_prid = 1/(1+np.exp(-z))
   
    return y_prid

In [4]:
def predict(X, w, b): # correct
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w
    
    Args:
    X : (ndarray Shape (m, n))
    w : (array_like Shape (n,))      Parameters of the model
    b : (scalar, float)              Parameter of the model

    Returns:
    p: (ndarray (m,1))
        The predictions for X using a threshold at 0.5
    """
    m,n = X.shape
    y_prid = np.zeros(m)
    for i in range(m):
        z = np.dot(w,X[i]) + b
        y_prid[i] = sigmoid(z)
    # Apply threshold
    y_prid[y_prid >= 0.5] = 1
    y_prid[y_prid < 0.5] = 0
    # we could do y_prid = y_prid > 0.5
    return y_prid

In [5]:
def predict_vect(X, w, b): #correct
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w
    
    Args:
    X : (ndarray Shape (m, n))
    w : (array_like Shape (n,))      Parameters of the model
    b : (scalar, float)              Parameter of the model

    Returns:
    p: (ndarray (m,1))
        The predictions for X using a threshold at 0.5
    """
    z = np.matmul(w.T,X.T)
    z += b
    z = sigmoid(z)
    z = z > 0.5
    return z

In [6]:
# Generate an array of evenly spaced values between -10 and 10
z_tmp = np.arange(-10,11)

# Use the function implemented above to get the sigmoid values
y = sigmoid(z_tmp)

# Code for pretty printing the two arrays next to each other
np.set_printoptions(precision=3) 
print("Input (z), Output (sigmoid(z))")
print(np.c_[z_tmp, y])

Input (z), Output (sigmoid(z))
[[-1.000e+01  4.540e-05]
 [-9.000e+00  1.234e-04]
 [-8.000e+00  3.354e-04]
 [-7.000e+00  9.111e-04]
 [-6.000e+00  2.473e-03]
 [-5.000e+00  6.693e-03]
 [-4.000e+00  1.799e-02]
 [-3.000e+00  4.743e-02]
 [-2.000e+00  1.192e-01]
 [-1.000e+00  2.689e-01]
 [ 0.000e+00  5.000e-01]
 [ 1.000e+00  7.311e-01]
 [ 2.000e+00  8.808e-01]
 [ 3.000e+00  9.526e-01]
 [ 4.000e+00  9.820e-01]
 [ 5.000e+00  9.933e-01]
 [ 6.000e+00  9.975e-01]
 [ 7.000e+00  9.991e-01]
 [ 8.000e+00  9.997e-01]
 [ 9.000e+00  9.999e-01]
 [ 1.000e+01  1.000e+00]]


In [7]:
def my_compute_cost_logistic(X, y, w, b): #correct
    """
    Computes cost

    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    m = X.shape[0]
    total_loss = 0
    for i in range(m):
        z = np.dot(w,X[i]) + b
        y_prid = 1 / (1 + np.exp(-z))
        loss = y[i] * np.log(y_prid) + (1-y[i]) * np.log(1-y_prid)
        total_loss += loss
    total_loss /= -m
    return total_loss

In [8]:
w_tmp = np.array([1,1])
b_tmp = -3
print(my_compute_cost_logistic(X_train, y_train, w_tmp, b_tmp))

0.36686678640551745


**Expected output**: 0.3668667864055175

In [9]:
def my_compute_gradient_logistic(X, y, w, b): #correct
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape
    dj_dw = np.zeros(n)
    dj_db = 0
    for i in range(m):
        y_prid_i = 1 / (1 + np.exp(-(np.dot(X[i],w) + b)) )
        loss = y_prid_i - y[i]
        dj_db += loss
        for j in range(n):
            dj_dw[j] += loss * X[i][j]
    dj_dw /= m
    dj_db /= m
    return dj_db,dj_dw

In [10]:
X_tmp = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_tmp = np.array([0, 0, 0, 1, 1, 1])
w_tmp = np.array([2.,3.])
b_tmp = 1.
dj_db_tmp, dj_dw_tmp = my_compute_gradient_logistic(X_tmp, y_tmp, w_tmp, b_tmp)
print(f"dj_db: {dj_db_tmp}" )
print(f"dj_dw: {dj_dw_tmp.tolist()}" )

dj_db: 0.49861806546328574
dj_dw: [0.498333393278696, 0.49883942983996693]


**Expected output**
``` 
dj_db: 0.49861806546328574
dj_dw: [0.498333393278696, 0.49883942983996693]
```

In [11]:
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)) : initial model parameters  
      b_in (scalar)       : initial model parameter
      cost_function       : function to compute cost
      gradient_function   : function to compute the gradient
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,)) : Updated values of parameters 
      b (scalar)       : Updated value of parameter 
      """
    w = copy.deepcopy(w_in) #why?
    b = b_in
    cost_history = []
    for i in range(num_iters):
        dj_db,dj_dw = gradient_function(X,y,w,b)
        w = w - alpha * dj_dw # Vector of size n
        b = b - alpha * dj_db # Scalar 
        if i < 100000:
            cost_history.append(cost_function(X,y,w,b))
            
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {cost_history[-1]:8.2f}   ")
    return w,b,cost_history

## Run it 

In [12]:
w_tmp  = np.zeros_like(X_train[0])
b_tmp  = 0.
alph = 0.1
iters = 10000

w_out, b_out, _ = gradient_descent(X_train, y_train, w_tmp, b_tmp,my_compute_cost_logistic,my_compute_gradient_logistic, alph, iters) 
print(f"\nupdated parameters: w:{w_out}, b:{b_out}")

Iteration    0: Cost     0.68   
Iteration 1000: Cost     0.16   
Iteration 2000: Cost     0.08   
Iteration 3000: Cost     0.06   
Iteration 4000: Cost     0.04   
Iteration 5000: Cost     0.03   
Iteration 6000: Cost     0.03   
Iteration 7000: Cost     0.02   
Iteration 8000: Cost     0.02   
Iteration 9000: Cost     0.02   

updated parameters: w:[5.281 5.078], b:-14.222409982019837


## vectorized implementation
- sigmoid or (predict) function is already vectoried 
- gradient descent algorithm is iterative in nature 
- we can only vectorize 
    - compute cost
    - compute gradient

In [13]:
def sigmoid(z):
    """
    Compute the sigmoid of z

    Args:
        z (ndarray): A scalar, numpy array of any size.

    Returns:
        y_prid (ndarray): sigmoid(z), with the same shape as z
         
    """

    y_prid = 1/(1+np.exp(-z))
   
    return y_prid

In [14]:
x = np.array([1,2,3])
y = np.array([2,2,2])
np.multiply(x,y)

array([2, 4, 6])

In [15]:
def compute_cost_logistic_vect(X, y, w, b):
    """
    Computes cost

    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    y = y.reshape((-1,1)) # Impotant to make it 2D matrix (it's a vector anyway but the shape will be mx1 instead of (m,)), and it's important for calculations -otherwise produce a bug-
    wT = w.reshape((1,-1))
    z = np.matmul(wT,X.T).T + b
    y_prid = sigmoid(z)
    losses = np.add(np.multiply(y,np.log(y_prid)), np.multiply(np.subtract(1,y), np.log(np.subtract(1,y_prid)))  )
    return -1 * losses.mean()

In [16]:
w_tmp = np.array([1,1])
b_tmp = -3
print(compute_cost_logistic_vect(X_train, y_train, w_tmp, b_tmp,sigmoid))

TypeError: compute_cost_logistic_vect() takes 4 positional arguments but 5 were given

In [17]:
def compute_gradient_logistic_vect(X, y, w, b): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """
    m = X.shape[0]
    y = y.reshape((-1,1)) # Impotant to make it 2D matrix (it's a vector anyway but the shape will be mx1 instead of (m,)), and it's important for calculations -otherwise produce a bug-
    wT = w.reshape((1,-1))
    z = np.matmul(wT,X.T).T + b
    y_prid = sigmoid(z)
    err = np.subtract(y_prid,y)
    dj_db = err.mean()
    dj_dw = np.matmul(err.T,X)
    dj_dw /= m
    return dj_db,dj_dw[0]

In [18]:
X_tmp = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])
y_tmp = np.array([0, 0, 0, 1, 1, 1])
w_tmp = np.array([2.,3.])
b_tmp = 1.
dj_db_tmp, dj_dw_tmp = compute_gradient_logistic_vect(X_tmp, y_tmp, w_tmp, b_tmp)
print(dj_db_tmp)
print(dj_dw_tmp.tolist())

0.49861806546328574
[0.498333393278696, 0.49883942983996693]


**Expected output**
``` 
dj_db: 0.49861806546328574
dj_dw: [0.498333393278696, 0.49883942983996693]
```

## Run it

In [19]:
w_tmp  = np.zeros_like(X_train[0])
b_tmp  = 0.
alph = 0.1
iters = 10000

w_out, b_out, _ = gradient_descent(X_train, y_train, w_tmp, b_tmp,compute_cost_logistic_vect,compute_gradient_logistic_vect, alph, iters) 
print(f"\nupdated parameters: w:{w_out.tolist()}, b:{b_out}")

Iteration    0: Cost     0.68   
Iteration 1000: Cost     0.16   
Iteration 2000: Cost     0.08   
Iteration 3000: Cost     0.06   
Iteration 4000: Cost     0.04   
Iteration 5000: Cost     0.03   
Iteration 6000: Cost     0.03   
Iteration 7000: Cost     0.02   
Iteration 8000: Cost     0.02   
Iteration 9000: Cost     0.02   

updated parameters: w:[5.281230291780549, 5.078156075159833], b:-14.222409982019837


## Compare speeds

In [20]:
import time
tic1 = time.time()
w_final1, b_final1, J_hist = gradient_descent(X_train, y_train, w_tmp,
                                              b_tmp,my_compute_cost_logistic,
                                              my_compute_gradient_logistic, alph, iters) 

toc1 = time.time()
tic2 = time.time()
w_final2, b_final2, J_hist2 = gradient_descent(X_train, y_train, w_tmp,
                                              b_tmp,compute_cost_logistic_vect,
                                              compute_gradient_logistic_vect, alph, iters) 
toc2 = time.time()
print(f"The w from the looping algorithm is {w_final1} and the b is {b_final1} it took {toc1-tic1}")
print(f"The w from the vectorized algorithm is {w_final2} and the b is {b_final2} it took {toc2-tic2}")

Iteration    0: Cost     0.68   
Iteration 1000: Cost     0.16   
Iteration 2000: Cost     0.08   
Iteration 3000: Cost     0.06   
Iteration 4000: Cost     0.04   
Iteration 5000: Cost     0.03   
Iteration 6000: Cost     0.03   
Iteration 7000: Cost     0.02   
Iteration 8000: Cost     0.02   
Iteration 9000: Cost     0.02   
Iteration    0: Cost     0.68   
Iteration 1000: Cost     0.16   
Iteration 2000: Cost     0.08   
Iteration 3000: Cost     0.06   
Iteration 4000: Cost     0.04   
Iteration 5000: Cost     0.03   
Iteration 6000: Cost     0.03   
Iteration 7000: Cost     0.02   
Iteration 8000: Cost     0.02   
Iteration 9000: Cost     0.02   
The w from the looping algorithm is [5.281 5.078] and the b is -14.222409982019837 it took 2.562499523162842
The w from the vectorized algorithm is [5.281 5.078] and the b is -14.222409982019837 it took 1.1406280994415283


# Adding regularization to the model

### Cost function for regularized logistic regression
For regularized **logistic** regression, the cost function is of the form
$$J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{3}$$
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)  \tag{4} $$ 

Compare this to the cost function without regularization (which you implemented in  a previous lab):

$$ J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right] $$

As was the case in linear regression above, the difference is the regularization term, which is    <span style="color:blue">
    $\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$ </span> 

Including this term encourages gradient descent to minimize the size of the parameters. Note, in this example, the parameter $b$ is not regularized. This is standard practice. 

### unvectorized

In [21]:
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1): #correct
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    m,n = X.shape
    total_loss = 0
    for i in range(m):
        z = np.dot(w,X[i]) + b
        y_prid = 1 / (1 + np.exp(-z))
        loss = y[i] * np.log(y_prid) + (1-y[i]) * np.log(1-y_prid)
        total_loss += loss
    print(total_loss)
    total_loss /= -m
    # This is the added part 
    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = reg_cost * (lambda_/(2*m))
    print(reg_cost)
    total_loss += reg_cost
    return total_loss

In [22]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

-3.2682174003938935
0.031441433795388524
Regularized cost: 0.6850849138741673


**Expected Output**:
<table>
  <tr>
    <td> <b>Regularized cost: </b> 0.6850849138741673 </td>
  </tr>
</table>

### Vectorized

In [23]:
def compute_cost_logistic_reg_vect(X, y, w, b,lambda_ = 1): #correct
    """
    Computes cost

    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
    """
    m,n = X.shape
    y = y.reshape((-1,1)) # Impotant to make it 2D matrix (it's a vector anyway but the shape will be mx1 instead of (m,)), and it's important for calculations -otherwise produce a bug-
    wT = w.reshape((1,-1))
    z = np.matmul(wT,X.T).T + b
    y_prid = sigmoid(z)
    losses = np.add(np.multiply(y,np.log(y_prid)), np.multiply(np.subtract(1,y), np.log(np.subtract(1,y_prid)))  )
    losses = losses.mean() * -1
    
    # This is the added part
    cost_reg = np.square(w) 
    cost_reg = cost_reg.sum() * (lambda_/(2*m))
    
    return losses + cost_reg

In [24]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg_vect(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

Regularized cost: 0.6850849138741673


**Expected Output**:
<table>
  <tr>
    <td> <b>Regularized cost: </b> 0.6850849138741673 </td>
  </tr>
</table>

### Gradient function for regularized logistic regression

### Computing the Gradient with regularization (both linear/logistic)
The gradient calculation for both linear and logistic regression are nearly identical, differing only in computation of $f_{\mathbf{w}b}$.
$$\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j \tag{2} \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} 
\end{align*}$$

* m is the number of training examples in the data set      
* $f_{\mathbf{w},b}(x^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target

      
* For a  <span style="color:blue"> **linear** </span> regression model  
    $f_{\mathbf{w},b}(x) = \mathbf{w} \cdot \mathbf{x} + b$  
* For a <span style="color:blue"> **logistic** </span> regression model  
    $z = \mathbf{w} \cdot \mathbf{x} + b$  
    $f_{\mathbf{w},b}(x) = g(z)$  
    where $g(z)$ is the sigmoid function:  
    $g(z) = \frac{1}{1+e^{-z}}$   
    
The term which adds regularization is  the <span style="color:blue">$\frac{\lambda}{m} w_j $</span>.

### unvectorized

In [25]:
def compute_gradient_logistic_reg(X, y, w, b, lambda_): #correct
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns
      dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)            : The gradient of the cost w.r.t. the parameter b. 
    """
    m,n = X.shape
    dj_dw = np.zeros(n)
    dj_db = 0
    for i in range(m):
        y_prid_i = 1 / (1 + np.exp(-(np.dot(X[i],w) + b)) )
        loss = y_prid_i - y[i]
        dj_db += loss
        for j in range(n):
            dj_dw[j] += loss * X[i][j]
    dj_dw /= m
    dj_db /= m
    
    #this is The Added Part
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
    
    return dj_db,dj_dw

In [26]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]


**Expected Output**
```
dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
 ```

### Vectorized

In [27]:
def compute_gradient_logistic_reg_vect(X, y, w, b,lambda_ = 1): #correct
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
    Returns
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)      : The gradient of the cost w.r.t. the parameter b. 
    """
    m = X.shape[0]
    y = y.reshape((-1,1)) # Impotant to make it 2D matrix (it's a vector anyway but the shape will be mx1 instead of (m,)), and it's important for calculations -otherwise produce a bug-
    wT = w.reshape((1,-1))
    z = np.matmul(wT,X.T).T + b
    y_prid = sigmoid(z)
    err = np.subtract(y_prid,y)
    dj_db = err.mean()
    dj_dw = np.matmul(err.T,X)
    dj_dw /= m
    
    # This is the added part
    reg_term = np.multiply(w,(lambda_/m)) #reg term 
    dj_dw[0] += reg_term
    
    return dj_db,dj_dw[0]

In [28]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_logistic_reg_vect(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]


**Expected Output**
```
dj_db: 0.341798994972791
Regularized dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
 ```

In [29]:
#Compute accuracy on the training set
p = predict(X_train, w, b)

print('Train Accuracy: %f'%(np.mean(p == y_train) * 100))

NameError: name 'w' is not defined