**Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting**

Similar to the linear regression, even logistic regression is prone to overfitting if there are large number of features. If the decision boundary is overfit, the shape might be highly contorted to fit only the training data while failing to generalise for the unseen data.

So, the cost function of the logistic regression is updated to penalize high values of the parameters and is given by,

$ J(θ) = -\frac{1}{m} \sum_{i=1}^{m} \left(y^{(i)} log(h_θ(x^{(i)}) + (1 − y^{(i)}) log(1 − h_θ(x^{(i)}))\right) + \frac{λ}{2mn} \sum_{j=1}^{n} θ^2_j $
                                          
+ Where
                                          
 $ \circ \frac{λ}{2mn} \sum_{j=1}^{n}θ^2_j $ is the regularization term.
                                          
 $ \circ λ $ is the regularization factor.                                        

### Regularization for Gradient Descent

The gradient descent for logistic regression without regularization was given by,
    
$ Repeat Until Convergence $

$ \{ 
     θ_j:= θ_j − α \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)})−y^{(i)}) x^{(i)}_j 
\} $

$ \bullet $ where j ∈ {0,1,⋯,n} 

But since the equation for cost function has changed  to include the regularization term, there will be a **change in the derivative of cost function** that was plugged in the gradient descent algorithm,

$ \frac{∂}{∂θ_j} J(θ) = \frac{∂}{∂θ_j} \begin{bmatrix}−\frac{1}{m}\sum_{i=0}^{m} \left(y^{(i)}log(h_θ(x^{(i)}) + (1−y^{(i)})log(1−h_θ(x^{(i)}))\right) + \frac{λ}{2mn} \sum_{j=1}^{n} θ^2_j \end{bmatrix} $
                                                                                       
                                                                                       
 $ = \frac{1}{m} \sum_{i=0}^{m} (h_θ(x^{(i)}) − y^{(i)}) x^{(i)}j + \frac{λ}{m}θ_j $ 

Because the first term of cost fuction remains the same, so does the first term of the derivative. So taking derivative of second term gives  $ \frac{λ}{m}θ_j $ as seen above.

updated as:
    
    
Repeat until Convergence =  $ θ_0:= θ_0 - α\begin{bmatrix}\frac{1}{m}\sum_{i=1}^{m}(h_θ(x^{(i)})−y^{(i)})x^{(i)}_0\end{bmatrix} \\
                              θ_j:= θ_j - α\begin{bmatrix}\frac{1}{m}\sum_{i=1}^{m}(h_θ(x^{(i)})−y^{(i)})x^{(i)}_j + \frac{λ}{m}\theta_j \end{bmatrix} $ 

+ Where j ∈ {1,2,⋯,n}and h is the **sigmoid function**.     

It can be noticed that, for case j=0, there is no regularization term included which is consistent with the convention followed for regularization.

In [5]:
import numpy as np
mul = np.matmul

"""
X is the design matrix
y is the target vector
theta is the parameter vector
lamda is the regularization parameter
"""
def sigmoid(X):
    return np.power(1 + np.exp(-X), -1)
"""
hypothesis function
"""
def h(X, theta):
    return sigmoid(mul(X, theta))

In [6]:
"""
regularized cost function
"""
def j(theta, X, y, lamda=None):
    m = X.shape[0]
    theta[0] = 0
    if lamda:
        return (-(1/m) * (mul(y.T, np.log(h(X, theta))) + \
                          mul((1-y).T, np.log(1 - h(X, theta)))) + \
                (lamda/(2*m))*mul(theta.T, theta))[0][0] 
    return -(1/m) * (mul(y.T, np.log(h(X, theta))) + \
                     mul((1-y).T, np.log(1 - h(X, theta))))[0][0]

In [7]:
"""
regularized cost gradient
"""
def j_prime(theta, X, y, lamda=None):
    m = X.shape[0]
    theta[0] = 0
    if lamda:
        return (1/m) * mul(X.T, (h(X, theta) - y)) + (lamda/m) * theta 
    return (1/m) * mul(X.T, (h(X, theta) - y)) 

In [8]:
"""
Simultaneous update
"""
def update_theta(theta, X, y, lamda=None):
    return theta - alpha * j_prime(theta, X, y, lamda)
