## We will implement a logistic regression gradient descent algorithm;
## To find a w which minimizes the negative log likelihood of the data.

### Logistic Regression basis function

The logistic regression model is a linear model for binary classification. It is based on the logistic function, which is defined as:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = w^T x$ is the linear combination of the input features $x$ and the weights $w$. The logistic function maps the linear combination to the range $[0, 1]$, which can be interpreted as a probability.

### Logistic regression probabilistic interpretation

The logistic regression model can be interpreted as a probabilistic model. The probability of the target variable $y$ being 1 given the input features $x$ is given by:

$$p(y=1|x, w) = \sigma(w^T x)$$

The probability of the target variable $y$ being 0 given the input features $x$ is given by:

$$p(y=0|x, w) = 1 - \sigma(w^T x)$$

### Negative log likelihood

The negative log likelihood of the data given the weights $w$ is given by:

$$\mathcal{L}(w) = -\sum_{i=1}^{N} y_i \log(\sigma(w^T x_i)) + (1 - y_i) \log(1 - \sigma(w^T x_i))$$

Where $N$ is the number of samples in the dataset, $x_i$ is the input features of the $i$-th sample, and $y_i$ is the target variable of the $i$-th sample.

### Gradient of the negative log likelihood

The gradient of the negative log likelihood with respect to the weights $w$ is given by:

$$\nabla \mathcal{L}(w) = -\sum_{i=1}^{N} (y_i - \sigma(w^T x_i)) x_i$$

### Gradient descent

The weights $w$ are updated using the gradient descent algorithm. The update rule is given by:

$$w = w - \alpha \nabla \mathcal{L}(w)$$

Where $\alpha$ is the learning rate.


### Stochastic gradient descent

In the stochastic gradient descent algorithm, the weights are updated using the gradient of the negative log likelihood of a single sample at each iteration. The update rule is given by:

$$w = w - \alpha (y_i - \sigma(w^T x_i)) x_i$$

Where $\alpha$ is the learning rate, $x_i$ is the input features of the $i$-th sample, and $y_i$ is the target variable of the $i$-th sample.

### Regularization

The negative log likelihood can be regularized to prevent overfitting. The regularized negative log likelihood is given by:

$$\mathcal{L}(w) = -\sum_{i=1}^{N} y_i \log(\sigma(w^T x_i)) + (1 - y_i) \log(1 - \sigma(w^T x_i)) + \frac{\lambda}{2} ||w||^2$$

Where $\lambda$ is the regularization parameter.

The gradient of the regularized negative log likelihood with respect to the weights $w$ is given by:

$$\nabla \mathcal{L}(w) = -\sum_{i=1}^{N} (y_i - \sigma(w^T x_i)) x_i + \lambda w$$

The weights are updated using the gradient descent algorithm. The update rule is given by:

$$w = w - \alpha \nabla \mathcal{L}(w)$$

Where $\alpha$ is the learning rate.