## Logistic regression is a probabilistic classifier similar to the Naïve Bayes

The goal of binary logistic regression is to train a classifier that can make a binary decision about the class of a new input observation. In the sigmoid classifier that will help us make this decision.

Consider a single input observation x, which we will represent by a vector of features [x1, x2,..., xn] (we’ll show sample features in the next subsection).

The classifier

output y can be 1 (meaning the observation is a member of the class) or 0 (the observation is not a member of the class). We want to know the probability P(y = 1|x) that this observation is a member of the class. So perhaps the decision is “positive sentiment” versus “negative sentiment”, the features represent counts of words in a
document, P(y = 1|x) is the probability that the document has positive sentiment,
and P(y = 0|x) is the probability that the document has negative sentiment

Logistic regression solves this task by learning, from a training set, a vector of weights and a bias term. Each weight wi is a real number, and is associated with one of the input features xi. The weight wi represents how important that input feature is to the classification decision, and can be positive (providing evidence that the instance being classified belongs in the positive class) or negative (providing evidence that the instance being classified belongs in the negative class). Thus we might expect in a sentiment task the word awesome to have a high positive weight, and bias term abysmal to have a very negative weight. The bias term, also called the intercept, is
intercept another real number that’s added to the weighted inputs.

To make a decision on a test instance— after we’ve learned the weights in training— the classifier first multiplies each xi by its weight wi, sums up the weighted features, and adds the bias term b. The resulting single number z expresses the weighted sum of the evidence for the class.


![Imgur](https://imgur.com/sX3ek9T.png)

Above equation can be represented using the dot product notation from linear algebra. The dot product of two vectors a and b, written as a · b is the sum of the products of the corresponding elements of each vector. Thus the following is an equivalent formation of the above equation:

# $$z = w· x+b$$

Because this is simply the below vector calculation

![Imgur](https://imgur.com/r2Ayx2o.png)

---

But note that nothing in above Eq forces z to be a legal probability, that is, to lie between 0 and 1. In fact, since weights are real-valued, the output might even be
negative; z ranges from −∞ to ∞.

To create a probability, we’ll pass z through the sigmoid function, σ(z). The
sigmoid function (named because it looks like an s) is also called the logistic function, and gives logistic regression its name. The sigmoid has the following equation

![Imgur](https://imgur.com/sjkaJq1.png)

The sigmoid has a number of advantages; it takes a real-valued number and maps it into the range [0,1], which is just what we want for a probability. Because it is nearly linear around 0 but flattens toward the ends, it tends to squash outlier values toward 0 or 1. And it’s differentiable

![Imgur](https://imgur.com/hpMykvK.png)

If we apply the sigmoid to the sum of the weighted features, we get a number between 0 and 1. To make it a probability, we just need to make
sure that the two cases, p(y = 1) and p(y = 0), sum to 1. We can do this as follows

![Imgur](https://imgur.com/vlRcxZt.png)

Now we have an algorithm that given an instance x computes the probability P(y = 1|x). How do we make a decision? For a test instance x, we say yes if the
probability P(y = 1|x) is more than .5, and no otherwise. We call .5 the decision boundary:

![Imgur](https://imgur.com/7KCnMNt.png)

#### Now a little bit more of math to represent the same above

Python_Machine_Learning_By_Example_by_Yuxi_Liu-GREAT-2020.pdf - page - 153

In logistic regression, the function input z becomes the weighted sum of features. Given a data sample x with n features, x1, x2,
..., xn (x represents a feature vector and x = (x1, x2, ..., xn)), and weights (also called coefficients) of the model w (w represents a vector (w1, w2, ..., wn)), z is expressed as follows:

![Imgur](https://imgur.com/nXLUyuf.png)

Also, occasionally, the model comes with an intercept (also called bias), w0. In this instance, the preceding linear relationship becomes:

![Imgur](https://imgur.com/wnqt8Ys.png)

#### Note the basic representation of the predicted y-value of a Logistic Function

![Imgur](https://imgur.com/kxOrYA9.png)

![Imgur](https://imgur.com/PgL9Ikv.png)


A logistic regression model or, more specifically, its weight vector w is learned from the training data, with the goal of predicting a positive sample as close to 1 as possible and predicting a negative sample as close to 0 as possible. In mathematical language, the weights are trained so as to minimize the cost defined as the mean squared error (MSE), which measures the average of squares of the difference between the truth and the prediction.

![Imgur](https://imgur.com/OkwP5KC.png)

However the above cost function is non-convex, i.e. when searching for the optimal w, many local (suboptimal) optimums are found and the
function does not converge to a global optimum.

Examples of the convex and non-convex functions are plotted respectively below:

![Imgur](https://imgur.com/28ri9b3.png)

To overcome this, the cost function in practice is defined as follows:

![Imgur](https://imgur.com/A4DuZxI.png)

When the ground truth y(i) = 1, if the model predicts correctly with full confidence (the positive class with 100% probability), the sample cost j is 0; the cost j increases. And cost increased when predicted probability (y_hat) decreases. If the model incorrectly predicts that there is no chance of the positive class, the cost is infinitely high.

![Imgur](https://imgur.com/zPo2TDZ.png)

On the contrary, when the ground truth y(i) = 0, if the model predicts correctly with full confidence (the positive class with 0 probability, or the negative class with 100% probability), the sample cost j is 0;

The cost j increases when the predicted probability y_hat increases. When it incorrectly predicts that there is using no chance of negative class, cost becomes When infinitely it incorrectly high. predicts We can visualize

![Imgur](https://imgur.com/W0FDQS4.png)

Minimizing this alternative cost function is actually equivalent to minimizing the MSE-based cost function. The advantages of choosing it over the MSE one include
the following:

![Imgur](https://imgur.com/luKwJzZ.png)

For understanding the above graph also checkout Andrew Ng

https://www.youtube.com/watch?v=HIQlmHxI6-0&ab_channel=ArtificialIntelligence-AllinOne


## Cost-Function of Logistic-Regression

First a little comarison between Linear Regression and Logistic Regression

1. **Linear regression** uses the following hypothesis:

$$ h_\theta(x) = \theta_0 + \theta_1 x $$

Accordingly, the cost function is defined as:

$$J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$$

2. The **logistic regression** uses a sigmoid/logistic function which is

$$ 0 \leq h_\theta (x) \leq 1 $$

Defined as :

$$ h_\theta (x) =  \dfrac{1}{1 + e^{-(\theta^T x)}} $$

Accordingly, our cost function has also changed. However, instead of plugging-in the new h(x) equation directly, we used logarithm.

#### The main reason of using Logarithms here is to make the function a Convex one so it has a global minima and so Gradient Descent can be applied on it.


$$ J(\theta) = \dfrac{1}{m} \sum_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)}) $$

$$ Cost(h_\theta(x),y) = -log(h_\theta(x)) \;  \text{if y = 1} $$

$$ Cost(h_\theta(x),y) = -log(1-h_\theta(x)) \;  \text{if y = 0} $$

And the new cost function is defined as:

$$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$$

[Slightly modified from here](https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression)

where $h_{\theta}(x)$ is defined as follows

$$h_{\theta}(x)=g(\theta^{T}x)$$
$$g(z)=\frac{1}{1+e^{-z}}$$

Substituting value of z

$$
h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}
$$

So in Probability Terms

$$
P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}
$$

so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.

---

Noting again our Logistic Regression Vectorized Cost function is

## $$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

The $h_\theta(x^{i})$ is the Predicted Y AND $y^{i}$ are the $y^{true}$ so the above formulae will be

## $$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{true}\log(Y_{pred})+(1-y^{true})\log(1-Y_{pred})$$



calculating the partial derivative with respect to $\theta$ of the cost function (the logs are natural logarithms):

$$ \frac{\partial}{\partial\theta_{j}}J(\theta) = \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i$$

So the ".. the more hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0."

It's also mentioned in the class notes that MLE (maximum-likelihood estimation) is used to derive the logs in the cost function. I can see how logs function and set penalty values until we find the right values.

[1]: https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression


---

## Gradient Descent Loop for Logistic Regression

![Imgur](https://imgur.com/MWrz5iE.png)

[source](https://www.internalpointers.com/post/cost-function-logistic-regression)

---

Refer below for calculating the Gradient for Logistic Regression

![Imgur](https://imgur.com/XY8Pnfk.png)

[Source](https://web.stanford.edu/~jurafsky/slp3/5.pdf)

Now below is for Vectorized logistic regression with regularization

## Gradient of loss function w.r.t each weight in weight vector

# $$dw^{(t)} = x_n(y_n − σ((w^{(t)})^{T} x_n+b^{t}))- \frac{λ}{N}w^{(t)})$$

---

## Gradient of the intercept

# $$db^{(t)} = y_n- σ((w^{(t)})^{T} x_n+b^{t}))$$

After calculating Gradient of loss function w.r.t each weight in weight vector AND gradient of the intercept, these are the two equations by which our model will learn to get better at cats and dogs image classification.

![Imgur](https://imgur.com/Va11TbN.png)

The α represents the learning rate for our gradient descent algorithm i.e. the step size for going down the hill. The term(s) next to α represent the gradients of the loss function corresponding to the weights and the bias respectively.



## Derive partial derivative of Logistic Regression Cost Function

### First the most important Basic Derivative Formulaes that we need here

![Imgur](https://imgur.com/2huUIby.png)

---

### Derive partial derivative of Logistic Regression Cost Function

(Slightly adpted from here)[https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression]

#### The partial derivative of

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

where $h_{\theta}(x)$ is given as below

$$h_{\theta}(x)=g(\theta^{T}x)$$
$$g(z)=\frac{1}{1+e^{-z}}$$

is derived to be as below

$$ \frac{\partial}{\partial\theta_{j}}J(\theta) = \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i$$

Meaning, calculating the partial derivative with respect to $\theta$ of the cost function (the logs are natural logarithms):

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

---

### Actual Derivation of the above formulae

$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$

Then

$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$

$$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} )$$

Because,

$\log(x/y) = \log(x) - \log(y)$

Since our original cost function is the form of:

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

Plugging in the two simplified expressions above, we obtain

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$,

which can be simplified to:

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right]$$

where the second equality follows from

$$
-\left[ \log e^{\theta x^i}+
\log(1+e^{-\theta x^i} )
\right]=-\log(1+e^{\theta x^i}). $$

we used $\log(x) + \log(y) = log(x y)$


---


All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As

$$\frac{\partial}{\partial \theta_j}y_i\theta x^i = y_ix^i_j $$

$$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i}) = \frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}} = x^i_jh_\theta(x^i) $$


---

#### Full Calculation of the above line

**Also note, Anything without θ is treated as constant:**

![Imgur](https://imgur.com/u2GRqyY.png)

So finally the thesis follows as.

$$\frac{\partial}{\partial\theta_{j}}J(\theta) =

\frac{\partial}{\partial\theta_{j}} \left[-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right]\right]

= - \frac{1}{m}\sum_{i=1}^{m} y_ix^i_j - x^i_jh_\theta(x^i)

= \frac{1}{m}\sum_{i=1}^{m} \left[ x^i_jh_\theta(x^i) - y_ix^i_j  \right]

 $$

#### Hence Proved

---

#### Proof of

## $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}$$

[By chain rule][1].

$$(u(v))' = u(v)' * v'$$


**Also note, Anything without θ is treated as constant:**

For example:

$$y = \sin(3x - 5)$$
$$u(v) = \sin(3x - 5)$$
$$v = (3x - 5)$$
$$y' = \sin(3x - 5)' = \cos(3x - 5) * (3 - 0) = 3\cos(3x-5)$$

So, Regarding:

$$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}$$

$$u(v) = \log(1+e^{\theta x^i})$$

$$v = 1+e^{\theta x^i}$$

$$\frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) = \frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) * \frac{\partial}{\partial \theta}(1+e^{\theta x^i}) = \frac{1}{1+e^{\theta x^i}} * (0 + xe^{\theta x^i}) = \frac{xe^{\theta x^i}}{1+e^{\theta x^i}} $$

Note that $$\log(x)' = \frac{1}{x}$$


  [1]: https://en.wikipedia.org/wiki/Chain_rule

## Time and Space Complexity of Logistic Regression

### During Train - Roughly order of $n * d$

Where n is number of samples and d is dimensionality


### During Run Time - Its $O(d)$

Because, in run-time I already have the the W vector in store calculated during training phase. Now I just need to multiply the sample with that optimum W vector.

$W = [w1 , w2...wd]$

Meaning given a query point x_q I just need to do

And if the multiplication result is > 0 the label is determined to be positive, else negative.

# $$W^T * x_q$$

![Imgur](https://imgur.com/NYrKtMQ.png)

[source of image-Applied AI Course](https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course)

After a substantial number of iterations, the learned w and b are used to classify a new sample $x'$ by means of the following equation:

![Imgur](https://imgur.com/hnZMLMr.png)

The decision threshold is 0.5 by default, but it definitely can be other values. In a case
where a false negative is, by all means, supposed to be avoided, for example, when
predicting fire occurrence (the positive class) for alerts, the decision threshold can be
lower than 0.5, such as 0.3, depending on how paranoid we are and how proactively
we want to prevent the positive event from happening. On the other hand, when the
false positive class is the one that should be evaded, for instance, when predicting
the product success (the positive class) rate for quality assurance, the decision
threshold can be greater than 0.5, such as 0.7, or lower than 0.5, depending on how
high a standard you set.

Noting again our Logistic Regression Vectorized Cost function is

### $$ J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i})) $$

The $h_\theta(x^{i})$ is the Predicted Y AND $y^{i}$ are the $y^{true}$ so the above formulae will be

### $$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{true}\log(Y_{pred})+(1-y^{true})\log(1-Y_{pred})$$

## Training a logistic regression model using stochastic gradient descent

In gradient descent-based logistic regression models, all training samples are used
to update the weights in every single iteration. Hence, if the number of training
samples is large, the whole training process will become very time-consuming and
computationally expensive, as you just witnessed in our last example.
Fortunately, a small tweak will make logistic regression suitable for large-sized
datasets. For each weight update, only one training sample is consumed, instead
of the complete training set. The model moves a step based on the error calculated
by a single training sample. Once all samples are used, one iteration finishes. This
advanced version of gradient descent is called stochastic gradient descent (SGD).
Expressed in a formula, for each iteration, we do the following:

![Imgur](https://imgur.com/X7TkU1q.png)

![Imgur](https://imgur.com/eweOXsV.png)

def train_logistic_regression_sgd(
    X_train, y_train, max_iter, learning_rate, fit_intercept=False
):
    """Train a logistic regression model via SGD
    Args:
        X_train, y_train (numpy.ndarray, training data set)
        max_iter (int, number of iterations)
        learning_rate (float)
        fit_intercept (bool, with an intercept w0 or not)
    Returns:
        numpy.ndarray, learned weights
    """
    # If fit_intercept, then initialize a intercept value 
    if fit_intercept:
        intercept = np.ones((X_train.shape[0], 1))
        X_train = np.hstack((intercept, X_train))
    
    # initialize weights with zeros
    weights = np.zeros(X_train.shape[1])

    for iteration in range(max_iter):
        weights = sgd_update_weights(X_train, y_train, weights, learning_rate)
        # Check the cost for every 2 (for example) iterations
        if iteration % 2 == 0:
            print(compute_cost(X_train, y_train, weights))
    
    return weights