# Logistic Regression From Scratch

## Introduction


In Logistic Regression, we fit a curve to data that represents the probability of a binary outcome. We then use this curve to predict the probability of the outcome for new data points.

The objective of this project is to code a Logistic Regression function from scratch without using imported python libraries. It follows my completion of the Linear Regression from scratch project. This project still involves finding the best coefficient values for the hypothesis function, but in the case the hypothesis function is a sigmoid function. We use gradient descent to optimize the coefficients, as in Linear Regression.

Coding Logistic Regression from scratch may be more complex than Linear Regression. However, by building the algorithm from the ground up, I hope to gain a better understanding of the underlying math and logic involved.

## Background: Probability and Odds in Logistic Regression
In logistic regression, we are interested in estimating the probability of a binary outcome, such as whether a customer will buy a product or not, based on one or more predictor variables. In order to understand logistic regression, it's important to have a basic understanding of probability and odds.

#### Probability
Probability is a measure of the likelihood of an event occurring. It is defined as the number of favorable outcomes divided by the total number of possible outcomes. For example, the probability of rolling a 1 or 2 on a fair six-sided die is $\frac{2}{6}$, or $\frac{1}{3}$, or approximately 0.333.

#### Odds
Odds are another way to express the likelihood of an event occurring. Odds are defined as the probability of the event occurring divided by the probability of the event not occurring. Mathematically, this can be expressed as:

$$Odds = \frac{P(event)}{1-P(event)}$$

For example, the odds of rolling a 1 or 2 on a fair six-sided die can be calculated as follows:

$$Odds(1\ or\ 2) = \frac{P(1\ or\ 2)}{P(not\ 1\ or\ 2)} = \frac{2/6}{4/6} = \frac{1}{2}$$

#### Odds Ratio
In logistic regression, we are interested in the odds ratio, which is the ratio of two odds. For example, the odds ratio of buying a product between two different age groups might be the odds of buying the product for the older group divided by the odds of buying the product for the younger group.

#### Logistic Function
The logistic regression model estimates the probability of an event occurring based on one or more predictor variables. This is done by using a logistic or sigmoid function, which maps the predictor variables to a probability between 0 and 1. The logistic function is defined as:

$$logit(p) = ln\left(\frac{p}{1-p}\right)$$

where $p$ is the probability of the event occurring. The logistic function is useful because it is undefined at $p=0$ and $p=1$, which are the extreme values of the probability.

#### Inverse Logit
The inverse logit function, also known as the logistic function, is used to convert the output of the logistic regression model back into a probability. It is defined as:

$$p = \frac{e^a}{1+e^a}$$

where $a$ is the linear combination of the predictor variables and their coefficients, also known as the log odds or logit.

## Hypothesis Function


In Logistic Regression, the hypothesis function uses the sigmoid function to model the probability of a binary outcome. The sigmoid function is defined as:

$$ g(z) = \frac{1}{1 + e^{-z}} $$

where $z$ is the linear combination of the input features and weights, represented as:

$$ z = \theta_{0} x_{0} + \theta_{1} x_{1} + \theta_{2} x_{2} + ... + \theta_{n} x_{n} $$

Here, $\theta_{0}$ corresponds to the bias term, and $x_{0}$ is set to 1 for all input examples. The hypothesis function is then defined as:

$$ h_{\theta}(x) = g(z) = \frac{1}{1 + e^{-\vec{\theta^{\top}} \vec{x}}} $$

The cross entropy (log-likelihood) Loss Function:
For Logistic Regression, we use the log-likelihood or cross entropy loss function to optimize the parameters of the model, rather than the MSE:

$$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)}\log(h_{\theta}(x^{(i)})) + (1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right] $$

* $m$ is the number of training examples,  
* $y^{(i)}$ is the true label (0 or 1) for the $i$th example  
* $h_{\theta}(x^{(i)})$ is the predicted probability that the $i$th example belongs to the positive class (i.e., has label 1).

## The Gradient Descent Algorithm:
To minimize the loss function and find the optimal values of the weights $\theta$, we again use gradient descent. The gradient of the loss function with respect to $\theta_{j}$ is:

$$ \frac{\partial J}{\partial \theta_{j}} = \frac{1}{m} \sum_{i=1}^{m} \left(h_{\theta}(x^{(i)}) - y^{(i)}\right) x_{j}^{(i)} $$

We then update each $\theta$ using the learning rate  $\alpha$:

$$ \theta_{j} = \theta_{j} - \alpha \frac{\partial J}{\partial \theta_{j}} $$


## The Cross Entropy Loss Function:

$$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1 - h_\theta(x^{(i)}))] $$


Note: Technically, we will not need to use this loss function in the code, since we only need the gradient of the loss function. 

#### Gradient of Cross Entropy Loss Function
We end up with:

$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

Note that this ends up looking the same as when we calculate the gradient of the MSE.  However, the difference between the two loss functions is in the way that the predicted output $h_\theta(x^{(i)})$ is computed. For logistic regression, the predicted output is computed using the sigmoid function. For linear regression, it is just computed as a linear combination of the input features.


## The Code

In [1]:
# Set the value of Euler's number. 
# I am doing this instead of math.exp simply because this project does not use imported modules or libraries
e = 2.71828

class LogisticRegression:
    
    #initialize hyperparameters
    def __init__(self, alpha=0.001, num_iterations=1000, threshold = 0.5):
        self.alpha = alpha
        self.num_iterations = num_iterations
        self.theta = None
        self.threshold = threshold
        
    def fit(self, X, y):
        # Add a column of 1's to X to represent the intercept
        X = X.insert(0, "ONES", 1) 
        
        # Rows (m) and columns (n)
        m , n = X.shape
        
        # Initialize a list of 0's length "n" for first theta values
        self.theta = [0]*n
        
        # Hypothesis function
        for _ in range(self.num_iterations):
            # Create a 0's list of predicted values for y_hat
            y_hat = [0]*m 
            for i in range(m):
                for j in range(n):
                    y_hat[i] += (X[i][j] * self.theta[j])  #this is just the hypothesis function from linear regression.
                
                # For Logistic Regression, we need to apply the sigmoid function to each predicted value                
                y_hat[i] = 1 / (1 + e**(-y_hat[i])) 
                        
            # Gradient calculation of cross-entropy (J(theta)) w.r.t. theta[j]
            dJ_dtheta = [0]*n
            
            for j in range(n):
                for i in range(m):
                    dJ_dtheta[j] += 1/m * ((y_hat[i] - y[i]) * X[i][j]) # This gives the partial derivative of the cross-entropy w.r.t theta[j]
             
            # Update theta values. Previous theta value minus the corresponding gradient value times the learning rate
            for j in range(n):
                self.theta[j] -=  self.alpha * dJ_dtheta[j]
            
    def predict_prob(self, X):
        # Add a column of 1's to X to represent the intercept
        X = X.insert(0, "ONES", 1) 
        
        # Rows (m) and columns (n)
        m , n = X.shape 
        
        # Create a list of zeros for probability predictions
        y_predicted_prob = [0]*m
        
        # Compute predictions
        for i in range(m):
            for j in range(n):
                y_predicted_prob[i] += X[i][j] * self.theta[j]
                
                # Apply sigmoid function to the predicted values
                y_predicted_prob[i] = 1 / (1 + e**(-y_predicted_prob[i]))

        return y_predicted_prob
    
    def predict(self, X):
        # Add a column of 1's to X to represent the intercept
        X = X.insert(0, "ONES", 1) 
        
        # Rows (m) and columns (n)
        m , n = X.shape 
        
        # Create a list of zeros for class predictions and probability predictions
        y_prob_pred = [0]*m
        y_pred  = [0]*m
        
        # Compute class predictions according to the threshold
        for i in range(m):
            for j in range(n):
                y_prob_pred[i] += X[i][j] * self.theta[j]
                
                # Apply sigmoid function to the predicted values
                y_prob_pred[i] = 1 / (1 + e**(-y_prob_pred[i]))
                
            if y_prob_pred[i] >= self.threshold:
                y_pred[i] = 1
        

        return y_pred
        
        


## Insights and Conclusion

Conceptually, I find logisitic regression more difficult and less intuitive than Linear Regression. However, I feel I learned quite a bit by building this. 

Like my Linear Regression project, there are many ways to improve. For example, Numpy would make the code more efficient. However, I chose not to use Numpy to ensure that I understood every step of the process. Avoiding numpy also demonstrates that this project was my own work, since all online tutorials rely on Numpy.