# Logistic Regression

## Mathematical Understanding
Logistic regression is a way to map continuous data to binary outputs. What it essentially means is that by using such a model we input a bunch of parameters and seek to know the output of a binary question i.e. whether the output would be a Yes(1) or a No(0).
 
So an excellent example would be:
> Given the following study habits (number of hours studied, number of books read), what is the probability that the student passes i.e. how do the studying habits affect the chances of getting a pass in the exam. 

Here, by taking the exact example, we are mapping a function 
$$
f: (n_s, n_b) \rightarrow \{0,1\}
$$

However, we do not model it as a function spitting a binary output, but rather as giving the probability-giving function:

$$
p: (n_s, n_b) \rightarrow (0,1)
$$

### Model
Taking the simple case of only taking one input, $p$ becomes a one-dimensional function, whose ansatz is:
$$
p(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x)}}
$$
We can check that this function only outputs within the range $(0,1]$
Our whole task in this regression model then becomes to find out $\beta_0$ and $\beta_1$ that best fit the input data. $p(x)$ is now the probability that at $x$ the output is 1, and conversely $1-p(x)$ is the probability that the output is 0.

To generalise for $n$-dimensional input $\bold{x}$:
$$
p(\bold{x})=\frac{1}{1+e^{-(\boldsymbol{\beta} \cdot \bold{x}+\beta_0)}}
$$

where $\boldsymbol{\beta}$ is a vector that is dotted with the vector $\bold{x}$ (by vector here I just mean, an array of numbers so that I could compactify the  long multiplication as a simple dot product). In this case, we now have to determine each $n$ components of $\boldsymbol{\beta}$ and $\beta_0$

### Fit
Now, to find out the above function, we need to fit the curve to the data that we already have. 

In most of the cases, we do this by finding a suitable loss function, which tracks how much our model deviated from the output that we obtained in the real world. In this case we can use log-loss as a function to calculate the loss of a specific data point and then sum them up.

To calculate the log-loss, we go about as follows:
$$
Logloss(x_k)=\begin{cases}
-ln(p_k) \ if \ y_k=1\\
-ln(1-p_k) \ if \ y_k=0
\end{cases}
$$

This can more neatly be written as $-(y_k(ln(p_k))+(1-y_k)(ln(1-p_k)))$

To copy some fancy jargon(formal language) from Wikipedia:
This expression is more formally known as the cross entropy of the predicted distribution $(p_{k},(1-p_{k}))$ from the actual distribution $(y_{k},(1-y_{k}))$, as probability distributions on the two-element space of (pass, fail).

Thus, the actual stuff to be minimised is:
$$
l=\frac{1}{n}\sum_{k=1}^n -((y_k(ln(p_k))+(1-y_k)(ln(1-p_k))))
$$

Or, if we just remove the - sign from the RHS, we reduce the problem to maximising $-l$

#### Log odds and assumptions
Lets define odds as follows(we run with the 1-d data case here): 
$$
o(x)=\frac{p(x)}{1-p(x)}=e^{\beta_0+\beta_1 x}
$$

Log-odds thus is:
$$
\log(o(x))=\beta_0+\beta_1 x
$$

The basic assumption of logistic regression is that the log odds of probability is linearly dependent on the input variables, which may not necessarily be true.

### Estimating the parameters $\boldsymbol{\beta}$ and $\beta_0$

This now boils down to simple multivariable calculus, where to find out the extrema of a function is to take the partial derivative with respect to each of the input variables and setting them to $0$:
$$
\frac{\partial l}{\partial \beta_i}= 0 
$$
for $i \in \{0,1,...,n\}$

### Gradient Descent
We will use gradient descent. First let calculate the explicit formulas for the gradients we take for n-dimensional input values 

For, $\beta_m$ $m\in \{0,1,..., n\}$, with $N$ data points and where $x_{0.i}= 1$:
$$
    \frac{\partial l}{\partial \beta_m}= \frac{1}{N}\sum_{i=1}^N y_i(x_{m,i} (1-p_i))+ (1-y_i)(x_{m,i} p_i)
$$

$$
= \frac{1}{N}\sum_{i=1}^N x_{m,i}(y_i-p_i)
$$

Thus the gradient:
$$
\nabla l=\begin{bmatrix}
     \frac{\partial l}{\partial \beta_0}\\
     \frac{\partial l}{\partial \beta_1}\\
     \vdots\\
     \frac{\partial l}{\partial \beta_n}\\
    \end{bmatrix}=
    \begin{bmatrix}
    \frac{1}{N}\sum_{i=0}^{N} (y_i-p_i)x_{0,i}\\
    \frac{1}{N}\sum_{i=0}^{N} (y_i-p_i)x_{1,i}\\
    \vdots\\
    \frac{1}{N}\sum_{i=0}^{N} (y_i-p_i)x_{n,i}\\
    \end{bmatrix}
$$




## Training and Prediction

In [32]:
# Definitions
import numpy as np
from numpy import log,dot,e,shape
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix

In [22]:
def sigmoid(x):
    return (1+e**(-x))**(-1)


class Logistic_Regression():
    def __init__(self, lr=0.01, n_iters=1000000):
        self.lr=lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iters):
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = sigmoid(linear_pred)

            dw = (1/n_samples) * np.dot(X.T, (predictions - y))
            db = (1/n_samples) * np.sum(predictions-y)

            self.weights = self.weights - self.lr*dw
            self.bias = self.bias - self.lr*db
            
           
    def predict(self, X):
        linear_pred = np.dot(X, self.weights)+ self.bias
        y_pred = sigmoid(linear_pred)
        class_pred=[]
        for i in y_pred:
            if i>=0.5:
                class_pred.append(1)
            else:
                class_pred.append(0)
        return class_pred
                

In [23]:
clf = Logistic_Regression()
#Training data
df=pd.read_csv('data/ds1_train.csv')
X_0=df.drop(columns=["y"])
y_0=df["y"]
X0=X_0.to_numpy()
y0=y_0.to_numpy().flatten()

#Testing data
df1=pd.read_csv('data/ds1_test.csv')
X_0=df1.drop(columns=["y"])
y_0=df1["y"]
X1=X_0.to_numpy()
y1=y_0.to_numpy().flatten()

In [24]:
clf.fit(X0, y0)

In [25]:
y_pred=clf.predict(X1)

def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test)/len(y_test)

acc = accuracy(y_pred, y1)
print("Accuracy: ", acc)
print("Confusion Matrix: \n", confusion_matrix(y_pred, y1))
#An unusuaal number of false positives

Accuracy:  0.82
Confusion Matrix: 
 [[34  2]
 [16 48]]


## Scikit-learn implementation

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


model = LogisticRegression()
model.fit(X_0,y_0)

In [33]:
y_pred_scikit=model.predict(X1)
acc = accuracy(y_pred_scikit, y1)
print("Accuracy: ", acc)
print("Confusion Matrix: \n", confusion_matrix(y_pred_scikit, y1))
#An unusuaal number of false positives

Accuracy:  0.9
Confusion Matrix: 
 [[43  3]
 [ 7 47]]




#### **Vanilla implementation's accuracy**: $82\%$

#### **Scikit-Learn pre-built's accuracy**: $90\%$

So the Scikit-Learn is fairly more accurate,