# Logisitic Regression

- Main difference between regression and classification is the response variable which is qualitative in classfication case.
- Just as in regression, we have (x,y) training observations that we can use to build a classifier.

- Widely used classifiers : Logistic regression, Linear Discriminant Analysis and K- Nearest Neighbours.

- Logistic regression models the probability that Y belongs to a particular category.

## Logistic Model

- Model a relationship between $p(X) = Pr(Y = 1 | X )$ and $X$

- Logistic function (or Sigmoid function) : 

$p(X) = \frac{e^{\beta_0 + \beta_1 X}}{{1 + e^{\beta_0 + \beta_1 X} }}$

or

$p(X) = \frac{1}{{1 + e^{-(\beta_0 + \beta_1 X)} }}$


In [4]:
#Logistic function / Sigmoid Function
def logistic(x):
    return 1.0 / (1 + math.exp(-x))

After a bit of manipulation we get:

$\frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X}$


- The quantity $\frac{p(X)}{1 - p(X)}$ is called odds and can take any value between 0 and inf.
- Example, if p = 0.9, implies an odd of 9 or '9 out of 10 people will default'. 

- By taking log on both side of the equation:

$log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1 X$ 

- This is also called log-odds or logit.

- In contrast to regression, here 1 unit increase in X changes the log odds by $\beta_1$. However, $\beta_1$ does not correspond to change in $p(X)$ associated with a one-unit increase in X since the relationship between X and $p(X)$ is not a straight line.

- The amount of $p(X)$ change due to a unit increase in X will depend on current value of X.

### Determining Coefficients :

- For determining the coefficients $\beta_0$ and $\beta_1$ we will use maximum likelihood method.

- In this method we try to find the values of $\beta_0$ and $\beta_1$ such that plugging these values into the model gives a number close to 1 if defaulted and close to 0 if not defaulted.

- Estimates $\beta_0$ and $\beta_1$ are chosen to maximize the likelihood function. If f is logistic function,  $f(x, \beta)$ is the probability when y is 1 and $1-f(x, \beta)$ when y is 0.

- Hence, maximum likelihood function is given as:

$L(\beta_0, \beta_1) = f(x, \beta)^y * (1-f(x, \beta))^{1-y} $

- It's simpler to calculate the log of this function
$log L(\beta_0, \beta_1) = ylogf(x, \beta) + (1-y)log(1-f(x, \beta)) $

- If we assume different data points are independent from one another, the overall likelihood is just the product of the individual likelihood.

In [5]:
# It's derivative is given by: (Prob density function/ Likelihood function)
def logistic_prime(x):
    return logistic(x) * (1 - logistic(x))

# Maximize likelihood function
def logistic_log_likelihood_i(x_i, y_i, beta):
    if y_i == 1:
        return math.log(logistic(dot(x_i, beta)))
    else:
        return math.log(1- logistic(dot(x_i, beta)))
    

def logistic_log_likelihood(x, y, beta):
    return sum(logistic_log_likelihood_i(x_i, y_i, beta)
              for x_i, y_i in zip(x,y))

### Multiple Logisitic Regression

- Predict a binary response using multiple predictors.

- By analogy we can generalie this formula:

$log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$

- and $p(X)$ becomes :

$p(X) = \frac{e^{\beta_0 + \beta_1 X_1 +.. + \beta_p X_p}}{{1 + e^{\beta_0 + \beta_1 X_1 +...+ \beta_p X_p} }}$

### Calculating Gradient

- Now, we need to calculate the gradient of the log of likelihood function

- $\beta x$ is the dot product of the vectors $\beta$ and x, or simply a weighted sum of the components of x (with $\beta$ containing the weights).

- To maximize the likelihood we need to simply choose the value of $\beta$ which maximize it. This can be done using an optimization algorithm. 

- First we need partial derivative of likelihood with respect to each parameter.


In [6]:
def logistic_log_partial_ij(x_i, y_i, beta, j):
    """here is the index of the data point,
    j is the index of the derivative"""
    
    return (y_i - logistic(dot(x_i, beta))) * x_i[j]

# Calculating the gradient
def logistic_log_gradient_i(x_i, y_i, beta):
    """gradient of log of likelihood function
    corresponding to the ith data point"""
    
    return [logistic_log_partial_ij(x_i, y_i, beta, j)
           for j,_ in enumerate(beta)]

def logistic_log_gradient(x, y, beta):
    return reduce(vector_add, 
                 [logistic_log_gradient_i(x_i, y_i, beta)
                 for x_i, y_i in zip(x,y)])

### Example:

- We have to determine if a user has a premium account based on work exprience and salary.
- 200 users

In [2]:
# A matrix of data in which each row is a list[experience, salary, paid_account]

x = [[1] + row[:2] for row in data] #each element is [1, experience, salary]
y = [row[2] for row in data] # each element is paid_account

$PaidAccount = \beta_0 + \beta_1 experience + \beta_2 salary +\epsilon$

In [None]:
rescaled_x = rescale(x)
beta = estimate_beta(rescale_x, y) #[0.26, 0.43, -0.43]
predictions = [predict(x_i, beta) for x_i in rescaled_x]

plt.scatter(predictions, y)

### Applying the model

In [None]:
random.seed(0)
X_train, y_train, x_test, y_test = train_test_split(rescaled_x, y, 0.33)

# Want to maximize the log likelihood on the training data
fn = partial(logistic_log_likelihood, x_train, y_train)
gradient_fn = partial(logistic_log_gradient, x_train, y_train)

# Pick a random starting point
beta_0 = [random.random() for _ in range(3)]

# Maximize using gradient descent
beta_hat = maximize_batch (fn, gradient_fn, beta_0)