<a href="https://colab.research.google.com/github/vermadev54/Machine-learning-basic-algorithm/blob/master/Understanding%20Logistic%20Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Understanding Logistic Regression**
# Pre-requisite: **`Linear Regression`**

This article discusses the basics of Logistic Regression and its implementation in Python. Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for given set of features(or inputs), X.
We can also say that the target variable is categorical. Based on the number of categories, Logistic regression can be classified as:



1.   ***binomial:*** target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2.   ***multinomial***: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
3. ***ordinal***: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.
First of all, we explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression.

**Binomial Logistic Regression**

Consider an example dataset which maps the number of hours of study with the result of an exam. The result can take only two values, namely passed(1) or failed(0):
<pre><table>
<tr>
<th>Hours(x)</th>
<td>0.50</td>
<td>0.75</td>
<td>1.00</td>
<td>1.25</td>
<td>1.50</td>
<td>1.75</td>
<td>2.00</td>
<td>2.25</td>
<td>2.50</td>
<td>2.75</td>
<td>3.00</td>
<td>3.25</td>
<td>3.50</td>
<td>3.75</td>
<td>4.00</td>
<td>4.25</td>
<td>4.50</td>
<td>4.75</td>
<td>5.00</td>
<td>5.50</td>
</tr>
<tr>
<th>Pass(y)</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</table></pre>
So, we have:

![alt text](https://latex.codecogs.com/gif.latex?y%20%3D%20%5Cleft%5C%7B%5Cbegin%7Bmatrix%7D%200%2Cif%20fail%5C%5C%201%2Cif%20pass%5C%5C%20%5Cend%7Bmatrix%7D%5Cright.)

i.e. y is a categorical target variable which can take only two possible type:“0” or “1”.
In order to generalize our model, we assume that:





*  The dataset has ‘p’ feature variables and ‘n’ observations.



* The feature matrix is represented as:

![alt text](https://latex.codecogs.com/gif.latex?%5Cmathbf%7BX%7D%20%3D%5Cbegin%7Bpmatrix%7D%201%20%26%20x_%7B11%7D%20%26%20%5Ccdots%20%26%20x_%7B1p%7D%20%5C%5C%201%20%26%20x_%7B21%7D%20%26%20%5Ccdots%20%26%20x_%7B2p%7D%20%5C%5C%20%5Cvdots%20%26%20%5Cvdots%20%26%20%5Cddots%20%26%20%5Cvdots%20%5C%5C%201%20%26%20x_%7Bn1%7D%20%26%20%5Ccdots%20%26%20x_%7Bnp%7D%20%5Cend%7Bpmatrix%7D)

Here,  x_{ij} denotes the values of  j^{th} feature for  i^{th} observation.
Here, we are keeping the convention of letting  x_{i0} = 1. (Keep reading, you will understand the logic in a few moments).
* The i^{th} observation, x_i, can be represented as:

![alt text](https://latex.codecogs.com/gif.latex?x_i%20%3D%20%5Cbegin%7Bbmatrix%7D%201%5C%5C%20x_%7Bi1%7D%5C%5C%20x_%7Bi2%7D%5C%5C%20.%5C%5C%20.%5C%5C%20x_%7Bip%7D%5C%5C%20%5Cend%7Bbmatrix%7D)

* h(x_i) represents the predicted response for  i^{th} observation, i.e.  x_i. The formula we use for calculating  h(x_i) is called **hypothesis**.


If you have gone though Linear Regression, you should recall that in Linear Regression, the hypothesis we used for prediction was:

![alt text](https://latex.codecogs.com/gif.latex?h%28x_i%29%20%3D%20%5Cbeta_0%20+%20%5Cbeta_1x_%7Bi1%7D%20+%20%5Cbeta_2x_%7Bi2%7D%20+%20.....%20+%20%5Cbeta_px_%7Bip%7D)

where,  ***B_0, B_1,…, B_p*** are the regression coefficients.
Let regression coefficient matrix/vector,  ***B

![alt text](https://latex.codecogs.com/gif.latex?%5Cbeta%20%3D%20%5Cbegin%7Bbmatrix%7D%20%5Cbeta_0%5C%5C%20%5Cbeta_1%5C%5C%20%5Cbeta_2%5C%5C%20.%5C%5C%20.%5C%5C%20%5Cbeta_p%5C%5C%20%5Cend%7Bbmatrix%7D)


Then, in a more compact form,

***The reason for taking  x_0 = 1 is pretty clear now. We needed to do a matrix product, but there was no actual  x_0 multiplied to  B_0 in original hypothesis formula. So, we defined  x_0 = 1.***

Now, if we try to apply Linear Regression on above problem, we are likely to get continuous values using the hypothesis we discussed above. Also, it does not make sense for  h(x_i) to take values larger that 1 or smaller than 0.

So, some modifications are made to the hypothesis for classification:


![alt text](https://latex.codecogs.com/gif.latex?h%28x_i%29%20%3D%20g%28%5Cbeta%5ET%20x_i%29%20%3D%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-%5Cbeta%5ET%20x_i%7D%7D)

where,

![alt text](https://latex.codecogs.com/gif.latex?g%28z%29%20%3D%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-z%7D%7D)

is called **logistic function** or the sigmoid function.
Here is a plot showing g(z):

![alt text](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/logistic-function.png)

We can infer from above graph that:

* g(z) tends towards 1 as  z --> infinity 
* g(z) tends towards 0 as  z--> infinity
* g(z) is always bounded between 0 and 1

So, now, we can define conditional probabilities for 2 labels(0 and 1) for  i^{th} observation as:

![alt text](https://latex.codecogs.com/gif.latex?%5Cnewline%20P%28y_i%20%3D%201%7Cx_i%3B%20%5Cbeta%29%20%3D%20h%28x_i%29%20%5Cnewline%20P%28y_i%3D0%7Cx_i%3B%20%5Cbeta%29%20%3D%201%20-%20h%28x_i%29)

We can write it more compactly as:

![alt text](https://latex.codecogs.com/gif.latex?P%28y_i%7Cx_i%3B%5Cbeta%29%20%3D%20%28h%28x_i%29%29%5E%7By_i%7D%281-h%28x_i%29%29%5E%7B1-y_i%7D)

Now, we define another term,** likelihood of parameters** as:

![alt text](https://latex.codecogs.com/gif.latex?%5Cnewline%20L%28%5Cbeta%29%20%3D%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28y_i%7Cx_i%3B%5Cbeta%29%20%5Cnewline%20or%20%5Cnewline%20L%28%5Cbeta%29%20%3D%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7D%28h%28x_i%29%29%5E%7By_i%7D%281-h%28x_i%29%29%5E%7B1-y_i%7D)

Likelihood is nothing but the probability of data(training examples), given a model and specific parameter values(here,  **B**). It measures the support provided by the data for each possible value of the  **B**. We obtain it by multiplying all  P(y_i|x_i) for given  **B**.

And for easier calculations, we take **log likelihood**:

![alt text](https://latex.codecogs.com/gif.latex?%5Cnewline%20l%28%5Cbeta%29%20%3D%20log%28L%28%5Cbeta%29%29%20%5Cnewline%20or%20%5Cnewline%20l%28%5Cbeta%29%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bn%7Dy_ilog%28h%28x_i%29%29%20+%20%281-y_i%29log%281-h%28x_i%29%29)

The cost function for logistic regression is proportional to inverse of likelihood of parameters. Hence, we can obtain an expression for cost function, J using log likelihood equation as:

![alt text](https://latex.codecogs.com/gif.latex?J%28%5Cbeta%29%20%3D%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%20-%20y_ilog%28h%28x_i%29%29%20-%20%281-y_i%29log%281-h%28x_i%29%29)

and our aim is to estimate  **B**  so that cost function is minimized !!

Using Gradient descent algorithm

Firstly, we take partial derivatives of  J(**B**) w.r.t each  **B**_j  subset of  **B** to derive the stochastic gradient descent rule(we present only the final derived value here):

![alt text](https://latex.codecogs.com/gif.latex?%5Cfrac%7B%5Cpartial%20J%28%5Cbeta%29%7D%7B%5Cpartial%20%5Cbeta_j%7D%20%3D%20%28h%28x%29%20-%20y%29x_j)

Here, y and h(x) represent the response vector and predicted response vector(respectively). Also,  x_j is the vector representing the observation values for  j^{th} feature.
Now, in order to get min  J(**B**),

![alt text](https://latex.codecogs.com/gif.latex?%5Cnewline%20Repeat%5C%7B%20%5Cnewline%20%5Cbeta_j%20%3A%3D%20%5Cbeta_j%20-%20%5Calpha%5Csum_%7Bi%3D1%7D%5E%7Bn%7D%28h%28x_i%29-y_i%29x_%7Bij%7D%20%5Cnewline%20%28Simultaneously%5Chspace%7B5%7Dupdate%5Chspace%7B5%7Dall%5Chspace%7B5%7D%5Cbeta_j%29%20%5Cnewline%20%5C%7D)

where  **alpha**  is called learning rate and needs to be set explicitly.
Let us see the python implementation of above technique on a sample dataset (download it from here):

2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.50









In [1]:
import csv 
import numpy as np 
import matplotlib.pyplot as plt 


def loadCSV(filename): 
	''' 
	function to load dataset 
	'''
	with open(filename,"r") as csvfile: 
		lines = csv.reader(csvfile) 
		dataset = list(lines) 
		for i in range(len(dataset)): 
			dataset[i] = [float(x) for x in dataset[i]]	 
	return np.array(dataset) 


def normalize(X): 
	''' 
	function to normalize feature matrix, X 
	'''
	mins = np.min(X, axis = 0) 
	maxs = np.max(X, axis = 0) 
	rng = maxs - mins 
	norm_X = 1 - ((maxs - X)/rng) 
	return norm_X 


def logistic_func(beta, X): 
	''' 
	logistic(sigmoid) function 
	'''
	return 1.0/(1 + np.exp(-np.dot(X, beta.T))) 


def log_gradient(beta, X, y): 
	''' 
	logistic gradient function 
	'''
	first_calc = logistic_func(beta, X) - y.reshape(X.shape[0], -1) 
	final_calc = np.dot(first_calc.T, X) 
	return final_calc 


def cost_func(beta, X, y): 
	''' 
	cost function, J 
	'''
	log_func_v = logistic_func(beta, X) 
	y = np.squeeze(y) 
	step1 = y * np.log(log_func_v) 
	step2 = (1 - y) * np.log(1 - log_func_v) 
	final = -step1 - step2 
	return np.mean(final) 


def grad_desc(X, y, beta, lr=.01, converge_change=.001): 
	''' 
	gradient descent function 
	'''
	cost = cost_func(beta, X, y) 
	change_cost = 1
	num_iter = 1
	
	while(change_cost > converge_change): 
		old_cost = cost 
		beta = beta - (lr * log_gradient(beta, X, y)) 
		cost = cost_func(beta, X, y) 
		change_cost = old_cost - cost 
		num_iter += 1
	
	return beta, num_iter 


def pred_values(beta, X): 
	''' 
	function to predict labels 
	'''
	pred_prob = logistic_func(beta, X) 
	pred_value = np.where(pred_prob >= .5, 1, 0) 
	return np.squeeze(pred_value) 


def plot_reg(X, y, beta): 
	''' 
	function to plot decision boundary 
	'''
	# labelled observations 
	x_0 = X[np.where(y == 0.0)] 
	x_1 = X[np.where(y == 1.0)] 
	
	# plotting points with diff color for diff label 
	plt.scatter([x_0[:, 1]], [x_0[:, 2]], c='b', label='y = 0') 
	plt.scatter([x_1[:, 1]], [x_1[:, 2]], c='r', label='y = 1') 
	
	# plotting decision boundary 
	x1 = np.arange(0, 1, 0.1) 
	x2 = -(beta[0,0] + beta[0,1]*x1)/beta[0,2] 
	plt.plot(x1, x2, c='k', label='reg line') 

	plt.xlabel('x1') 
	plt.ylabel('x2') 
	plt.legend() 
	plt.show() 
	

	
if __name__ == "__main__": 
	# load the dataset 
	dataset = loadCSV('dataset1.csv') 
	
	# normalizing feature matrix 
	X = normalize(dataset[:, :-1]) 
	
	# stacking columns wth all ones in feature matrix 
	X = np.hstack((np.matrix(np.ones(X.shape[0])).T, X)) 

	# response vector 
	y = dataset[:, -1] 

	# initial beta values 
	beta = np.matrix(np.zeros(X.shape[1])) 

	# beta values after running gradient descent 
	beta, num_iter = grad_desc(X, y, beta) 

	# estimated beta values and number of iterations 
	print("Estimated regression coefficients:", beta) 
	print("No. of iterations:", num_iter) 

	# predicted labels 
	y_pred = pred_values(beta, X) 
	
	# number of correctly predicted labels 
	print("Correctly predicted labels:", np.sum(y == y_pred)) 
	
	# plotting regression line 
	plot_reg(X, y, beta) 


FileNotFoundError: ignored