<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

*Note:This lesson will be built upon Linear Regression. So, if you are still unacquainted with Linear Regression, I would suggest you to go through [Day 5 material](https://datahack.analyticsvidhya.com/contest/avdatafest-datahack-hour/) first.*

Table of Contents
--------------------
* [Introduction to Logistic Regression](#Introduction-to-Logistic-Regression)
* [Theory behind Logistic Regression](#Theory-behind-Logistic-Regression)
* [Optimizing weights for best classification](#Optimizing-weights-for-best-classification)
* [Visualizing classification](#Visualizing-classification)
* [Evaluation Metrics to know for Logistic Regression](#Evaluation-Metrics-to-know-for-Logistic-Regression)
* [Hands on Problem](#Hands-on-Problem)

# Introduction to Logistic Regression 

As we learned earlier, Linear Regression deals with "continuous" dependent variable and there is no restriction for independent variables or features. They may be either continuous or categorical or both.

Whereas in Logistic Regression, the output or the dependent variable is categorical in nature like male or female, 0 or 1, etc. The independent variables or features in this case too may be continuous or categorical.

Let us have a look at an example dataset where Logistic Regression may be applied.

In [1]:
# import modules
import numpy as np
import pandas as pd

In [2]:
# read dataset
data = pd.read_csv('winequality.csv')

In [3]:
data.head()

Unnamed: 0,ID,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,W0001,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,2
1,W0002,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,,9.5,2
2,W0003,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,,10.1,2
3,W0004,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2
4,W0005,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2


As you can see this dataset has discrete levels in the output variable (quality). So rather than predicting a continuous output, we will be more interested in classifying the unseen data into any one of these discrete levels (generally). This is the classification task for which Logistic Regression is used.

# Theory behind Logistic Regression

Let us first look at the functional form of Logistic Regression.

$g(f(x)) = g(\theta_0+ \theta_1*x_1+….)$

If you look closely at the right side of the equality, you will notice an equations in terms of thetas(coefficients) and Xs(variables). This is the same equation used in Linear Regression to predict a continuous output. So, how are we going to predict classes with the same functional form as of Linear Regression?

To be able to better understand the solution to the above problem, readers must know that Logistic Regression nevers predicts "classes" in its original form. Instead, it predicts probabilities of the positive class (i.e 1). 

But we know that probabilities are not discrete but continuous.And Linear Regression as we know is used in continuous output. The only problem that remains is probabilities lie between 0 and 1. This is where the "g" in the functional form of Logistic Regression comes in. The g is a sigmoid function of the mathematical form
$g(x)=e^x/(1+e^x)$

<img src="sigmoid.png" style="width: 400px;height: 300px"/>

It restricts any range of x between 0 and 1. The only question that naturally arises is - How do we predict classes from probabilities? The answer to this question is -setting a threshold.Let's see with an example how we can predict classes by setting a threshold.

In [4]:
#generate 100 random numbers between 0 and 1 and convert them into 0 and 1 using 0.5 threshold (use if else)

In [5]:
random_nums = np.random.rand(100, 1)

In [6]:
print (random_nums)

[[ 0.38886568]
 [ 0.62768454]
 [ 0.02753184]
 [ 0.88656832]
 [ 0.5436794 ]
 [ 0.85116379]
 [ 0.45646028]
 [ 0.69877083]
 [ 0.25261354]
 [ 0.58613324]
 [ 0.67357946]
 [ 0.76045733]
 [ 0.93351731]
 [ 0.28437589]
 [ 0.87083286]
 [ 0.53899245]
 [ 0.79875127]
 [ 0.20615844]
 [ 0.11631048]
 [ 0.30600695]
 [ 0.67967911]
 [ 0.76338644]
 [ 0.22214849]
 [ 0.94314299]
 [ 0.42049908]
 [ 0.59866325]
 [ 0.86391126]
 [ 0.01697726]
 [ 0.77876255]
 [ 0.72706968]
 [ 0.95867538]
 [ 0.41330192]
 [ 0.01119864]
 [ 0.66269601]
 [ 0.87015847]
 [ 0.42611905]
 [ 0.75207268]
 [ 0.42421444]
 [ 0.07710168]
 [ 0.16901722]
 [ 0.10513816]
 [ 0.87010402]
 [ 0.07225868]
 [ 0.70422604]
 [ 0.07514892]
 [ 0.28013859]
 [ 0.59828911]
 [ 0.48727088]
 [ 0.67550888]
 [ 0.11450931]
 [ 0.57503867]
 [ 0.49254503]
 [ 0.68140362]
 [ 0.8415574 ]
 [ 0.21510704]
 [ 0.79974769]
 [ 0.34655056]
 [ 0.80058852]
 [ 0.22834465]
 [ 0.10702351]
 [ 0.78380294]
 [ 0.025667  ]
 [ 0.7327199 ]
 [ 0.40228742]
 [ 0.78340968]
 [ 0.70772165]
 [ 0.31282

In [7]:
for num in random_nums:
    if num >= 0.5:
        print (1)
    else:
        print (0)

0
1
0
1
1
1
0
1
0
1
1
1
1
0
1
1
1
0
0
0
1
1
0
1
0
1
1
0
1
1
1
0
0
1
1
0
1
0
0
0
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
1
1
1
0
1
0
0
1
0
1
1
0
0


We can set any threshold according to the problem statement and our understanding. Suppose we need to classify, whether the person need to be tested further for cancer or not. In this case we can set that even if the probability of having cancer is 25% or 0.25, we shall classify that the person should be tested further. Therefore the threshold here would be 0.25.

# Optimizing weights for best classification

Our goal in classification problems is to predict 1 as 1 with high confidence and 0 as 0 with high confidence. We do not want our model to classify 0s as 1s and if it does, we want our model to readjust the coefficients by penalising. This is achieved using a cost function.

## Cost Function

Like Linear Regression has a cost function of $((y-h(x))^2)/n$, Logistic Regression has a cost function of $y*log(h(x))+(1-y)*log(1-h(x))$. Let us understand this cost function more intuitively.Suppose y=1 and you predict it as 0, so the penalty for the above cost function will be infinite, similarly for when you predict 0 as 1. We want this to happen since we want to severely penalise wrong classification.

<img src="cost function.png" style="width: 400px;height: 300px"/>

The next question then arises is how then should the actual weights updated? They are updated similarly as Linear Regression updates. It is done by partially differentiating the cost function with respect to thetas. Here's an interesting insight; try differentiating the cost function for both Linear and Logistic Regression and you will get the same updation equation for both. This can be understood because our classification is performing regression under the hood.

# Visualizing classification 

In a binary classification setting with two independent variables, classification can be understood as drawing a line that can best separate the two classes. Look at the image below.

<img src="decision boundary.png" style="width: 400px;height: 300px"/>

The line that separates the two classes above is called the decision boundary.

# Evaluation Metrics to know for Logistic Regression

The following metrics help us decide whether our model is a good fit for classification or not.
1. Accuracy
2. Logloss
3. F1- score

## Accuracy

It is defined as the proportion of total number of observations that were correctly classified. This is the most common evaluation metric used in balanced classification problems (balanced means almost equal number of 1s and 0s in the dependent variable). Suppose there are 50 males and 50 females in the dependent variable and you classify 40 out of 50 males correctly and 45 out of 50 females correctly, then the accuracy is 0.85

## LogLoss

This is another evaluation metric used in situation where wrong classifications with higher confidence is not acceptable i.e classifying 0 as 1 with 0.9 probability. For example - financial modeling problems. An important point to note is log loss has the same mathematical form as of logistic regression cost function.
$$y*log(h(x))+(1-y)*log(1-h(x))$$

## F1 score

This is the most robust evaluation metric used in classification problems where it is important to keep a check on both the wrong and the right predictions. It is defined as the harmonic mean of precision and recall.Let us understand both of these terms separately. 

**Precision** - Ratio of True Positive and sum of True Positive and False positive. Let us understand this more intuitively. Suppose there are 100 dog images and 100 cat images and you correctly classify 80 out of 100 of them as dogs while also classifying 20 cats as dogs. Precision in this case will be 80/(80+20)= 0.8. 

**Recall** - Recall is the ratio of true positive to sum true positive and false negative. It is the  proportion of the dogs that we were able to classify correctly. Recall in the above case would be - 50/100= 0.5

$$F1 score= 2*precision*recall/(precision+recall)$$

# Hands on Problem

In [4]:
from sklearn.linear_model import LogisticRegression

In [5]:
# separate dependent and independent variables
X = data.drop(['ID', 'quality'], axis=1)
y = data.quality

In [7]:
# fill missing vales
X.fillna(X.mean(), inplace=True)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
1,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
2,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.490158,10.100000
3,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
4,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
5,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.440000,10.100000
6,6.2,0.320,0.334031,7.00,0.045,30.0,136.0,0.99490,3.188762,0.470000,9.600000
7,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
8,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
9,8.1,0.220,0.430000,1.50,0.044,28.0,129.0,0.99380,3.220000,0.450000,11.000000


In [11]:
# define logistic regression
logReg = LogisticRegression()

In [12]:
# train model
logReg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
# print coefficient
logReg.coef_

array([[ -2.37841382e-01,  -5.76392875e+00,   2.45333013e-01,
          6.13206508e-02,  -6.22912895e-01,   1.13769493e-02,
         -2.69561960e-03,  -2.73005303e+00,  -5.29625757e-01,
          1.08583712e+00,   9.75524832e-01]])

In [14]:
# print intercept
logReg.intercept_

array([-2.72425802])

In [15]:
# get predictions
pred = logReg.predict(X)

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
# get accuracy
accuracy_score(y, pred)

0.75091874234381384

In [18]:
from sklearn.metrics import f1_score

In [19]:
# get f1 score
f1_score(y, pred)

0.56490727532097007

In [20]:
from sklearn.metrics import log_loss

In [21]:
pred_prob = logReg.predict_proba(X)

In [22]:
# get log loss
log_loss(y, pred_prob[:, 0])

1.3758063624635555

**Exercise**:

Q1. Apply your learnings on [Loan Prediction practice problem](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/)

That's all for today!
----------------
-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017