# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Logistic Regression

## LEARNING OBJECTIVES
*After this lesson, you will be able to:*

- Explain what a logistic regression is.
- Explain when to use a logistic regression.
- Feel comfortable performing a logistic regression with built-in sklearn datasets and checking the accuracy of your model. 

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
from IPython.display import Image
from IPython.core.display import HTML 

### Introduction (5 mins)

At this point, we've learned many different classification models (Decision Tree, Random Forest, Support Vector Machine, Naive Bayes and others) so let's review the first classification model we learned, logistic regression.

<center>![Image](http://www.gepsoft.com/tutorials/imagesLRAP/LogisticRegressionWindowLogisticFitChart6.png)</center>

### Key Points

- With logistic regression and all classification models, our label is catgorical, not continuous. 
- It is important to normalize all features in your dataframe before running your logistic regression model. We normalize by substracting the column mean and dividing by the column standard deviation. Our features might be on different scales and normalizing corrects for this. 
- We can perform multinomial logistic regressions or binary logistic regressions.

### What's going on under the hood?

#### Here's how it works:

![Image](https://dl.dropboxusercontent.com/s/fpvgsspzn40ueve/Screenshot%202016-12-02%2012.41.01.png?dl=0)

What we're predicting with the logistic regression classifier is the probability of being in one class versus another. This is done in four basic steps:
   1. Calculate the odds ratio for probability - essentially, getting a sense of how many times more likely one outcome is versus another - (p)/(1/p)
   2. Set the lower bound of this probabilty to 0 by taking the natural log of the odds ratio (called the logit link function).
   3. Take the inverse of the logit link function (called the logistic function). 

### Guided practice - IRIS DATASET - multinomial logistic regression

In [120]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import pandas as pd

In [121]:
iris = load_iris()

In [122]:
iris

 'data': array([[ 5.1,  3.5,  1.4,  0.2],
        [ 4.9,  3. ,  1.4,  0.2],
        [ 4.7,  3.2,  1.3,  0.2],
        [ 4.6,  3.1,  1.5,  0.2],
        [ 5. ,  3.6,  1.4,  0.2],
        [ 5.4,  3.9,  1.7,  0.4],
        [ 4.6,  3.4,  1.4,  0.3],
        [ 5. ,  3.4,  1.5,  0.2],
        [ 4.4,  2.9,  1.4,  0.2],
        [ 4.9,  3.1,  1.5,  0.1],
        [ 5.4,  3.7,  1.5,  0.2],
        [ 4.8,  3.4,  1.6,  0.2],
        [ 4.8,  3. ,  1.4,  0.1],
        [ 4.3,  3. ,  1.1,  0.1],
        [ 5.8,  4. ,  1.2,  0.2],
        [ 5.7,  4.4,  1.5,  0.4],
        [ 5.4,  3.9,  1.3,  0.4],
        [ 5.1,  3.5,  1.4,  0.3],
        [ 5.7,  3.8,  1.7,  0.3],
        [ 5.1,  3.8,  1.5,  0.3],
        [ 5.4,  3.4,  1.7,  0.2],
        [ 5.1,  3.7,  1.5,  0.4],
        [ 4.6,  3.6,  1. ,  0.2],
        [ 5.1,  3.3,  1.7,  0.5],
        [ 4.8,  3.4,  1.9,  0.2],
        [ 5. ,  3. ,  1.6,  0.2],
        [ 5. ,  3.4,  1.6,  0.4],
        [ 5.2,  3.5,  1.5,  0.2],
        [ 5.2,  3.4,  1.4,  0.2],
      

In [123]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [124]:
iris.data = pd.DataFrame(iris.data)

In [125]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [133]:
irisDF = pd.DataFrame({'Sepal Length': iris.data[iris.data.columns[0]], 'Sepal Width':iris.data[iris.data.columns[1]], ' Petal Length':iris.data[iris.data.columns[2]], 'Petal Width':iris.data[iris.data.columns[3]],'Target': iris.target})

In [134]:
irisDF.head()

Unnamed: 0,Petal Length,Petal Width,Sepal Length,Sepal Width,Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [135]:
# Let's shuffle the iris dataset so we can predict on all three classes 
# as opposed to just two:
from sklearn.utils import shuffle
irisDF = shuffle(irisDF)

In [136]:
irisDF.head()

Unnamed: 0,Petal Length,Petal Width,Sepal Length,Sepal Width,Target
83,6.0,2.7,5.1,1.6,1
73,6.1,2.8,4.7,1.2,1
47,4.6,3.2,1.4,0.2,0
125,7.2,3.2,6.0,1.8,2
53,5.5,2.3,4.0,1.3,1


In [200]:
# Train-Test Split
Xtrain, Xtest, ytrain, ytest = irisDF.iloc[:100,:4], irisDF.iloc[100:,:4], irisDF.iloc[:100,4:], irisDF.iloc[100:,4:]

In [143]:
print(len(Xtrain))
print(len(ytrain))

100
100


In [149]:
lr = LogisticRegression()
lr.fit(Xtrain,ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [191]:
# Let's test how it performs on our test data
predicted = lr.predict(Xtest)
actual = ytest.values.flatten()

In [195]:
df = pd.DataFrame({'Predicted':predicted, 'Actual':actual})
df.head()

Unnamed: 0,Actual,Predicted
0,1,1
1,2,2
2,1,1
3,0,0
4,0,0


In [196]:
# predicted probabilities (for second to last row):
lr.predict_proba(Xtest)

array([[  6.72170676e-03,   5.26420684e-01,   4.66857609e-01],
       [  3.68221955e-04,   2.55304523e-01,   7.44327255e-01],
       [  5.58643744e-02,   6.57907855e-01,   2.86227770e-01],
       [  8.85124078e-01,   1.14805786e-01,   7.01351829e-05],
       [  7.81418732e-01,   2.18455471e-01,   1.25796337e-04],
       [  1.39887789e-01,   7.99561300e-01,   6.05509102e-02],
       [  4.40480634e-02,   8.01647736e-01,   1.54304201e-01],
       [  3.34981838e-03,   2.72964167e-01,   7.23686015e-01],
       [  1.95639336e-03,   3.27457946e-01,   6.70585660e-01],
       [  9.14466705e-01,   8.54510020e-02,   8.22930165e-05],
       [  1.19228482e-02,   5.67908995e-01,   4.20168157e-01],
       [  4.87096138e-03,   6.19211861e-01,   3.75917177e-01],
       [  3.16287136e-04,   4.51704959e-01,   5.47978754e-01],
       [  8.58258317e-01,   1.41513609e-01,   2.28074102e-04],
       [  1.09366292e-03,   1.82562808e-01,   8.16343529e-01],
       [  1.41163098e-03,   2.98441732e-01,   7.0014663

In [199]:
# Finally, let's see our model performance:
from sklearn.metrics import classification_report
print(classification_report(ytest, predicted))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        18
          1       0.93      0.87      0.90        15
          2       0.89      0.94      0.91        17

avg / total       0.94      0.94      0.94        50



### Independent practice (5 mins)

Import the MNIST handwritten digits dataset from sklearn.datasets and perform a multinomial logistic regression classifier like the one above.

In [72]:
# Starter CODE
from sklearn.datasets import load_digits
digits = load_digits()