# SLU08 - Classification: Example notebook
How to use the very useful sklearn's implementation of:
- LogisticRegression

to solve the last exercise of the Exercise Notebook of SLU08.

In [1]:
import pandas as pd 
import numpy as np 

from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

### The Banknote Authentication Dataset

There are 1372 items (images of banknotes — think Euro or dollar bill). There are 4 predictor variables (variance of image, skewness, kurtosis, entropy). The variable to predict is encoded as 0 (authentic) or 1 (forgery).

Your quest, is to first analyze this dataset from the materials that you've learned in the previous SLUs and then create a logistic regression model that can correctly classify forged banknotes from authentic ones.

The data is loaded for you below.

In [2]:
columns = ['variance','skewness','kurtosis','entropy', 'forgery']
data = pd.read_csv('data/data_banknote_authentication.txt',names=columns).sample(frac=1, random_state=1)
X_train = data.drop(columns='forgery').values
Y_train = data.forgery.values

In [5]:
data.head()

Unnamed: 0,variance,skewness,kurtosis,entropy,forgery
1240,-3.551,1.8955,0.1865,-2.4409,1
703,1.3114,4.5462,2.2935,0.22541,0
821,-4.0173,-8.3123,12.4547,-1.4375,1
1081,-5.119,6.6486,-0.049987,-6.5206,1
37,3.6289,0.81322,1.6277,0.77627,0


In [10]:
data[['variance','skewness','kurtosis','entropy']].to_numpy()

array([[-3.551   ,  1.8955  ,  0.1865  , -2.4409  ],
       [ 1.3114  ,  4.5462  ,  2.2935  ,  0.22541 ],
       [-4.0173  , -8.3123  , 12.4547  , -1.4375  ],
       ...,
       [-4.3667  ,  6.0692  ,  0.57208 , -5.4668  ],
       [ 2.0466  ,  2.03    ,  2.1761  , -0.083634],
       [-2.3147  ,  3.6668  , -0.6969  , -1.2474  ]])

How does the dataset (features) and target look like?

In [3]:
X_train

array([[-3.551   ,  1.8955  ,  0.1865  , -2.4409  ],
       [ 1.3114  ,  4.5462  ,  2.2935  ,  0.22541 ],
       [-4.0173  , -8.3123  , 12.4547  , -1.4375  ],
       ...,
       [-4.3667  ,  6.0692  ,  0.57208 , -5.4668  ],
       [ 2.0466  ,  2.03    ,  2.1761  , -0.083634],
       [-2.3147  ,  3.6668  , -0.6969  , -1.2474  ]])

In [11]:
data['forgery'].to_numpy()

array([1, 0, 1, ..., 1, 0, 1])

In [5]:
Y_train

array([1, 0, 1, ..., 1, 0, 1])

# [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
_Transforms features by scaling each feature to a given range._

You can select the range for your final feature values with argument `feature_range=(0, 1)`

In [4]:
# Init class
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit your class
scaler.fit(X_train)

MinMaxScaler()

In [5]:
# Transform your data
X_train = scaler.transform(X_train)
X_train

array([[0.25175778, 0.58629657, 0.23575075, 0.5553252 ],
       [0.60240573, 0.68548197, 0.3265169 , 0.79776772],
       [0.21813094, 0.20433532, 0.76424494, 0.64656246],
       ...,
       [0.19293425, 0.74247045, 0.25236091, 0.28018586],
       [0.65542407, 0.59132937, 0.3214595 , 0.76966693],
       [0.34091253, 0.65257608, 0.19769531, 0.6638479 ]])

So, now our features are scaled between 0 and 1.

# [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
_Logistic Regression (aka logit, MaxEnt) classifier._ In this case let us use the L2 penalty (argument: `penalty='l2'`)

In [6]:
# init with your arguments
logit_clf = LogisticRegression(penalty='l2', random_state=1)

# Fit it!
logit_clf.fit(X_train, Y_train)

LogisticRegression(random_state=1)

What are the predicted probabilities on the training data (probability of being `1`) with our **Logit** classifier for the first 10 samples?

In [9]:
# First ten instances
logit_clf.predict_proba(X_train)[:, 1][:10]

array([0.95431586, 0.09710292, 0.92123834, 0.94047534, 0.06491872,
       0.20113678, 0.01061188, 0.03811844, 0.02582772, 0.10689863])

What about the predicted classes?

In [10]:
# First ten instances
logit_clf.predict(X_train)[:10]

array([1, 0, 1, 1, 0, 0, 0, 0, 0, 0])

And the accuracy?

In [11]:
logit_clf.score(X_train, Y_train)

0.9715743440233237

How can we change the threshold from the default (0.5) to 0.9?

In [12]:
predictions = logit_clf.predict_proba(X_train)[:, 1]
predictions[predictions>=0.9] = 1
predictions[predictions<0.9] = 0
predictions[:10]

array([1., 0., 1., 1., 0., 0., 0., 0., 0., 0.])