# Logistic Regression - Comparison With Other Models


The goal of this notebook is to compare the Logistic Regression classifier model with two other classifier models that we have studied. 

In particular we will investigate the following three classification approaches.

- Instance or Memory Based Learning (K-Nearest Neighbor)
- Model Based Learning
      -- Generative Approach (naive Bayes classifier)
      -- Discrinimative Approach (Logistic Regression)


For this study we will use Scikit-Learn's LogisticRegression object that uses the Gradient Descent algorithm. Since our dataset is small, Gradient Descent based Logistic Regression will not require longer training time. However, for larger dataset, we should use the Stochastic Gradient Descent implemetation of Logistic Regression.

# Dataset


We will use the iris dataset, which is a multivariate data set. 

This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica

There are 4 features: 
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)

Total number of samples: 150

The dataset is also known as Fisher's Iris data set as it was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis".


<img src="https://cse.unl.edu/~hasan/IrisFlowers.png",width=800,height=400>

In [1]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

## Explore The Dataset

In [2]:
iris = load_iris()

print(iris.keys())
print(iris.feature_names)
print(iris.target_names)
print(iris.data.shape)

#print(iris.DESCR)

dict_keys(['filename', 'DESCR', 'feature_names', 'target', 'data', 'target_names'])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
(150, 4)


## Create Data Matrix (X) and the Label Vector (y)

We can use all features or a subset.

In [3]:
#X = iris["data"][:, (2, 3)]  # petal length, petal width
X = iris["data"]
y = iris["target"]

## Split Data Into Training and Test Sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# 1. Logistic Regression

We will use the multonimal (softmax) logistic regression for multi-class clasification.

## Model Selection for Softmax Regression: Hyperparameter Tuning

First, we need to find the optimal hyperparameters via Gridsearch.

In [5]:
%%time
param_grid_sm = {'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'], 
              'multi_class' : ['multinomial'],
              'tol': [1e-3, 1e-4, 1e-5], 'max_iter':[10000, 20000],'C': [10, 50]}

lg_reg_sm = LogisticRegression()

lg_reg_sm_cv = GridSearchCV(lg_reg_sm, param_grid_sm, scoring='accuracy', cv=3)
lg_reg_sm_cv.fit(X_train, y_train)

params_optimal_sm = lg_reg_sm_cv.best_params_

print("Best Score (accuracy): %f" % lg_reg_sm_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal_sm)



Best Score (accuracy): 0.966667
Optimal Hyperparameter Values:  {'max_iter': 10000, 'solver': 'sag', 'tol': 0.001, 'C': 10, 'multi_class': 'multinomial'}
CPU times: user 9.52 s, sys: 62.4 ms, total: 9.58 s
Wall time: 9.6 s




## 1. Train the Optimal Softmax Logistic Regression Classifier

In [6]:
softmax_reg = LogisticRegression(**params_optimal_sm)

softmax_reg.fit(X_train, y_train)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=10000, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='sag',
          tol=0.001, verbose=0, warm_start=False)

## 1. Evaluate the Optimal Softmax Logistic Regression Classifier on Test Data

In [7]:
print("No. of Iterations:", softmax_reg.n_iter_ )


y_test_predicted = softmax_reg.predict(X_test)
#print(y_test_predict)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nAccuracy: ", accuracy_score_test)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

No. of Iterations: [132]

Accuracy:  1.0

Test Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# 2. K-Nearest Neighbor

## Model Selection for KNN: Hyperparameter Tuning

First, we need to find the optimal hyperparameters via Gridsearch.

In [8]:
%%time
# The param_grid tells Scikit-Learn to first evaluate all 50 × 2 = 100 combinations of 
#   n_neighbors and p hyperparameter values specified in the dict
param_grid_knn = {'n_neighbors': np.arange(1,50), 'p': [1, 2, 10, 50, 100, 500, 1000], 
                  'weights': ["uniform", "distance"]}

knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn, param_grid_knn, scoring='f1_macro', cv=3)
knn_cv.fit(X_train, y_train)

params_optimal_knn = knn_cv.best_params_

print("Best Score: %f" % knn_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal_knn)

Best Score: 0.974941
Optimal Hyperparameter Values:  {'weights': 'distance', 'p': 10, 'n_neighbors': 43}
CPU times: user 12.5 s, sys: 479 ms, total: 13 s
Wall time: 11.1 s




## 2. Train the Optimal KNN Classifier

In [9]:
%%time
knn_clf = KNeighborsClassifier(**params_optimal_knn)

knn_clf.fit(X_train, y_train)

CPU times: user 773 µs, sys: 324 µs, total: 1.1 ms
Wall time: 845 µs


## 2. Evaluate the Optimal KNN Classifier on Test Data

In [10]:
y_test_predicted = knn_clf.predict(X_test)
#print(y_test_predict)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nAccuracy: ", accuracy_score_test)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Accuracy:  1.0

Test Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# 3. Gaussian Naive Bayes Classifier

Since the features values are real-numbers, we will use the Gaussian Naive Bayes for classification.

## Note:
There is no model selection in NB classifiers (hyperparameter tunining).

In [11]:
gaussianNB_clf = GaussianNB()

gaussianNB_clf.fit(X_train, y_train)  

GaussianNB(priors=None, var_smoothing=1e-09)

## Evaluate the Gaussian NB Classifier on Test Data

In [12]:
y_test_predicted = gaussianNB_clf.predict(X_test)


accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nAccuracy: ", accuracy_score_test)

print("\nText Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Accuracy:  1.0

Text Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# Comparative Understanding of Three Classifiers

### Naive Bayes Classifier
- No model selection (hyperparameter tuning). Thus the we can immediately start training the model.
- Works well even if data is small.
- Scale well with high-dimensional features, hence we don't have to worry about the curse of dimensionality.
- No need for regularization.
- Since it's not an optimization technique, we don't have to search appropriate learning rate, generate learning curves to detect underfitting/overfitting, regularize via early stopping, etc.
- Life is good with the NB classifier!
- Can directly perform multi-class classification. We don't need to train multiple binary classifiers.


### K-Nearest Neighbor
- No learning!
- There is no need to compute class probabilities.
- Suffers from the curse of dimensionality.
- Training time is painfully longer for large dataset and high-dimensional features.
- Can directly perform multi-class classification. We don't need to train multiple binary classifiers.


### Logistic Regression
- It's mainly a binary classifier.
- For non-linear dataset we need to augment the features.
- For multi-class classification we need to train multiple binary classifiers when we use the one-versus-all (OvA) strategy. However, the Softmax regression is able to directly perform multi-class classification.
- No closed form solution!
- Batch Gradient Descent takes longer time on larger dataset.
- Training gets slower if the learning rate is not chosen appropriately.
- For faster convergence we need to use 2nd order derivative of the cost function in the gradient descent algorithm.
- Stochastic Gradient Descent requires careful hyperparparameter tuning.



## Logistic Regression vs Naive Bayes

- Logistic Regression makes no assumption about P(X|Y) in learning, while the Naive Bayes does!
- They optimize different functions and obtain different solutions.

In general, NB and LR make different assumptions
- NB: Features are independent given class: P(X|Y) 
- LR: Defines the functional form of P(Y|X), but no assumptionon about the features P(X|Y)

Convergence rates
- GNB(usually) needs less data
- LR(usually) gets in the limit

With asymptotic assumption, when model correcty
- GNB(with class independent variances) and LR produce identical classifiers 
- LR is less biased, does not assume conditional independence, therefore LR expected to outperform GNB