# Logistic Regression - Binary Classification - Stochastic Gradient Descent

In this notebook, we apply the **Stochastic Gradient Descent (SGD)** algorithm for solving a binary classification problem using the Logistic Regression model. 

### Regularization
The Scikit-Learn SGDClassifier() model allows to apply regularization techniques such as L2, L1, and Elastic Net. The strength of the regularization is controlled by the hyperparameter $\alpha$.

More on Stochastic Gradient Descent:
https://scikit-learn.org/stable/modules/sgd.html#sgd

# Dataset


We will use the iris dataset, which is a multivariate data set. 

This is a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica

There are 4 features: 
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)

Total number of samples: 150

The dataset is also known as Fisher's Iris data set as it was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis".



<img src="http://engineering.unl.edu/images/uploads/IrisFlowers.png" width=800, height=400>


## Binary Classification
We will use Scikit-Learn's LogisticRegression() model to detect the Iris-Virginica type.

In [1]:
import warnings
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

## Explore The Dataset

In [2]:
iris = load_iris()

# See the key values
print("\nKey Values: \n", list(iris.keys()))

# The feature names
print("\nFeature Names: \n", list(iris.feature_names))

# The target names
print("\nTarget Names: \n", list(iris.target_names))

# The target values (codes)
#print("\nTarget Values: \n", list(iris.target))


Key Values: 
 ['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']

Feature Names: 
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Target Names: 
 ['setosa', 'versicolor', 'virginica']


## Create Data Matrix (X) and the Label Array (y)

Recall that our goal is to detect the Iris-Virginica type. In the above Target array, the index for Virginica is 2.

Thus, we create a binary target vector by putting 1 if the target value is 2 (Iris-Virginica), else we put 0.

We can use all features or a subset. For this notebook, we will use two features (i.e., petal length, petal width).

In [3]:
# For the experimentation we use two features
X = iris["data"][:, (2, 3)]  # petal length, petal width

# Target Array
y = (iris["target"] == 2).astype(np.int32)  # 1 if Iris-Virginica, else 0


print(X.shape)
print(y.shape)

print("\nX data type: ", X.dtype)
print("y data type: ", y.dtype)

(150, 2)
(150,)

X data type:  float64
y data type:  int32


## Split Data Into Training and Test Sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Standardize the Data

In [5]:
scaler = StandardScaler()

# Fit on the training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Stochastic Gradient Descent


The main problem with Batch Gradient Descent is that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. 

At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. 

This makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration.

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. 

Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

## Scikit-Learn SGDClassifier


The SGDClassifier implements a plain Stochastic Gradient Descent learning routine which supports different loss functions and penalties for classification.

The concrete loss function can be set via the loss parameter. SGDClassifier supports the following loss functions:

- loss="hinge": (soft-margin) linear Support Vector Machine
- loss="modified_huber": smoothed hinge loss
- loss="log": logistic regression

For implementing SGD for Logistic Regression, we usually use the "log" loss. The "log" loss gives logistic regression, a probabilistic classifier.

Using loss="log" enables the predict_proba method, which gives a vector of probability estimates per sample.



We need to set the following attributes to train an SGDClassifier.


- penalty : ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’
    -- The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.
    

- alpha : Constant that multiplies the regularization term. Defaults to 0.0001 


- l1_ratio : The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.


- max_iter : The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.


- tol : The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to 1e-3 from 0.21.


- random_state : The seed of the pseudo-random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.


- learning_rate : The learning rate schedule:

    -- ‘constant’: eta = eta0

    --‘optimal’: [default] eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.

    --‘invscaling’: eta = eta0 / pow(t, power_t)

    --‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase the validation score by tol if early_stopping is True, the current learning rate is divided by 5.


- eta0 : The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.



- early_stopping : Whether to use early stopping to terminate training when the validation score is not improving. If set to True, it will automatically set aside a fraction of training data as validation and terminate training when the validation score is not improving by at least tol for n_iter_no_change consecutive epochs.


- n_iter_no_change : Number of iterations with no improvement to wait before early stopping.

More detail: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html


# Binary Classification



## Model Selection for Binary Classifier: Hyperparameter Tuning

First, we need to find the optimal hyperparameters via Gridsearch.

- For logistic regression, the loss function should be set to "log".

In [6]:
%%time

warnings.filterwarnings('ignore')

param_grid = {'alpha': [0.05, 0.01, 0.001],
              'penalty' : ["l2", "l1"],
              'learning_rate': ["constant", "optimal", "invscaling", "adaptive"], 
              'max_iter':[500, 1000, 3000],
              'eta0': [0.1, 0.01, 0.001],
              'tol': [1e-3, 1e-5, 1e-8]}

sgd_clf = SGDClassifier(loss='log_loss') # For logistic regression, the loss function should be "log_loss"

sgd_clf_cv = GridSearchCV(sgd_clf, param_grid, scoring='f1_micro', cv=3, verbose=1, n_jobs=-1)
sgd_clf_cv.fit(X_train, y_train)

params_optimal = sgd_clf_cv.best_params_

print("Best Score (F1 micro): %f" % sgd_clf_cv.best_score_)
print("Optimal Hyperparameter Values: ", params_optimal)
print("\n")

Fitting 3 folds for each of 648 candidates, totalling 1944 fits


/Users/mhasan2/anaconda3/envs/ml_env/lib/python3.11/site-packages/

Best Score (F1 micro): 0.966667
Optimal Hyperparameter Values:  {'alpha': 0.05, 'eta0': 0.1, 'learning_rate': 'invscaling', 'max_iter': 500, 'penalty': 'l2', 'tol': 0.001}


CPU times: user 264 ms, sys: 96.2 ms, total: 360 ms
Wall time: 1.23 s




## Train the Optimal SGDClassifier 

In [7]:
sgd = SGDClassifier(loss='log_loss', **params_optimal)
sgd.fit(X_train, y_train)

## Evaluate the Optimal SGDClassifier

In [8]:
y_train_predicted = sgd.predict(X_train)

accuracy_score_train = np.mean(y_train_predicted == y_train)
print("\nTrain Accuracy: ", accuracy_score_train)

print("\nTrain Confusion Matrix:")
print(confusion_matrix(y_train, y_train_predicted))

y_test_predicted = sgd.predict(X_test)

accuracy_score_test = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

precision_test = precision_score(y_test, y_test_predicted) 
print("\nTest Precision = %f" % precision_test)

recall_test = recall_score(y_test, y_test_predicted)
print("Test Recall = %f" % recall_test)

f1_test = f1_score(y_test, y_test_predicted)
print("Test F1 Score = %f" % f1_test)

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Train Accuracy:  0.975

Train Confusion Matrix:
[[78  3]
 [ 0 39]]

Test Accuracy:  1.0

Test Confusion Matrix:
[[19  0]
 [ 0 11]]

Test Precision = 1.000000
Test Recall = 1.000000
Test F1 Score = 1.000000

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

