Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# REGULARIZATION AND SGD

Regularization is a technique that allows us to avoid overfitting by penalizing excessive feature weights. Several classifiers, such as [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html),  include the option for choosing which regularization term to use.

In this notebook we'll explore the usage of different regularization terms. For that, we'll use a restaurant reviews classification task.

In [2]:
# Loading the data

import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data/restaurant_reviews.tsv', delimiter = '\t', quoting = 3)

print(dataset['Liked'].value_counts())
dataset.head()

1    500
0    500
Name: Liked, dtype: int64


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [3]:
# Cleaning the text

import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

corpus = []
ps = PorterStemmer()
for i in range(0,1000):
    # get review and remove non alpha chars
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    # to lower-case and tokenize
    review = review.lower().split()
    # stemming and stop word removal
    review = ' '.join([ps.stem(w) for w in review if not w in set(stopwords.words('english'))])
    corpus.append(review)

In [4]:
# Creating a bag-of-words model

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1500)
X = vectorizer.fit_transform(corpus).toarray()
y = dataset['Liked']

print(X.shape, y.shape)

(1000, 1500) (1000,)


In [5]:
# Splitting the dataset into training and test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print(y_train.value_counts())
print(y_test.value_counts())

(800, 1500) (800,)
(200, 1500) (200,)
1    400
0    400
Name: Liked, dtype: int64
0    100
1    100
Name: Liked, dtype: int64


## Logistic Regression

Scikit-learn's [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) includes both L1 and L2 regularizations. L2 is the default.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

clf = LogisticRegression(penalty='l2') # l2 regularization is the default
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

[[83 17]
 [22 78]]


Print the feature weights that we've obtained.

In [7]:
print(clf.coef_)


[[ 0.41686797  0.17697687  0.         ... -0.20321363  0.64383437
  -0.61415454]]


How many features are actually being used? (I.e., how many non-zero weights are there?)

In [10]:
total_features = len(clf.coef_[0])

used_features = sum([1 for i in clf.coef_[0] if i != 0])

print('Total features: ', total_features)
print('Used features: ', used_features)

Total features:  1500
Used features:  1311


L1 regularization typically obtains sparser weight vectors. Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?

In [12]:
clf = LogisticRegression(penalty='l1', solver='liblinear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

print("Total features: ", len(clf.coef_[0]))
print("Used features: ", sum([1 for i in clf.coef_[0] if i != 0]))

[[89 11]
 [30 70]]
Total features:  1500
Used features:  149


You can also try using a mix of L1 and L2 (check the documentation for how to do it).

In [14]:
clf = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=10000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

print("Total features: ", len(clf.coef_[0]))
print("Used features: ", sum([1 for i in clf.coef_[0] if i != 0]))

[[84 16]
 [28 72]]
Total features:  1500
Used features:  380


## SVM

Scikit-learn's [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) also includes both L1 and L2 regularizations. L2 is the default.

In [21]:
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix

clf = LinearSVC(penalty='l2') # l2 regularization is the default

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

[[82 18]
 [20 80]]


How many features are actually being used? (I.e., how many non-zero weights are there?)

In [22]:
print("Total features: ", len(clf.coef_[0]))
print("Used features: ", sum([1 for i in clf.coef_[0] if i != 0]))

Total features:  1500
Used features:  1084


Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?

In [26]:
clf = LinearSVC(penalty='l1', dual=False)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

print("Total features: ", len(clf.coef_[0]))
print("Used features: ", sum([1 for i in clf.coef_[0] if i != 0]))

[[86 14]
 [27 73]]
Total features:  1500
Used features:  416


## SGD Classifier

Scikit-learn's [SGD Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements regularized linear models (such as SVM and Logistic Regression) with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing learning rate.

Several loss functions can be used, namely *hinge loss* (which corresponds to SVM) and *log loss* (which corresponds to Logistic Regression). And as before, you can use L1 and/or L2 regularization.

The *max_iter* parameter allows you to set the maximum number of epochs, where an epoch corresponds to going through the whole dataset for training. Also, *learning_rate* allows you to set a learning rate schedule.

Several parameters allow you to define stopping criteria: *tol* specifies a tolerance loss value or stopping criterion, while *n_iter_no_change* indicates the number of iterations with no improvement that should be observed before stopping; *early_stopping* allows us to use a validation set (a fraction *validation_fraction* of the training data) on which the stopping criterion will be checked (instead of checking the loss on the training data).

The *verbose* parameter allows you to set a verbosity (output) level.

Try using SGD, and explore different parameters!

In [27]:
from sklearn.linear_model import SGDClassifier

clf = SGDClassifier(loss='hinge', penalty='l2')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))

print("Total features: ", len(clf.coef_[0]))
print("Used features: ", sum([1 for i in clf.coef_[0] if i != 0]))

[[78 22]
 [19 81]]
Total features:  1500
Used features:  1096


Stochastic gradient descent updates the model weights base on one example at a time. Instead, we can compute the gradient over batches of training instances before updating the weights.

SGDClassifier allows us to do so via [*partial_fit*](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit), which corresponds to training the model with a specific set of examples for a single epoch. To properly use this method, we need to split our data into mini-batches and then iterate through them for as many epochs as we want.
Matters such as objective convergence, early stopping, and learning rate adjustments must be handled manually.

Try it out!

In [35]:
import numpy as np
from sklearn.metrics import accuracy_score

num_epochs = 100

clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42)

batch_size = 32
n_batches = int(np.ceil(len(X_train) / batch_size))

for epoch in range(num_epochs):
    for batch in range(n_batches):
        start = batch * batch_size
        end = (batch + 1) * batch_size
        X_batch = X_train[start:end]
        y_batch = y_train[start:end]
        clf.partial_fit(X_batch, y_batch, classes=np.unique(y))
    
    y_pred = clf.predict(X_test)
    print(f"Epoch {epoch}: Test Accuracy = {accuracy_score(y_test, y_pred): 2f}")

Epoch 0: Test Accuracy =  0.750000
Epoch 1: Test Accuracy =  0.770000
Epoch 2: Test Accuracy =  0.810000
Epoch 3: Test Accuracy =  0.800000
Epoch 4: Test Accuracy =  0.810000
Epoch 5: Test Accuracy =  0.820000
Epoch 6: Test Accuracy =  0.815000
Epoch 7: Test Accuracy =  0.815000
Epoch 8: Test Accuracy =  0.815000
Epoch 9: Test Accuracy =  0.820000
Epoch 10: Test Accuracy =  0.825000
Epoch 11: Test Accuracy =  0.825000
Epoch 12: Test Accuracy =  0.820000
Epoch 13: Test Accuracy =  0.815000
Epoch 14: Test Accuracy =  0.815000
Epoch 15: Test Accuracy =  0.810000
Epoch 16: Test Accuracy =  0.815000
Epoch 17: Test Accuracy =  0.815000
Epoch 18: Test Accuracy =  0.815000
Epoch 19: Test Accuracy =  0.815000
Epoch 20: Test Accuracy =  0.815000
Epoch 21: Test Accuracy =  0.815000
Epoch 22: Test Accuracy =  0.815000
Epoch 23: Test Accuracy =  0.820000
Epoch 24: Test Accuracy =  0.820000
Epoch 25: Test Accuracy =  0.825000
Epoch 26: Test Accuracy =  0.820000
Epoch 27: Test Accuracy =  0.820000
Ep