In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import torch.nn.functional as F

In [3]:
# Load dataset
categories = ['rec.autos', 'sci.med', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)
vectorizer = CountVectorizer(max_features=5000)
X = vectorizer.fit_transform(newsgroups.data).toarray()
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(newsgroups.target)


In [5]:
print(y[:10])

[1 0 1 1 0 0 2 0 2 0]


In [6]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert data to PyTorch tensors
X_train_seq = torch.tensor(X_train, dtype=torch.float32)
X_test_seq = torch.tensor(X_test, dtype=torch.float32)
y_train_seq = torch.tensor(y_train, dtype=torch.long)
y_test_seq = torch.tensor(y_test, dtype=torch.long)

In [7]:
# Parameters
input_size = X_train_seq.shape[1]
hidden_size = 32
num_layers = 2
num_classes = 3

# Building an RNN model for text

As a data analyst at PyBooks, you often encounter datasets that contain sequential information, such as customer interactions, time series data, or text documents. RNNs can effectively analyze and extract insights from such data. In this exercise, you will dive into the Newsgroup dataset that has already been processed and encoded for you. This dataset comprises articles from different categories. Your task is to apply an RNN to classify these articles into three categories:

rec.autos, sci.med, and comp.graphics.

This and the following exercises use the fetch_20newsgroups dataset from sklearn.

* Complete the RNN class with an RNN layer and a fully connected linear layer.
* Initialize the model.
* Train the RNN model for ten epochs by zeroing the gradients.

In [24]:
class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNNModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)        
        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        out, _ = self.rnn(x.unsqueeze(1), h0)
        out = out[:, -1, :]
        out = self.fc(out)
        return out

# Initialize the model
rnn_model = RNNModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=0.01)

In [25]:
# Train the model for 50 epochs
for epoch in range(50):
    rnn_model.train()
    optimizer.zero_grad()
    outputs = rnn_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 1, Loss: 1.1557263135910034
Epoch: 2, Loss: 1.0302963256835938
Epoch: 3, Loss: 0.9026435017585754
Epoch: 4, Loss: 0.7266136407852173
Epoch: 5, Loss: 0.548943281173706
Epoch: 6, Loss: 0.39736050367355347
Epoch: 7, Loss: 0.27267393469810486
Epoch: 8, Loss: 0.18637728691101074
Epoch: 9, Loss: 0.12769444286823273
Epoch: 10, Loss: 0.08587413281202316
Epoch: 11, Loss: 0.05711580440402031
Epoch: 12, Loss: 0.038773197680711746
Epoch: 13, Loss: 0.027011841535568237
Epoch: 14, Loss: 0.019469747319817543
Epoch: 15, Loss: 0.014546065591275692
Epoch: 16, Loss: 0.011154643259942532
Epoch: 17, Loss: 0.008809257298707962
Epoch: 18, Loss: 0.007102936506271362
Epoch: 19, Loss: 0.005821355618536472
Epoch: 20, Loss: 0.004847390111535788
Epoch: 21, Loss: 0.004096565302461386
Epoch: 22, Loss: 0.0034660627134144306
Epoch: 23, Loss: 0.0028982353396713734
Epoch: 24, Loss: 0.00237806374207139
Epoch: 25, Loss: 0.0019221705151721835
Epoch: 26, Loss: 0.001561245764605701
Epoch: 27, Loss: 0.00130589818581938

Model loss should always decrease as it shows how well the model has learned new patterns. Keep up the excellent work!

In [26]:
# Evaluate the model
rnn_model.eval()
with torch.no_grad():
    outputs = rnn_model(X_test_seq)
    _, predicted = torch.max(outputs, 1)
    accuracy = accuracy_score(y_test_seq, predicted)
    print(f'Test Accuracy: {accuracy:.2f}')

Test Accuracy: 0.97


# Building an LSTM model for text

At PyBooks, the team is constantly seeking to enhance the user experience by leveraging the latest advancements in technology. In line with this vision, they have assigned you a critical task. The team wants you to explore the potential of another powerful tool: LSTM, known for capturing more complexities in data patterns. You are working with the same Newsgroup dataset, with the objective remaining unchanged: to classify news articles into three distinct categories:

rec.autos, sci.med, and comp.graphics.

* Set up an LSTM model by completing the LSTM and linear layers with the necessary parameters.
* Initialize the model with the necessary parameters.
* Train the LSTM model resetting the gradients to zero and passing the input data X_train_seq through the model.
* Calculate the loss based on the predicted outputs and the true labels.




In [14]:
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.lstm(x.unsqueeze(1), (h0, c0))  # Reshape input to (batch_size, seq_length, input_size)
        out = out[:, -1, :] 
        out = self.fc(out)
        return out


In [30]:
# Initialize model with required parameters
lstm_model = LSTMModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

# Train the model by passing the correct parameters and zeroing the gradient
for epoch in range(10): 
    optimizer.zero_grad()
    outputs = lstm_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 1, Loss: 1.1131354570388794
Epoch: 2, Loss: 1.0890493392944336
Epoch: 3, Loss: 1.054835557937622
Epoch: 4, Loss: 1.0028889179229736
Epoch: 5, Loss: 0.9312984943389893
Epoch: 6, Loss: 0.8425112962722778
Epoch: 7, Loss: 0.7449913024902344
Epoch: 8, Loss: 0.6447567343711853
Epoch: 9, Loss: 0.5466322302818298
Epoch: 10, Loss: 0.4525633752346039


In [33]:
# Evaluate the model
lstm_model.eval()
with torch.no_grad():
    outputs = lstm_model(X_test_seq)
    _, y_pred_lstm = torch.max(outputs, 1)
#     accuracy = accuracy_score(y_test_seq, y_pred_lstm)
#     print(f'Test Accuracy: {accuracy:.2f}')

 The output presents model loss that would keep decreasing with each epoch. This information could be utilized by the team at PyBooks to compare with other models. Keep up the great w

# Building a GRU model for text

At PyBooks, the team has been impressed with the performance of the two models you previously trained. However, in their pursuit of excellence, they want to ensure the selection of the absolute best model for the task at hand. Therefore, they have asked you to further expand the project by experimenting with the capabilities of GRU models, renowned for their efficiency and effectiveness in text classification tasks. Your new assignment is to apply the GRU model to classify articles from the Newsgroup dataset into the following categories:

rec.autos, sci.med, and comp.graphics.

* Complete the GRU class with the required parameters.
* Initialize the model with the same parameters.
* Train the model: pass the parameters to the criterion function, and backpropagate the loss.

In [17]:
# Complete the GRU model
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(GRUModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.gru(x.unsqueeze(1), h0)  # Reshape input to (batch_size, seq_length, input_size)
        out = out[:, -1, :]
        out = self.fc(out)
        return out


In [34]:
# Initialize the model
gru_model = GRUModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(gru_model.parameters(), lr=0.01)

# Train the model and backpropagate the loss after initialization
for epoch in range(15):
    optimizer.zero_grad()
    outputs = gru_model(X_train_seq)
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 1, Loss: 1.0958889722824097
Epoch: 2, Loss: 1.0295681953430176
Epoch: 3, Loss: 0.9211702942848206
Epoch: 4, Loss: 0.7771641612052917
Epoch: 5, Loss: 0.6244705319404602
Epoch: 6, Loss: 0.475839227437973
Epoch: 7, Loss: 0.34952712059020996
Epoch: 8, Loss: 0.24499188363552094
Epoch: 9, Loss: 0.16498719155788422
Epoch: 10, Loss: 0.10784843564033508
Epoch: 11, Loss: 0.06953068822622299
Epoch: 12, Loss: 0.04533623531460762
Epoch: 13, Loss: 0.030256202444434166
Epoch: 14, Loss: 0.020781321451067924
Epoch: 15, Loss: 0.01461486890912056


In [35]:
# Evaluate the model
gru_model.eval()
with torch.no_grad():
    outputs = gru_model(X_test_seq)
    _, y_pred_gru = torch.max(outputs, 1)
#     accuracy = accuracy_score(y_test_seq, predicted)
#     print(f'Test Accuracy: {accuracy:.2f}')

You've effectively trained GRU models for text classification. The decreasing model loss across epochs is promising, and can be used by the PyBooks team for comparison with other models!


# Evaluating RNN classifications


The team at PyBooks now wants you to evaluate the RNN model you created and ran using the Newsgroup dataset. Recall, the goal was to classify the articles into one of three categories:

rec.autos, sci.med, and comp.graphics.


An instance of rnn_model trained in the previous exercise in preloaded for you, too.


* Create an instance of each metric for multi-class classification with num_classes equal to the number of categories.
* Generate the predictions for the rnn_model using the test data X_test_seq.
* Calculate the metrics using the predicted classes and the true labels.

In [29]:
from torchmetrics import Accuracy, Precision, F1Score,Recall

# Create an instance of the metrics
accuracy = Accuracy(task="multiclass", num_classes = 3)
precision = Precision(task = "multiclass", num_classes=3)
recall = Recall(task="multiclass", num_classes=3)
f1 = F1Score(task="multiclass", num_classes=3)

# Generate the predictions
outputs = rnn_model(X_test_seq)
_, predicted = torch.max(outputs, 1)

# Calculate the metrics
accuracy_score = accuracy(predicted, y_test_seq)
precision_score = precision(predicted, y_test_seq)
recall_score = recall(predicted, y_test_seq)
f1_score = f1(predicted, y_test_seq)
print("RNN Model - \n Accuracy: {} \n Precision: {} \n Recall: {} \n F1 Score: {} \n".format(accuracy_score, precision_score, recall_score, f1_score))

RNN Model - 
 Accuracy: 0.9712352156639099 
 Precision: 0.9712352156639099 
 Recall: 0.9712352156639099 
 F1 Score: 0.9712352156639099 



The model metrics provide significant insights on the effectiveness of our model. We can notice that all metrics are around 0.97 which is a good sign. Keep up the excellent work!

**Evaluating the model's performance**

The PyBooks team has been making strides on the book recommendation engine. The modeling team has provided you two different models ready for your book recommendation engine at PyBooks. One model is based on LSTM (lstm_model) and the other uses a GRU (gru_model). You've been tasked to evaluate and compare these models.

The testing labels y_test and the model's predictions y_pred_lstm for lstm_model and y_pred_gru for gru_model.

* Define accuracy, precision, recall and F1 for multi-class classification by specifying num_classes and task.
* Calculate and print the accuracy, precision, recall, and F1 score for lstm_model.
* Similarly, calculate the evaluation metrics for gru_model.

In [37]:
# Create an instance of the metrics
accuracy = Accuracy(task="multiclass", num_classes=3)
precision = Precision(task="multiclass", num_classes=3)
recall = Recall(task="multiclass", num_classes=3)
f1 = F1Score(task="multiclass", num_classes=3)

# Calculate metrics for the LSTM model
accuracy_1 = accuracy(y_pred_lstm, y_test_seq)
precision_1 = precision(y_pred_lstm, y_test_seq)
recall_1 = recall(y_pred_lstm, y_test_seq)
f1_1 = f1(y_pred_lstm, y_test_seq)
print("LSTM Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_1, precision_1, recall_1, f1_1))

# Calculate metrics for the GRU model
accuracy_2 = accuracy(y_pred_gru, y_test_seq)
precision_2 = precision(y_pred_gru, y_test_seq)
recall_2 = recall(y_pred_gru, y_test_seq)
f1_2 = f1(y_pred_gru, y_test_seq)
print("GRU Model - Accuracy: {}, Precision: {}, Recall: {}, F1 Score: {}".format(accuracy_2, precision_2, recall_2, f1_2))

LSTM Model - Accuracy: 0.9492385983467102, Precision: 0.9492385983467102, Recall: 0.9492385983467102, F1 Score: 0.9492385983467102
GRU Model - Accuracy: 0.9644669890403748, Precision: 0.9644669890403748, Recall: 0.9644669890403748, F1 Score: 0.9644669890403748


Well done! You've evaluated and compared two different models. Now, PyBooks can decide which model to deploy for their book recommendation engine. 