#  Long Exam 5 - On Deep Learning Methods

## Name: Yumol, Dianne


**Instructions**

Save this notebook with your answers/codes and submit here in canvas.

## Problem 1. Preprocessing text  (20 points)

As a Data Analyst at PyBooks, you're on the trail of mastering text preprocessing, and what better practice text to tackle than text from Sherlock Holmes. Your task is to preprocess a block of text using the various techniques presented in the video in order to prepare it for further analysis.

The text variable is an excerpt from The Hound of the Baskervilles by Arther Conan Doyle.

The following packages and functions will be used: nltk, `torch`, `get_tokenizer`, `PorterStemmer`, `stopwords`.

In [1]:
!pip install torchtext



In [2]:
import nltk
import torch
import torchtext

nltk.download('stopwords')

from torchtext.data.utils import get_tokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\Don
[nltk_data]     Bosco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


###  Instructions: Fill in the blanks and use the text below.

In [3]:
text = 'The moor is very sparsely inhabited, and those who live near each other are thrown very much together. For this reason I saw a good deal of Sir Charles Baskerville. With the exception of Mr. Frankland, of Lafter Hall, and Mr. Stapleton, the naturalist, there are no other men of education within many miles. Sir Charles was a retiring man, but the chance of his illness brought us together, and a community of interests in science kept us so. He had brought back much scientific information from South Africa, and many a charming evening we have spent together discussing the comparative anatomy of the Bushman and the Hottentot.'


### 1.1   Initialize the tokenizer with "basic_english" and tokenize the text using the tokenizer.

In [4]:
# Initialize and tokenize the text
### 1.1.1 Answer 
tokenizer = get_tokenizer("basic_english")
### 1.1.2 Answer 
tokens = tokenizer(text)
print(tokens)

['the', 'moor', 'is', 'very', 'sparsely', 'inhabited', ',', 'and', 'those', 'who', 'live', 'near', 'each', 'other', 'are', 'thrown', 'very', 'much', 'together', '.', 'for', 'this', 'reason', 'i', 'saw', 'a', 'good', 'deal', 'of', 'sir', 'charles', 'baskerville', '.', 'with', 'the', 'exception', 'of', 'mr', '.', 'frankland', ',', 'of', 'lafter', 'hall', ',', 'and', 'mr', '.', 'stapleton', ',', 'the', 'naturalist', ',', 'there', 'are', 'no', 'other', 'men', 'of', 'education', 'within', 'many', 'miles', '.', 'sir', 'charles', 'was', 'a', 'retiring', 'man', ',', 'but', 'the', 'chance', 'of', 'his', 'illness', 'brought', 'us', 'together', ',', 'and', 'a', 'community', 'of', 'interests', 'in', 'science', 'kept', 'us', 'so', '.', 'he', 'had', 'brought', 'back', 'much', 'scientific', 'information', 'from', 'south', 'africa', ',', 'and', 'many', 'a', 'charming', 'evening', 'we', 'have', 'spent', 'together', 'discussing', 'the', 'comparative', 'anatomy', 'of', 'the', 'bushman', 'and', 'the', 'ho

### 1.2  Create a set of English stopwords and use list comprehension to filter these `stop_words` out of the text, making sure to ignore capitalization.

In [5]:
## Create a set of English stopwords 
### 1.2.1 Answer 
stop_words = set(stopwords.words("english"))

# Remove any stopwords ignoring capitalization
### 1.2.2 Answer 
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print(filtered_tokens)

['moor', 'sparsely', 'inhabited', ',', 'live', 'near', 'thrown', 'much', 'together', '.', 'reason', 'saw', 'good', 'deal', 'sir', 'charles', 'baskerville', '.', 'exception', 'mr', '.', 'frankland', ',', 'lafter', 'hall', ',', 'mr', '.', 'stapleton', ',', 'naturalist', ',', 'men', 'education', 'within', 'many', 'miles', '.', 'sir', 'charles', 'retiring', 'man', ',', 'chance', 'illness', 'brought', 'us', 'together', ',', 'community', 'interests', 'science', 'kept', 'us', '.', 'brought', 'back', 'much', 'scientific', 'information', 'south', 'africa', ',', 'many', 'charming', 'evening', 'spent', 'together', 'discussing', 'comparative', 'anatomy', 'bushman', 'hottentot', '.']


###  1.3  Perform stemming on the filtered_tokens using the appropriate `nltk` function.

In [6]:
# Perform stemming on the filtered tokens
### 1.3.1 Answer
stemmer = PorterStemmer()
### 1.3.2 Answer 
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print(stemmed_tokens)

['moor', 'spars', 'inhabit', ',', 'live', 'near', 'thrown', 'much', 'togeth', '.', 'reason', 'saw', 'good', 'deal', 'sir', 'charl', 'baskervil', '.', 'except', 'mr', '.', 'frankland', ',', 'lafter', 'hall', ',', 'mr', '.', 'stapleton', ',', 'naturalist', ',', 'men', 'educ', 'within', 'mani', 'mile', '.', 'sir', 'charl', 'retir', 'man', ',', 'chanc', 'ill', 'brought', 'us', 'togeth', ',', 'commun', 'interest', 'scienc', 'kept', 'us', '.', 'brought', 'back', 'much', 'scientif', 'inform', 'south', 'africa', ',', 'mani', 'charm', 'even', 'spent', 'togeth', 'discuss', 'compar', 'anatomi', 'bushman', 'hottentot', '.']


## Problem 2. Creating a Shakespearean language encoder  (25 points)

Over at PyBooks, the team wants to transform a vast library of Shakespearean text data for further analysis. The most efficient way to do this is with a text processing pipeline, starting with the preprocessing steps which you have done.

With the *preprocessed Shakespearean text* at your fingertips, you now face the challenge of encoding it into a numerical representation. This means, it's time to define the encoding steps before putting the pipeline together. To better handle large amounts of data and efficiently perform the encoding, you will use PyTorch's Dataset and DataLoader for batching and shuffling the data.

The processed Shakespearean text data is saved as `processed_shakespeare_df` and the `processed_sentences` have already been extracted.

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data.dataset import Dataset
from torch.utils.data.dataloader import DataLoader

In [11]:
import os
# I updated the code so that I didn't encounter any errors
dataset_path = os.path.join("Datasets", "shakespeare_complete_works.txt")

# Read the dataset file with explicit encoding
with open(dataset_path, "r", errors="ignore") as file:
    shakespeare = file.readlines()

# Create a list of stopwords
stop_words = set(stopwords.words("english"))

# Initialize the tokenizer and stemmer
tokenizer = get_tokenizer("basic_english")
stemmer = PorterStemmer() 

### 2.1  Define a ShakespeareDataset dataset class and complete the `__init__` and `__getitem__` methods.


In [12]:
## Define your Dataset class
class ShakespeareDataset(Dataset):
    def __init__(self, data):
        ### 2.1.1 Answer 
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        ### 2.1.2 Answer 
        return self.data[idx]  

### 2.2 Complete the `preprocess_sentences()` function to enable tokenization, stop word removal, and stemming.

In [13]:
# Complete the function to preprocess sentences

def preprocess_sentences(sentences):
    processed_sentences = []
    for sentence in sentences:
        ### 2.2.1 Answer 
        sentence = sentence.lower()
        ### 2.2.2 Answer 
        tokens = tokenizer(sentence)
        tokens = [token for token in tokens if token not in stop_words]
        ### 2.2.3 Answer 
        tokens = [stemmer.stem(token) for token in tokens]
        processed_sentences.append(' '.join(tokens))
    return processed_sentences

###  2.3 Complete the `encode_sentences()` function to take in a list of sentences and encode them using the `bag-of-words` technique from `sklearn`.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

## Complete the encoding function
def encode_sentences(sentences):
    vectorizer = CountVectorizer()
    ### 2.3 Answer 
    X = vectorizer.fit_transform(sentences)
    return X.toarray(), vectorizer

###  2.4  Complete and call the `text_processing_pipeline()` function by using `preprocess_sentences()`, `encode_sentences()`, `ShakespeareDataset` class, and `DataLoader`.

In [15]:
# Complete the text processing pipeline
def text_processing_pipeline(sentences):
    ### 2.4.1 Answer 
    processed_sentences = preprocess_sentences(sentences)
    encoded_sentences, vectorizer = encode_sentences(processed_sentences)
    ### 2.4.2 Answer 
    dataset = ShakespeareDataset(encoded_sentences)
    ### 2.4.3 Answer 
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    return dataloader, vectorizer

dataloader, vectorizer = text_processing_pipeline(shakespeare)

# Print the vectorizer's feature names from index=1000 to index=1010
print(vectorizer.get_feature_names_out()[1000:1010]) 

['accurs' 'accurst' 'accursâ' 'accus' 'accusativo' 'accuserâ' 'accuseth'
 'accustom' 'accustomâ' 'accusâ']


##  Problem 3. Convolutional Neural Networks  (25 points)

PyBooks has successfully built a book recommendation engine. Their next task is to implement a sentiment analysis model to understand user reviews and gain insight into book preferences.

You'll use a Convolutional Neural Network (CNN) model to classify text data (book reviews) based on their sentiment.

In [16]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

####  3.1   Do the following:

*   Initialize the embedding layer in the `__init__()` method.
*   Apply the `ReLU` activation to this layer within the `forward()` method.

In [20]:
class TextClassificationCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super(TextClassificationCNN, self).__init__()
        # Initialize the embedding layer 
        ###3.1.1 Answer
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, 
                              embed_dim, 
                              kernel_size=3, 
                              stride=1, 
                              padding=1)
        self.relu = nn.ReLU() #Added the ReLU activation module so I can utilize self.relu
        self.fc = nn.Linear(embed_dim, 2)
    
    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)
        # Pass the embedded text through the convolutional layer and apply a ReLU
        ### 3.1.2 Answer
        conved = self.relu(self.conv(embedded))   
        conved = conved.mean(dim=2) 
        return self.fc(conved)

PyBooks now needs to train the model to optimize it for accurate sentiment analysis of book reviews.

In [21]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

#  text data for this problem
data = [(['I', 'love', 'this', 'book'], 1),
 (['This', 'is', 'an', 'amazing', 'novel'], 1),
 (['I', 'really', 'like', 'this', 'story'], 1),
 (['I', 'do', 'not', 'like', 'this', 'book'], 0),
 (['I', 'hate', 'this', 'novel'], 0),
 (['This', 'is', 'a', 'terrible', 'story'], 0)]

feature, label = zip(*data)
tokenized_sentences = list(feature)

unique_words = set()
vocab = []

for sentence in tokenized_sentences:
    for word in sentence:
        if word not in unique_words:
            vocab.append(word)
            unique_words.add(word)

word_to_idx = {word: i for i, word in enumerate(vocab)}

### 3.2 Define a loss function used for binary classification and save as criterion.

In [22]:
vocab_size = len(vocab)
embed_dim = 10
model = TextClassificationCNN(vocab_size, embed_dim)

## Define the loss function
## 3.2 Answer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

### 3.3  Zero the gradients at the start of the training loop and update the parameters at the end of the loop

In [31]:
for epoch in range(10):
    for sentence, label in data:     
        # Clear the gradients
        ### 3.3.1 Answer
        model.zero_grad()
        sentence = torch.LongTensor([word_to_idx.get(w, 0) for w in sentence]).unsqueeze(0)
        label = torch.LongTensor([int(label)])
        label = torch.nn.functional.one_hot(label, num_classes=2).float() #Added this line to bypass the error
        outputs = model(sentence)
        loss = criterion(outputs, label)
        loss.backward()
        # Update the parameters
        ### 3.3.2 Answer
        optimizer.step()
    print('Epoch {}: Loss: {:.4f}'.format(epoch+1, loss.item()))
print('Training complete!')

Epoch 1: Loss: 0.6123
Epoch 2: Loss: 0.6016
Epoch 3: Loss: 0.5883
Epoch 4: Loss: 0.5769
Epoch 5: Loss: 0.5634
Epoch 6: Loss: 0.5421
Epoch 7: Loss: 0.5222
Epoch 8: Loss: 0.4992
Epoch 9: Loss: 0.4813
Epoch 10: Loss: 0.4580
Training complete!


### 3.4 Testing the Sentiment Analysis CNN Model

Now that model is trained, PyBooks wants to check its performance on some new book reviews.

You need to check if the sentiment in a review is `positive` or `negative`.


In [32]:
book_reviews = [
    "I love this book".split(),
    "I do not like this book".split()
]

for review in book_reviews:
    # Convert the review words into tensor form
    input_tensor = torch.LongTensor([word_to_idx[w] for w in review]).unsqueeze(0)
    
    # Get the model's output
    outputs = model(input_tensor)
    
    # Find the index of the most likely sentiment category
    _, predicted_label = torch.max(outputs.data, 1)
    
    # Convert the predicted label into a sentiment string
    sentiment = "Positive" if predicted_label.item() == 1 else "Negative"
    
    print(f"Book Review: {' '.join(review)}")
    print(f"Sentiment: {sentiment}\n")

Book Review: I love this book
Sentiment: Positive

Book Review: I do not like this book
Sentiment: Negative



##  Problem 4  Long Short Term Memory  (30 points)

### Data Preparation

In [33]:
#!pip install scikit-learn

from sklearn.datasets import fetch_20newsgroups

# Specify the categories you want to download. You can also use 'all' to get all categories.
categories = ['rec.autos', 'sci.med', 'comp.graphics']

# Fetch the dataset
newsgroups = fetch_20newsgroups(categories=categories, shuffle=True, random_state=42)
y = newsgroups.target

# Specify the categories you want to download. You can also use 'all' to get all categories.
categories = ['rec.autos', 'sci.med', 'comp.graphics']

# Load and preprocess the dataset
newsgroups = fetch_20newsgroups(categories=categories, remove=('headers', 'footers', 'quotes'))

# news_data = newsgroups.data
newsgroups_train = fetch_20newsgroups(subset='train')
#newsgroups_train

In [34]:
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def set_clean(raw_text):
    set_stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    no_punc = re.sub(r'[^\w\s]', '', raw_text)
    lowercase_no_punc = no_punc.lower()
    tokenized_text= word_tokenize(lowercase_no_punc)
    no_stop = [w for w in tokenized_text if w not in set_stop_words]
    lc_text = [lemmatizer.lemmatize(word, pos="v") for word in no_stop]
    lc_text = [lemmatizer.lemmatize(word, pos="n") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="a") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="r") for word in lc_text]
    lc_text = [lemmatizer.lemmatize(word, pos="s") for word in lc_text]
    return(lc_text)

[nltk_data] Downloading package punkt to C:\Users\Don
[nltk_data]     Bosco\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Don
[nltk_data]     Bosco\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Don
[nltk_data]     Bosco\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Don
[nltk_data]     Bosco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [35]:
# parameters 
input_size = 21500
hidden_size = 32
num_layers = 2
num_classes = 3

###  4.1 Tokenize and vectorize the text data

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize

#  Tokenize and vectorize the text data:
### 4.1.1 Answer
vectorizer = CountVectorizer(analyzer=set_clean, max_features=input_size)

#fit and transform the vectorizer
### 4.1.2 Answer
bow_matrix = vectorizer.fit_transform(newsgroups.data)


# normalize all predictors
X = normalize(bow_matrix, norm='l2')


###  4.2 Split the text data into train and test sets and convert them to tensors

In [37]:
import numpy as np
from sklearn.model_selection import train_test_split

# Split into training and testing
### 4.2.1 Answer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)
X_train_seq = torch.tensor(X_train.toarray()).unsqueeze(1).to(torch.float32)
### 4.2.2 Answer
X_test_seq = torch.tensor(X_test.toarray()).unsqueeze(1).to(torch.float32)
y_train_seq = torch.tensor(y_train).to(torch.long)
### 4.2.3 Answer
y_test_seq = torch.tensor(y_test).to(torch.long)

### 4.3  Set up the LSTM model

In [38]:
# Set up an LSTM model by completing the LSTM and linear layers with the necessary parameters.

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        ### 4.3.1 Answer
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        ### 4.3.2 Answer
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        ### 4.3.3 Answer
        self.fc = nn.Linear(hidden_size, num_classes)       

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        out, _ = self.lstm(x, (h0, c0))
        out = out[:, -1, :] 
        out = self.fc(out)
        return out
 

###  4.4  Initialize the model with the necessary parameters.

In [39]:
# Initialize model with required parameters
### 4.4 Answer
lstm_model = LSTMModel(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

### 4.5 Train the LSTM model resetting the gradients to zero and passing the input data X_train_seq through the model and calculate the loss based on the predicted outputs and the true labels.

In [49]:
# Train the model by passing the correct parameters and zeroing the gradient
    
# Train the model by passing the correct parameters and zeroing the gradifor epoch in range(10): 
for epoch in range(20): 
    optimizer.zero_grad()
     ### 4.5.1 Answer
    outputs = lstm_model(X_train_seq)
     ### 4.5.2 Answer
    loss = criterion(outputs, y_train_seq)
    loss.backward()
    optimizer.step()
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')

Epoch: 1, Loss: 0.17680732905864716
Epoch: 2, Loss: 0.14261922240257263
Epoch: 3, Loss: 0.11559505015611649
Epoch: 4, Loss: 0.09492459148168564
Epoch: 5, Loss: 0.07962517440319061
Epoch: 6, Loss: 0.06864938139915466
Epoch: 7, Loss: 0.06098884716629982
Epoch: 8, Loss: 0.055756185203790665
Epoch: 9, Loss: 0.05222829431295395
Epoch: 10, Loss: 0.04984162747859955
Epoch: 11, Loss: 0.04817504063248634
Epoch: 12, Loss: 0.04693993926048279
Epoch: 13, Loss: 0.04597131162881851
Epoch: 14, Loss: 0.04520771652460098
Epoch: 15, Loss: 0.044646967202425
Epoch: 16, Loss: 0.04429207742214203
Epoch: 17, Loss: 0.04410171136260033
Epoch: 18, Loss: 0.043969057500362396
Epoch: 19, Loss: 0.04378441721200943
Epoch: 20, Loss: 0.04352302476763725


### 4.6   Evaluating the model's performance using the test set

In [42]:
!pip install torchmetrics
import torchmetrics
from torchmetrics import Accuracy
from torchmetrics import Precision
from torchmetrics import Recall
from torchmetrics import F1Score

Collecting torchmetrics
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
     ------------------------------------- 805.2/805.2 kB 12.8 MB/s eta 0:00:00
Collecting lightning-utilities>=0.8.0
  Downloading lightning_utilities-0.10.0-py3-none-any.whl (24 kB)
Installing collected packages: lightning-utilities, torchmetrics
Successfully installed lightning-utilities-0.10.0 torchmetrics-1.2.0


In [50]:
# Applying the lstm_model on the test set
y_pred_lstm = lstm_model(X_test_seq)

# Create an instance of the metrics
accuracy_metric = Accuracy(task="multiclass", num_classes=3)
precision_metric = Precision(task="multiclass", num_classes=3)
recall_metric = Recall(task="multiclass", num_classes=3)
f1_metric = F1Score(task="multiclass", num_classes=3)

## Calculate metrics for the LSTM model
### Answer 4.6.1
accuracy = accuracy_metric(y_pred_lstm, y_test_seq)
### Answer 4.6.2
precision = precision_metric(y_pred_lstm, y_test_seq)
### Answer 4.6.3
recall = recall_metric(y_pred_lstm, y_test_seq)
### Answer 4.6.4
f1 = f1_metric(y_pred_lstm, y_test_seq)

print("LSTM Model: \n Accuracy: {},\n Precision: {},\n Recall: {},\n F1 Score: {}".format(accuracy, precision, recall, f1))

LSTM Model: 
 Accuracy: 0.9239436388015747,
 Precision: 0.9239436388015747,
 Recall: 0.9239436388015747,
 F1 Score: 0.9239436388015747
