# **Day 8: Embeddings**
---

### **Description**
In today's lab, we have another text classification task, but this time we will be using **embeddings.** For this project, we will be working with a dataset of BBC News articles classified by topic.

<br>

### **Lab Structure**

**Part 1**: [Review](#p1)
>
>**Part 1.1**: [Tokenization and Vectorization with sklearn](#p1.1)
>
>**Part 1.2**: [Tokenization and Vectorization with TextDataLoaders](#p1.2)

**Part 2**: [Embeddings](#p2)


**Part 3**: [News Classification with a Simple Neural Net with Embedding](#p3)

**Part 4**: [News Classification with a CNN with Embedding](#p4)

**Part 5**: [[ADDITIONAL PRACTICE] Sentiment Analysis with IMDB Movie Review](#p5)

<br>

### **Goals**
By the end of this lab, you will:
* Understand how tokenization and vectorization works when using TextDataLoaders
* Understand how to apply embedding layers in models.
* Compare a fully connected network to a CNN for text classification with embeddings.

<br>

### **Cheat Sheets**
[Natural Language Processing II](https://docs.google.com/document/d/1OoP-sFW6qMk0BzvYMlavgJtiXX9eziTUptlFdzgLfGk/edit)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from fastai.text.all import *

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # PyTorch v0.4.0

<a name="p1"></a>

---
## **Part 1: Review**
---



<a name="p1.1"></a>

---
### **Part 1.1: Tokenization and Vectorization with sklearn**
---

**Run the cell below to load a simple corpus for us to work with.**

In [None]:
# Define a collection of text documents
corpus = [
       "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
       "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
       "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
       "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
       "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)"
]

#### **Problem #1.1.1: Create a CountVectorizer object**



In [None]:
vectorizer = #FILL IN CODE HERE

#### **Problem #1.1.2: Fit the vectorizer to the corpus**



#### **Problem #1.1.3: Transform the corpus into a matrix of token counts**



In [None]:
# Transform the corpus into a matrix of token counts
# WRITE YOUR CODE HERE

# Print the resulting matrix
print(X.toarray())

#### **Problem #1.1.4: Print the tokens**

Use `get_feature_names_out()` to print the tokens.


Compare the tokens, the matrix, and the corpus. Do you see how each sentence is represented in the matrix?

<a name="p1.2"></a>

---
### **Part 1.2: Tokenization and Vectorization with TextDataLoaders**
---

Last time, we learned how to tokenize and vectorize data using `sklearn`'s `CountVectorizer()`. However, we don't always have to do this step manually. When loading data with fast.ai's `TextDataLoaders`, the DataLoader handles the tokenization and vectorization for us. Let's take a look.

**Run the code below to load the BBC News data into a pandas DataFrame.**


In [None]:
dataset = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vRRiQ1DUkUxk31YpaHA2i9QtwGq_VGXiy86z7l3aT9v5zoB6M7a-2M2qlYckr1C_ZG6StBELlU_hD3S/pub?output=csv')

#### **Problem #1.2.1: Load the data using TextDataLoaders**

Use the following parameters:
* `dataset`
* `text_col='text'`
* `label_col='category'`
* `valid_pct=0.2`
* `bs=64`
* `seq_len=100`


In [None]:
dls = TextDataLoaders.from_df(
    #FILL IN CODE HERE
)

#### **Problem #1.2.2: Print the vocabulary**


The `vocab` attribute of the DataLoaders object contains the vocabulary of the data and of the labels.

Use the first element of the `vocab` attribute of the DataLoaders object to print the vocabulary of the data.

#### **Problem #1.2.3: Print the labels**


Use the second element of the `vocab` attribute to print the labels.

#### **Problem #1.2.4: View vectorized data**
---

TextDataLoaders assigns a unique integer ID to each token in the vocabulary, while preserving the order of the tokens.

Use `dls.one_batch()` to pull one batch of the data and view the first instance.

In [None]:
xb, yb = # FILL IN CODE HERE

#### **Problem #1.2.5: Decode vectorized data**
---

Use `dls.show()` to decode the numeric data.

*Hint: Pass a tuple to the function.*

---

<center>

#### **Back to lecture**

---

<a name="p2"></a>

---
## **Part 2: Embeddings**
---

#### **Problem #2.1: Create an embedding layer in PyTorch**


We can use PyTorch to create embeddings. Create an embedding layer. The first input will be the vocab size, and the second input is the embedding dimension. Set the embedding dimension to 50.

In [None]:
# Create an embedding layer with 50 dimensions
vocab_size = # FILL IN CODE HERE
embedding = nn.Embedding(vocab_size, # FILL IN CODE HERE)

#### **Problem #2.2: Apply the embedding**


Earlier, we pulled one batch of the data and saved the data `xb` and labels `yb`. Apply the embedding to the batch of data `xb`. Compare this numerical representation to the representation in Problem #4.

In [None]:
# Apply the embedding layer
embedded = # FILL IN CODE HERE

# Print the first embedded instance of the data
embedded[0]

#### **Run the code below to visualize the embedding.**

You can change the first and second dimension for plotting.

In [None]:
first_dimension = 0
second_dimension = 1

# Detach the tensor from the computational graph (preparing for plotting)
embedded = embedded.detach()

# Plot the embeddings
plt.figure(figsize=(10, 10))
plt.scatter(
    embedded[:, :, first_dimension].numpy().flatten(),
    embedded[:, :, second_dimension].numpy().flatten(),
    s=10)
plt.title('Embedding Visualization')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

---

<center>

#### **Back to lecture**

---

<a name="p3"></a>

---
## **Part 3: News Classification with a Simple Neural Net with Embedding**
---

#### **Problem #3.1**



Fill in the code below for a fully connected network of your own design. Ensure you have the correct number of inputs and outputs.

Set the embedding dimension to 200.

In [None]:
embed_size = # FILL IN CODE HERE

model = nn.Sequential(
    nn.Embedding( # FILL IN CODE HERE
    nn.AdaptiveAvgPool2d((1, embed_size)),
    nn.Flatten(),
    nn.Linear(embed_size, # FILL IN CODE HERE
    # ADD THE REST OF YOUR LAYERS
)

#### **Problem #3.2**

Create a Learner object and fit the model. Since this is a multiclass classification problem, you will use `nn.CrossEntropyLoss()` and accuracy as the metric. Choose your own hyperparameters (5 epochs and a learning rate of 0.001 is a good start.)

In [None]:
# Create a Learner and train the model


#### **Problem #3.3**

Now, evaluate the model for both the training and validation sets.


In [None]:
# Evaluate the training set


# Evaluate the test set


#### **How did your model perform? Try to improve your result with hyperparameter tuning.**




---

<center>

#### **Back to lecture**

---

<a name="p4"></a>

---
## **Part 4: News Classification with a CNN with Embedding**
---



#### **Problem #4.1**


Let's build a CNN model with an embedding layer. In order for the output of the embedding layer to be the right dimensions for the convolutional layer, we've provided a custom module that transposes the last two dimensions.

Remember, we've introduced a new version of the max pooling layer. The previous one specifies the pool size, and requires us to keep track of the output sizes:
* `nn.MaxPool1d(2)`

For the new one, we just specify what size *output* we would like:
* `nn.AdaptiveMaxPool1d(10)`

<br>

Define a CNN with the following layers:

Block 1:
* A convolutional layer with 300 outputs, kernel size of 11, `padding='same'`, and ReLU activation.
* A adaptive max pooling layer with an output size of 10

Block 2
* A convolutional layer with 150 outputs, kernel size of 11, `padding='same'`, and ReLU activation.
* A adaptive max pooling layer with an output size of 1

Finally, add:
* A Flatten layer
* A linear layer with 20 outputs and ReLU activation
* The output layer

In [None]:
# To prepare the embedding layer for the convolutional layer, we need
# to define a custom module to transpose the last two dimensions.
class Transpose(nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

model = nn.Sequential(
    nn.Embedding(vocab_size, embed_size),
    Transpose(),
    nn.Conv1d(embed_size, # FILL IN CODE HERE
    # ADD THE REST OF THE LAYERS
)


#### **Problem #4.2**


Create a Learner object and fit the model. Since this is a multiclass classification problem, you will use `nn.CrossEntropyLoss()`

In [None]:
# Create a Learner and train the model


#### **Problem #4.3**


Now, evaluate the model for both the training and validation sets.


In [None]:
# Evaluate the training set


# Evaluate the test set


#### **How did your model perform? Try to improve your result with hyperparameter tuning.**

<a name="p5"></a>

---
## **[ADDITIONAL PRACTICE] Part 5: Sentiment Analysis with IMDB Movie Reviews**
---

In this part, we will create a CNN using the IMDB Movie Reviews dataset, which includes movie reviews along with their corresponding sentiment (positive, neutral, negative).

####**Problem #5.1**

**Run the code below to load the IMDB Movie Reviews data into a pandas DataFrame.**

View the DataFrame before beginning.


In [None]:
dataset = pd.read_csv('https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/main/imdb_reviews/IMDB_Dataset.csv')

dataset.head()

#### **Problem #5.2: Load the data using TextDataLoaders**

Use the following parameters:
* `dataset`
* `text_col='review'`
* `label_col='sentiment'`
* `valid_pct=0.2`
* `bs=64`
* `seq_len=100`


In [None]:
dls = TextDataLoaders.from_df(
    #FILL IN CODE HERE
)

#### **Problem #5.3: Print the vocabulary**


The `vocab` attribute of the DataLoaders object contains the vocabulary of the data and of the labels.

Use the first element of the `vocab` attribute of the DataLoaders object to print the vocabulary of the data.

#### **Problem #5.4: Print the labels**


Use the second element of the `vocab` attribute to print the labels.

#### **Problem #5.5: View vectorized data**


TextDataLoaders assigns a unique integer ID to each token in the vocabulary, while preserving the order of the tokens.

Use `dls.one_batch()` to pull one batch of the data and view the first instance.

In [None]:
xb, yb = # FILL IN CODE HERE

#### **Problem #5.6: Decode vectorized data**


Use `dls.show()` to decode the numeric data.

*Hint: Pass a tuple to the function.*

#### **Problem #5.7: Create an embedding layer in PyTorch**


We can use PyTorch to create embeddings. Create an embedding layer. The first input will be the vocab size, and the second input is the embedding dimension. Set the embedding dimension to 50.

In [None]:
# Create an embedding layer with 50 dimensions
vocab_size = # FILL IN CODE HERE
embedding = nn.Embedding(vocab_size, # FILL IN CODE HERE)

#### **Problem #5.8: Apply the embedding**


Earlier, we pulled one batch of the data and saved the data `xb` and labels `yb`. Apply the embedding to the batch of data `xb`. Compare this numerical representation to the representation in Problem #4.

In [None]:
# Apply the embedding layer
embedded = # FILL IN CODE HERE

# Print the first embedded instance of the data
embedded[0]

#### **Run the code below to visualize the embedding.**

You can change the first and second dimension for plotting.

In [None]:
first_dimension = 0
second_dimension = 1

# Detach the tensor from the computational graph (preparing for plotting)
embedded = embedded.detach()

# Plot the embeddings
plt.figure(figsize=(10, 10))
plt.scatter(
    embedded[:, :, first_dimension].numpy().flatten(),
    embedded[:, :, second_dimension].numpy().flatten(),
    s=10)
plt.title('Embedding Visualization')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

#### **Problem #5.9**


Let's build a CNN model with an embedding layer. In order for the output of the embedding layer to be the right dimensions for the convolutional layer, we've provided a custom module that transposes the last two dimensions.

Remember, we've introduced a new version of the max pooling layer. The previous one specifies the pool size, and requires us to keep track of the output sizes:
* `nn.MaxPool1d(2)`

For the new one, we just specify what size *output* we would like:
* `nn.AdaptiveMaxPool1d(10)`

<br>

Define a CNN with the following layers:

Block 1:
* A convolutional layer with 300 outputs, kernel size of 11, `padding='same'`, and ReLU activation.
* A adaptive max pooling layer with an output size of 10

Block 2
* A convolutional layer with 150 outputs, kernel size of 11, `padding='same'`, and ReLU activation.
* A adaptive max pooling layer with an output size of 1

Finally, add:
* A Flatten layer
* A linear layer with 20 outputs and ReLU activation
* The output layer

In [None]:
# To prepare the embedding layer for the convolutional layer, we need
# to define a custom module to transpose the last two dimensions.
class Transpose(nn.Module):
    def forward(self, x):
        return x.transpose(1, 2)

model = nn.Sequential(
    nn.Embedding(vocab_size, embed_size),
    Transpose(),
    nn.Conv1d(embed_size, # FILL IN CODE HERE
    # ADD THE REST OF THE LAYERS
)


#### **Problem #5.10**


Create a Learner object and fit the model. Since this is a multiclass classification problem, you will use `nn.CrossEntropyLoss()`

In [None]:
# Create a Learner and train the model


#### **Problem #5.11**


Now, evaluate the model for both the training and validation sets.


In [None]:
# Evaluate the training set


# Evaluate the test set


---
#End of notebook
---
© 2024 The Coding School, All rights reserved