# **Lab 7: Introduction to Natural Language Processing (NLP)**
---

### **Description**
In today's lab, we will see how to use neural networks for one of the most popular NLP tasks: **text classification**. This will involve applying what you already know about neural nets and new NLP concepts of tokenization and vectorization.

For this project, we will be working with the `fetch_20newsgroups` dataset, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. Each newsgroup covers a different topic, such as sports, politics, religion, and technology. The documents within each newsgroup were posted by various authors, and cover a wide range of subtopics related to the main theme of the newsgroup.

The goal of this project is to build a machine learning model that can accurately classify newsgroup documents based on their content.

<br>

### **Lab Structure**
**Part 1**: [Tokenization and Vectorization](#p1)

**Part 2**: [News Group Classification with a Neural Network](#p2)
>
>**Part 2.1**: [Tokenizing and Vectorizing the News Groups Dataset](#p2.1)
>
>**Part 2.2**: [Training and Testing a Neural Network](#p2.2)

**Part 3**: [News Group Classification with a CNN](#p3)



<br>

### **Goals**
By the end of this lab, you will:
* Understand the concept of tokenization in NLP.
* Compare a fully connected network to a CNN for text classification.

<br>

### **Cheat Sheets**
[Natural Language Processing I](https://docs.google.com/document/d/1MamYMxe8zlWoiDc0tX2RzUKQULCPVUh-2QtdzRRvzcs/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from fastai.text.all import *

import warnings
warnings.filterwarnings('ignore')

<a name="p1"></a>

---
## **Part 1: Tokenization and Vectorization**
---

**Run the cell below to load a simple corpus for us to work with.**

In [None]:
# Define a collection of text documents
corpus = [
    "This is the first document.",
    "This is the second document.",
    "And this is the third document.",
    "Is this the first document?",
]

#### **Problem #1.1: Create a CountVectorizer object**



In [None]:
vectorizer = #FILL IN CODE HERE

###### **Solution**

In [None]:
# Create a CountVectorizer object
vectorizer = CountVectorizer()

#### **Problem #1.2: Fit the vectorizer to the corpus**



###### **Solution**

In [None]:
# Fit the vectorizer to the corpus
vectorizer.fit(corpus)

#### **Problem #1.3: Transform the corpus into a matrix of token counts**



In [None]:
# Transform the corpus into a matrix of token counts
# WRITE YOUR CODE HERE

# Print the resulting matrix
print(X.toarray())

###### **Solution**

In [None]:
# Transform the corpus into a matrix of token counts
X = vectorizer.transform(corpus)

# Print the resulting matrix
print(X.toarray())

[[0 1 1 1 0 1 0 1]
 [0 1 0 1 1 1 0 1]
 [1 1 0 1 0 1 1 1]
 [0 1 1 1 0 1 0 1]]


#### **Problem #1.4: Print the tokens**

Use `get_feature_names_out()` to print the tokens.


###### **Solution**

In [None]:
print(vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'second' 'the' 'third' 'this']


Compare the tokens, the matrix, and the corpus. Do you see how each sentence is represented in the matrix?

<a name="p2"></a>

---
## **Part 2: News Group Classification with a Neural Network**
---


The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. Our task is to classify the articles to the correct newsgroup.

<br>

**Run the cell below to load the dataset.**

In [None]:
# Load the dataset
newsgroups_data = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers', 'quotes')
)

texts = newsgroups_data.data
labels = newsgroups_data.target

# Split the dataset into training and validation sets
texts_train, texts_val, labels_train, labels_val = train_test_split(
    texts,
    labels,
    test_size=0.2,
    random_state=42
)

<a name="p2.1"></a>

---
### **Part 2.1: Tokenizing and Vectorizing the News Groups Dataset**
---

#### **Problem #2.1.1: Create the CountVectorizer object**

Initialize the vectorizer with the following parameters:
* `stop_words='english'`
* `max_features=4000`

###### **Solution**


In [None]:
vectorizer = CountVectorizer(stop_words='english', max_features=4000)

#### **Problem #2.1.2: Fit and transform the training data.**



###### **Solution**


In [None]:
X_train_bow = vectorizer.fit_transform(texts_train)

#### **Problem #2.1.3: Transform the validation data.**

###### **Solution**


In [None]:
X_valid_bow = vectorizer.transform(texts_val)

###### **Run the code below to print out the shapes of each BoW matrix and a sample of the vocabulary.**

In [None]:
# Show the shape of the BoW matrices
print("Shape of the training BoW matrix:", X_train_bow.shape)
print("Shape of the validation BoW matrix:", X_valid_bow.shape)

print(L(vectorizer.get_feature_names_out()[2000:2100]))

Shape of the training BoW matrix: (9051, 4000)
Shape of the validation BoW matrix: (2263, 4000)
[array(['ken', 'kent', 'kept', 'kevin', 'key', 'keyboard', 'keys', 'kg',
       'kh', 'khf', 'ki', 'kids', 'kill', 'killed', 'killing', 'kind',
       'kinds', 'king', 'kingdom', 'kings', 'kit', 'kjz', 'kk', 'km',
       'kn', 'knew', 'knife', 'know', 'knowing', 'knowledge', 'known',
       'knows', 'koresh', 'kt', 'kurds', 'la', 'lab', 'labor',
       'laboratory', 'lack', 'land', 'lane', 'language', 'large',
       'largely', 'larger', 'larry', 'larson', 'laser', 'late', 'later',
       'latest', 'launch', 'launched', 'launches', 'law', 'laws',
       'lawyer', 'lay', 'lc', 'lcs', 'le', 'lead', 'leader', 'leaders',
       'leadership', 'leading', 'leads', 'leafs', 'league', 'learn',
       'learned', 'learning', 'leave', 'leaves', 'leaving', 'lebanese',
       'lebanon', 'led', 'lee', 'left', 'legal', 'legally', 'legislation',
       'legitimate', 'lemieux', 'length', 'let', 'lets', 'lette

#### **Problem #2.1.4: Choose a random document and print out its BoW representation.**

Then use `get_feature_names_out()` to determine what some of the words are.

In [None]:
random_doc_idx = # You can choose any index
print("BoW representation of a random document:\n", X_train_bow[random_doc_idx])

In [None]:
# Use get_feature_names_out() to explore your results

###### **Solution**

In [None]:
random_doc_idx = 42  # You can choose any index
print("BoW representation of a random document:\n", X_train_bow[random_doc_idx])

BoW representation of a random document:
   (0, 790)	1
  (0, 3060)	1
  (0, 3353)	1
  (0, 2050)	1
  (0, 3136)	1
  (0, 1988)	1
  (0, 2213)	1
  (0, 1460)	1
  (0, 3699)	1
  (0, 860)	1
  (0, 3156)	2
  (0, 553)	1
  (0, 3616)	1
  (0, 1964)	1
  (0, 3893)	1
  (0, 1245)	1
  (0, 3258)	1
  (0, 1969)	1
  (0, 647)	1


In [None]:
vectorizer.get_feature_names_out()[3156]

'say'

<a name="p2.2"></a>

---
### **Part 2.2: Training and Testing a Neural Network**
---

At this point, we have imported, split, and vectorized the data. Now we need to prepare it for a PyTorch model and proceed as we would for *any* classification task with a PyTorch model.

#### **Step #1**

**This code has been provided for you. Run the cell below.**

In [None]:
# Convert to PyTorch tensors
X_train = torch.tensor(X_train_bow.todense()).float()
X_valid = torch.tensor(X_valid_bow.todense()).float()

# Extract labels
y_train = torch.tensor(labels_train)
y_valid = torch.tensor(labels_val)

# Create DataLoaders
train_dataset = list(zip(X_train, y_train))
valid_dataset = list(zip(X_valid, y_valid))

train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_dl = DataLoader(valid_dataset, batch_size=64)
dls = DataLoaders(train_dl,val_dl)

#### **Step #2**

For a fully connected network, the dimension of the input layer will be the number of tokens. Complete the code below.

###### **Solution**

In [None]:
input_dims = len(vectorizer.get_feature_names_out())

#### **Steps #3-6**

Define a fully connected network of your own design. Ensure you have the correct number of inputs and outputs.

20

###### **Solution**


In [None]:
# Define the DNN model
dnn_model = nn.Sequential(
    nn.Linear(input_dims, 512),
    nn.ReLU(),
    nn.Linear(512, 20)
)

#### **Step #7**

Create a Learner object and fit the model. Since this is a multiclass classification problem, you will use `nn.CrossEntropyLoss()`

In [None]:
# Create a Learner and train the model


###### **Solution**


In [None]:
# Create a Learner and train the model
dnn_learner = Learner(
    dls,
    dnn_model,
    loss_func=nn.CrossEntropyLoss(),
    metrics=accuracy)

dnn_learner.fit(5, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,1.619478,1.426403,0.624834,00:05
1,0.907468,1.559543,0.652673,00:04
2,0.443807,1.858257,0.640301,00:05
3,0.220311,2.206271,0.627044,00:04
4,0.192995,2.542234,0.641184,00:04


#### **Step #8**

Now, evaluate the model for both the training and validation sets.


In [None]:
# Evaluate the training set


# Evaluate the test set


###### **Solution**


In [None]:
# Calculate training accuracy
train_loss, train_accuracy = dnn_learner.validate(dl=dls.train)
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = dnn_learner.validate(dl=dls.valid)
print(f"Validation accuracy: {valid_accuracy:.4f}")

Training accuracy: 0.9619


Validation accuracy: 0.6412


#### **How did your model perform?**

<a name="p3"></a>

---
## **Part 3: News Group Classification with a CNN**
---



#### **Step #1**

**The code for importing the data is provided for you. Run the cell below.**

In [None]:
# Convert to PyTorch tensors
X_train = torch.tensor(X_train_bow.todense()).float().unsqueeze(1)
X_valid = torch.tensor(X_valid_bow.todense()).float().unsqueeze(1)

# Extract labels
y_train = torch.tensor(labels_train)
y_valid = torch.tensor(labels_val)

# Create DataLoaders
train_dataset = list(zip(X_train, y_train))
valid_dataset = list(zip(X_valid, y_valid))

train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_dl = DataLoader(valid_dataset, batch_size=64)
dls = DataLoaders(train_dl,val_dl)

input_dims = len(vectorizer.get_feature_names_out())

#### **Steps #3-6**

Let's start by building a new CNN model. Remember, the syntax for CNNs for NLP is a little different than for images. We will be using the 1D versions of the convolution and max pooling layers. Examples:
* `nn.Conv1d(64, 128, kernel_size=5, padding=2)`
* `nn.MaxPool1d(2)`

Define a CNN with the following layers:

Block 1:
* A convolutional layer with the appropriate input dimension and 16 outputs, kernel size of 3, `padding=1`, and ReLU activation.
* A max pooling layer with a pool size of 2

Block 2
* A convolutional layer with 32 outputs, kernel size of 3, `padding=1`, and ReLU activation.
* A max pooling layer with a pool size of 2

Finally, add:
* A linear layer with 8 outputs and ReLU activation
* The output layer

###### **Solution**


In [None]:
# Define the CNN Model
cnn_model = nn.Sequential(
    nn.Conv1d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool1d(2),
    nn.Conv1d(16, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool1d(2),
    nn.Flatten(),
    nn.Linear(32 * (input_dims // 4), 8),
    nn.ReLU(),
    nn.Linear(8, 20)
)

#### **Step #7**

Create a Learner object and fit the model. Since this is a multiclass classification problem, you will use `nn.CrossEntropyLoss()`

In [None]:
# Create a Learner and train the model


###### **Solution**


In [None]:
# Create a Learner and train the model
cnn_learner = Learner(
    dls,
    cnn_model,
    loss_func=nn.CrossEntropyLoss(),
    metrics=accuracy)

cnn_learner.fit(5, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,1.567503,1.803258,0.447636,00:24
1,1.528314,1.768604,0.46487,00:25
2,1.495133,1.750425,0.470172,00:24
3,1.490543,1.738211,0.473707,00:25
4,1.4478,1.728292,0.479894,00:23


#### **Step #8**

Now, evaluate the model for both the training and validation sets.


In [None]:
# Evaluate the training set


# Evaluate the test set


###### **Solution**


In [None]:
# Calculate training accuracy
train_loss, train_accuracy = cnn_learner.validate(dl=dls.train)
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = cnn_learner.validate(dl=dls.valid)
print(f"Validation accuracy: {valid_accuracy:.4f}")

Training accuracy: 0.5496


Validation accuracy: 0.4799


**Oh no!** It looks like the CNN didn't do much better! It turns out that tokenization and vectorization is not enough to prepare text data for deep learning. There's an additional processing step we can take that will set our models up for success: **embedding.** We will see how embedding improves model performance in the next lab.

# End of notebook
---
© 2024 The Coding School, All rights reserved