<a href="https://colab.research.google.com/github/rtrochepy/machine_learning/blob/main/Text_Classification_TorchText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"Neural Networks Course with PyTorch
Instructor: Omar Uriel Espejel Diaz **texto en negrita**

**Text Classification with TorchText**

The PyTorch Project contains libraries for different types of data and purposes.

torchaudio
torchvision
TorchElastic
TorchServe **texto en negrita**

We will use torchtext for text classification. The torchtext package consists of data processing utilities and popular datasets for natural language processing.

However, feel free to try other available libraries in PyTorch. torchvision is particularly used for applications working with images!

**1. Importing libraries and dataset**

In [1]:
%%capture
!pip install portalocker>=2.0.0
!pip install torchtext --upgrade

In [2]:
import torch
import torchtext
from torchtext.datasets import DBpedia

**Check the version**

In [3]:
torchtext.version

<module 'torchtext.version' from '/usr/local/lib/python3.10/dist-packages/torchtext/version.py'>

**2. Processing the dataset and creating a vocabulary**

Import the torch and torchtext libraries. Use torchtext to load the DBpedia dataset.

Next, use the iter function to create an iteration object for the training dataset. Finally, the code prints the version of the torchtext library used.

In [4]:
train_iter = iter(DBpedia(split="train"))

In [5]:
next(train_iter)

(1,
 'E. D. Abbott Ltd  Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.')

We will build a vocabulary with the dataset by implementing the built-in function **build_vocab_from_iterator**, which accepts the iterator that produces a list or iterator of tokens.

We use **torchtext** to build a vocabulary from an English DBpedia dataset.

First, import the **get_tokenizer** function from the **torchtext** library to get a pre-defined tokenizer for the English language. Then, define a data iterator for the training dataset of DBpedia.

Next, define a **yield_tokens** function that uses the tokenizer to split the text into tokens and yield them one by one. This function is used as input to the **build_vocab_from_iterator** function, which builds a vocabulary from the tokens returned by the **yield_tokens** function. The **build_vocab_from_iterator** function also takes a list of special tokens, which will be used to represent out-of-vocabulary words.

In summary, this code snippet builds a vocabulary from a training dataset and prepares it for use in machine learning models using PyTorch.

In [6]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
train_iter = DBpedia(split="train")

def yield_tokens(data_iter):
  for _, text in data_iter:
    yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Our vocabulary converts the list of tokens into integers.

In [7]:
vocab(tokenizer("Hello how are you? I am a platzi student"))

[7296, 1506, 47, 578, 2323, 187, 2409, 5, 0, 1078]

Define two lambda functions, **text_pipeline** and **label_pipeline**, which are used to process the input data into a format that can be used for training and evaluating models.

The first function, **text_pipeline**, takes a text string as input and processes it using the tokenizer and vocabulary we defined. Remember that the tokenizer splits the text into tokens (words or subwords), while the vocabulary maps each token to a unique integer index. The function returns a list of integers representing the tokens in the text.

The second function, **label_pipeline**, takes a label as input and converts it to an integer. In this case, the label is subtracted by **1** to adjust it to an index range of **0** to **n-1**, where **n** is the number of classes in the problem.

In [8]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [9]:
text_pipeline("Hello I am Ruben Dario")
label_pipeline("1")

0

Create a function called **collate_batch** to process a batch of data. The input batch is a list of tuples, where each tuple contains a label and its corresponding text.

* Three lists are initialized: **label_list**, **text_list**, and **offsets**. Offsets store the starting index of each text sequence in the concatenated tensor of text sequences. It helps to keep track of the boundaries of individual text sequences within the concatenated tensor. It starts with a value of 0, representing the starting index of the first text sequence.

* The function iterates over each data point in the batch. For each data point, it processes the label using **label_pipeline(_label)** and adds the result to **label_list**. It processes the text using **text_pipeline(_text)** and converts it to a torch tensor of type **torch.int64**. The processed text is added to **text_list**, and its length **(size(0))** is added to **offsets**.

* The last element in the **offsets** list is removed using the slicing **offsets[:-1]**. Then, the **cumsum** function calculates the cumulative sum of the elements in the **offsets** list along dimension 0.

* The **text_list** is concatenated into a single 1D tensor using **torch.cat(text_list)**.

The **label_list**, **text_list**, and **offsets** tensors are converted to the specified device (either GPU or CPU).

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
  label_list = []
  text_list = []
  offsets = [0]

  for (_label, _text) in batch:
    label_list.append(label_pipeline(_label))
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
    text_list.append(processed_text)
    offsets.append(processed_text.size(0))

  label_list = torch.tensor(label_list, dtype=torch.int64)
  offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
  text_list = torch.cat(text_list)
  return label_list.to(device), text_list.to(device), offsets.to(device)

A **DataLoader** handles the process of iteration through a dataset in mini-batches. The DataLoader is important because it efficiently manages memory, shuffles data, and easily parallelizes data loading.

In [11]:
from torch.utils.data import DataLoader

train_iter = DBpedia(split="train")
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

In [12]:
dataloader

<torch.utils.data.dataloader.DataLoader at 0x7d5a1890de10>

**3. Creating the classification model and its layers**

We create **TextClassificationModel**, a neural network class that implements a simple but effective architecture for text classification, using embedding layers, batch normalization, and fully connected layers.

* __init__(**self**, **vocab_size**, **embed_dim**, **num_class**): This method initializes the model with three arguments: the vocabulary size (vocab_size), the embedding dimension (embed_dim), and the number of classes (num_class).

* **self.embedding**: The embedding layer (nn.EmbeddingBag) converts each word in the text into a vector of dimension embed_dim. The embedding is efficiently performed in batches for the text sequences in the input.

* **self.bn1**: The batch normalization layer (nn.BatchNorm1d) improves the stability and training speed of the model by normalizing the input features along the specified dimension (in this case, embed_dim).

**self.fc**: The fully connected layer (nn.Linear) performs theSure! I can help you with the translation and improvement of the documentation for the given PyTorch Neural Network model.

In [13]:
from torch import nn
import torch.nn.functional as F

class TextClassificationModel(nn.Module):
  def __init__(self, vocab_size, embed_dim, num_class):
    super(TextClassificationModel, self).__init__()

    # Embedding layer
    self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)

    # Batch normalization layer
    self.bn1 = nn.BatchNorm1d(embed_dim)

    # Fully connected layer
    self.fc = nn.Linear(embed_dim, num_class)

  def forward(self, text, offsets):
    # Embed the text
    embedded = self.embedding(text, offsets)

    # Apply batch normalization
    embedded_norm = self.bn1(embedded)

    # Apply the ReLU activation function
    embedded_activated = F.relu(embedded_norm)

    # Output the class probabilities
    return self.fc(embedded_activated)

We build a model with an embedding dimension of 100.

In [14]:
train_iter = DBpedia(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
embedding_size = 100

model = TextClassificationModel(vocab_size=vocab_size, embed_dim=embedding_size, num_class=num_class).to(device)

# Model architecture
# print(model)

# Number of trainable parameters in our model
def count_parameters(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 80,301,414 trainable parameters


**4. Functions for Model Training and Evaluation**

We now define the functions to train the model and evaluate the results.

We use **torch.nn.utils.clip_grad_norm_** to limit the maximum value of the gradient norm during the training of a neural network. In other words, it ensures that the gradients aren't too large, and thus avoids the neural network becoming unstable during training.

The first argument, **model.parameters()**, refers to the parameters of the model being trained. The second argument, "0.1", is the maximum allowed value for the gradient norm.

In [15]:
def train(dataloader):
    # Set the model to training mode
    model.train()

    # Initialize accuracy, count, and loss for each epoch
    epoch_acc = 0
    epoch_loss = 0
    total_count = 0

    for idx, (label, text, offsets) in enumerate(dataloader):
        # Reset gradients after each batch
        optimizer.zero_grad()
        # Get model predictions
        prediction = model(text, offsets)

        # Get the loss
        loss = criterion(prediction, label)

        # Backpropagate the loss and compute gradients
        loss.backward()

        # Get the accuracy
        acc = (prediction.argmax(1) == label).sum()

        # Prevent gradients from becoming too large
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)

        # Update the weights
        optimizer.step()

        # Keep track of the loss and accuracy for this epoch
        epoch_acc += acc.item()
        epoch_loss += loss.item()
        total_count += label.size(0)

        if idx % 500 == 0 and idx > 0:
            print(f" epoch {epoch} | {idx}/{len(dataloader)} batches | loss {epoch_loss/total_count} | accuracy {epoch_acc/total_count}")

    return epoch_acc/total_count, epoch_loss/total_count

In [16]:
def evaluate(dataloader):
    model.eval()
    epoch_acc = 0
    total_count = 0
    epoch_loss = 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            # Get the predicted label
            prediction = model(text, offsets)

            # Get loss and accuracy
            loss = criterion(prediction, label)
            acc = (prediction.argmax(1) == label).sum()

            # Keep track of the loss and accuracy for this epoch
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            total_count += label.size(0)

    return epoch_acc/total_count, epoch_loss/total_count


Please note that the variable **epoch** inside the training function is not defined within the function scope. Make sure it is defined globally or passed as an argument if you are running epochs outside the function scope.

**5. Preparing for Training: Data Split, Loss, and Optimization**

We split the training dataset into valid training sets with a split ratio of 0.95 (training) and 0.5 (valid) using the function torch.utils.data.dataset.random_split

**Hyperparameters**

In [17]:
EPOCHS = 4 # epochs
LEARNING_RATE = 0.2 # learning rate
BATCH_SIZE = 64 # batch size

Explore the other loss functions available in PyTorch. You can find them all here: https://pytorch.org/docs/stable/nn.html#loss-functions.

The loss function is the one that measures how good our model's predictions are compared to the actual labels. PyTorch offers a wide range of loss functions that we can use to train our models on different types of problems, such as regression, classification, and sequence-to-sequence modeling.

By delving into these other loss functions, we can expand our machine learning knowledge. The same applies to the optimizers. PyTorch provides a variety of optimization algorithms: https://pytorch.org/docs/stable/optim.html#algorithms.

Spend time exploring PyTorch's documentation on loss functions and optimizers. Experiment with different functions in your projects.

**Loss, Optimizer**

In [18]:
#Loss, Optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr= LEARNING_RATE)

We split the dataset into three parts: training, validation, and test.

First, we import the **random_split** function from the Dataset class and the **to_map_style_dataset** function from **torchtext.data.functional**. Then, we load the **DBpedia** dataset using the **DBpedia()** method. Next, we convert the dataset into a format that can be used by PyTorch's **DataLoader** using the **to_map_style_dataset** function.

We then define the proportion of data we will use to train our model (95%) and the percentage we will use to validate our model (5%). We use the **random_split** function to split the training dataset into training and validation.

Finally, we create three DataLoaders for each part of the dataset: one for training, one for validation, and another for testing. We use the **batch_size** argument to define the size of the data batches that will be used in training and testing. The **collate_fn** argument specifies how data samples should be joined to form a batch.

In [19]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Get the trainset and testset
train_iter, test_iter = DBpedia()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

# We train the model with 95% of the data from the trainset
num_train = int(len(train_dataset) * 0.95)

# We create a validation dataset with 5% of the trainset
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])

# We create dataloaders ready to feed into our model
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

In [20]:
train_dataset

<torchtext.data.functional.to_map_style_dataset.<locals>._MapStyleDataset at 0x7d5a0bc48df0>

This documentation serves as a comprehensive guide for setting up the model's training process. It covers the data split, selection of hyperparameters, choice of loss function and optimizer, and the creation of DataLoaders for the train, validation, and test datasets.

**6. Training and Evaluating the Model**

Now, let's proceed with the training and evaluation of our model. Firstly, we define the variable **best_validation_loss** and initialize it with positive infinity. This variable is used to track the best validation loss during the training.

Then, we perform a **for** loop across epochs. Within each epoch, the model's training and validation is performed using the respective training and validation datasets.

In other words, if the current validation loss is less than the previous best validation loss, we save the current state of the model in the **saved_weights.pt** file.

In [21]:
# Obtain the best loss
best_validation_loss = float('inf')

# Training loop
for epoch in range(1, EPOCHS + 1):
    # Training
    train_acc, train_loss = train(train_dataloader)

    # Validation
    validation_acc, validation_loss = evaluate(valid_dataloader)

    # Save the best model
    if validation_loss < best_validation_loss:
      best_validation_loss = validation_loss
      torch.save(model.state_dict(), "best_saved.pt")


 epoch 1 | 500/8313 batches | loss 0.033213432907998684 | accuracy 0.38566616766467066
 epoch 1 | 1000/8313 batches | loss 0.02894332105448315 | accuracy 0.4741976773226773
 epoch 1 | 1500/8313 batches | loss 0.02638594473064998 | accuracy 0.5161142571618921
 epoch 1 | 2000/8313 batches | loss 0.024672556727163973 | accuracy 0.5444621439280359
 epoch 1 | 2500/8313 batches | loss 0.023443429099648343 | accuracy 0.563811975209916
 epoch 1 | 3000/8313 batches | loss 0.022505414862432745 | accuracy 0.5778907030989671
 epoch 1 | 3500/8313 batches | loss 0.021759408591030514 | accuracy 0.5892780634104542
 epoch 1 | 4000/8313 batches | loss 0.02117373248558958 | accuracy 0.5981551487128218
 epoch 1 | 4500/8313 batches | loss 0.020674863539286863 | accuracy 0.6062819373472562
 epoch 1 | 5000/8313 batches | loss 0.020251466184064905 | accuracy 0.6128430563887223
 epoch 1 | 5500/8313 batches | loss 0.01985594714976342 | accuracy 0.6191743546627886
 epoch 1 | 6000/8313 batches | loss 0.0195181625

We evaluate the model on the test dataset.

In [24]:
test_acc, test_loss = evaluate(test_dataloader)

print(f'Accuracy del test dataset -> {test_acc}')
print(f'Pérdida del test dataset -> {test_loss}')

Accuracy del test dataset -> 0.7986857142857143
Pérdida del test dataset -> 0.01025109011062554


**7. Inference**

Let's try with an example. Let's try with two examples of English texts. We will use **torch.compile()** to speed up the inference of the model. We give it the argument **mode="reduce-overhead"** which refers to reducing the computational overhead of our model, that is, reducing computational resources such as GPU usage and reducing the time needed to run the inference or, in other cases, training the model.

**reduce-overhead** allows our code to run more efficiently. However, this optimization may come at the cost of a small amount of additional memory. It is the recommended mode for small models like ours for classification.

The **max-autotune** mode compiles the code for a longer time, trying to optimize the code as much as possible to achieve the highest execution speed. This mode may involve exploring different optimization strategies and finding the best one, which may result in longer compilation times but better performance during execution.

In [27]:
DBpedia_label = {1: 'Company',
                2: 'EducationalInstitution',
                3: 'Artist',
                4: 'Athlete',
                5: 'OfficeHolder',
                6: 'MeanOfTransportation',
                7: 'Building',
                8: 'NaturalPlace',
                9: 'Village',
                10: 'Animal',
                11: 'Plant',
                12: 'Album',
                13: 'Film',
                14: 'WrittenWork'}

def predict(text, texto_pipeline):
  with torch.no_grad():
    text = torch.tensor(texto_pipeline(text))
    opt_mod = torch.compile(model, mode="reduce-overhead")
    output = opt_mod(text, torch.tensor([0]))
    return output.argmax(1).item() + 1


ejemplo_1 = "Nithari is a village in the western part of the state of Uttar Pradesh India bordering on New Delhi. Nithari forms part of the New Okhla Industrial Development Authority's planned industrial city Noida falling in Sector 31. Nithari made international news headlines in December 2006 when the skeletons of a number of apparently murdered women and children were unearthed in the village."


model = model.to("cpu")


print(f"El ejemplo 1 es de categoría {DBpedia_label[predict(ejemplo_1, text_pipeline)]}")

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


El ejemplo 1 es de categoría Village


**8. Storage and model loading**

The **state_dict()** method is used to return the state dictionary of the model. This dictionary contains all the trainable parameters of the model, such as weights and biases, in the form of PyTorch tensors.

It is useful for a variety of tasks, such as saving and loading models or transferring learned parameters from one model to another. It allows you to easily manipulate the state of the model as a dictionary of parameters with names, without having to access them directly.

For example, if we want to save our model to disk, we can use it to get a dictionary of the model parameters and then save that dictionary using Python's **pickle** module. Then, when we want to load the model again, we can use the **load_state_dict()** method to load the saved dictionary into a new instance of the model.

In [30]:
model_state_dict = model.state_dict()
optimizer_state_dict = optimizer.state_dict()

checkpoint = {
    "model_state_dict" :  model_state_dict,
    "optimizer_state_dict" : optimizer_state_dict,
    "epoch" : epoch,
    "loss" : train_loss,
}

torch.save(checkpoint, "model_checkpoint.pth")

We upload the model to the Hugging Face Hub so that other community members have access to it and we also have a copy in the cloud.

In [31]:
%%capture
!pip install huggingface_hub

In [32]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


We create the repository where we will store our model in the Hugging Face Hub.

In [33]:
from huggingface_hub import HfApi
api = HfApi()

api.create_repo(repo_id="platzi/clasificacion-DBpedia-Ruben-Troche")

RepoUrl('https://huggingface.co/platzi/clasificacion-DBpedia-Ruben-Troche', endpoint='https://huggingface.co', repo_type='model', repo_id='platzi/clasificacion-DBpedia-Ruben-Troche')

We upload our checkpoint.

In [34]:
!ls

'=2.0.0'   best_saved.pt   model_checkpoint.pth   sample_data


In [35]:
api.upload_file(
    path_or_fileobj="./model_checkpoint.pth",
    path_in_repo="model_checkpoint.pth",
    repo_id="platzi/clasificacion-DBpedia-Ruben-Troche"
)

model_checkpoint.pth:   0%|          | 0.00/321M [00:00<?, ?B/s]

'https://huggingface.co/platzi/clasificacion-DBpedia-Ruben-Troche/blob/main/model_checkpoint.pth'

Let's load the checkpoint in a new directory called weights.

In [36]:
!mkdir weights

In [37]:
!rm weights/model_checkpoint.pth

rm: cannot remove 'weights/model_checkpoint.pth': No such file or directory


Now let's load our model

In [39]:
checkpoint = torch.load("weights/model_checkpoint.pth")

In [41]:
train_iter = DBpedia(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
embedding_size = 100

modelo_2 = TextClassificationModel(vocab_size=vocab_size, embed_dim=embedding_size, num_class=num_class)

In [42]:
optimizer_2 = torch.optim.SGD(modelo_2.parameters(), lr=0.2)

In [43]:
modelo_2.load_state_dict(checkpoint["model_state_dict"])

<All keys matched successfully>

In [44]:
optimizer_2.load_state_dict(checkpoint["optimizer_state_dict"])

In [45]:
epoch_2 = checkpoint["epoch"]
loss_2 = checkpoint["loss"]

In [47]:
ejemplo_2 = "Axolotls are members of the tiger salamander, or Ambystoma tigrinum, species complex, along with all other Mexican species of Ambystoma."

model_cpu = modelo_2.to("cpu")

DBpedia_label[predict(ejemplo_2, text_pipeline)]

'Plant'

**Conclusion**

In this module we learned to use **torchtext** to train a classification model with real data.

1. We started by preprocessing the data through tokenization and building a vocabulary.

2. Then we created a PyTorch dataset and used it to train a classification model with a neural network architecture.

3. We tested the model with a test set.

4. Then we performed inference on new data.

5. Finally, we saved our trained model so that it can be used later for other tasks.




