# Chapter 15: Word Embeddings and Text Classification

#### Installation Notes
To run this notebook on Google Colab, you will need to install the following libraries: transformers, evaluate, datasets, chromadb, langchain, and gensim.

In Google Colab, you can run the following command to install them:

In [None]:
!pip install transformers evaluate chromadb langchain datasets gensim

## 15.2 Learning Objectives

By the end of this chapter, you should be able to:
- tokenize and encode sentences into their corresponding embeddings
- train a simple model using embeddings as features
- use vector databases to store and search documents
- use a similarity metric to perform zero-shot text classification

## 15.4 AG News Dataset

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

In this chapter, we'll be primarily using the AG News Dataset. The original [AG](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) is a collection of more than 1,000,000 news articles gathered from more than 2,000 news sources.

The version we'll be using here, the [AG News Dataset](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv) was constructed by choosing the four largest classes from the original corpus, namely, "world", "sports", "business", and "science and technology". Each class contains 30,000 training and 1,900 testing samples, amouting to a total of 120,000 training and 7,600 testing samples.

The AG News Dataset is a [built-in dataset](https://pytorch.org/text/stable/datasets.html#ag-news) from Torchtext. It downloads the corresponding files directly from the [AG News Dataset](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv) repository.

Before using the dataset, we need to do a little bit of cleaning up, such as replacing some special characters and HTML tags that weren't included as raw data. For example, the apostrophe is found 44,316 times as "#39;". By cleaning up a little, we'll get more sensible tokens and thus better results.

You can download the files from the following links:

- https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
- https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
- https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt
Alternatively, you can download all files as a single compressed file instead:

https://raw.githubusercontent.com/lftraining/LFD273-code/main/data/AGNews/agnews.zip

If you're running Google Colab, you can download the files using the commands below:

In [None]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt

--2024-09-09 17:13:19--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29470338 (28M) [text/plain]
Saving to: ‘train.csv’


2024-09-09 17:13:23 (11.0 MB/s) - ‘train.csv’ saved [29470338/29470338]

--2024-09-09 17:13:23--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1857427 (1.8M) [text/plain]
Saving to: ‘test.csv’


2024-09-09 17:13:24 (9.19 MB/

### 15.4.1 Data Cleaning

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

Next, let's perform some data cleaning. We're keeping the cleanup to a minimum, namely, replacing the aforementioned special chars and HTML tags. There are also cases of duplicate rows, and even JavaScript code in it, but we won't be handling those here.

Here is a non-exhaustive list of characters and tags for replacement:

In [None]:
import numpy as np

chr_codes = np.array([
     36,   151,    38,  8220,   147,   148,   146,   225,   133,    39,  8221,  8212,   232,   149,   145,   233,
  64257,  8217,   163,   160,    91,    93,  8211,  8482,   234,    37,  8364,   153,   195,   169
])
chr_subst = {f' #{c};':chr(c) for c in chr_codes}
chr_subst.update({' amp;': '&', ' quot;': "'", ' hellip;': '...', ' nbsp;': ' ', '&lt;': '', '&gt;': '',
                  '&lt;em&gt;': '', '&lt;/em&gt;': '', '&lt;strong&gt;': '', '&lt;/strong&gt;': ''})

And here are a couple of helper functions to perform a quick cleanup:

In [None]:
def replace_chars(sent):
    to_replace = [c for c in list(chr_subst.keys()) if c in sent]
    for c in to_replace:
        sent = sent.replace(c, chr_subst[c])
    return sent

def preproc_description(desc):
    desc = desc.replace('\\', ' ').strip()
    return replace_chars(desc)

### 15.4.2 Hugging Face Datasets

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

We've already used Hugging Face Datasets before, first with a tabular dataset, and then again with the Stanford Sentiment Treebank dataset for sentiment analysis.

Now, we'll start by loading both CSV files using the load_dataset() method and naming the columns manually. Since each file represents a split, we'll assemble a DatasetDict manually as well:

In [None]:
from datasets import load_dataset, Split, DatasetDict

colnames = ['topic', 'title', 'news']

train_ds = load_dataset("csv", data_files='train.csv', sep=',', split=Split.ALL, column_names=colnames)
test_ds = load_dataset("csv", data_files='test.csv', sep=',', split=Split.ALL, column_names=colnames)

datasets = DatasetDict({'train': train_ds, 'test': test_ds})
datasets

Downloading and preparing dataset csv/default to /home/dvgodoy/.cache/huggingface/datasets/csv/default-e74aa9f4afc75bd6/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /home/dvgodoy/.cache/huggingface/datasets/csv/default-e74aa9f4afc75bd6/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.
Downloading and preparing dataset csv/default to /home/dvgodoy/.cache/huggingface/datasets/csv/default-b95e4b26323eb18c/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /home/dvgodoy/.cache/huggingface/datasets/csv/default-b95e4b26323eb18c/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['topic', 'title', 'news'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['topic', 'title', 'news'],
        num_rows: 7600
    })
})

Let's take a quick look at an example from our training set:

In [None]:
datasets['train'][0]

{'topic': 3,
 'title': 'Wall St. Bears Claw Back Into the Black (Reuters)',
 'news': "Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."}

Looks good! We'll be focusing on the topic (numbered from one to four - world, sports, business, and sci-tech) and the piece of news itself. We won't be using the "title" field.

We can use the map() method to apply transformations to each key in the dictionary, so we're adjusting the topic numbering to a 0-based index, and we're cleaning up the news using the preproc_description() function:

In [None]:
datasets = datasets.map(lambda row: {'topic': row['topic']-1,
                                     'news': preproc_description(row['news'])})
datasets = datasets.select_columns(['topic', 'news'])

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

Let's take a batch of four elements from our training set:

In [None]:
batch = datasets['train'][:4]
labels, descriptions = batch['topic'], batch['news']
labels, descriptions

([2, 2, 2, 2],
 ["Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again.",
  'Reuters - Private investment firm Carlyle Group, which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market.',
  'Reuters - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.',
  'Reuters - Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday.'])

Now that the text is cleaned, we can focus on the next step: tokenization.

## 15.5 Tokenization

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Tokenization is the process of turning a piece of text, be it a sentence, a paragraph, or a full page, commonly referred to as a "document" into a sequence of its components, the tokens.

If our document is a paragraph, its sentences may be considered tokens, and we would be talking about sentence tokenization. But, if our document is a sentence, its words (or sometimes subwords, e.g., syllables, prefixes, and suffixes) may be considered tokens.

The simplest and most straightforward way of tokenizing a string is to use its split() method:

In [None]:
tokens = descriptions[0].split()
tokens

['Reuters',
 '-',
 'Short-sellers,',
 'Wall',
 "Street's",
 'dwindling',
 'band',
 'of',
 'ultra-cynics,',
 'are',
 'seeing',
 'green',
 'again.']

***
#### Aside: Gensim

Gensim is a popular library for topic modeling, which offers out-of-the-box tools for NLP-related tasks such as tokenization, vocabularies, and pretrained embeddings. In this chapter, we'll be using a few tools from Gensim such as the simple_preprocess() utility function for tokenization and the downloader() to retrieve and instantiate pretrained GloVe embeddings.
***
However, we'd be overlooking lots of details: lower and uppercase differences, accents, special characters, punctuation, etc. This kind of preprocessing can be tedious to implement, so we could use a utility function such as Gensim's simple_preprocess() to take care of these steps:

In [None]:
from gensim.utils import simple_preprocess
tokens = simple_preprocess(descriptions[0])
tokens

['reuters',
 'short',
 'sellers',
 'wall',
 'street',
 'dwindling',
 'band',
 'of',
 'ultra',
 'cynics',
 'are',
 'seeing',
 'green',
 'again']

No punctuation, no apostrophes, lowercase words. That's some good old-fashioned tokenization, and it still is an important part of the tokenization pipeline of current tokenizers.

In HF tokenizers, this kind of cleaning up and splitting up is performed by two components of the [tokenization pipeline](https://huggingface.co/docs/tokenizers/en/pipeline): the normalizer and the pre-tokenizer. Let's load a typical BERT tokenizer and see what is the output of each one of these steps:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tok_obj = tokenizer.backend_tokenizer
tok_obj



<tokenizers.Tokenizer at 0x7f2c4d81ac30>

The tokenizer from the transformers library is actually a wrapper around the tokenizer from the tokenizers library. We can retrieve the latter using the former's backend_tokenizer attribute.

The backend tokenizer is the one that implements the pipeline. Let's take a look at its first step, the normalizer:

In [None]:
normalizer = tok_obj.normalizer
normalizer.lowercase, normalizer.clean_text, normalizer.strip_accents

(True, True, None)

By inspecting the attributes of the normalizer, we can see that it is configured to convert the string to lowercase, and to clean the text (removing special characters), but not strip it out of its accents. Let's put it to the test by calling its normalize_str() method:

In [None]:
normalized = normalizer.normalize_str(descriptions[0])
normalized

"reuters - short-sellers, wall street's dwindling band of ultra-cynics, are seeing green again."

In the next step, the normalized string is going to be pre-tokenized (which is pretty much the same as the old-fashioned tokenization). Let's try it out by calling the pre-tokenizer's pre_tokenize_str() method:

In [None]:
pre_tokenizer = tok_obj.pre_tokenizer
tokens = pre_tokenizer.pre_tokenize_str(normalized)
tokens

[('reuters', (0, 7)),
 ('-', (8, 9)),
 ('short', (10, 15)),
 ('-', (15, 16)),
 ('sellers', (16, 23)),
 (',', (23, 24)),
 ('wall', (25, 29)),
 ('street', (30, 36)),
 ("'", (36, 37)),
 ('s', (37, 38)),
 ('dwindling', (39, 48)),
 ('band', (49, 53)),
 ('of', (54, 56)),
 ('ultra', (57, 62)),
 ('-', (62, 63)),
 ('cynics', (63, 69)),
 (',', (69, 70)),
 ('are', (71, 74)),
 ('seeing', (75, 81)),
 ('green', (82, 87)),
 ('again', (88, 93)),
 ('.', (93, 94))]

As you can see, its approach is very simple: even commas and apostrophes are considered individual tokens and the sentence is split accordingly. Moreover, the pre-tokenizer keeps track of the position of each token in the original sentence, denoted by the tuple of integers next to each token.

So far, there's nothing extraordinary about it, right? We're getting to the interesting part, which is the definition of the tokenizer's vocabulary, in the next section.

### 15.5.1 Vocabulary

The set of all unique tokens in a corpus of text (that is, a collection of documents) makes up its corresponding vocabulary. The vocabulary is like the "dictionary" (although not necessarily in the Python sense!) that contains entries for every token (word) that we expect to find in our sentences. We can retrieve the tokenizer's vocabulary using the get_vocab() method:

In [None]:
vocab = tok_obj.get_vocab()
vocab

{'##camp': 26468,
 '[unused625]': 630,
 'raid': 8118,
 'zoological': 26168,
 'guarantee': 11302,
 'auckland': 8666,
 'fai': 26208,
 'overhead': 8964,
 '##hee': 21030,
 'johnston': 10773,
 '##linger': 23101,
 'acting': 3772,
 '##tablished': 28146,
 '[unused9]': 10,
 'considered': 2641,
 'pardon': 14933,
 'greyish': 26916,
 '##54': 27009,
 'caracas': 21675,
 'renumbered': 27855,
 'flowing': 8577,
 '[unused748]': 753,
 'forces': 2749,
 'credited': 5827,
 '##hell': 18223,
 'milk': 6501,
 'deals': 9144,
 '##kle': 19099,
 'christy': 21550,
 'guests': 6368,
 '[unused372]': 377,
 'curse': 8364,
 'alvaro': 24892,
 '##onus': 24891,
 '##千': 30310,
 'corpses': 18113,
 'mollusk': 13269,
 '−': 1597,
 '##graphy': 12565,
 '##lius': 15513,
 'convincing': 13359,
 'clutched': 13514,
 'iraq': 5712,
 '##ggy': 22772,
 'urgency': 19353,
 'executives': 12706,
 'hobart': 14005,
 'telecommunication': 25958,
 'def': 13366,
 'inmates': 13187,
 'twinned': 25901,
 'kathleen': 14559,
 'staircase': 10714,
 'approache

Do you see anything weird? There are plenty of words starting with "##". Hold on to that thought, we'll understand their role shortly.

The vocabulary is going to be used by the tokenizer's model to map words into their corresponding indices, so strings are translated into integers.

Let's try fetching the index of a given word from our example:

In [None]:
vocab['dwindling']

KeyError: 'dwindling'

How could it be? "Dwindling" isn't such a rare word, after all.

Let's see how large our vocabulary is:

In [None]:
tok_obj.get_vocab_size()

30522

Doesn't the vocabulary length look suspiciously short? It's no wonder that the very first sentence of our example already contained a word that is _not_ in the vocabulary.

In the past, that word would be removed and deemed an "unknown" token. Vocabularies had to grow up to 400,000 words in length to avoid that situation as much as possible. Modern-day tokenizers, on the other hand, tackle the challenge in a different way: they implement an algorithm that maximizes coverage while keeping the vocabulary short and manageable. Enter the tokenizer's model, the third step in the pipeline.

### 15.5.2 Tokenizer's Model

The basic naïve tokenizer is simple and straightforward, but it has a drawback: it either makes the vocabulary huge (to include rare words), or it leads to a lot of unknown tokens (that will replace words absent from our vocabulary).

Ideally, we would be able to take any word that's thrown at us (at our vocabulary, that is) and handle it without resorting to the unknown token while keeping the size of the vocabulary manageable and limited. Seems too good to be true? Not really!

The general idea of modern tokenizers is to break words into their components, so the vocabulary is actually made of the "building blocks", or subwords, used to assemble (almost) any word we may find. Let's say our vocabulary has only a few entries: "any", "some", "where", "how", and "body". These are all (full) words, but they can be used as parts to build many other words: "anyhow", "anybody", "somehow", "somebody", "anywhere", and "somewhere". We covered 11 words using a vocabulary of only five entries. Some words like "some" may be used to modify others, like "awesome", "wholesome", etc. Besides, we are also not limited to using full words, we can add prefixes and suffixes to the vocabulary, such as "ly", for "commonly", "ordinarily", and so on.

The choice between keeping the full word in a vocabulary or breaking it into smaller pieces is made according to how common or rare the word is, and the algorithm being used. There are a few different algorithms, such as WordPiece, Byte-Pair Encoding (BPE), and Byte-Level BPE, which are used by BERT, GPT, and CLIP models, respectively.

For more details on sub-word tokenizers such as WordPiece, Byte-Pair Encoding (BPE), and SentencePiece, please check Hugging Face’s "[Summary of the Tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary)" and Cathal Horan’s great post "Tokenizers: How machines read" on FloydHub.

In a nutshell, it is the tokenizer's own internal model (yes, you can train a tokenizer!) that learns how to split a complex or unusual token/word into smaller components.

BERT's tokenizer model was trained using the WordPiece algorithm:

In [None]:
tok_obj.model

<tokenizers.models.WordPiece at 0x7f2c4cf87890>

If we call the model's token_to_id() method, it will simply look the token up in the vocabulary and, if it is not a valid key, the method returns None instead of the token's index:

In [None]:
tokens_only = [token[0] for token in tokens]
token_ids = [tok_obj.model.token_to_id(token) for token in tokens_only]
print(tokens_only)
print(token_ids)

['reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'dwindling', 'band', 'of', 'ultra', '-', 'cynics', ',', 'are', 'seeing', 'green', 'again', '.']
[26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, None, 2316, 1997, 11087, 1011, None, 1010, 2024, 3773, 2665, 2153, 1012]


As we've seen before, "dwindling" is not part of the vocabulary, so its token ID is missing in the list above.

In [None]:
missing_id = token_ids.index(None)
missing_token = tokens_only[missing_id]
missing_id, missing_token

(10, 'dwindling')

Well, the second step in the pipeline was called pre-tokenization for a reason! Some tokens, those that are not keys in the vocabulary dictionary, need to be further tokenized into smaller components. We can call the mode's tokenize() method on the missing token to handle this situation:

In [None]:
tokenized_word = tok_obj.model.tokenize(missing_token)
[piece.as_tuple() for piece in tokenized_word]

[(1040, 'd', (0, 1)), (11101, '##wind', (1, 5)), (2989, '##ling', (5, 9))]

The (pre-)token was decomposed into three parts, "d", "##wind", and "##ling", where the "##" prefix denotes tokens that are not whole words themselves, but rather parts of a word.

The tokenizer's encode() method does that automatically, though, as we can see in the example below:

In [None]:
encoded = tok_obj.encode(descriptions[0], add_special_tokens=False)
print(encoded.tokens)
print(encoded.ids)

['reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', 'band', 'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing', 'green', 'again', '.']
[26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 2316, 1997, 11087, 1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012]


Did you notice that we explicitly set the add_special_tokens argument to False? Let's discuss these tokens in the next section.

### 15.5.3 Special Tokens

The last step of the tokenizer's pipeline, the postprocessor, is the one responsible for prepending and appending special tokens to the tokenized inputs. Let's see what it does to the encoded sentence we got as a result of the previous step:

In [None]:
post_processor = tok_obj.post_processor
post_encoded = post_processor.process(encoded)
print(post_encoded.tokens)

['[CLS]', 'reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', 'band', 'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing', 'green', 'again', '.', '[SEP]']


By the way, the output is exactly what we would have obtained from the tokenizer's encode() method if we hadn't turned the special tokens off:

In [None]:
print(tok_obj.encode(descriptions[0]).tokens)

['[CLS]', 'reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', 'band', 'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing', 'green', 'again', '.', '[SEP]']


There are two special tokens in the sentence above: the classification ([CLS]) token and the separation ([SEP]) token. Let's take a closer look at both of them.

#### 15.5.3.1 `[CLS]`: Classification Token

The classification ([CLS]) is a "very" special token that is prepended to the sequence. Unlike the other special tokens, which are primarily used to define the boundaries of a sequence, the classification token can also be used as a type of "summary" of the whole sequence in classification tasks (hence its name). We can retrieve the token and its corresponding ID using the cls_token and cls_token_id attributes of the tokenizer:

In [None]:
tokenizer.cls_token, tokenizer.cls_token_id

('[CLS]', 101)

#### 15.5.3.2 `[SEP]`: Separation Token

The [SEP] token is used to either separate two sentences or to mark the end of a single sentence. We can retrieve the token and its corresponding ID using the sep_token and sep_token_id attributes of the tokenizer:

In [None]:
tokenizer.sep_token, tokenizer.sep_token_id

('[SEP]', 102)

In the example below, the [SEP] token is used for both separating the two sentences and marking the end of the whole sequence:

In [None]:
print(tok_obj.encode(*descriptions[:2]).tokens)

['[CLS]', 'reuters', '-', 'short', '-', 'sellers', ',', 'wall', 'street', "'", 's', 'd', '##wind', '##ling', 'band', 'of', 'ultra', '-', 'cy', '##nic', '##s', ',', 'are', 'seeing', 'green', 'again', '.', '[SEP]', 'reuters', '-', 'private', 'investment', 'firm', 'carly', '##le', 'group', ',', 'which', 'has', 'a', 'reputation', 'for', 'making', 'well', '-', 'timed', 'and', 'occasionally', 'controversial', 'plays', 'in', 'the', 'defense', 'industry', ',', 'has', 'quietly', 'placed', 'its', 'bets', 'on', 'another', 'part', 'of', 'the', 'market', '.', '[SEP]']


#### 15.5.3.3 `[UNK]`: Unknown Token

The [UNK] token, once a common fixture of tokenized sequences, rarely shows up anymore. The third step in the pipeline, the tokenizer's model, handles these cases and splits them further into smaller tokens that are present in the vocabulary.

We can retrieve the token and its corresponding ID using the unk_token and unk_token_id attributes of the tokenizer:

In [None]:
tokenizer.unk_token, tokenizer.unk_token_id

('[UNK]', 100)

#### 15.5.3.4 `[PAD]`: Padding Token

The [PAD] token is used to pad (or stuff) sequences so their lengths match. We can retrieve the token and its corresponding ID using the pad_token and pad_token_id attributes of the tokenizer:

In [None]:
tokenizer.pad_token, tokenizer.pad_token_id

('[PAD]', 0)

Our mini-batch has four data points, and each data point has a description containing a different number of tokens:

In [None]:
[len(seq) for seq in tokenizer(descriptions)['input_ids']]

[28, 41, 38, 34]

We cannot make tensors out of list of different lengths, so we need to set the padding argument to True to make every sequence the same length as the longest one:

In [None]:
padded_token_ids = tokenizer(descriptions, padding=True, return_tensors='pt')['input_ids']
padded_token_ids

tensor([[  101, 26665,  1011,  2460,  1011, 19041,  1010,  2813,  2395,  1005,
          1055,  1040, 11101,  2989,  2316,  1997, 11087,  1011, 22330,  8713,
          2015,  1010,  2024,  3773,  2665,  2153,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0],
        [  101, 26665,  1011,  2797,  5211,  3813, 18431,  2571,  2177,  1010,
          2029,  2038,  1037,  5891,  2005,  2437,  2092,  1011, 22313,  1998,
          5681,  6801,  3248,  1999,  1996,  3639,  3068,  1010,  2038,  5168,
          2872,  2049, 29475,  2006,  2178,  2112,  1997,  1996,  3006,  1012,
           102],
        [  101, 26665,  1011, 23990, 13587,  7597,  4606, 15508,  2055,  1996,
          4610,  1998,  1996, 17680,  2005, 16565,  2024,  3517,  2000,  6865,
          2058,  1996,  4518,  3006,  2279,  2733,  2076,  1996,  5995,  1997,
          1996,  2621,  2079,  6392,  6824,  2015,  1012,   102,     0,     0,
             0],
 

### 15.5.4 Truncation

What happens if a sentence is too long? Models can only take so many tokens as inputs, so we may need to truncate our sequences to the maximum length taken by our model.

But, if a sequence is too long, a [SEP] token is appended to its end, and it will be truncated! Therefore, the first thing the tokenizer needs to do is to truncate our sequences to a length of two tokens shorter than the maximum length supported by our model.

That's why the tokenizer max_len_single_sentence is two tokens shorter than the model_max_length:

In [None]:
tokenizer.max_len_single_sentence, tokenizer.model_max_length

(510, 512)

Only after truncating the sentence, it can proceed to prepend and append the two special tokens, [CLS] and [SEP], respectively, and, if needed, pad the sequences.

## 15.6 Embeddings

Tokenizers are fun and all, but they are only the opening act.

Let us introduce you to the main star of the NLP world: "Embeddings". Embeddings are at the front and center of everything we'll be doing in this course from now on. We can use them to perform text classification, semantic search, clustering, you name it.

We have already touched upon the topic of embeddings a few times in this course. First, we used them to convert categorical features (from the Auto MPG dataset, remember that?) into numerical ones. Then, we used them once again while fine-tuning RoBERTa to perform sentiment analysis. Finally, we briefly talked about them while discussing the powerful CLIP model, which bridges the gap between the worlds of image and text by using, guess what, embeddings!

So, as you can see, embeddings are everywhere! But, what is exactly an embedding? If we're talking about an embedding layer in PyTorch, it works just like a lookup table. We could, for example, create an embedding to handle our vocabulary of roughly 80,000 tokens, assigning a tensor for each and every token.

Each token gets converted to an index, and that index is then used to look the corresponding tensor up in the embedding layer:

In [None]:
import torch.nn as nn

emb_dims = 50
embeddings = nn.Embedding(len(vocab), emb_dims)
embeddings

Embedding(30522, 50)

Let's try it out:

In [None]:
import torch

idx = torch.as_tensor([vocab['reuters']])
idx, embeddings(idx)

(tensor([26665]),
 tensor([[ 1.8288e-01, -1.3801e+00, -5.3830e-01, -1.4301e-01, -2.0827e-01,
          -1.7362e+00,  1.0018e+00,  4.6152e-01,  3.2680e-02, -5.8854e-01,
           3.1180e-01,  6.0066e-01, -1.2477e-01, -1.1660e+00, -1.2219e+00,
           1.0182e+00, -2.0216e-01, -3.1973e-01, -5.4026e-01, -1.8794e+00,
           2.2819e-01,  2.7748e-01, -1.4689e-01, -9.8170e-01, -2.1549e+00,
           4.9118e-01, -4.7388e-01, -2.3673e-01,  2.1740e-03,  1.1351e-01,
          -1.0422e+00,  1.4274e+00, -9.8884e-02, -7.2925e-01, -3.3722e-01,
           4.6264e-01, -3.8414e-01, -9.5412e-01, -5.6739e-02,  2.9316e+00,
           1.2275e-01,  5.6614e-01,  1.0147e-01, -8.2784e-01,  1.8933e-01,
          -8.5093e-01, -3.1484e-01, -1.7159e+00,  6.4275e-01,  2.7018e+00]],
        grad_fn=<EmbeddingBackward0>))

The token "reuters" corresponds to index 26,665, and there it is the corresponding tensor.

Notice that this tensor is the i-th row in the tensor that represents the weights of the embedding layer:

In [None]:
embeddings.weight[idx]

tensor([[ 1.8288e-01, -1.3801e+00, -5.3830e-01, -1.4301e-01, -2.0827e-01,
         -1.7362e+00,  1.0018e+00,  4.6152e-01,  3.2680e-02, -5.8854e-01,
          3.1180e-01,  6.0066e-01, -1.2477e-01, -1.1660e+00, -1.2219e+00,
          1.0182e+00, -2.0216e-01, -3.1973e-01, -5.4026e-01, -1.8794e+00,
          2.2819e-01,  2.7748e-01, -1.4689e-01, -9.8170e-01, -2.1549e+00,
          4.9118e-01, -4.7388e-01, -2.3673e-01,  2.1740e-03,  1.1351e-01,
         -1.0422e+00,  1.4274e+00, -9.8884e-02, -7.2925e-01, -3.3722e-01,
          4.6264e-01, -3.8414e-01, -9.5412e-01, -5.6739e-02,  2.9316e+00,
          1.2275e-01,  5.6614e-01,  1.0147e-01, -8.2784e-01,  1.8933e-01,
         -8.5093e-01, -3.1484e-01, -1.7159e+00,  6.4275e-01,  2.7018e+00]],
       grad_fn=<IndexBackward0>)

Perfect match; after all, the embedding layer works as a lookup table to its own weights.

Of course, the values we got now are completely meaningless. As with every other layer, its weights (the embeddings) were randomly initialized.

In the non-linear regression model for the Auto MPG Dataset, the embedding layers representing the categorical features got trained together with the rest of the model so, in the end, our model learned how to represent those features numerically.

When it comes to words, though, it is a much more daunting task. There are, literally, hundreds of thousands of unique words in English. But, instead of trying to accomplish this ourselves, let's stand on the shoulders of giants and use good old pretrained word embeddings.

### 15.6.1 Word2Vec

It wasn't practical to use one-hot encoding with a vocabulary of hundreds of thousands of unique words. Moreover, OHE vectors are orthogonal by nature, thus making every word completely independent of all others. That's not how languages work: there are synonyms and antonyms, and words may be modified by their relationship to other dimensions such as gender, for example. "King" and "Queen" are related words, since they represent royals, respectively, a man and a woman. They shouldn't be orthogonal, there is, their numerical representations should be comparable and more similar to one another than, for example, "King" and "Jester" or "King" and "Horse". That's the general idea behind Word2Vec, proposed in 2013.

In the continuous-bag-of-words (CBoW) architecture used to train the embeddings, the target was to predict the central word in a sentence, using words both before and after it as context. For example, in the sentence "Yesterday, the Queen was crowned", "yesterday" "the", "was", and "crowned" are the context around the central word, "Queen". Being crowned is a typical thing for both Queens and Kings, but not so much for anyone else, right? So, the model should quickly learn that those two words, and likely only those, are good candidates for the central word. "Yesterday, the Aardvark was crowned" isn't gonna cut it! By extensively training the model over lots and lots of text, it will eventually figure out appropriate numerical representations for the words.

It has been (only?) a decade, but everyone marveled at the time by the possibilities of doing something rather impressive: embedding arithmetic.

### 15.6.2 Embedding Arithmetic

Embeddings are vectors, and you can easily add or subtract them from one another. What happens if you start using words as variables in an equation?

**KING - MAN + WOMAN = ?**

In theory, since embeddings learned how to encode abstract dimensions, "KING - MAN" should result in the vector that represents "royalty". If you add "WOMAN" to it, you should get a royal woman, that is, a "QUEEN".

**KING - MAN + WOMAN = QUEEN**

The figure below is a two-dimensional hypothetical representation of these relationships, in embedding space, among these four words, king, queen, man, and woman (in the figure below, 'w' stands for 'weight' since embeddings are weights in an embedding layer):


![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch13/embed_arithmetic.png)

Arithmetic with kings and queens



In its large 50-dimensional feature space, the model learned to place "man" as far apart from "woman" as "king" is from "queen" (roughly approximating the gender difference between the two). Similarly, the model learned to place "king" as far apart from "man" as "queen" is from "woman" (roughly approximating the difference of being a royal).

In practice, if you try out the equation above using some pretrained word embeddings (such as GloVe, covered in the next section), you'll see that the word that's actually most similar to the result is "King" itself, not "Queen".

**KING - MAN + WOMAN ~ KING**

This may happen because the embedding space is very sparse (that is, each word is "far apart" from every word). For this reason, it is usual to exclude the original word from the results. In this particular case, the runner-up is, in fact, "Queen".

Here is an illustration of the vectors involved in the computation. Of course, it's impossible to tell, visually, if the "synthetic Queen" (the result of the equation) is more similar to the vector representing "King" or that representing "Queen".

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch13/synthetic_queen.png)

Synthetic Queen

Of course, we're not adding and subtracting words, but the fact that these embeddings can capture the relationship between different words (as opposed to being orthogonal as one-hot encoded vectors) makes them quite useful in tasks such as sentiment analysis or text classification in general.

### 15.6.3 Global Vectors (GloVe)

Stanford Global Vectors (GloVe) is one of the most successful pretrained word embeddings. It represented a great leap forward back in 2014, in the BT (Before Transformers) era, when it was released. They are simple and straightforward to use, and they can deliver very good results in tasks such as text classification, as we'll see very soon.

They can be retrieved using Gensim's downloader. We'll be loading the 50-dimension version:

In [None]:
from gensim import downloader

vec = downloader.load('glove-wiki-gigaword-50')

Gensim's KeyedVectors class, which the downloaded GloVe is an instance of, has plenty of methods to retrieve and compare embeddings. Moreover, we can easily peek at its internal vectors using its vectors attribute:

In [None]:
vec.vectors, vec.vectors.shape

(array([[ 0.418   ,  0.24968 , -0.41242 , ..., -0.18411 , -0.11514 ,
         -0.78581 ],
        [ 0.013441,  0.23682 , -0.16899 , ..., -0.56657 ,  0.044691,
          0.30392 ],
        [ 0.15164 ,  0.30177 , -0.16763 , ..., -0.35652 ,  0.016413,
          0.10216 ],
        ...,
        [-0.51181 ,  0.058706,  1.0913  , ..., -0.25003 , -1.125   ,
          1.5863  ],
        [-0.75898 , -0.47426 ,  0.4737  , ...,  0.78954 , -0.014116,
          0.6448  ],
        [ 0.072617, -0.51393 ,  0.4728  , ..., -0.18907 , -0.59021 ,
          0.55559 ]], dtype=float32),
 (400000, 50))

There are 400,000 entries in it, each corresponding to a token in its extensive vocabulary, each returning 50 numerical features. Let's see what the "reuters" embedding look like in GloVe:

In [None]:
vec['reuters']

array([-0.13741  , -0.25495  ,  1.8853   ,  0.1476   ,  0.63859  ,
       -0.67678  , -1.1622   , -0.21528  ,  0.2598   , -0.52879  ,
        0.66678  , -0.76747  , -0.52731  ,  0.06657  ,  0.076613 ,
        0.32743  , -0.80251  , -0.4955   , -0.37393  ,  0.11261  ,
        1.1671   ,  1.1508   ,  0.61801  ,  0.079467 ,  0.1269   ,
       -0.072447 , -1.2037   , -0.24622  , -0.77076  ,  0.76699  ,
        1.2745   , -0.12898  ,  0.99892  , -0.26733  , -0.57542  ,
       -1.0151   , -0.14278  , -0.43824  ,  0.76577  , -0.0087715,
        1.2848   ,  0.0030819,  0.1186   , -0.38817  , -0.23516  ,
       -0.92094  , -0.51644  ,  1.5083   ,  0.36456  ,  0.59912  ],
      dtype=float32)

GloVE embeddings, although somewhat "old school", are pretrained embeddings nonetheless. Therefore, we can load them into a PyTorch embedding layer, provided we convert the Numpy array into a PyTorch tensor first:

In [None]:
import torch.nn as nn

tensor_glove = torch.as_tensor(vec.vectors).float()
embedding = nn.Embedding.from_pretrained(tensor_glove)
embedding.state_dict()

OrderedDict([('weight',
              tensor([[ 0.4180,  0.2497, -0.4124,  ..., -0.1841, -0.1151, -0.7858],
                      [ 0.0134,  0.2368, -0.1690,  ..., -0.5666,  0.0447,  0.3039],
                      [ 0.1516,  0.3018, -0.1676,  ..., -0.3565,  0.0164,  0.1022],
                      ...,
                      [-0.5118,  0.0587,  1.0913,  ..., -0.2500, -1.1250,  1.5863],
                      [-0.7590, -0.4743,  0.4737,  ...,  0.7895, -0.0141,  0.6448],
                      [ 0.0726, -0.5139,  0.4728,  ..., -0.1891, -0.5902,  0.5556]]))])

In order to find out the entry corresponding to a given word, we can use the key_to_index dictionary attribute. We can also use the index_to_key attribute to get the token corresponding to a given index:

In [None]:
idx = vec.key_to_index['reuters']
token = vec.index_to_key[idx]
idx, token

(10851, 'reuters')

The embedding layer is just a big lookup table, so we can use the reuters index as an argument to retrieve its corresponding embedding:

In [None]:
embedding(torch.as_tensor(idx))

It is simple and straightforward, as long as the word is part of GloVe's vocabulary.

But what if we make something up?

In [None]:
vec.key_to_index['zzzzz']

KeyError: 'zzzzz'

Unfortunately, Gensim's implementation of GloVe vectors does not handle missing words gracefully, raising an exception instead. We'll need to check ourselves whether a given word is in the vocabulary or not. In this case, we assign that token a special ID for an unknown token.

In [None]:
def encode_str(key_to_index, tokens, unk_token=-1):
    token_ids = [key_to_index.get(token, unk_token) for token in tokens]
    return token_ids

In the encode_str() function above, invalid tokens return -1 as their ID, for example:

In [None]:
some_ids = encode_str(vec.key_to_index, ['reuters', 'zzzzz'])
some_ids

[10851, -1]

We still have to filter out these invalid tokens whenever we're retrieving the tokens' embeddings:

In [None]:
def get_embeddings(embedding, token_ids):
    valid_ids = torch.as_tensor([token_id for token_id in token_ids if token_id >= 0])
    embedded_tokens = embedding(valid_ids)
    return embedded_tokens

In [None]:
get_embeddings(embedding, some_ids)

tensor([[-0.1374, -0.2549,  1.8853,  0.1476,  0.6386, -0.6768, -1.1622, -0.2153,
          0.2598, -0.5288,  0.6668, -0.7675, -0.5273,  0.0666,  0.0766,  0.3274,
         -0.8025, -0.4955, -0.3739,  0.1126,  1.1671,  1.1508,  0.6180,  0.0795,
          0.1269, -0.0724, -1.2037, -0.2462, -0.7708,  0.7670,  1.2745, -0.1290,
          0.9989, -0.2673, -0.5754, -1.0151, -0.1428, -0.4382,  0.7658, -0.0088,
          1.2848,  0.0031,  0.1186, -0.3882, -0.2352, -0.9209, -0.5164,  1.5083,
          0.3646,  0.5991]])

See? Even though the list has two tokens, one of them is invalid, and therefore we only get one embedding back.

The function builder below takes an instance of Gensim's GloVe embeddings, builds the corresponding PyTorch embedding layer, and then builds and returns the get_vecs_by_tokens() function. The resulting function, on its turn, takes a list of tokens as arguments, filters out those tokens without valid IDs in GloVe's vocabulary, and retrieves their corresponding embeddings:

In [None]:
def func_builder(vec):
    tensor_glove = torch.as_tensor(vec.vectors).float()
    embedding = nn.Embedding.from_pretrained(tensor_glove)

    def get_vecs_by_tokens(tokens):
        token_ids = encode_str(vec.key_to_index, tokens)
        embedded_tokens = get_embeddings(embedding, token_ids)
        return embedded_tokens

    return get_vecs_by_tokens

get_vecs_by_tokens = func_builder(vec)

Remember, GloVe vectors were trained on whole words, so we need to tokenize our sentences accordingly. We can use Gensim's own simple_preprocess() method (introduced earlier in this chapter) to get a list of tokens:

In [None]:
from gensim.utils import simple_preprocess
tokens = simple_preprocess(descriptions[0])
tokens

['reuters',
 'short',
 'sellers',
 'wall',
 'street',
 'dwindling',
 'band',
 'of',
 'ultra',
 'cynics',
 'are',
 'seeing',
 'green',
 'again']

Next, we can use the get_vecs_by_tokens() function to retrieve the GloVe embeddings for our sentence:

In [None]:
embedded_tokens = get_vecs_by_tokens(tokens)
embedded_tokens.shape

torch.Size([14, 50])

Fourteen tokens, 50 dimensions each. Notice that a call to this method completely skips over the step of transforming tokens into indices (it's performed internally using the encode_str() function), thus returning the corresponding embeddings directly. This won't be the case for more sophisticated approaches, such as contextual embeddings, where embeddings don't actually work like a lookup table anymore. We're getting ahead of ourselves, though. Let's keep it simple for now.

## 15.7 Vector Databases

###Vector Databases: Overview
In the age of Large Language Models (LLMs), vector databases are all the rage! Why? Because they are perfect for storing vectors, that is, embeddings. They make it very easy to search for embeddings that are similar to each other, and this particular characteristic plays a crucial role in providing context to LLMs, as we'll see in a bit more detail towards the end of this course.

There are many databases, some of them work in-memory only, some can be optionally persisted to disk, and some are full-fledged databases in the sense they support typical operations as in relational databases.

In this course, we'll be briefly discussing ChromaDB only, given its ease of use, and the fact that it can be easily run in a Jupyter notebook.

### 15.7.1 ChromaDB

[ChromaDB](https://docs.trychroma.com/getting-started) is an open-source embedding database that allows you store embeddings and metadata, embed documents and queries, and search embeddings.

In this example, we'll be storing a collection of GloVe embeddings for the AG News Dataset on a persisted database, and then will query the collection to search for similar items.

Creating a database in ChromaDB follows a short sequence of steps:
- getting a client, which we can configure to persist the data
- creating a collection that will store the embeddings and metadata - you can think of it as a folder or table
- adding documents (to be embedded by ChromaDB itself) or embeddings (as we're doing here) to the collection, along with any corresponding metadata you may wish to add
- querying the collection to get the most similar results back

Let's get a client and define `agnews_db` as the folder our collection must be saved to:

Let's get a client and define agnews_db as the folder our collection must be saved to:

In [None]:
import chromadb

client = chromadb.PersistentClient(path="./agnews_db")

Creating the collection itself is easy, we only have to give it a name:

In [None]:
collection = client.create_collection("agnews_collection")

Next, we go over our dataset, compute and store embeddings, the documents (sentences) and their IDs, and their labels as metadata.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch13/vector_db.png)

Populating a Vector Database

The first helper function, tokenize_batch(), applies the tokenizer to each sentence in a mini-batch, while the second helper function, get_bag_of_embeddings(), compute the average of the embeddings of every token in every sentence of a mini-batch:

In [None]:
def tokenize_batch(sentences, tokenizer=None):
    if tokenizer is None:
        tokenizer = simple_preprocess

    return [tokenizer(s) for s in sentences]

def get_bag_of_embeddings(tokens):
    embeddings = torch.cat([get_vecs_by_tokens(s).mean(axis=0).unsqueeze(0) for s in tokens], dim=0)
    return embeddings

Notice that we're creating a new, unshuffled, data loader for our training set so we can assign a sequential number to each data point as its ID.

In [None]:
from torch.utils.data import DataLoader

batch_size = 32
unshuffled_dl = DataLoader(dataset=datasets['train'], batch_size=batch_size, shuffle=False)

for i, batch in enumerate(unshuffled_dl):
    labels, sentences = batch['topic'], batch['news']
    tokens = tokenize_batch(sentences)
    embeddings = get_bag_of_embeddings(tokens)
    ids = [f'{i:06}' for i in np.arange(i*batch_size, i*batch_size+len(sentences))]

    collection.add(embeddings=embeddings.tolist(),
                   documents=sentences,
                   metadatas=[{'label': v} for v in labels.tolist()],
                   ids=ids)

    if i == 300: # roughly 10k docs
        break

In [None]:
collection.count()

9632

We have loaded roughly 10,000 documents to our database, let's query it!

### 15.7.2 Similarity Search

In a nutshell, that's what vector databases are built for: similarity search. They allow faster comparison among hundreds of thousands of vectors. We're not discussing their implementation details, though. We're interested in using their ability to quickly search for similar vectors (embeddings) to find and group similar documents together.

Let's take a sentence from the AG News Dataset that talks about a nuclear plant and compute its bag of embeddings (pretending it wasn't loaded in the database already):

In [None]:
query_sentence = 'The company running the Japanese nuclear plant hit by a fatal accident is to close its reactors for safety checks.'
query_tokens = tokenize_batch([query_sentence])
query_embeddings = get_bag_of_embeddings(query_tokens)[0]

query_embeddings

tensor([ 4.0827e-01,  7.9920e-02,  3.1115e-01,  1.8721e-01, -4.7369e-02,
         3.3698e-01, -5.0617e-01, -3.6810e-02,  2.6068e-01, -1.2847e-01,
         2.2948e-01, -5.9424e-02, -3.3787e-01,  2.9188e-02,  2.6071e-01,
         2.0179e-01, -7.7526e-02,  3.3718e-01, -5.2526e-01, -2.7158e-01,
         3.7156e-01, -1.0214e-01, -1.1645e-01, -2.9637e-01,  7.9672e-02,
        -1.6904e+00, -4.1659e-02,  1.0523e-01,  2.8247e-01,  2.9835e-03,
         3.0708e+00, -1.5180e-01, -1.4941e-01, -2.3085e-01,  2.1777e-01,
        -4.0086e-02,  2.0281e-01,  8.9309e-02,  5.5554e-02,  2.6830e-02,
        -2.9215e-01, -1.5104e-01,  2.6449e-01, -1.0034e-01,  1.4842e-01,
         8.8036e-02, -2.3717e-01,  3.1029e-01,  1.2701e-02, -1.4513e-01])

The query embeddings must be computed and/or retrieved in the same way the embeddings stored in the database were. You cannot mix and match different embeddings. If we stored bags of GloVe embeddings in the database, we need to query it using bags of GloVe embeddings as well.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch13/query_db.png)

Querying a Vector Database

So, let's query the database using the computed query embeddings, and retrieve the top 5 results for it:

In [None]:
query_embeddings = query_embeddings.tolist()
collection.query(query_embeddings=query_embeddings, n_results=5)

{'ids': [['000030', '001046', '004715', '002464', '006905']],
 'distances': [[0.0,
   0.8038501739501953,
   0.9175586104393005,
   0.9644219875335693,
   0.9812381267547607]],
 'metadatas': [[{'label': 2},
   {'label': 0},
   {'label': 0},
   {'label': 0},
   {'label': 2}]],
 'embeddings': None,
 'documents': [['The company running the Japanese nuclear plant hit by a fatal accident is to close its reactors for safety checks.',
   'AP - The operator of a nuclear power plant where a long-neglected cooling pipe burst and killed four workers last week said Monday that four other pipes at its reactors also went unchecked for years.',
   'TOKYO The operators of a Japanese nuclear plant say there was no evidence of danger at the plant before a deadly explosion this month.',
   'Reuters - No more Japanese nuclear reactors need to be closed for inspections, electric power companies said on Wednesday after submitting reports ordered by the government following a reactor accident that killed fou

The first, and obvious, result is the sentence itself - having a distance of exactly zero because it's exactly the same sentence.

The other four results, the ones most similar (with the lower distances) from the original sentence, are also about nuclear power plants.

What about searching for something that's not from the database itself, as a real query, or even a typical search term such as "asian stock market"? Let's check it out:

In [None]:
query_sentence = 'asian stock market'
query_tokens = tokenize_batch([query_sentence])
query_embeddings = get_bag_of_embeddings(query_tokens)[0]
query_embeddings = query_embeddings.tolist()

In [None]:
collection.query(query_embeddings=query_embeddings, n_results=5)

{'ids': [['007389', '006925', '004791', '007014', '006829']],
 'distances': [[5.2573628425598145,
   5.490711688995361,
   5.643772125244141,
   5.697542190551758,
   5.810704231262207]],
 'metadatas': [[{'label': 2},
   {'label': 2},
   {'label': 0},
   {'label': 0},
   {'label': 2}]],
 'embeddings': None,
 'documents': [['Asian stocks rose after oil prices fell from a record on Friday, easing concern higher energy costs will damp consumer spending and corporate profits.',
   'Asian stocks advanced after oil prices fell from a record Friday in New York, easing concern higher energy costs will damp consumer spending and corporate profits.',
   "AP - Tokyo's main stock index ended lower Friday amid profit-taking of technology issues and concerns about soaring oil prices. The U.S. dollar was down against the Japanese yen.",
   'Japanese stocks rose after oil prices fell from a record in New York on Friday, easing concern higher energy costs will damp consumer spending and corporate profi

The values for the distances are much higher now, but the most similar results are related to the "asian stock market": from literally "asian stocks", to "Japanese stocks", and "Japan's Nikkei" (stock index).

As you can see, searching for similar documents is quite easy. ChromaDB reports the results as distances (low values, meaning the documents are "close" to each other, corresponding to documents that are similar to each other), but that's not the only metric you can use. Alternatively, it is possible to use a similarity metric instead, such as cosine similarity.

***
**ASIDE: Cosine Similarity**

If two vectors are pointing in the same direction, their cosine similarity is a perfect one. If they are orthogonal (that is, if there is a right angle between them), their cosine similarity is zero. If they are pointing in opposite directions, their cosine similarity is minus one.

$$
\Large
\cos \theta = \frac{\sum_i{x_iy_i}}{\sqrt{\sum_j{x_j^2}}\sqrt{\sum_j{y_j^2}}}
$$
***


Let's use this similarity metric, implemented in PyTorch, to tackle a task that we've already discussed briefly.

## 15.8 Zero-Shot Text Classification
We have already seen an example of zero-shot image classification that was based on CLIP, OpenAI's model that bridges the gap between the worlds of computer vision and natural language processing. It accomplishes zero-shot classification by producing embeddings for both images and text, and comparing the image embeddings to the embeddings of each candidate label. The candidate label that has the embeddings that are most similar to those computed for the image being classified is deemed the right one.

Let's try the same approach to accomplish zero-short text classification!

Our candidate labels are the classes from the AG News Dataset: "world", "sports", "business", and "science and technology". Let's compute their embeddings (or bag of embeddings, in the case of the last, three-word, label):
![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

In [None]:
cand_labels = ["world", "sports", "business", "science and technology"]

cand_emb = torch.vstack([get_vecs_by_tokens(tokens).mean(axis=0) for tokens in tokenize_batch(cand_labels)])
cand_emb.shape

torch.Size([4, 50])

To compare two embeddings, we can use the typical cosine similarity metric. If two vectors are identical (except for their norm), their similarity is one. If they are orthogonal to each other, their similarity is zero.

We can use PyTorch's own nn.CosineSimilarity to easily compute, for example, a similarity matrix between our candidate labels:

In [None]:
cos = nn.CosineSimilarity(dim=2)

cos(cand_emb.unsqueeze(1), cand_emb.unsqueeze(0))

tensor([[1.0000, 0.6529, 0.6136, 0.6678],
        [0.6529, 1.0000, 0.6410, 0.6171],
        [0.6136, 0.6410, 1.0000, 0.8069],
        [0.6678, 0.6171, 0.8069, 1.0000]])

As expected, the main diagonal compares each label to itself, so the similarity is a perfect one. Notice the calls to each tensor's unsqueeze() method, required to make the N-to-N comparison.

Now, let's do exactly the same using a mini-batch from our validation/test data loader, and comparing the computed embeddings to those of the candidate labels:

In [None]:
batch_size = 32
dataloader = DataLoader(dataset=datasets['test'], batch_size=batch_size, shuffle=False)

batch = next(iter(dataloader))
labels, sentences = batch['topic'], batch['news']
tokens = tokenize_batch(sentences)
embeddings = get_bag_of_embeddings(tokens)
similarities = cos(embeddings.unsqueeze(1), cand_emb.unsqueeze(0))
similarities

tensor([[0.6534, 0.5361, 0.7945, 0.7030],
        [0.7634, 0.6130, 0.7007, 0.7579],
        [0.6658, 0.5156, 0.7244, 0.8548],
        [0.7275, 0.5425, 0.6800, 0.7176],
        [0.7058, 0.5120, 0.7004, 0.7567],
        [0.7184, 0.5626, 0.7428, 0.7919],
        [0.7056, 0.5410, 0.7269, 0.8026],
        [0.6613, 0.5637, 0.7525, 0.8029],
        [0.5893, 0.4821, 0.6439, 0.6403],
        [0.7183, 0.5389, 0.7209, 0.7500],
        [0.6883, 0.6081, 0.8330, 0.8441],
        [0.6728, 0.6266, 0.8165, 0.8166],
        [0.7453, 0.6166, 0.8239, 0.8035],
        [0.7400, 0.4860, 0.6221, 0.7080],
        [0.6582, 0.4579, 0.6692, 0.7232],
        [0.6819, 0.4262, 0.6630, 0.7159],
        [0.6154, 0.4228, 0.5796, 0.6999],
        [0.7636, 0.5894, 0.7239, 0.8140],
        [0.6990, 0.4981, 0.6911, 0.8043],
        [0.6747, 0.4536, 0.7283, 0.6745],
        [0.7615, 0.5422, 0.7594, 0.7675],
        [0.6438, 0.5324, 0.7470, 0.7863],
        [0.5911, 0.4596, 0.7293, 0.8482],
        [0.7202, 0.5606, 0.7810, 0

The predicted class is, as already mentioned, the one the candidate label embedding is the most similar to the sentence embedding being classified:

In [None]:
predicted_class = similarities.argmax(dim=1)
predicted_class

tensor([2, 0, 3, 0, 3, 3, 3, 3, 2, 3, 3, 3, 2, 0, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2,
        2, 3, 0, 3, 3, 0, 3, 0])

Is it any good? Let's see how many of those match the actual labels:

In [None]:
(predicted_class == labels).float().mean()

tensor(0.5625)

Not bad for such a simple approach, right? But, does it hold for the whole dataset?

### 15.8.1 Evaluation

We can run the very same evaluation as in the lab, except for the fact that the "predictions" are not logits coming out of a model anymore, but similarities among embeddings:

In [None]:
import evaluate

metric1 = evaluate.load('precision', average=None)
metric2 = evaluate.load('recall', average=None)
metric3 = evaluate.load('accuracy')

In [None]:
for batch in dataloader:
    labels, sentences = batch['topic'], batch['news']
    tokens = tokenize_batch(sentences)
    embeddings = get_bag_of_embeddings(tokens)

    # predictions = model(embeddings)
    predictions = cos(embeddings.unsqueeze(1), cand_emb.unsqueeze(0))

    pred_class = predictions.argmax(dim=1).tolist()
    labels = labels.tolist()

    metric1.add_batch(references=labels, predictions=pred_class)
    metric2.add_batch(references=labels, predictions=pred_class)
    metric3.add_batch(references=labels, predictions=pred_class)

In [None]:
metric1.compute(average=None), metric2.compute(average=None), metric3.compute()

({'precision': array([0.33205619, 1.        , 0.67253045, 0.43290471])},
 {'recall': array([4.10526316e-01, 5.26315789e-04, 7.84736842e-01, 6.91052632e-01])},
 {'accuracy': 0.47171052631578947})

Well, that's not so great. Apparently, the "sports" label isn't the most similar to any of the sentences in our dataset.

Perhaps if we had chosen different words/sentences as candidate labels, we could have achieved better results. Wouldn't it be nice to know which words in a sentence are the most meaningful words when it comes to classifying the sentence as belonging to a given topic or not?

In a way, we're interested in knowing what the model is paying attention to. The keyword here is "attention": it revolutionized the world of natural language processing, and fostered the development of contextual word embeddings, which are much better at, well, embedding the meaning of individual words and sentences.

The quality of the results you get, either if it's searching for similar documents or performing zero-shot text classification, depends on the quality of the embeddings you use. Better embeddings, better results. So, let's up our game and go for contextual embeddings!

Before moving on, however, we need to tackle one last topic here.

## 15.9 Chunking Strategies

#### Chunking Strategies: Overview
The AG News dataset contains very short pieces of text, usually a single sentence. Most of the time, however, pieces of text are much longer than that: paragraphs, pages, even full books. Even though language models are getting larger and able to take longer sequences, there's always a hard limit that forces a sequence to be truncated.

The good news is, we can always split the original text into chunks (if you go back to the figures in the "Vector Databases" section, you'll see it depicts chunks being transformed into embeddings). The bad news is, there is no right or wrong answer to how you should split the text into chunks. It depends on a series of factors such as the type of text you're dealing with (long reports or short tweets, for example), the model you're using to embed the text and the nature of your queries (more on that later), and limitations in size (models typically have a maximum input length as we've already seen).

For more details, check the "[Chunking Strategies for LLM Applications](https://www.pinecone.io/learn/chunking-strategies/)" blog post.

Having said that, it's possible to chunk your text using a fixed-length or a content-aware approach. Let's take a quick look at some of them. We'll use a paragraph of text from a financial report, reproduced below, to illustrate two different chunking strategies:

In [None]:
text = """
ITEM 1A. RISK FACTORS Our operations and financial results are subject to various risks and uncertainties, including those described below, that could adversely affect our business, financial condition, results of operations, cash flows, and the trading price of our common stock. STRATEGIC AND COMPETITIVE RISKS We face intense competition across all markets for our products and services, which may lead to lower revenue or operating margins.    Competition in the technology sector Our competitors range in size from diversified global companies with significant research and development resources to small, specialized firms whose narrower product lines may let them be more effective in deploying technical, marketing, and financial resources. Barriers to entry in many of our businesses are low and many of the areas in which we compete evolve rapidly with changing and disruptive technologies, shifting user needs, and frequent introductions of new products and services. Our ability to remain competitive depends on our success in making innovative products, devices, and services that appeal to businesses and consumers.    Competition among platform-based ecosystems An important element of our business model has been to create platform-based ecosystems on which many participants can build diverse solutions. A well-established ecosystem creates beneficial network effects among users, application developers, and the platform provider that can accelerate growth. Establishing significant scale in the marketplace is necessary to achieve and maintain attractive margins. We face significant competition from firms that provide competing platforms.
"""

### 15.9.1 Fixed-Length

Fixed-length approaches split the text into equal-length chunks (e.g. 300 characters/words/tokens) with or without some overlap between them. Langchain, a Python package that has grown a lot in popularity in the last few months, and that allows you to integrate different tools (e.g. embedding models, vector databases) into a workflow, also offers a convenient way of splitting text into chunks of fixed-length:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
chunks = text_splitter.create_documents([text])
chunks[:3]

[Document(page_content='ITEM 1A. RISK FACTORS Our operations and financial results are subject to various risks and uncertainties, including those described below, that could adversely affect our business, financial condition, results of operations, cash flows, and the trading', metadata={}),
 Document(page_content='and the trading price of our common stock. STRATEGIC AND COMPETITIVE RISKS We face intense competition across all markets for our products and services, which may lead to lower revenue or operating margins.    Competition in the technology sector Our', metadata={}),
 Document(page_content='sector Our competitors range in size from diversified global companies with significant research and development resources to small, specialized firms whose narrower product lines may let them be more effective in deploying technical, marketing, and', metadata={})]

### 15.9.2 Content-Aware

The problem with fixed-length is that the chunks will most certainly end mid-sentence. The overlap may help mitigate this issue, but it won't work for long sentences. That's when the content-aware approach comes in. We can split it by sentences or paragraphs using indications in the text's structure. We could, for example, naively split the text into sentences using the period (.) as an indication of the end of a sentence. What about exclamation and question marks?

Fortunately, sentence tokenizing is a very well-known problem, and the [traditional Natural Language Toolkit (NLTK)](https://www.nltk.org/) package has a sentence tokenizer available. We only need to download the punkt tokenizer package and it will be ready to be used:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/dvgodoy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.tokenize import sent_tokenize

chunks = sent_tokenize(text)
chunks[:3]

['\nITEM 1A.',
 'RISK FACTORS Our operations and financial results are subject to various risks and uncertainties, including those described below, that could adversely affect our business, financial condition, results of operations, cash flows, and the trading price of our common stock.',
 'STRATEGIC AND COMPETITIVE RISKS We face intense competition across all markets for our products and services, which may lead to lower revenue or operating margins.']

As you can see, each element in the list of documents is a single sentence.

### 15.9.3 Custom

Sometimes, as in the case of our example, there's some other indication to the text's structure: it looks like paragraphs are separated by a sequence of two or more spaces. Let's try it out:

In [None]:
chunks = text.split('  ')
chunks[:3]

['\nITEM 1A. RISK FACTORS Our operations and financial results are subject to various risks and uncertainties, including those described below, that could adversely affect our business, financial condition, results of operations, cash flows, and the trading price of our common stock. STRATEGIC AND COMPETITIVE RISKS We face intense competition across all markets for our products and services, which may lead to lower revenue or operating margins.',
 '',
 'Competition in the technology sector Our competitors range in size from diversified global companies with significant research and development resources to small, specialized firms whose narrower product lines may let them be more effective in deploying technical, marketing, and financial resources. Barriers to entry in many of our businesses are low and many of the areas in which we compete evolve rapidly with changing and disruptive technologies, shifting user needs, and frequent introductions of new products and services. Our ability

Looks good, these are definitely paragraphs. Unfortunately, this is a strategy that's specific to this particular document only.