In [None]:
%%capture
!pip install -Uqqq spacy datasets tokenizers plotly
!python -m spacy download en_core_web_sm

Introduction & Setup

Natural language processing (NLP) is the field of using computers to undersatnd written language. NLP is one of the fastest growing fields in all of machine learning, and there are consistently exciting developments in the field.

You may have worked with techniques like sklearn's CountVectorizer or TfidfVectorizer to turn raw text into features where each document/item is a row, and each column represents a different word. However, deep learning represents text a bit differently. In deep learning, text is represented as a sequence of words. This allows each word to be understood by the model in the context of the other words it appears with in the text.

In this module, we will introduce some core concepts required to work with natural language and model it using deep learning. In this lecture we will cover:

Classical NLP techniques - sparse representations of text
Text preprocessing - cleaning and normalizing text
Tokenization & numericalization - representing text as a sequence of words (or tokens), and turning those tokens into integers
These concepts will give us a foundation for how to think about language data when we begin to use it to perform tasks like text classificaiton and text generation.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import torch
from torch.utils.data import DataLoader
import pandas as pd
import spacy
from tokenizers.models import WordLevel, BPE, WordPiece
import tokenizers
from datasets import load_dataset
import matplotlib.pyplot as plt
import plotly.express as px
import warnings

In [None]:
warnings.filterwarnings('ignore')

Later on in this module, we'll be using the train split of the emotion dataset to illustrate data preparation for NLP with deep learning. From the dataset's documentation:

Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.

In the code block below, we load the dataset and create a lookup for the class names.

In [None]:
dataset = load_dataset("SetFit/emotion", split='train')
class_names = 'anger fear joy love sadness surprise'.split()
class_lookup = {i:c for i, c in enumerate(class_names)}
class_lookup

Downloading readme:   0%|          | 0.00/194 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/276k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/279k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

{0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}

In [None]:
sentence1 = 'The quick brown fox jumped over the lazy dog'
sentence2 = 'Deep learning is fun!'
sentence3 = 'deep learning is hard.'

sentences = [sentence1, sentence2, sentence3]

In [None]:
cv = CountVectorizer()
sentences_cv = cv.fit_transform(sentences)
sentences_cv = pd.DataFrame(sentences_cv.toarray(), columns=cv.get_feature_names_out())

sentences_cv

Unnamed: 0,brown,deep,dog,fox,fun,hard,is,jumped,lazy,learning,over,quick,the
0,1,0,1,1,0,0,0,1,1,0,1,1,2
1,0,1,0,0,1,0,1,0,0,1,0,0,0
2,0,1,0,0,0,1,1,0,0,1,0,0,0


Text Normalization

If you've taken the DSML course, you've probably learned a bit about cleaning data. Generally, cleaning data involves steps like filling or imputing missing values, checking for values that don't make sense, or removing outliers. However, text data can have different data quality problems that require a different set of skills and tools to address.

As an example, let's consider the word "naive". For the most part, the word is spelt "naive." Howerever, some writers may choose to use the alternate "naïve", which contains an accented character (diaeresis). One way to reduce the number of words we need to remember in our vocabulary would be to replace accented characters with non-accented characters.

Another example is lower casing. One hyperparameter affecting the size of a model is the vocabulary size, or the number of tokens or words we will be able to recognize. We want to make sure we can recognize very common words, but we may not care about extremely uncommon words. One way to reduce the number of words we have to remember is lower casing. If we don't lower case, for example, we will have to store both "The" and "the" in our vocabulary, where as there will only be one "the" in our vocabulary if we perform lower casing.

In this lesson, we'll use some tools in tokenizers to perform some text cleaning steps.

In [None]:
from tokenizers import normalizers

Similar to image augmentations, ther is a compose-like feature called Sequence that composes all of our normalization steps. In the case below, we are using three steps - one that normalizes Unicode (see here), one that lower cases our text, and a third that strips accents.

In [None]:
normalizer = normalizers.Sequence([
    normalizers.NFD(),
    normalizers.Lowercase(),
    normalizers.StripAccents()
])

Let's see how our normalizer works on this string 'Höw aRę ŸõŪ dÔįñg?'. After normalization, they should all be the same.

In [None]:
normalizer.normalize_str('Höw aRę ŸõŪ dÔįñg?')

'how are you doing?'

Let's use our normalizer to normalize each string in sentences. Although we're using the normalizer directly now, we will bundle it with other preprocessing steps in a later lesson.

In [None]:
normalized_senences = [normalizer.normalize_str(s) for s in sentences]
normalized_senences

['the quick brown fox jumped over the lazy dog',
 'deep learning is fun!',
 'deep learning is hard.']

If you have extremely dirty text, you may require some help from outside tools. Something that's been useful in the past have been the tools provided in gensim, documented here. They can be especially useful if you have lots of unicode characters, HTML tags, etc. in your text.

NOTE: In the past, you may have come across concepts like removing stop words and stemming/lemmatization. These techniques were more popular when a large vocabulary would be very computationally expensive. While this is something you may want to be sensitive to, today they are are generally not used in deep learning. Currently, tokenizers for extremely powerful models like BERT and GPT have vocabularies on the order of 30-60K tokens. We will learn a little bit about how this works in a later lesson.

In this lesson, we learned about the first step in preparing text data for nerural networks - text normalization. For additional information and more documentation on text normalization, please see the tokenizers documentation.

Text Splitting

In the previous lesson, we introduced text normalization. The next step in the process is to split text into chunks. Each chunk is called a token. The easiest way to think of tokens is that each token is a word.

You may be familiar with the str.split() method. If you've used it before, you have some idea of what this looks like. Let's examine what our normalized_sentences look like after being split using this method.

In the tokenizers library, this is done using the pre_tokenizers module.

In [None]:
for s in normalized_senences:
    print(s.split())

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
['deep', 'learning', 'is', 'fun!']
['deep', 'learning', 'is', 'hard.']


This does a decent job, but there are some issues. We see that the words 'fun!' and 'hard.' are not split from the punctuation. This means that in our vocabulary, the word 'fun!' would be a different entry from 'fun', 'fun,', and 'fun?', assuming all of those items appeard in our corpus.

The tokenizers library has some methods that handle splitting up texts into items in different ways. Let's take a look at the pre_tokenizers module.

In [None]:
from tokenizers import pre_tokenizers

First, let's build a pre_tokenizer that just splits on whitespace. However, unlike str.split(), we can see that this method effectively splits on whitespace and punctuation.

In [None]:
pre_tokenizer = pre_tokenizers.Whitespace()
# split our normalized_sentences
split_sentences = [pre_tokenizer.pre_tokenize_str(s) for s in normalized_senences]
split_sentences

[[('the', (0, 3)),
  ('quick', (4, 9)),
  ('brown', (10, 15)),
  ('fox', (16, 19)),
  ('jumped', (20, 26)),
  ('over', (27, 31)),
  ('the', (32, 35)),
  ('lazy', (36, 40)),
  ('dog', (41, 44))],
 [('deep', (0, 4)),
  ('learning', (5, 13)),
  ('is', (14, 16)),
  ('fun', (17, 20)),
  ('!', (20, 21))],
 [('deep', (0, 4)),
  ('learning', (5, 13)),
  ('is', (14, 16)),
  ('hard', (17, 21)),
  ('.', (21, 22))]]

We can also see that the pre_tokenizer also returns the indices of each token in the document. This behavior is specific to the tokenizers library.

In this lesson, we learned about an efficient way to split text up into a sequence of words. These are the sequences we will model using deep learning.

Tokenization

In the past few lessons, we normalized and split our text using some nice tools from tokenizers. In this lesson, we will build and train a tokenizer that we can apply to a dataset. We'll also examine the key components of the tokenizer like its vocabulary.

What happens when you train a tokenizer? The role of a tokenizer is to convert text into numbers for input into a ML model. The tokenizer has a vocabulary, which is all the tokens it can recognize. One hyperparameter of the tokenizer is the vocab size, which is an upper limit on the number of individual tokens it will be able to recognize.

Since it's impossible to capture every immaginable word in a limited vocabulary, we need a way to handle out-of-sample tokens. To address this, we can add a special token that represents unknown words, which will have its own representation in the model. Today, we'll use the special token '[UNK]'.

Another special token we will want to add is a padding token. Not all of our documents are the same length, but we still need to pass our tokenized documents to our models as tensors, which can't have sequences of variable length in them. To address this, we'll us a padding token. There are multiple ways to pad, but in general there is a maximum squence length that we will use, and any document shorter than that length will be extended with the padding token. In this case, we'll use '[PAD]' for our padding token.

Let's start by instantiating our tokenizer with the unknown token.

In [None]:
UNK_TOKEN = '[UNK]'
PAD_TOKEN = '[PAD]'

tokenizer = tokenizers.Tokenizer(model=tokenizers.models.WordLevel(unk_token=UNK_TOKEN))

Next, we can tell the tokenizer how to normalize and preprocess each document.

In [None]:
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizer

In the cell below, we instantiate our trainer. The trainer governs how the tokenizer learns the vocabulary. In this case, we are considering words as tokens, with a max vocabulary size of 30,000 tokens, a minimum token frequency of 10, and with the unknown and padding special tokens.

In [None]:
trainer = tokenizers.trainers.WordLevelTrainer(vocab_size=30000, min_frequency=10, show_progress=True, special_tokens=[PAD_TOKEN, UNK_TOKEN])

Now we can train our tokenizer. To do this with an in-memory dataset, we need a generator that yields the texts we want to use to build the vocabulary.

In [None]:
def document_iterator(ds):
    for item in ds:
        yield item['text']

one_document = next(document_iterator(dataset))
one_document

'i didnt feel humiliated'

Finally, we perform the training step. In the cell below, we pass the document iterator and trainer to the tokenizer.train_from_iterator method.

In [None]:
tokenizer.train_from_iterator(document_iterator(dataset), trainer)

print(f"""
Our tokenizer contains {tokenizer.get_vocab_size()} unique tokens.
""")


Our tokenizer contains 2119 unique tokens.



Let's print out the first few items in our vocabulary.

In [None]:
for i in range(5):
    print(f'ID: {i}, token: {tokenizer.id_to_token(i)}')

ID: 0, token: [PAD]
ID: 1, token: [UNK]
ID: 2, token: i
ID: 3, token: feel
ID: 4, token: and


Let's take a look at one encoded item from our dataset. We can see htat each word is a token that corresponds to a token ID from the vocabular.

In [None]:
encoded = tokenizer.encode(one_document)
pd.DataFrame(zip(encoded.tokens, encoded.ids), columns=['token', 'id']).T

Unnamed: 0,0,1,2,3
token,i,didnt,feel,humiliated
id,2,139,3,686


Using the tokenizer in the training loop
In the last lesson, we trained a tokenizer and examined the vocabulary. In this lesson, we'll demonstrate one way to use the tokenizer during a training loop.

The first thing we'll want to do is enable padding. This ensures that the items in our batch can be put into a tensor together. We do this using the enable_padding method.

In [None]:
tokenizer.enable_padding(pad_id=tokenizer.token_to_id(PAD_TOKEN), pad_token=PAD_TOKEN)

Next, we want to create a collate function for our DataLoader. In this instance, a batch will be a list of dictionaries, each with the keys ['label', 'text']. In the cell block below, we use the encode_batch method to encode all our texts. This applies padding so everything in the batch comes out the same length. We also extract the ids from each item in the batch. Finally, we return a tensor of the ids and a tensor of the labels.

In [None]:
def collate_fn(batch):
    texts = [i['text'] for i in batch]
    encoded = tokenizer.encode_batch(texts)
    ids = [t.ids for t in encoded]
    labels = [i['label'] for i in batch]
    return torch.tensor(ids), torch.tensor(labels)

Let's instantiate our DataLoader and look at a batch.

In [None]:
dl = DataLoader(dataset, batch_size=8, collate_fn=collate_fn)
encoded_texts, labels = next(iter(dl))

Below are our encoded texts. We can see that for shorter sequences, there are many 0 values on the right. We can also see that the longest sequence in the batch has no padding.

In [None]:
encoded_texts

tensor([[   2,  139,    3,  686,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [   2,   40,  101,   60,    8,   15,  497,    5,   15,    1,  557,   32,
           60,   61,  128,  148,   77, 1487,    4,   22, 1256,    0,    0],
        [  17,    1,    7, 1165,    5,  288,    2,    3,  496,  448,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [   2,   24,  165,    8,  671,   27,    6,    1,    2,   59,   48,    9,
           13,   22,   72,   30,    6,    1,    0,    0,    0,    0,    0],
        [   2,   24,    8, 1075,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [  73,   47,    8,    7,   56,  522,  321,  335,  160,  161,    9,   20,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [  73,   47,  332,   35,    1,   35,  196,    1,  

Another way to use our tokenizer might be to tokenize directly in the training loop. Let's mock up a training loop in the code below.

In [None]:
# pass the dataset directly to the dataloader without a collate_fn
dl = DataLoader(dataset, batch_size=8)

# tokenize each batch after it's loaded in the training loop
for batch in dl:
    encoded = tokenizer.encode_batch(batch['text'])
    input_ids = torch.tensor([document.ids for document in encoded])
    labels = batch['label']
    # We'll also write the rest of the steps,
    # although we're not actually training a model at the moment.
    # logits = model(ids)
    # loss = loss_fn(logits, labels)
    # loss.backward()
    # opt.step()
    # opt.zero_grad()
    break

In [None]:
input_ids

tensor([[   2,  139,    3,  686,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [   2,   40,  101,   60,    8,   15,  497,    5,   15,    1,  557,   32,
           60,   61,  128,  148,   77, 1487,    4,   22, 1256,    0,    0],
        [  17,    1,    7, 1165,    5,  288,    2,    3,  496,  448,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [   2,   24,  165,    8,  671,   27,    6,    1,    2,   59,   48,    9,
           13,   22,   72,   30,    6,    1,    0,    0,    0,    0,    0],
        [   2,   24,    8, 1075,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [  73,   47,    8,    7,   56,  522,  321,  335,  160,  161,    9,   20,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
        [  73,   47,  332,   35,    1,   35,  196,    1,  

Besides these two examples, there is a third option to tokenize the entire dataset using the same sequence length. To determine how many tokens to consider as you tokenize, you may want to consider the longest document. While this is a viable strategy, you are also basing the size of the input tensors on an outliner (the longest document). If this is a lot longer than the median document length, you could be taking up a lot of additional memory for your model.

Now that we've encoded a batch, let's see how to decode from token ids to a string. We can also use our tokenizer to decode from IDs into our original text.

In [None]:
tokenizer.decode_batch(encoded_texts.numpy())

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so hopeful just from being around someone who cares and is awake',
 'im a minute to post i feel greedy wrong',
 'i am ever feeling nostalgic about the i will know that it is still on the',
 'i am feeling grouchy',
 'ive been feeling a little burdened lately wasnt sure why that was',
 'ive been taking or or times amount and ive asleep a lot faster but i also feel like so funny',
 'i feel as confused about life as a teenager or as jaded as a year old man']

Now that we've fully encoded text into tensors, we've covered all the preprocessing required before we train a model.

In this lesson, we trained our tokenizer and used it a few different ways in a mock training loop. When our data's prepared, each document is represented by a sequence of integers, where each integer represents some item in a vocabulary of tokens. We also used our tokenizer to decode an encoded batch of