### Chapter 5. Introduction to Natural Language Processing

In [1]:
import torch
sentences = [
    'Today is a sunny day',
    'Today is a rainy day'
]
# Tokenization function
def tokenize(text):
    return text.lower().split()

# Build the vocabulary
def build_vocab(sentences):
    vocab={}
    for sentence in sentences:
        tokens = tokenize(sentence)
        for token in tokens:
            if token not in vocab:
                vocab[token] = len(vocab)+1
    return vocab

# create the vocabulary index
vocab = build_vocab(sentences)
print("vocabulary Index:",vocab)

vocabulary Index: {'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6}


In [4]:
# !pip install transformers

In [8]:
from transformers import BertTokenizerFast
# intialize the tokenizer 
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Tokenize the sentences and encode them
encoded_inputs = tokenizer(sentences,padding=True,truncation=True,return_tensors='pt')
# To see the tokens for each input (helpful for understanding the output)
tokens = [tokenizer.convert_ids_to_tokens(ids) 
           for ids in encoded_inputs["input_ids"]]
 
# To get the word index similar to Keras' tokenizer
word_index = tokenizer.get_vocab()
 
print("Tokens:", tokens)
print("Token IDs:", encoded_inputs['input_ids'])
print("Word Index:", dict(list(word_index.items())[:10]))  
# show only the first 10 for brevity
 

Tokens: [['[CLS]', 'today', 'is', 'a', 'sunny', 'day', '[SEP]'], ['[CLS]', 'today', 'is', 'a', 'rainy', 'day', '[SEP]']]
Token IDs: tensor([[  101,  2651,  2003,  1037, 11559,  2154,   102],
        [  101,  2651,  2003,  1037, 16373,  2154,   102]])
Word Index: {'##eb': 15878, 'nearest': 7205, 'prequel': 28280, 'unauthorized': 24641, '##ʿ': 29714, 'vegetation': 10072, '##nsen': 29428, 'malabar': 28785, 'terra': 14403, 'ョ': 1730}


BERT (which stands for bidirectional encoder representations from transformers)


Now, you may be wondering what [CLS] and [SEP] are—and how the BERT model has been trained to expect sentences to begin with [CLS] (for classifier) and end with or be separated by [SEP] (for separator). These two expressions are tokenized to values 101 and 102, respectively, so when you print out the token values for your sentences.

Either way, once you have the words in your sentences tokenized, the next step is to convert your sentences into lists of numbers, with the number being the value where the word is the key. This process is called sequencing.

### Stripping HTML Tags
Use the BeautifulSoup library to remove HTML tags. Here's an example:

```
from bs4 import BeautifulSoup
soup = BeautifulSoup(sentence)
sentence = soup.get_text()
```