<a href="https://colab.research.google.com/github/vektor8891/llm/blob/main/projects/12_bert_preparing/12_bert_preparing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Loading and Text Processing for BERT

## Tokenization and vocabulary building

### Tokenization

Components:

- `tokenizer`: tokenizes text
- `yield_tokens`: generator function that iterates through the data
- `word_dict`: defines special tokens used in text processing, such as padding `[PAD]`, class `[CLS]`, separator `[SEP]`, mask `[MASK]`, and unknown `[UNK]`
- `text_to_index`: converts text into a list of numerical indices based on the vocabulary
- `index_to_en`: translates sequence of indices back into readable English text.

Special tokens:

- **`CLS` (Classification Token)**: Start of Sentence (SOS) marker
- **`SEP` (Separator Token)**: End of Sentence (EOS) marker
- **`PAD` (Padding Token)**: added to sequences to ensure all inputs are of equal length
- **`MASK` (Masked Token)**: Utilized masked language modeling
- **`UNK` (Unknown Token)**: placeholder for unknown words

In [4]:
!pip install torchtext==0.17.2
!pip install portalocker==2.8.2
!pip install transformers==4.35.2
!pip install torch==2.2.0

In [5]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for label, data_sample in data_iter:
        yield tokenizer(data_sample)

# Define special symbols and indices
PAD_IDX,CLS_IDX, SEP_IDX,  MASK_IDX,UNK_IDX= 0, 1, 2, 3, 4

# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['[PAD]','[CLS]', '[SEP]','[MASK]','[UNK]']

OSError: /usr/local/lib/python3.11/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_

### Vocabulary building

In [None]:
from torchtext.datasets import IMDB

#create data splits
train_iter, test_iter = IMDB(split=('train', 'test'))
all_data_iter = chain(train_iter, test_iter)
#check tokenizer
# list(yield_tokens(all_data_iter))[5][:20]
fifth_item_tokens = next(islice(yield_tokens(all_data_iter), 5, None))
print(fifth_item_tokens[:20])