# `torchtext`
`torchtext` is an extremely useful for solving NLP based pre-processing. Types of pre-processing done -

- **Train/Val/Test split**

- **File Loading**: Loading corpus in various formats

- **Tokenization**: Break sentences into a list of words

- **Vocab**: Generating a Vocabulary list

- **Numericalize/Indexify**: Map words into integer numbers for the entire corpus

- **Word Vector**: Either initialize vocabulary randomly or load in from a pretrained embedding, this embedding must be "trimmed", meaning we only store words in our vocabulary into memory.

- **Batching**: Generate batches of training sample (padding is normally done here)

- **Embedding Lookup**: Map each sentence (which contain word indices) to a fixed dimension word vectors

<p align="center">
<b>Examples</b>
</p>
<p align="center">
<img src="../images/torchtext.png" style="width:450px;height:450px;">
</p>

## Steps
1. Specify how the preprocessing should be done --> Will be done with `Field`

2. Use `Dataset` to load the data and numericalizing --> Will be done using `TabularDataset` (Handles JSON/CSV/TSV files)

3. Construct an iterator to do batching and padding --> Will be done using `BucketIterator`

In [1]:
from torchtext.data import BucketIterator, TabularDataset, Field

In [2]:
tokenize = lambda x: x.split()
tokenize

<function __main__.<lambda>(x)>

In [3]:
quote = Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True)
# sequential = True: Because the data is sequential
# use_vocab = True: Because we will use a vocab
# lower = True: To ensure all the text LOWER CASE

In [4]:
score = Field(sequential=False, use_vocab=False) 
# Since this is a sentiment classification type problem
# If it was a machine translation type output, sequential and use_vocab would be True

In [5]:
fields = {'quote':('q', quote), 'score':('s', score)} # For applying preprocessing

In [6]:
# JSON
train_data, test_data = TabularDataset.splits(
    path = 'data',
    train = 'train.json',
    # validation = 'val.json',
    test = 'test.json',
    format = 'json',
    fields = fields
)

#CSV
# train_data, test_data = TabularDataset.splits(
#     path = 'data',
#     train = 'train.csv',
#     # validation = 'val.csv',
#     test = 'test.csv',
#     format = 'csv',
#     fields = fields
# )

#TSV
# train_data, test_data = TabularDataset.splits(
#     path = 'data',
#     train = 'train.tsv',
#     # validation = 'val.tsv',
#     test = 'test.tsv',
#     format = 'tsv',
#     fields = fields
# )

In [9]:
print(train_data[0].__dict__.keys(), '\n', train_data[0].__dict__.values())

dict_keys(['q', 's']) 
 dict_values([['you', 'must', 'own', 'everything', 'in', 'your', 'world.', 'there', 'is', 'no', 'one', 'else', 'to', 'blame.'], 1])


In [11]:
# Building the Vocabulary
quote.build_vocab(
    train_data,
    max_size = 10000, # Although we only have 50 words in the train dataset
    min_freq = 1 # Only consider the words which occur atleast min_freq times in the train data 
)

In [16]:
train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data),
    batch_sizes = (2,2),
    device = 'cpu'
)

In [18]:
for batch in train_iterator:
    print(batch.q) # To print the quote
    print(batch.q.shape)
    print(batch.s) # To print the score

tensor([[27, 10],
        [29, 21],
        [ 7,  4],
        [26,  3],
        [18,  6],
        [ 2, 11],
        [25, 17],
        [ 1,  4],
        [ 1,  3],
        [ 1, 30],
        [ 1, 28],
        [ 1,  5],
        [ 1, 13],
        [ 1,  2],
        [ 1,  9],
        [ 1, 23]])
torch.Size([16, 2])
tensor([0, 1])
tensor([[33],
        [19],
        [24],
        [14],
        [15],
        [34],
        [32],
        [31],
        [16],
        [20],
        [22],
        [12],
        [ 5],
        [ 8]])
torch.Size([14, 1])
tensor([1])


- The number **1** in the quote array stands for PAD, to fill for the gaps for shorter sentences
- The rest of the numbers are the respective index of the word in the vocabulary

In [27]:
import spacy

OSError: dlopen(/Users/venkatakrishnanvk/miniforge3/lib/python3.10/site-packages/mxnet/libmxnet.so, 0x0006): tried: '/Users/venkatakrishnanvk/miniforge3/lib/python3.10/site-packages/mxnet/libmxnet.so' (not a mach-o file)

In [28]:
spacy_en = spacy.load('en')

NameError: name 'spacy' is not defined

In [29]:
def tokenize(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

## Pretrained word embedding
Lets use GloVe

In [None]:
quote.build_vocab(
    train_data,
    max_size = 10000, # Although we only have 50 words in the train dataset
    min_freq = 1, # Only consider the words which occur atleast min_freq times in the train data 
    vectors = 'glove.6B.100d' # 1 GB size - 6 billion words, 100 dimensions
)