# TorchText Tutorials

* TorchText Github: https://github.com/pytorch/text
* TorchText Documentation: http://torchtext.readthedocs.io/en/latest/data.html

Install: 
```
install: pip install torchtext
```

**Reference:**

* Allen Nie's article about TorchText: [A Tutorial on Torchtext](http://anie.me/On-Torchtext/)
* yunjey's pytorch tutorial: [link](https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/01-basics/pytorch_basics/main.py)

## Contents
1. **Natural Language Processing Process**
2. **Using TorchText**
3. **What if i don't want to use TorchText**

---

### 1. Natural Language Processing Process

For Natural language processing in Deep Learning, you will always have to do under process. 

1. Tokenization
2. Build Vocabulary
3. Numericalize all tokens
4. Put processed data into your model: For example, input > embedding look up > rnn > output

TorchText is a powerful tool for doing 1~3 process.

---

### 2. Using TorchText

1. Create Field
2. Create Datasets
3. Build vocabulary
4. Create Iterator

Get an example from sentiment analysis task.

In [1]:
from torchtext.data import Field, Iterator, TabularDataset
with open('./data/examples.tsv') as file:
    data = file.read().splitlines()
    data = [line.split('\t') for line in data]

In [2]:
data[7]

["The Importance of Being Earnest , so thick with wit it plays like a reading from Bartlett 's Familiar Quotations",
 '3']

#### 2.1 Create Field: 

* http://torchtext.readthedocs.io/en/latest/data.html#fields

In [3]:
TEXT = Field(sequential=True,  
             use_vocab=True,
             tokenize=str.split,  # you can define your own tokenizer
             lower=True, 
             batch_first=True)
LABEL = Field(sequential=False,  
              use_vocab=False,
              preprocessing = lambda x: int(x),  # this preprocessing is used after Tokenize and before Numericalize
              batch_first=True)

#### 2.2 Create Datasets : 

If you have train/dev/test datasets, you can use `TabularDataset.splits` method and add `train=`, `valid=`, `test=` arguments.

In [4]:
train_data = TabularDataset(path='./data/examples.tsv', format='tsv', fields=[('text', TEXT), ('label', LABEL)])

#### 2.3 Build vocabulary

When buiding vocabulary, TorchText prepare "<unk\>" for unknown words in valid/test data & "<pad\>" tokens for padding sentences to make sure all length are same in the batch data.

In [5]:
TEXT.build_vocab(train_data)
print('Total vocabulary: {}'.format(len(TEXT.vocab)))
print('Token for "<unk>": {}'.format(TEXT.vocab.stoi['<unk>']))
print('Token for "<pad>": {}'.format(TEXT.vocab.stoi['<pad>']))

Total vocabulary: 15561
Token for "<unk>": 0
Token for "<pad>": 1


#### 2.4 Create Iterator

In [6]:
train_loader = Iterator(train_data, 
                        batch_size=3,  # size of batches  
                        device=None,  # if you want to use gpu, change it to "cuda"
                        repeat=False)

In [7]:
for batch in train_loader:
    break
print(batch.text)
print(batch.label)

tensor([[   643,    191,      4,     43,   1447,      3,   4384,    485,
              7,    207,    892,    107,     43,     85,    408,      3,
            376,     17,      5,   6447,  11035,     37,     98,     43,
            199,   5859,      2,      1,      1,      1,      1,      1,
              1,      1,      1],
        [     3,   4515,     51,    444,      4,   3738,     30,     94,
            957,   3498,     59,    700,  13967,      6,   2287,   4435,
              4,    431,     40,      3,   1201,      7,    486,   1134,
           4120,     59,      5,    166,   1749,    547,      6,   1339,
            144,  14759,      2],
        [    29,      7,    195,    568,    192,     63,    229,     60,
             17,     21,    202,    334,     18,      5,    535,     20,
              4,     15,    628,    231,     52,      9,    303,    195,
           6910,      8,  10136,      8,      3,   2204,   4340,      2,
              1,      1,      1]])
tensor([ 0,  3,  1])


all lengths of sentences are same in each batch data

In [8]:
f = lambda x: TEXT.vocab.itos[x]
for b in batch.text:
    x = list(map(f, b.tolist()))
    print(' '.join(x), '\033[1;01;36m|| len of sentence: {} \033[0m'.format(len(x)))

deep down , i realized the harsh reality of my situation : i would leave the theater with a lower i.q. than when i had entered . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> [1;01;36m|| len of sentence: 35 [0m
the leaping story line , shaped by director peter kosminsky into sharp slivers and cutting impressions , shows all the signs of rich detail condensed into a few evocative images and striking character traits . [1;01;36m|| len of sentence: 35 [0m
one of these days hollywood will come up with an original idea for a teen movie , but until then there 's always these rehashes to feed to the younger generations . <pad> <pad> <pad> [1;01;36m|| len of sentence: 35 [0m


---

### 3. What if i don't want to use TorchText

1. Pure Python code DataLoader
2. Custom DataLoader

#### 3.1 Pure Python code DataLoader

In [9]:
import torch
from collections import defaultdict

In [10]:
with open('./data/examples.tsv') as file:
    data = file.read().splitlines()
    data = [line.split('\t') for line in data]

# Tokenization
sentences, labels = list(zip(*data))
all_tokens = [s.split() for s in sentences]
labels = [int(l) for l in labels]

#Build Vocabulary
flatten = lambda x: [tkn for s in x for tkn in s]
unique_tokens = set(flatten(all_tokens))
vocab_stoi = defaultdict()
vocab_stoi['<unk>'] = 0
vocab_stoi['<pad>'] = 1
for i, token in enumerate(unique_tokens, 3):
    vocab_stoi[token] = i
vocab_itos = [t for t, i in sorted([(token, index) for token, index in vocab_stoi.items()], key=lambda x: x[1])]

#Numericalize all tokens
all_tokens_numerical = [list(map(vocab_stoi.get, s)) for s in all_tokens]

# add <pad> and create batch:
def Loader(x, y, batch_size=32, pad_idx=1):
    sindex = 0
    eindex = batch_size
    while eindex < len(x):
        batch_text = x[sindex:eindex]
        batch_label = y[sindex:eindex]
        temp = eindex
        eindex = eindex + batch_size
        sindex = temp
        max_len = max([len(s) for s in batch_text])
        batch_text = [s+[pad_idx]*(max_len-len(s)) if len(s) < max_len else s for s in batch_text]
        
        yield torch.LongTensor(batch_text), torch.LongTensor(batch_label)
        
    if eindex >= len(x):
        batch_text = x[sindex:]
        batch_label = y[sindex:]
        max_len = max([len(s) for s in batch_text])
        batch_text = [s+[pad_idx]*(max_len-len(s)) if len(s) < max_len else s for s in batch_text]
        
        yield torch.LongTensor(batch_text), torch.LongTensor(batch_label)

In [11]:
for batch in Loader(x=all_tokens_numerical, y=labels, batch_size=32, pad_idx=1):
    print('batch_text:', batch[0].size())
    print('batch_label:', batch[1].size())
    break

batch_text: torch.Size([32, 41])
batch_label: torch.Size([32])


#### 3.2 Custom DataLoader

Reference for custom collate_fn:

1. https://discuss.pytorch.org/t/how-to-create-a-dataloader-with-variable-size-input/8278
2. https://github.com/yunjey/seq2seq-dataloader/blob/master/data_loader.py

In [12]:
import torch
import torch.utils.data as torchdata
from collections import defaultdict

In [13]:
class CustomDataset(torchdata.Dataset):
    def __init__(self, path='./data/examples.tsv', format_='\t', pad_idx=1):
        self.flatten = lambda x: [tkn for s in x for tkn in s]
        # Preprocessing
        with open(path, 'r') as file:
            data = file.read().splitlines()
            data = [line.split(format_) for line in data]
        
        # Tokenization
        sentences, labels = list(zip(*data))
        all_tokens = [s.split() for s in sentences]
        labels = [int(l) for l in labels]

        #Build Vocabulary
        flatten = lambda x: [tkn for s in x for tkn in s]
        unique_tokens = set(flatten(all_tokens))
        self.vocab_stoi = defaultdict()
        self.vocab_stoi['<unk>'] = 0
        self.vocab_stoi['<pad>'] = 1
        for i, token in enumerate(unique_tokens, 3):
            self.vocab_stoi[token] = i
        self.vocab_itos = [t for t, i in sorted([(token, index) for token, index in self.vocab_stoi.items()], key=lambda x: x[1])]

        #Numericalize all tokens
        all_tokens_numerical = [list(map(self.vocab_stoi.get, s)) for s in all_tokens]
        
        self.x = all_tokens_numerical
        self.y = labels
        self.pad_idx = 1
        
    def __getitem__(self, index):
        # return index datas
        return [self.x[index], self.y[index]]
        
    def __len__(self):
        # lengths of data
        return len(self.x)

    def custom_collate_fn(self, data):
        """
        need a custom 'collate_fn' function in 'torchdata.DataLoader' for variable length of dataset
        """
        texts, labels = list(zip(*data))
        max_len = max([len(s) for s in texts])
        texts = [s+[self.pad_idx]*(max_len-len(s)) if len(s) < max_len else s for s in texts]
        return torch.LongTensor(texts), torch.LongTensor(labels)

In [14]:
exam_dataset = CustomDataset(path='./data/examples.tsv', format_='\t')

In [15]:
train_loader = torchdata.DataLoader(dataset=exam_dataset,
                                    collate_fn=exam_dataset.custom_collate_fn,
                                    batch_size=32, 
                                    shuffle=False, 
                                    drop_last=False)

In [16]:
for batch in train_loader:
    print('batch_text:', batch[0].size())
    print('batch_label:', batch[1].size())
    break

batch_text: torch.Size([32, 41])
batch_label: torch.Size([32])
