# Basic concept of torchtext
`Torchtext` takes a declarative approach to loading its data
* reference: [A Comprehensive Introduction to Torchtext](http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/)

![Alt text](https://i0.wp.com/mlexplained.com/wp-content/uploads/2018/02/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-02-07-10.32.59.png?resize=1024%2C481)

### Setup

In [1]:
import pandas as pd
from pathlib import Path
from pprint import pprint
from mecab import MeCab

In [2]:
data_dir = Path().cwd() / 'data'
nsmc = pd.read_csv(next(data_dir.iterdir()), sep='\t')

In [3]:
nsmc.head()

Unnamed: 0,document,label
0,애들 욕하지마라 지들은 뭐 그렇게 잘났나? 솔까 거기 나오는 귀여운 애들이 당신들보...,1
1,여전히 반복되고 있는 80년대 한국 멜로 영화의 유치함.,0
2,쉐임리스 스티브와 피오나가 손오공 부르마로 ㅋㅋㅋ,0
3,0점은 없나요?...,0
4,제발 시즌2 ㅜㅜ,1


### Pipeline
* api guide: https://torchtext.readthedocs.io/en/latest/data.html#pipeline

In [4]:
from torchtext.data import Pipeline

pipe = Pipeline(convert_token=lambda s: s + '1')

In [5]:
pipe('안녕')

'안녕1'

In [6]:
pipe.add_after(lambda s: s + '2')

<torchtext.data.pipeline.Pipeline at 0x7fd7904671d0>

In [7]:
pipe('안녕')

'안녕12'

In [8]:
pipe.add_before(lambda s: s + '3')

<torchtext.data.pipeline.Pipeline at 0x7fd7904671d0>

In [9]:
pipe('안녕')

'안녕312'

### Field
- api guide: https://torchtext.readthedocs.io/en/latest/data.html#field

In [29]:
from torchtext.data import Field

sentence = Field(sequential=True, use_vocab=True, tokenize=MeCab().morphs, batch_first=True, fix_length=32,
                 preprocessing=pipe)
label = Field(sequential=False, use_vocab=False, batch_first=True, unk_token=None, pad_token=None, is_target=True)

In [30]:
print(sentence.pad_token, label.pad_token)
print(sentence.unk_token, label.unk_token)
print(sentence.eos_token, label.eos_token)
print(sentence.init_token, label.init_token)

<pad> None
<unk> None
None None
None None


In [31]:
example_sentence = nsmc.iloc[0]['document']
print(example_sentence)

애들 욕하지마라 지들은 뭐 그렇게 잘났나? 솔까 거기 나오는 귀여운 애들이 당신들보다 훨 낮다.


In [32]:
list_of_tokens = sentence.tokenize(example_sentence)
print(list_of_tokens)

['애', '들', '욕', '하', '지', '마', '라', '지', '들', '은', '뭐', '그렇게', '잘', '났', '나', '?', '솔', '까', '거기', '나오', '는', '귀여운', '애', '들', '이', '당신', '들', '보다', '훨', '낮', '다', '.']


In [33]:
sentence.preprocess(example_sentence)

['애312',
 '들312',
 '욕312',
 '하312',
 '지312',
 '마312',
 '라312',
 '지312',
 '들312',
 '은312',
 '뭐312',
 '그렇게312',
 '잘312',
 '났312',
 '나312',
 '?312',
 '솔312',
 '까312',
 '거기312',
 '나오312',
 '는312',
 '귀여운312',
 '애312',
 '들312',
 '이312',
 '당신312',
 '들312',
 '보다312',
 '훨312',
 '낮312',
 '다312',
 '.312']

### Vocab
Defines a `Vocab` object **that will be used to numericalize a field.**

In [None]:
import itertools
from collections import Counter
from torchtext.vocab import Vocab, build_vocab_from_iterator

In [None]:
list_of_tokenized = nsmc['document'].apply(sentence.tokenize).tolist()

In [None]:
count_tokens = Counter(itertools.chain.from_iterable(list_of_tokenized))

In [None]:
vocab = Vocab(counter=count_tokens, min_freq=10)
sentence.vocab = vocab

In [None]:
print(list_of_tokens)
print(sentence.vocab.itos[:5], len(sentence.vocab))
print(sentence.numericalize([list_of_tokens]))

In [None]:
sentence.vocab = None # reset

In [None]:
sentence.build_vocab(list_of_tokenized, min_freq=10)

In [None]:
print(list_of_tokens)
print(sentence.vocab.itos[:5], len(sentence.vocab))
print(sentence.numericalize([list_of_tokens]))

### Example
Defines a single training or test example. Stores each column of the example as an attribute.

In [None]:
from torchtext.data import Example

# generate an Example
example = Example.fromlist(nsmc.iloc[0].tolist(), fields=[('document', sentence), ('label', label)])
print(example.document, example.label)

### Dataset


In [None]:
from torchtext.data import Dataset, TabularDataset

In [None]:
# generate list of Examples
list_of_examples = [Example.fromlist(row.tolist(),
                    fields=[('document', sentence), ('label', label)]) for _, row in nsmc.iterrows()]

In [None]:
pprint(list_of_examples[:5])
print(list_of_examples[0].document, list_of_examples[0].label)

In [None]:
# generate dataset
dataset = Dataset(examples=list_of_examples, fields=[('document', sentence), ('label', label)])

In [None]:
dataset.examples[:5]

In [None]:
dataset.fields

In [None]:
# using TabularDataset
sentence = Field(sequential=True, use_vocab=True, tokenize=MeCab().morphs, batch_first=True, fix_length=32)
label = Field(sequential=False, use_vocab=False, batch_first=True, unk_token=None, pad_token=None, is_target=True)

dataset = TabularDataset(path='data/train.txt', format='TSV', fields=[('document', sentence), ('label', label)],
                         skip_header=True)

In [None]:
sentence.build_vocab(dataset)

### Iterator

In [None]:
from torchtext.data import Iterator

In [None]:
iterator = Iterator(dataset, batch_size=2)

In [None]:
x_mb, y_mb = next(iter(iterator))

In [None]:
print(x_mb, y_mb)