For all datasets in torchtext, check out - https://pytorch.org/text/stable/datasets.html

We will use Multi30k for our Machine Translation program, to translate **English** to **German**

In [1]:
# import spacy
"""
    To install spacy languages -
    python -m spacy download en
    python -m spacy download de (Deutsch)
"""
import numpy as np
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

In [2]:
# Example trial
string = 'one,two three\nfour.five'
print(string)
# string.split(',')
p = np.array([])
for s in string.split(','):
    s1 = s.split()
    for s2 in s1:
        s2 = s2.split('.')
        p = np.append(p, s2)
p

one,two three
four.five


array(['one', 'two', 'three', 'four', 'five'], dtype='<U32')

In [3]:
# spacy_eng = spacy.load('en')
# spacy_ger = spacy.load('de')
def tokenize(text):
    p = np.array([])
    for s in text.split(' '):
        s1 = s.split(',')
        for s2 in s1:
            s2 = s2.split('.')
            p = np.append(p, s2)
    return list(p)

In [4]:
def tokenize_eng(text):
    return [tok.text for tok in spacy_eng.tokenizer(text)]

def tokenize_ger(text):
    return [tok.text for tok in spacy_ger.tokenizer(text)]

In [5]:
english = Field(sequential=True, use_vocab=True, tokenize=tokenize, lower = True)
german = Field(sequential=True, use_vocab=True, tokenize=tokenize, lower = True)

In [6]:
train_data, validation_data, test_data = Multi30k.splits(
    exts = ('.de', '.en'), # (Source language, Target Language)
    fields = (german, english)
)

In [54]:
english.build_vocab(train_data, max_size = 10000, min_freq = 2)
german.build_vocab(train_data, max_size = 10000, min_freq = 2)

In [55]:
train_iterator, validation_iterator, test_iterator = BucketIterator.splits(
    (train_data, validation_data, test_data),
    batch_sizes=(64, 64, 64),
    device = 'cpu'
)

In [56]:
for batch in train_iterator:
    print(batch)


[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 24x64]
	[.trg]:[torch.LongTensor of size 24x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 23x64]
	[.trg]:[torch.LongTensor of size 24x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 22x64]
	[.trg]:[torch.LongTensor of size 23x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 22x64]
	[.trg]:[torch.LongTensor of size 23x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 27x64]
	[.trg]:[torch.LongTensor of size 29x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 20x64]
	[.trg]:[torch.LongTensor of size 27x64]

[torchtext.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 23x64]
	[.trg]:[torch.LongTensor of size 22x64]

[torchtext.data.batch.Batch of size 64 f

We can see that all are of form b x 64. b is the length of the sentence

`.src` : German numericalized sentences (Source)

`.trg` : English numericalized sentences (Target)

In [57]:
english.vocab.stoi['the'] # stoi: String to index (from the vocabulary)

5

In [58]:
english.vocab.stoi['i']

1313

In [59]:
english.vocab.itos[5] # itos: Index to String

'the'

In [60]:
len(english.vocab)

5962

## Textfiles to Dataset

In [61]:
import pandas as pd
from torchtext.data import TabularDataset
from sklearn.model_selection import train_test_split

In [68]:
english_txt = open('.data/multi30k/train.en', encoding='utf8').read().split('\n')
german_txt = open('.data/multi30k/train.de', encoding='utf8').read().split('\n')

In [69]:
english_txt

['Two young, White males are outside near many bushes.',
 'Several men in hard hats are operating a giant pulley system.',
 'A little girl climbing into a wooden playhouse.',
 'A man in a blue shirt is standing on a ladder cleaning a window.',
 'Two men are at the stove preparing food.',
 'A man in green holds a guitar while the other man observes his shirt.',
 'A man is smiling at a stuffed lion',
 'A trendy girl talking on her cellphone while gliding slowly down the street.',
 'A woman with a large purse is walking by a gate.',
 'Boys dancing on poles in the middle of the night.',
 'A ballet class of five girls jumping in sequence.',
 'Four guys three wearing hats one not are jumping at the top of a staircase.',
 'A black dog and a spotted dog are fighting',
 'A man in a neon green and orange uniform is driving on a green tractor.',
 'Several women wait outside in a city.',
 'A lady in a black top with glasses is sprinkling powdered sugar on a bundt cake.',
 'A little girl is sitting

In [70]:
raw_data = {"English": [line for line in english_txt], "German": [line for line in german_txt]}

In [71]:
df = pd.DataFrame(raw_data, columns = ['English', 'German'])
df

Unnamed: 0,English,German
0,"Two young, White males are outside near many b...",Zwei junge weiße Männer sind im Freien in der ...
1,Several men in hard hats are operating a giant...,Mehrere Männer mit Schutzhelmen bedienen ein A...
2,A little girl climbing into a wooden playhouse.,Ein kleines Mädchen klettert in ein Spielhaus ...
3,A man in a blue shirt is standing on a ladder ...,Ein Mann in einem blauen Hemd steht auf einer ...
4,Two men are at the stove preparing food.,Zwei Männer stehen am Herd und bereiten Essen zu.
...,...,...
28995,A woman behind a scrolled wall is writing,Eine Frau schreibt hinter einer verschnörkelte...
28996,A rock climber practices on a rock climbing wall.,Ein Bergsteiger übt an einer Kletterwand.
28997,Two male construction workers are working on a...,Zwei Bauarbeiter arbeiten auf einer Straße vor...
28998,An elderly man sits outside a storefront accom...,Ein älterer Mann sitzt mit einem Jungen mit ei...


In [72]:
train, test = train_test_split(df, test_size=0.2) # For now we will use only the train data

`TabularDataset` can take only json, csv or tsv file, so we need to convert the train and test to one of these formats

In [73]:
train.to_json('data/train_translation.json', orient='records', lines=True)
test.to_json('data/test_translation.json', orient='records', lines=True)

# train.to_csv('data/train_translation.csv', index=False)
# test.to_csv('data/test_translation.csv', index=False)

### Steps -
1. Create the Fields for input (source) and output (target)
2. Create the fields dictionary
3. Create Dataset using TabularDataset
4. Build the vocabulary using the dataset
5. Create the BucketIterator, which works similar to DataLoader

We shall use `english` and `german` Field we created before

In [74]:
fields = {"English": ("eng", english), "German": ("ger", german)}

In [76]:
train_data, test_data = TabularDataset.splits(
    path = 'data',
    train = 'train_translation.json',
    test = 'test_translation.json',
    format = 'json',
    fields = fields # This ensures the pre-processing
)

In [77]:
english.build_vocab(train_data, max_size = 10000, min_freq = 2)
german.build_vocab(train_data, max_size = 10000, min_freq = 2)

In [78]:
train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data),
    batch_sizes=(32, 32),
    device = 'cpu'
)

In [82]:
for batch in train_iterator:
    print(batch)


[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 23x32]
	[.ger]:[torch.LongTensor of size 21x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 25x32]
	[.ger]:[torch.LongTensor of size 28x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 34x32]
	[.ger]:[torch.LongTensor of size 39x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 23x32]
	[.ger]:[torch.LongTensor of size 22x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 26x32]
	[.ger]:[torch.LongTensor of size 24x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 18x32]
	[.ger]:[torch.LongTensor of size 19x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 20x32]
	[.ger]:[torch.LongTensor of size 19x32]

[torchtext.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 20x32]
	[.ger]:[torch.LongTensor of size 19x32]

[torchtext.data

In [81]:
for batch in train_iterator:
    print(batch.eng)
    print(batch.ger)

tensor([[   2,    2,    2,  110, 1519,    2,    2,   18,    2,    2,   18,    2,
            2,   35,    2,    2,    2,   18,    7,    2,    2,    2,    7,   13,
            2,    2,    2,    2,    7,   15,  190,    2],
        [  55,    7,    7,  113,  295,    7,   12,  230,   20,   12,    0,   58,
            7,  388,   35,    7,   35,  895,   71,   12,   35,   35,   11,   47,
            7,    7,   33,    7,   71,    4,  210,  128],
        [  81,   74,    4,    8,   11,    4,   37,    7,   31,    9, 2101,   11,
           32,   49,   10,    4,   10,    7,    4,    9,   10,   10, 3844,   96,
          200,    4,    8,   11,   49,   26,    0,   20],
        [  10,    2,   30,  630,   24,    2,   38,   11,    4,    2,    0,    2,
            4,    2,  477,    2, 1741,   11,   92,   13,   20,   20,    3,  204,
           11,    2,   76,  597,    8,  789,  434,   12],
        [  15,   59,    3,   17,    4,   23,    2,    2,   23,   20,   17,   26,
            5,  245,   15,   98,  106, 