Author: Yuvaraj Prem Kumar <br>

In [None]:
'''
References:
-----------

Ben Trevett (2020). PyTorch Seq2Seq. URL: https://github.com/bentrevett/pytorch-seq2seq

Alexander Rush (2018). \The Annotated Transformer". In: Melbourne, Australia: Association for Computational Linguistics, pp. 52{60. doi: 10.18653/v1/W18-2509. URL: https://aclanthology.org/W18-2509

Yu-Hsiang Huang (2019) Attention is all you need: A Pytorch Implementation URL: https://github.com/jadore801120/attention-is-all-you-need-pytorch

TensorFlow (2021). Transformer model for language understanding. URL: https://www.tensorflow.org/text/tutorials/transformer.#
'''

# Dataset exploration for the Multi30K and IWSLT2016 dataset:

* **Loading the dataset via TorchText in-built datasets.**
* **Showing sample sentence pairs for the translation task.**
* **Using the Spacy EN and DE language model for data preprocessing (tokenization).**
* **Implementation for the PyTorch dataloader iterator.**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
from itertools import islice
#plt.style.use('ggplot')

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [31]:
import torch
from torchtext.datasets import Multi30k,IWSLT2016
from torchtext.legacy.datasets import Multi30k, IWSLT
from torchtext.legacy.data import Field, BucketIterator
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#from torchtext.data import Field, BucketIterator
#from torchtext.data.utils import get_tokenizer

Spacy's English and German language model, used for tokenization. More info here: https://spacy.io/usage/models

In [3]:
# Download and import the languages models. You may have to restart the runtime to refresh the language model
#!python -m spacy download en_core_web_sm
#!python -m spacy download de_core_news_sm

import en_core_web_sm
import de_core_news_sm
nlp_en = en_core_web_sm.load()
nlp_de = de_core_news_sm.load()

# or:
#nlp_en = spacy.load('en_core_web_sm')
#nlp_de = spacy.load('de_core_news_sm')

In [4]:
# To left-align the output DataFrames, easier readability for the text data

def left_align(df: pd.DataFrame):
    left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
    left_aligned_df = left_aligned_df.set_table_styles(
        [dict(selector='th', props=[('text-align', 'left')])]
    )
    return left_aligned_df

In [45]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in nlp_de.tokenizer(text)]


def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in nlp_en.tokenizer(text)]

# German lang. is the source
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

# English lang. is the target
TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

# Multi30K dataset

In [28]:
train_iter, valid_iter, test_iter = Multi30k(root='.data', 
                                             split=('train', 'valid', 'test'),
                                             language_pair=('de', 'en'))

In [29]:
print(f"Number of train sentence pairs: {len(train_iter)}")
print(f"Number of validation sentence pairs: {len(valid_iter)}") 
print(f"Number of test sentence pairs: {len(test_iter)}") 

Number of train sentence pairs: 29000
Number of validation sentence pairs: 1014
Number of test sentence pairs: 1000


In [6]:
multi30k_df = pd.DataFrame(list(islice(train_iter,5)), columns=['DE','EN'])
left_align(multi30k_df)

Unnamed: 0,DE,EN
0,Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.,"Two young, White males are outside near many bushes."
1,Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.,Several men in hard hats are operating a giant pulley system.
2,Ein kleines Mädchen klettert in ein Spielhaus aus Holz.,A little girl climbing into a wooden playhouse.
3,Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.,A man in a blue shirt is standing on a ladder cleaning a window.
4,Zwei Männer stehen am Herd und bereiten Essen zu.,Two men are at the stove preparing food.


In [7]:
src_sentence, tgt_sentence = next(train_iter)
print(f"DE: {src_sentence} EN: {tgt_sentence}")

for doc in nlp_en.pipe([src_sentence]):
    print([(ent.text) for ent in doc])

for doc in nlp_de.pipe([tgt_sentence]):
    print([(ent.text) for ent in doc])

DE: Ein Mann in grün hält eine Gitarre, während der andere Mann sein Hemd ansieht.
 EN: A man in green holds a guitar while the other man observes his shirt.

['Ein', 'Mann', 'in', 'grün', 'hält', 'eine', 'Gitarre', ',', 'während', 'der', 'andere', 'Mann', 'sein', 'Hemd', 'ansieht', '.', '\n']
['A', 'man', 'in', 'green', 'holds', 'a', 'guitar', 'while', 'the', 'other', 'man', 'observes', 'his', 'shirt', '.', '\n']


In [46]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG), root='data')

SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [None]:
def print_data_info(train_data, valid_data, test_data, src_field, trg_field):
    """ This prints some useful stuff about our data sets. """

    print("Data set sizes (number of sentence pairs):")
    print('train', len(train_data))
    print('valid', len(valid_data))
    print('test', len(test_data), "\n")

    print("First training example:")
    print("src:", " ".join(vars(train_data[0])['src']))
    print("trg:", " ".join(vars(train_data[0])['trg']), "\n")

    print("Most common words (src):")
    print("\n".join(["%10s %10d" % x for x in src_field.vocab.freqs.most_common(10)]), "\n")
    print("Most common words (trg):")
    print("\n".join(["%10s %10d" % x for x in trg_field.vocab.freqs.most_common(10)]), "\n")

    print("First 10 words (src):")
    print("\n".join(
        '%02d %s' % (i, t) for i, t in enumerate(src_field.vocab.itos[:10])), "\n")
    print("First 10 words (trg):")
    print("\n".join(
        '%02d %s' % (i, t) for i, t in enumerate(trg_field.vocab.itos[:10])), "\n")

    print("Number of German words (types):", len(src_field.vocab))
    print("Number of English words (types):", len(trg_field.vocab), "\n"

In [48]:
print_data_info(train_data, valid_data, test_data, SRC, TRG)

Data set sizes (number of sentence pairs):
train 29000
valid 1014
test 1000 

First training example:
src: zwei junge weiße männer sind im freien in der nähe vieler büsche .
trg: two young , white males are outside near many bushes . 

Most common words (src):
         .      28809
       ein      18851
     einem      13711
        in      11895
      eine       9909
         ,       8938
       und       8925
       mit       8843
       auf       8745
      mann       7805 

Most common words (trg):
         a      49165
         .      27623
        in      14886
       the      10955
        on       8035
       man       7781
        is       7525
       and       7379
        of       6871
      with       6179 

First 10 words (src):
00 <unk>
01 <pad>
02 <sos>
03 <eos>
04 .
05 ein
06 einem
07 in
08 eine
09 , 

First 10 words (trg):
00 <unk>
01 <pad>
02 <sos>
03 <eos>
04 a
05 .
06 in
07 the
08 on
09 man 

Number of German words (types): 7853
Number of English words (types): 5893

# IWSLT2016 dataset

In [33]:
train_iter, valid_iter, test_iter = IWSLT2016(root='.data',
                                              split=('train', 'valid', 'test'),
                                              language_pair=('de', 'en'))

In [25]:
print(f"Number of train sentence pairs: {len(train_iter)}")
print(f"Number of validation sentence pairs: {len(valid_iter)}") 
print(f"Number of test sentence pairs: {len(test_iter)}") 

Number of train sentence pairs: 196884
Number of validation sentence pairs: 993
Number of test sentence pairs: 1305


In [9]:
iwslt_df = pd.DataFrame(list(islice(train_iter,5)), columns=['DE','EN'])
left_align(iwslt_df)

Unnamed: 0,DE,EN
0,David Gallo: Das ist Bill Lange. Ich bin Dave Gallo.,David Gallo: This is Bill Lange. I'm Dave Gallo.
1,Wir werden Ihnen einige Geschichten über das Meer in Videoform erzählen.,And we're going to tell you some stories from the sea here in video.
2,"Wir haben ein paar der unglaublichsten Aufnahmen der Titanic, die man je gesehen hat,, und wir werden Ihnen nichts davon zeigen.","We've got some of the most incredible video of Titanic that's ever been seen, and we're not going to show you any of it."
3,"Die Wahrheit ist, dass die Titanic – obwohl sie alle Kinokassenrekorde bricht – nicht gerade die aufregendste Geschichte vom Meer ist.",The truth of the matter is that the Titanic -- even though it's breaking all sorts of box office records -- it's not the most exciting story from the sea.
4,"Ich denke, das Problem ist, dass wir das Meer für zu selbstverständlich halten.","And the problem, I think, is that we take the ocean for granted."


In [10]:
src_sentence, tgt_sentence = next(train_iter)
print(f"DE: {src_sentence} EN: {tgt_sentence}")

for doc in nlp_en.pipe([src_sentence]):
    print([(ent.text) for ent in doc])

for doc in nlp_de.pipe([tgt_sentence]):
    print([(ent.text) for ent in doc])

DE: Wenn man darüber nachdenkt, machen die Ozeane 75 % des Planeten aus.
 EN: When you think about it, the oceans are 75 percent of the planet.

['Wenn', 'man', 'darüber', 'nachdenkt', ',', 'machen', 'die', 'Ozeane', '75', '%', 'des', 'Planeten', 'aus', '.', '\n']
['When', 'you', 'think', 'about', 'it', ',', 'the', 'oceans', 'are', '75', 'percent', 'of', 'the', 'planet', '.', '\n']


In [40]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in nlp_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in nlp_en.tokenizer(text)]

In [41]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

In [42]:
train_data, valid_data, test_data = IWSLT.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG), root='data')

In [47]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [44]:
print_data_info(train_data, valid_data, test_data, SRC, TRG)

Data set sizes (number of sentence pairs):
train 196884
valid 993
test 1305 

First training example:
src: david gallo : das ist bill lange . ich bin dave gallo .
trg: david gallo : this is bill lange . i 'm dave gallo . 

Most common words (src):
         ,     277689
         .     201700
       und      99754
       die      91280
       sie      61033
       das      58977
       ich      58472
       der      57421
       ist      51463
       wir      49321 

Most common words (trg):
         ,     234067
         .     194932
       the     162091
       and     115840
        to      95881
        of      89481
         a      81801
      that      73082
         i      63755
        in      60638 

First 10 words (src):
00 <unk>
01 <pad>
02 <sos>
03 <eos>
04 ,
05 .
06 und
07 die
08 sie
09 das 

First 10 words (trg):
00 <unk>
01 <pad>
02 <sos>
03 <eos>
04 ,
05 .
06 the
07 and
08 to
09 of 

Number of German words (types): 56378
Number of English words (types): 32772 



References:

https://bastings.github.io/annotated_encoder_decoder/ <br>
https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb