# Optional

** Run the cells below if tokenizer needs to be trained. Trained tokenizer can be cloned from huggingface directly. **

### The purpose of this file is to create a corpus of english, hindi and kannada data for training a translation model. The saved corspus will be used to train a sentence piece model and then a transformer model for translation.

In [1]:
import pandas as pd
from pathlib import Path
from torch.nn.utils.rnn import pad_sequence
from en_indic_transformer import TranslationDataset, TranslationDataLoader, Tokenizer

In [2]:
path = Path()
base_dir = path.absolute().parent

In [3]:
load_dir = base_dir / 'data'

Get the saved english to hindi data and english to kannada data from the respective csv files. 

In [4]:
en_hindi_file = load_dir / 'en_hindi.csv'
en_kannada_file = load_dir / 'en_kannada.csv'

In [5]:
en_hindi_file

PosixPath('/Users/sameergururajmathad/en-indic-transformer/data/en_hindi.csv')

In [6]:
en_kannada_file

PosixPath('/Users/sameergururajmathad/en-indic-transformer/data/en_kannada.csv')

In [7]:
en_hindi_df = pd.read_csv(en_hindi_file)
en_kannada_df = pd.read_csv(en_kannada_file)

In [8]:
en_hindi_source = en_hindi_df["english_sentence"].tolist()
en_hindi_target = en_hindi_df["hindi_sentence"].tolist()
en_kannada_source = en_kannada_df["english_sentence"].tolist()
en_kannada_target = en_kannada_df["kannada_sentence"].tolist()

combine all the data into single list to store the corpus.

In [9]:
corpus = []

corpus.extend(en_hindi_source)
corpus.extend(en_hindi_target)
corpus.extend(en_kannada_source)
corpus.extend(en_kannada_target)

In [10]:
corpus

["When it is said to him: 'Fear Allah' egotism takes him in his sin. Gehenna (Hell) shall be enough for him. How evil a cradling!",
 'This profile exists already.',
 'Halo with Ornamental Borde',
 'and the jinn We had created before from flaming fire.',
 'Ladies and Gentlemen, the Government of India proposes to launch a new Urban Development Mission to support states by handholding them in building infrastructure and services in step with the rapid pace of urbanization.',
 "Have you then considered Al - Lat, and Al - 'Uzza (two idols of the pagan Arabs).",
 'Escalation in demand will provide traders an opportunity to increase the price.',
 'He understood the pity and the beauty of life and looked upon himself together with every other living creature as forming a single sympathetic cadence in the poem of creation.',
 'Fast Track Court - The Additional Sessions Court formed for settlement of long - standing crimes and under trial cases quickly.',
 'He was working as a quality controlle

save the processed data into a text file to be used for training sentence piece model.

In [11]:
corpus_save_dir = base_dir / 'data'
tokenizer_save_dir = base_dir / 'tokenizer'

check if the directory exists, if not create one.

In [12]:
if not corpus_save_dir.exists():
    corpus_save_dir.mkdir(parents=True, exist_ok=True)

save the corpus to a text file if not present.

In [13]:
save_file = corpus_save_dir / 'tokenizer_corpus.txt'

In [14]:
if not save_file.exists():
    with open(save_file , 'w', encoding='utf-8') as file:
        file.write('\n'.join(corpus))

Train the tokeinzer. It requires few parameters like input file, model prefix, vocab size etc.

In [15]:
vocab_size = 50_000
model_prefix = tokenizer_save_dir / 'tokenizer' # path to store the tokenizer files and also the name to store 'tokenizer'
user_defined_symbols = {'<|endoftext|>', '<|english|>', '<|hindi|>', '<|kannada|>'}

In [16]:
Tokenizer.train(corpus_path=str(save_file),
                save_path=str(model_prefix),
                vocab_size=vocab_size, 
                user_defined_symbols=user_defined_symbols, 
                model_type='unigram', 
                split_by_whitespace=False)

Training SentencePiece on the given data.


sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: /Users/sameergururajmathad/en-indic-transformer/data/tokenizer_corpus.txt
  input_format: 
  model_prefix: /Users/sameergururajmathad/en-indic-transformer/tokenizer/tokenizer
  model_type: UNIGRAM
  vocab_size: 50000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 0
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  user_defined_symbols: <|kannada|>
  user_defined_symbols: <|hindi|>
  user_defined_symbols: <|endoftext|>
  user_defined_symbols: <|english|>
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_c

In [17]:
tokenizer = Tokenizer(str(tokenizer_save_dir/'tokenizer.model'))

In [18]:
txt_en = "The quick brown fox jumps over the lazy dog."
txt_hi = "मुझे हिन्दी बहुत पसंद है।"
txt_kn = "ನನಗೆ ಕನ್ನಡ ತುಂಬಾ ಇಷ್ಟ."

In [19]:
dataset = TranslationDataset(src=en_hindi_source, target=en_hindi_target, tokenizer=tokenizer, src_prepend_value='<|english|>', target_prepend_value='<|hindi|>', endoftext='<|endoftext|>')

In [20]:
def custom_collate_fn(batch):
    sources, target_ins, target_outs = [], [], []

    for source, target_in, target_out in batch:
        sources.append(source)
        target_ins.append(target_in)
        target_outs.append(target_out)

    source_padded = pad_sequence(sources, batch_first=True, padding_value=50256)
    target_in_padded = pad_sequence(target_ins, batch_first=True, padding_value=50256)
    target_out_padded = pad_sequence(target_outs, batch_first=True, padding_value=-100)

    return source_padded, target_in_padded, target_out_padded
    

In [21]:
# dataloader = DataLoader(dataset=dataset, batch_size=16, shuffle=True,collate_fn=custom_collate_fn)
dataloader = TranslationDataLoader(dataset=dataset, batch_size=16, shuffle=True, pad_val=tokenizer.get_piece_id('<|endoftext|>'), ignore_index=-100)

In [22]:
data = iter(dataloader)

In [23]:
first = next(data)

In [24]:
# source = list(first[0][2])
# target_in = list(first[1][2])
# target_out = list(first[2][2])

source = first[0][2]
target_in = first[1][2]
target_out = first[2][2]

In [25]:
target_in, target_out

(tensor([    4, 13395,     8,  1760,    28, 20896,  1786,  5204,  8084,   342,
          4144,     8,   578,   610, 15884,  2324,   110,     7, 14686,  1069,
           204, 23170,  4483, 44341,    29,  5204, 36482, 39844,   874, 11440,
         25894,    17,   764,  5682,   909, 13142, 14179,  2127,     7,     5,
             5]),
 tensor([13395,     8,  1760,    28, 20896,  1786,  5204,  8084,   342,  4144,
             8,   578,   610, 15884,  2324,   110,     7, 14686,  1069,   204,
         23170,  4483, 44341,    29,  5204, 36482, 39844,   874, 11440, 25894,
            17,   764,  5682,   909, 13142, 14179,  2127,     7,     5,  -100,
          -100]))

In [26]:
tokenizer.decode(target_in), tokenizer.decode(target_out)

('<|hindi|> जाहिर है, दिल्ली की बै क क्रिकेट को लेकर कतई नहीं थी, उसका मकसद दूसरा था. हते भर तक स्वायत्तता और जवाबदेही को लेकर मगजमारी करने के बाद बोर्ड़ और खेल मंत्रालय के बीच तनाव में कमी आई.<|endoftext|><|endoftext|>',
 'जाहिर है, दिल्ली की बै क क्रिकेट को लेकर कतई नहीं थी, उसका मकसद दूसरा था. हते भर तक स्वायत्तता और जवाबदेही को लेकर मगजमारी करने के बाद बोर्ड़ और खेल मंत्रालय के बीच तनाव में कमी आई.<|endoftext|>')

checking if the length of target input and output are same

In [27]:
len(target_in), len(target_out)

(41, 41)

In [28]:
# tokenizer.decode([id for id in source if id != -100])
tokenizer.decode(source)

'<|english|> Naturally the Delhi meeting was not about cricket at all, it served a different purpose: it established a detente of sorts between the board and the Sports Ministry after a week of sniping about autonomy and accountability.<|endoftext|>'

In [29]:
# tokenizer.decode([id for id in target_in if id != -100])
tokenizer.decode(target_in)

'<|hindi|> जाहिर है, दिल्ली की बै क क्रिकेट को लेकर कतई नहीं थी, उसका मकसद दूसरा था. हते भर तक स्वायत्तता और जवाबदेही को लेकर मगजमारी करने के बाद बोर्ड़ और खेल मंत्रालय के बीच तनाव में कमी आई.<|endoftext|><|endoftext|>'

In [30]:
# tokenizer.decode([id for id in target_out if id != -100])
tokenizer.decode(target_out)

'जाहिर है, दिल्ली की बै क क्रिकेट को लेकर कतई नहीं थी, उसका मकसद दूसरा था. हते भर तक स्वायत्तता और जवाबदेही को लेकर मगजमारी करने के बाद बोर्ड़ और खेल मंत्रालय के बीच तनाव में कमी आई.<|endoftext|>'