# ** Hybrid Neural Machine Translation for HimangiY **
#### Vandan Mujadia, Dipti Misra Sharma
#### LTRC, IIIT-Hyderabad, Hyderabad

This demonstrates how to train a sequence-to-sequence (seq2seq) model for Kannada-to-Hindi translation **roughly** based on [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1706.03762) (Vaswani, Ashish et al).

## An Example to Understand sequence to Sequence processing using Transformar Network.

<img src="https://www.tensorflow.org/images/tutorials/transformer/apply_the_transformer_to_machine_translation.gif" alt="Applying the Transformer to machine translation">

Source: [Google AI Blog](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)



## Applying the Transformer to machine translation.


<table>
<tr>
  <td>
   <img width=400 src="https://miro.medium.com/max/720/1*57LYNxwBGcCFFhkOCSnJ3g.png"/>
  </td>
</tr>
<tr>
  <th colspan=1>This tutorial: An encoder/decoder connected by self attention neural network.</th>
<tr>
</table>

# Tools that we are using here

*   Library : Opennmt
*   Library : pytorch based neural network implemtation


In [None]:
!pip install -U pip
!!git clone https://github.com/OpenNMT/OpenNMT-py
! ls
%cd OpenNMT-py
!git checkout 1.2.0
!pip3 install torchtext==0.4.0 torch==1.11.0

Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
OpenNMT-py  sample_data
/content/OpenNMT-py
Note: switching to '1.2.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is 

In [None]:
%cd /content/

/content


# Check GPU

In [None]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


# Tokenizer Tool

In [None]:
!pip install git+https://github.com/vmujadia/tokenizer.git --upgrade

Collecting git+https://github.com/vmujadia/tokenizer.git
  Cloning https://github.com/vmujadia/tokenizer.git to /tmp/pip-req-build-3lao7iuu
  Running command git clone --filter=blob:none --quiet https://github.com/vmujadia/tokenizer.git /tmp/pip-req-build-3lao7iuu
  Resolved https://github.com/vmujadia/tokenizer.git to commit 93cd09b81702108a51c08c9796fd1cc941a1b98b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: IL-Tokenizer
  Building wheel for IL-Tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for IL-Tokenizer: filename=IL_Tokenizer-0.0.2-py3-none-any.whl size=7225 sha256=1bec4df8b3d0a8ca3a48367f72deb4b2d68623a782699f27934083cfbaa6b959
  Stored in directory: /tmp/pip-ephem-wheel-cache-624d680m/wheels/9a/fb/5b/3d75bfde8561726121c09f0f0a83389c05312df8a513808c41
Successfully built IL-Tokenizer
Installing collected packages: IL-Tokenizer
Successfully installed IL-Tokenizer-0.0.2
[0m

# To Clean and Filter Parallel Corpora

In [None]:
!git clone https://github.com/moses-smt/mosesdecoder.git

Cloning into 'mosesdecoder'...
remote: Enumerating objects: 148097, done.[K
remote: Counting objects: 100% (525/525), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 148097 (delta 323), reused 441 (delta 292), pack-reused 147572[K
Receiving objects: 100% (148097/148097), 129.88 MiB | 10.32 MiB/s, done.
Resolving deltas: 100% (114349/114349), done.


# To tackle vocabulary issue : Subword algorithm

In [None]:
!git clone https://github.com/rsennrich/subword-nmt.git

Cloning into 'subword-nmt'...
remote: Enumerating objects: 597, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 597 (delta 8), reused 12 (delta 4), pack-reused 576[K
Receiving objects: 100% (597/597), 252.23 KiB | 3.11 MiB/s, done.
Resolving deltas: 100% (357/357), done.


In [None]:
!ls mosesdecoder/scripts/training/clean-corpus-n.perl

mosesdecoder/scripts/training/clean-corpus-n.perl


# For this; Training Corpora

##  Kannada - Hindi
## (small courpus MIT+CDAC-B developed)

In [None]:
! wget -O train.src https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.hi
! wget -O train.tgt https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.kn
! wget -O valid.src https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.hi
! wget -O valid.tgt https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.kn

--2023-09-23 05:22:08--  https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.hi
Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52
Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4195974 (4.0M) [application/octet-stream]
Saving to: ‘train.src’


2023-09-23 05:22:18 (515 KB/s) - ‘train.src’ saved [4195974/4195974]

--2023-09-23 05:22:18--  https://ssmt.iiit.ac.in/uploads/data_mining/kannada-hindi_combined_all.kn
Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.12.53.52
Connecting to ssmt.iiit.ac.in (ssmt.iiit.ac.in)|196.12.53.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3796364 (3.6M) [application/octet-stream]
Saving to: ‘train.tgt’


2023-09-23 05:22:23 (1.03 MB/s) - ‘train.tgt’ saved [3796364/3796364]

--2023-09-23 05:22:23--  https://ssmt.iiit.ac.in/uploads/data_mining/flores200-dev.hi
Resolving ssmt.iiit.ac.in (ssmt.iiit.ac.in)... 196.

# Data Numbers

In [None]:
print ('Data Stats')
! wc -l train.*
! wc -l valid.*

Data Stats
  15877 train.src
  15877 train.tgt
  31754 total
   997 valid.src
   997 valid.tgt
  1994 total


# Tokenize the text

In [None]:
from ilstokenizer import tokenizer
import codecs

def to_tokenize_and_lower(input_path, output_path):
  outfile = open(output_path, 'w')
  for line in codecs.open(input_path):
    line = line.strip()
    line = tokenizer.tokenize(line).lower()
    #print (line)
    outfile.write(line+'\n')
  outfile.close()

In [None]:
to_tokenize_and_lower('train.src','train.src.tkn')
to_tokenize_and_lower('train.tgt','train.tgt.tkn')

to_tokenize_and_lower('valid.src','valid.src.tkn')
to_tokenize_and_lower('valid.tgt','valid.tgt.tkn')

In [None]:
! cat train.src.tkn > train.all.tkn
! cat train.tgt.tkn >> train.all.tkn

# Data Cleaning

In [None]:
! perl mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 2.5 train src.tkn tgt.tkn train_filtered 1 250

clean-corpus.perl: processing train.src.tkn & .tgt.tkn to train_filtered, cutoff 1-250, ratio 2.5
.
Input sentences: 15878  Output sentences:  15807


In [None]:
print ('Data Stats')
! wc -l train*
! wc -l valid*

Data Stats
   31756 train.all.tkn
   15807 train_filtered.src.tkn
   15807 train_filtered.tgt.tkn
   15877 train.src
   15878 train.src.tkn
   15877 train.tgt
   15878 train.tgt.tkn
  126880 total
    997 valid.src
    997 valid.src.tkn
    997 valid.tgt
    997 valid.tgt.tkn
   3988 total


In [None]:
print ('Data Stats')
! wc -l train*
! wc -l valid*

Data Stats
   31756 train.all.tkn
   15807 train_filtered.src.tkn
   15807 train_filtered.src.tkn.pos
   15807 train_filtered.src.tkn.posword
   15807 train_filtered.tgt.tkn
   15877 train.src
   15878 train.src.tkn
   15877 train.tgt
   15878 train.tgt.tkn
  158494 total
    997 valid.src
    997 valid.src.tkn
    997 valid.tgt
    997 valid.tgt.tkn
   3988 total


# Train subword model,
## Experiment with no of subword merge operation

In [None]:
!python subword-nmt/subword_nmt/learn_bpe.py -s 5000 < train.all.tkn > train.codes

100% 5000/5000 [00:23<00:00, 211.36it/s]


# How do subword codes look

In [None]:
! head -n 10 train.codes

#version: 0.2
ತ ್
ನ ್
क े</w>
ಲ ್
् र
ಾ ಗ
ಗ ಳ
ತ್ ತ
ಕ ್


# Apply Subword to the corpus

In [None]:
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.src > train.kn
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < train.tgt > train.hi

!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.src > valid.kn
!python subword-nmt/subword_nmt/apply_bpe.py -c train.codes < valid.tgt > valid.hi

# Training Corpus now

In [None]:
! head -n 10 train.kn

ಪ್ರ@@ ವಾ@@ ಹ ಪೀ@@ ಡಿತ ಕ@@ ಲ್ಲ@@ ಿದ್ದ@@ ಲು ಗಣ@@ ಿಗಳನ್ನು s@@ e@@ c@@ l ಕೈ@@ ಬಿಡ@@ ಬೇಕಾ@@ ಯಿತು .
ಹಣಕಾ@@ ಸು ಸಚಿ@@ ವಾ@@ ಲಯ@@ ವು ರಾಯ@@ ಧ@@ ನ ಮೇಲಿನ ತನ್ನ ಕೈ@@ ಬಿ@@ ಟ್ಟ ಹಕ್ಕ@@ ನ್ನು ಹೊ@@ ಸದ@@ ಾಗಿ ಪ್ರಸ್ತು@@ ತ@@ ಪಡಿಸಲು ನಿರ್ಧರಿಸ@@ ಿದೆ .
ಕಂಪ@@ ನಿಯ ತೊ@@ ರೆ@@ ದ ಆ@@ ಸ್@@ ತಿಯನ್ನು ಹ@@ ರಾಜ@@ ು ಮಾಡಲು ಸರ್ಕಾರ ನಿರ್ಧರಿಸ@@ ಿದೆ .
ವಿ@@ ಮಾನ ನಿ@@ ಲ್@@ ದಾ@@ ಣ@@ ದಲ್ಲಿ ಕೈ@@ ಬಿಡ@@ ಲಾದ ಸರ@@ ಕು@@ ಗಳನ್ನು ನಿರ್ವಹ@@ ಿಸುವಲ್ಲಿ ಸಮಸ್ಯೆ ಇದೆ .
ಶಿ@@ ಥ@@ ಿ@@ ಲ@@ ಗೊಂಡ ಕಟ್ಟಡ@@ ಗಳ ತ್ಯ@@ ಜ@@ ಿಸುವ ಪ್ರಕ್ರಿಯ@@ ೆಯನ್ನು ಎ@@ ಸ್ಟ@@ ೇಟ್ ಇಲಾಖೆ ಆರಂಭ@@ ಿಸಿದೆ .
ಮಾ@@ ಲಿ@@ ನ್ಯ@@ ವನ್ನು ಕೊ@@ ನ@@ ೆಗೊಳ@@ ಿಸಲು ಕೈಗೊಳ್ಳ@@ ಬೇಕು .
ಸರ್ಕಾರವು ಆ@@ ಮ@@ ದು ಸು@@ ಂಕ@@ ವನ್ನು ಕಡಿಮೆ ಮಾಡಲು ಕ್ರಮಗಳನ್ನು ತೆಗೆದುಕೊಳ್ಳ@@ ುತ್ತಿದೆ .
ಕಾನೂ@@ ನಿ@@ ನ ಈ ವಿಭಾಗ@@ ವನ್ನು ರ@@ ದ್ದ@@ ು@@ ಗೊಳಿಸಲಾಗಿದೆ .
ಸರ್ಕಾರಿ ಮನೆ@@ ಗಳ ಬಾ@@ ಡ@@ ಿಗೆ ಕಡ@@ ಿತ@@ ದ ಪ್ರಸ್ತಾ@@ ವ@@ ನೆ ಪರಿ@@ ಶೀ@@ ಲನ@@ ೆಯಲ್ಲ@@ ಿದೆ .
ಪ್ರ@@ ಯಾ@@ ಣ@@ ಿಕ ಸಾ@@ ರಿಗ@@ ೆಗೆ 60 % ರಿಯಾ@@ ಯ@@ ಿತ@@ ಿಯ@@ ೊಂದಿಗೆ 1@@ 2.@@ 3@@ 6 % ದ@@ ರದಲ್ಲಿ ತೆ@@ ರಿಗೆ ವಿಧ@@ ಿಸಲಾಗುತ್ತದೆ .


In [None]:
! head -n 10 train.hi

एस@@ ई@@ सी@@ एल को बा@@ ढ़ प्रभावित को@@ य@@ ला ख@@ दा@@ नों को छोड़@@ ना पड़ा ।
वित्@@ त मंत्रालय ने अपने राज@@ स्@@ व संबंधी छो@@ ड़े हुए दा@@ वे को नए सि@@ रे से पे@@ श करने का निर्णय लिया है ।
सरकार ने कंपनी की छो@@ ड़ी हुई परि@@ सं@@ पत्@@ तियों की नी@@ ला@@ मी का फै@@ स@@ ला किया है ।
वि@@ मान पत्@@ तन में छो@@ ड़े हुए कार्@@ ग@@ ो को रखने की समस्या आ रही है ।
सं@@ प@@ दा विभाग द्वारा ज@@ र्@@ जर भ@@ व@@ नों को छोड़@@ ने की प्रक्रिया शुरू की गई ।
प्र@@ दू@@ षण समाप्त करने के लिए कदम उठा@@ ने हैं ।
सरकार आ@@ या@@ त शु@@ ल्@@ क कम करने के लिए कदम उठ@@ ा रही है ।
कान@@ ून की इस धार@@ ा का उप@@ श@@ मन हो चु@@ का है ।
सरकारी आवा@@ सों के कि@@ रा@@ ये में कमी का प्रस्ताव वि@@ च@@ ारा@@ ध@@ ीन है ।
या@@ त्र@@ ी परि@@ वहन पर 60 % की छू@@ ट के साथ 1@@ 2.@@ 3@@ 6 % की दर से कर व@@ सू@@ ला जाता है ।



# Starting  NMT Training
## Preprocessing stage ; create dictionaries, make corpora ready for parallel processing


In [None]:
!pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 configargparse

!python OpenNMT-py/preprocess.py \
	    -train_src train.kn \
	    -train_tgt train.hi \
	    -valid_src valid.kn \
	    -valid_tgt valid.hi \
	    -save_data processed -share_vocab -overwrite

Collecting configargparse
  Obtaining dependency information for configargparse from https://files.pythonhosted.org/packages/6f/b3/b4ac838711fd74a2b4e6f746703cf9dd2cf5462d17dac07e349234e21b97/ConfigArgParse-1.7-py3-none-any.whl.metadata
  Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)
Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Installing collected packages: configargparse
Successfully installed configargparse-1.7
[0m[2023-09-23 05:40:19,687 INFO] Extracting features...
[2023-09-23 05:40:19,690 INFO]  * number of source features: 1.
[2023-09-23 05:40:19,690 INFO]  * number of target features: 0.
[2023-09-23 05:40:19,690 INFO] Building `Fields` object...
[2023-09-23 05:40:19,691 INFO] Building & saving training data...
[2023-09-23 05:40:19,870 INFO] Building shard 0.
[2023-09-23 05:40:21,946 INFO]  * saving 0th train data shard to processed.train.0.pt.
[2023-09-23 05:40:24,031 INFO]  * tgt vocab size: 2488.
[2023-09-23 05:40:24,038 INFO]  * src vocab size:

In [None]:
ls data-bin/trial

# Training
## Parameters to fix for your corpora and language pair



```
    --encoder-embed-dim	128 --encoder-ffn-embed-dim	128 \
    --encoder-layers	2 --encoder-attention-heads	2 \
    --decoder-embed-dim	128 --decoder-ffn-embed-dim	128 \
    --decoder-layers	2 --decoder-attention-heads	2 \
    --dropout 0.3 --weight-decay 0.0 \
    --max-update 4000 \
    --keep-last-epochs	10 \
```



---



In [None]:
! python OpenNMT-py/train.py -data processed -save_model model.pt \
		-layers 6 -rnn_size 512 -src_word_vec_size 512 -tgt_word_vec_size 512 -transformer_ff 2048 -heads 8  \
		-encoder_type transformer -decoder_type transformer -position_encoding \
		-train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
		-batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
		-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
		-max_grad_norm 0 -param_init 0  -param_init_glorot \
		-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
		-world_size 1 -gpu_ranks 0