<a href="https://colab.research.google.com/github/sakhile-mabunda/Natural-Language-Processing/blob/master/starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Umsuka English - Isizulu Parallel Corpus

## Research question: How do tokenization strategies affect machine translation performance?

In [1]:
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
! rm -r joeynmt
! rm -r sample_data
! rm -r dev.bpe.en
! rm -r dev.bpe.zu
! rm -r dev.en
! rm -r dev.zu
! rm -r test.bpe.en
! rm -r test.bpe.zu
! rm -r test.en
! rm -r test.zu
! rm -r train.bpe.en
! rm -r train.bpe.zu
! rm -r train.en
! rm -r train.zu
! rm -r en-zu.eval.csv
! rm -r en-zu.training.csv
! rm -r tokenizer-trained.json
! rm -r train.uni.en
! rm -r train.uni.zu
! rm -r dev.uni.en
! rm -r dev.uni.zu
! rm -r test.uni.en
! rm -r test.uni.zu

rm: cannot remove 'sample_data': No such file or directory


In [3]:
import os

source_language = 'en'
target_language = 'zu' 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = 'baseline' # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
!mkdir -p "/content/drive/My Drive/nlp-project/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/nlp-project/%s-%s-%s" % (source_language, target_language, tag)

In [4]:
!echo $gdrive_path

/content/drive/My Drive/nlp-project/en-zu-baseline


In [5]:
# from google.colab import files

# uploaded = files.upload()

! wget https://github.com/sakhile-mabunda/Natural-Language-Processing/blob/master/dataset/en-zu.training.csv

! wget https://github.com/sakhile-mabunda/Natural-Language-Processing/blob/master/dataset/en-zu.eval.csv

--2021-11-21 22:43:06--  https://github.com/sakhile-mabunda/Natural-Language-Processing/blob/master/dataset/en-zu.training.csv
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘en-zu.training.csv’

en-zu.training.csv      [  <=>               ] 159.46K   478KB/s    in 0.3s    

2021-11-21 22:43:07 (478 KB/s) - ‘en-zu.training.csv’ saved [163288]

--2021-11-21 22:43:08--  https://github.com/sakhile-mabunda/Natural-Language-Processing/blob/master/dataset/en-zu.eval.csv
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘en-zu.eval.csv’

en-zu.eval.csv          [ <=>                ] 855.44K  5.43MB/s    in 0.2s    

2021-11-21 22:43:08 (5.43 MB/s) - ‘en-zu.eval.csv’ saved [

In [6]:
import pandas as pd

train = pd.read_csv('en-zu.training.csv?download=1')

eval_df = pd.read_csv('en-zu.eval.csv?download=1')

## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [7]:
# drop duplicate translations
train = train.drop_duplicates()

# Shuffle the data to remove bias in dev set selection.
train = train.sample(frac=1, random_state=seed).reset_index(drop=True)

In [8]:
# Install fuzzy wuzzy to remove "almost duplicate" sentences in the
# test and training sets.
! pip install fuzzywuzzy
! pip install python-Levenshtein

import time
from fuzzywuzzy import process
import numpy as np
from os import cpu_count
from functools import partial
from multiprocessing import Pool

# reset the index of the training set after previous filtering
train.reset_index(drop=False, inplace=True)

# Remove samples from the training data set if they "almost overlap" with the
# samples in the test set.

# Filtering function. Adjust pad to narrow down the candidate matches to
# within a certain length of characters of the given sample.
def fuzzfilter(sample, candidates, pad):
  candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] 
  if len(candidates) > 0:
    return process.extractOne(sample, candidates)[1]
  else:
    return np.nan



In [9]:
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev set and the given test set.
# Do the split between dev/train and create parallel corpora

num_dev_patterns = 600

# lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:
    train["en"] = train["en"].str.lower()
    train["zu"] = train["zu"].str.lower()

# dev and test sets
dev = eval_df.tail(num_dev_patterns) 
test = eval_df.drop(eval_df.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in train.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")

with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as trg_file:
  for index, row in test.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")

# Doublecheck the format below. There should be no extra quotation marks or weird characters.
! head train.*
! head dev.*
! head test.*

==> train.en <==
Today, those using Planck and cosmic background data to obtain a value for the Hubble constant get a figure of 67.4 plus or minus 0.5. By contrast the local approach gives a figure of 73.5 plus or minus 1.4. These values represent the two different values we have for the expansion of the universe. (See "A matter of metrics," below.)
Slack's stock has now fallen nearly 20% from its reference price of $24 on the day of its Wall Street debut.
Ayesha Shroff's latest Instagram entry deserves everyone's attention.
winter weather alerts from west Virginia all the way up to Maine.
Mr Moise - who has been in power since 2017 - has called for talks with the opposition, with no success so far.
5. Clue (1985)
"My friend was stabbed in the shoulder and my other friend was shot," she said. "I escaped with one friend and went home and then came back to look for another friend."
Raising awareness or making light?
There was also a report where, because we had three policemen in our sid



---


## Installation of JoeyNMT

[JoeyNMT](https://joeynmt.readthedocs.io) is a simple, minimalist NMT package.  

In [10]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt;pip3 install .

# Install Pytorch with GPU support v1.7.1.
# ! pip uninstall torch
! pip install torch==1.10.0+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Cloning into 'joeynmt'...
remote: Enumerating objects: 3224, done.[K
remote: Counting objects: 100% (273/273), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 3224 (delta 155), reused 209 (delta 130), pack-reused 2951[K
Receiving objects: 100% (3224/3224), 8.19 MiB | 16.00 MiB/s, done.
Resolving deltas: 100% (2183/2183), done.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.3-py3-none-any.whl size=86029 sha256=8e507524fcbef

# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [11]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path
os.environ['src'] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ['tgt'] = target_language

# Learn BPEs on the training data.
os.environ['data_path'] = path.join('joeynmt', 'data', source_language + target_language)
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe --dropout 0.1 -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe --dropout 0.1 -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt



In [12]:
# importing the tokenizer and subword BPE trainer
! pip install tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
from tokenizers.trainers import BpeTrainer, WordLevelTrainer, WordPieceTrainer, UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace



In [13]:
unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]  # special tokens

def prepare_tokenizer_trainer(alg):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if alg == 'BPE':
        tokenizer = Tokenizer(BPE(unk_token = unk_token))
        trainer = BpeTrainer(special_tokens = spl_tokens)
    elif alg == 'UNI':
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens)
    elif alg == 'WPC':
        tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
        trainer = WordPieceTrainer(special_tokens = spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
        trainer = WordLevelTrainer(special_tokens = spl_tokens)
    
    tokenizer.pre_tokenizer = Whitespace()
    return tokenizer, trainer


def train_tokenizer(files, alg):
    """
    Takes the files and trains the tokenizer.
    """
    # f = open(files, 'r')
    tokenizer, trainer = prepare_tokenizer_trainer(alg)
    tokenizer.train(files, trainer) # training the tokenzier
    tokenizer.save("./tokenizer-trained.json")
    tokenizer = Tokenizer.from_file("./tokenizer-trained.json")
    return tokenizer

def tokenize(input_string, tokenizer):
    """
    Tokenizes the input string using the tokenizer provided.
    """
    output = tokenizer.encode(input_string)
    return output

In [14]:
def tokenization(strategy):
  
  train_en = ['train.en']
  train_zu = ['train.zu']
  dev_en = ['dev.en']
  dev_zu = ['dev.zu']
  test_en = ['test.en']
  test_zu = ['test.zu']

  if strategy == 'UNI':

    for f in [train_en, train_zu, dev_en, dev_zu, test_en, test_zu]:

      # training tokens
      if f == train_en:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("train.uni."+$src, "w") as src_file:
          for sentence in train.en:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            src_file.write(output+"\n")
              
      elif f == train_zu:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("train.uni."+$tgt, "w") as trg_file:
          for sentence in train.zu:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            trg_file.write(output+"\n")
      
      # validation tokens
      elif f == dev_en:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("dev.uni."+$src, "w") as src_file:
          for sentence in dev.en:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            src_file.write(output+"\n")

      elif f == dev_zu:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("dev.uni."+$tgt, "w") as trg_file:
          for sentence in dev.zu:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            trg_file.write(output+"\n")

      # testing tokens
      elif f == test_en:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("test.uni."+$src, "w") as src_file:
          for sentence in test.en:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            src_file.write(output+"\n")

      elif f == test_zu:
        trained_tokenizer = train_tokenizer(f, strategy)

        with open("test.uni."+$tgt, "w") as trg_file:
          for sentence in test.zu:
            output = ' '.join(tokenize(sentence, trained_tokenizer).tokens)
            trg_file.write(output+"\n")

# Create directory, move everyone we care about to the correct location
# # ! mkdir -p $data_path
# ! cp train.* $data_path
# ! cp test.* $data_path
# ! cp dev.* $data_path
# ! ls $data_path

# # Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
# ! cp train.* "$gdrive_path"
# ! cp test.* "$gdrive_path"
# ! cp dev.* "$gdrive_path"
# ! ls "$gdrive_path"

# # Create that vocab using build_vocab
# ! sudo chmod 777 joeynmt/scripts/build_vocab.py
# ! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.uni.$src joeynmt/data/$src$tgt/train.uni.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# # Some output
# ! echo "UNI Zulu Sentences"
# ! tail -n 5 test.uni.$tgt
# ! echo "Combined UNI Vocab"
# ! tail -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman

SyntaxError: ignored

In [None]:
# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Zulu Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman

In [None]:
print(test.bpe.en)

In [None]:
tokenization('UNI')

# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [None]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.uni"  # change back to .bpe
    dev:   "data/{name}/dev.uni"
    test:  "data/{name}/test.uni"
    level: "uni"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 35                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [None]:
# Train the model

!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

In [None]:
# Copy the created models from the notebook storage to google drive for persistant storage
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [None]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

In [None]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"