# Project Compositionality: Training

This notebook runs all the models and code necessary for the project

In [0]:
# MOUNTING GOOGLE DRIVE

from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
/gdrive


## Setup

**IMPORTANT!** Please ensure you run this notebook with **GPU** support, enabled in Google Colab via `"Runtime > Change runtime type"`. The command below will show the registered GPU.

1. Change the working directory to the top-level, containing subfolders like:
  - code
  - data
  - OpenNMT-py (required)/Fairseq (optional)
  - logs
  - models
  - results
2. Clone OpenNMT from Github
3. Install OpenNMT via `pip install`

In [0]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [0]:
%%bash
# Set working directory
set WORK_DIR="/gdrive/My Drive/Education/Master Data Science/Master/Semester 1/NLP2/NLP2_2 (shared)"
cd $WORK_DIR
ls -lah

total 68K
drwx------ 1 root root 4.0K May 24 16:19 .
drwxr-xr-x 1 root root 4.0K May 24 16:19 ..
-rw------- 1 root root 1.1K May 24 16:19 .bash_history
-rw-r--r-- 1 root root 3.1K Apr  9  2018 .bashrc
drwxr-xr-x 1 root root 4.0K May 20 16:47 .cache
drwxr-xr-x 1 root root 4.0K May 24 16:19 .config
drwxr-xr-x 3 root root 4.0K May 20 16:14 .gsutil
drwxr-xr-x 1 root root 4.0K May 20 16:45 .ipython
drwx------ 2 root root 4.0K May 20 16:45 .jupyter
drwxr-xr-x 2 root root 4.0K May 24 16:11 .keras
drwx------ 1 root root 4.0K May 20 16:45 .local
drwxr-xr-x 3 root root 4.0K May 20 16:45 .node-gyp
drwxr-xr-x 4 root root 4.0K May 20 16:45 .npm
-rw-r--r-- 1 root root  148 Aug 17  2015 .profile


### OpenNMT - Cloning & Installation

If installation results in `Error`, you must restart the runtime in order to use newly installed versions. After runtime restart re-running the notebook from [Setup](#setup), `pip install` will show all modules installed as necessary.

In [0]:
%%bash
if [ ! -d "OpenNMT-py" ]; then
  # Check if OpenNMT-py doesn't exist, if so, git clone.
  git clone https://github.com/OpenNMT/OpenNMT-py
fi

In [0]:
!pip install OpenNMT-py



### Fairseq - Cloning and install (optional)

As OpenNMT-py does not provide sufficient support for convolutional sequence to sequence models, Fairseq is used to set up a model comparable to [Compositionality Decomposed (Hupkes et al., 2020)](https://arxiv.org/abs/1908.08351) and [Convolutional sequence to sequence learning (Gehring et al., 2017)](https://dl.acm.org/doi/10.5555/3305381.3305510).

As Fairseq is not well supported within Google Colab, the installation requires some different installation procedures compared to regular Python packages.

In [0]:
%%bash
if [ ! -d "Fairseq" ]; then
  # Check if Fairseq doesn't exist, if so, git clone.
  git clone https://github.com/pytorch/fairseq
fi

fatal: destination path 'fairseq' already exists and is not an empty directory.


In [0]:
# Installing editable version of Fairseq over Github repository
!pip install fairseq --editable ./fairseq/

Obtaining file:///gdrive/My%20Drive/Education/Master%20Data%20Science/Master/Semester%201/NLP2/NLP2_2%20%28shared%29/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting fairseq
[?25l  Downloading https://files.pythonhosted.org/packages/67/bf/de299e082e7af010d35162cb9a185dc6c17db71624590f2f379aeb2519ff/fairseq-0.9.0.tar.gz (306kB)
[K     |████████████████████████████████| 307kB 3.4MB/s 
[?25hCollecting sacrebleu
[?25l  Downloading https://files.pythonhosted.org/packages/6e/9d/9846507837ca50ae20917f59d83b79246b8313bd19d4f5bf575ecb98132b/sacrebleu-1.4.9-py3-none-any.whl (60kB)
[K     |████████████████████████████████| 61kB 5.2MB/s 
Collecting portalocker
  Downloading https://files.pythonhosted.org/packages/53/84/7b3146ec6378d28abc73ab484f09f47dfa008ad6f03f33d90a369f880e25/portalocker-1.7.0-py2.py3-none

In [0]:
%%bash
# Build the package from pypi locally to support Nvidia GPUs
cd ./fairseq/
python setup.py build_ext --inplace
cd $WORK_DIR

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running build_ext
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.6/fairseq/libbleu.cpython-36m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.6/fairseq/data/data_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/data/token_block_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/libnat.cpython-36m-x86_64-linux-gnu.so -> fairseq




### Notebook seeds, models and parameters

These are the seeds, models and parameters used in all steps of this notebook.

In [0]:
seeds = [0,1,2]
models = ["lsmts2s","grus2s","transformer","convs2s"]
data_dir = "data"
results_dir = "results"

### Logging

Enabling Tensorboard support.

In [0]:
!pip install tensorboardX

In [0]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [0]:
%tensorboard --logdir logs

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 819.

## **Preprocessing**

As the training data needs transformations and preprocessing, we will outline the data flow below.

1. The PCFG datasets are stored with the following file/directory structure with `data`:
  - Final:
    - All:
      - 'Training files' src.txt & tgt.txt
      - 'Development (validation) files' src.txt & tgt.txt
      - 'Test files' src.txt & tgt.txt
    - Productivity:
      - Tasks:
        - Test1:
          - 'Training files'
        - Test2:
          - 'Training files'
        - 'Test files'
      - Whens:
        - Test3: 'Training files'
        - Test4: 'Training files'
        - Test5: 'Training files'
        - 'Test files'
    - Systematicity:
      - Test1:
        - 'Training files'
        - 'Test files'
    - README.txt
2. We will copy and rename some of these files, as OpenNMT-py and Fairsec parameters rely on file extensions and common directories. The new files will be placed in the `Transformed` directory.
3. Finally, Fairseq and OpenNMT-py will be used to preprocess the data for training with their respective seed values.


### 1. Datasets and Analysis

In [0]:
# Printing the directory and file structure of the data
import os, sys, yaml

original_dir = f"{data_dir}/Final"

def dir_to_dict(path):
    directory = {}
    for dirname, dirnames, filenames in os.walk(path):
        dn = os.path.basename(dirname)
        directory[dn] = []
        if dirnames:
            for d in dirnames:
                directory[dn].append(dir_to_dict(path=os.path.join(path, d)))
            for f in filenames:
                directory[dn].append(f)
        else:
            directory[dn] = filenames
        return directory

print(yaml.dump(dir_to_dict(path=original_dir), default_flow_style=False))

Final:
- All:
  - train30k_src_all.txt
  - train30k_tgt_all.txt
  - dev5k_src_all.txt
  - dev5k_tgt_all.txt
  - test5k_src_all.txt
  - test5k_tgt_all.txt
- Productivity:
  - Tasks:
    - Test1:
      - prod1_train_src_12tasks.txt
      - prod1_train_tgt_12tasks.txt
    - Test2:
      - prod2_train_src_13tasks.txt
      - prod2_train_tgt_13tasks.txt
    - prod_test_src_1tasks.txt
    - prod_test_tgt_1tasks.txt
    - prod_test_src_2tasks.txt
    - prod_test_tgt_2tasks.txt
    - prod_test_src_3tasks.txt
    - prod_test_tgt_3tasks.txt
  - Whens:
    - Test3:
      - prod3_train_src_012whens.txt
      - prod3_train_tgt_012whens.txt
    - Test4:
      - prod4_train_src_0123whens.txt
      - prod4_train_tgt_0123whens.txt
    - Test5:
      - prod5_train_src_024whens.txt
      - prod5_train_tgt_024whens.txt
    - prod_test_src_0whens.txt
    - prod_test_tgt_0whens.txt
    - prod_test_src_1whens.txt
    - prod_test_tgt_1whens.txt
    - prod_test_src_2whens.txt
    - prod_test_tgt_2whens.txt
   

In [0]:
def get_unique_words(data):
  unique_words = set()
  for src, _ in data:
    unique_words.update(src.split())
  return unique_words

In [0]:
# Analyzing training data to get accurate vocabulary sizes
all_dir = f"{original_dir}/All/"
files = [f for f in os.listdir(all_dir) if os.path.isfile(os.path.join(all_dir, f))]

for f in [f for f in files if ".txt" in f]:
  file_name = f.split("/")[-1]
  with open(os.path.join(all_dir, f)) as input_file:
    data = [(line, "") for line in input_file.readlines()]
    unique_words = get_unique_words(data)
    vocab_size = len(unique_words)
    print(f"{file_name} contains {vocab_size} words: ", sorted(unique_words))

train30k_src_all.txt contains 66 words:  ['at', 'doing0', 'doing1', 'doing10', 'doing11', 'doing12', 'doing13', 'doing14', 'doing15', 'doing16', 'doing17', 'doing18', 'doing19', 'doing2', 'doing3', 'doing4', 'doing5', 'doing6', 'doing7', 'doing8', 'doing9', 'location0', 'location1', 'location10', 'location11', 'location12', 'location13', 'location14', 'location15', 'location16', 'location17', 'location18', 'location19', 'location2', 'location3', 'location4', 'location5', 'location6', 'location7', 'location8', 'location9', 'person0', 'person1', 'person10', 'person11', 'person12', 'person13', 'person14', 'person15', 'person16', 'person17', 'person18', 'person19', 'person2', 'person3', 'person4', 'person5', 'person6', 'person7', 'person8', 'person9', 'went', 'what', 'when', 'where', 'who']
train30k_tgt_all.txt contains 62 words:  ['at', 'doing0', 'doing1', 'doing10', 'doing11', 'doing12', 'doing13', 'doing14', 'doing15', 'doing16', 'doing17', 'doing18', 'doing19', 'doing2', 'doing3', 'doi

### 2. Transform files

To facilitate training scripts, copy all files and rename datasets to corresponding `.src` and `.tgt` files.

In [0]:
from distutils.dir_util import copy_tree
transformed_dir = f"{data_dir}/Transformed"

copy_data = False

if copy_data:
  copy_tree(original_dir, transformed_dir)
  for root, dirs, files in os.walk(transformed_dir, topdown=False):
    for name in [f for f in files if ".txt" in f and "README.txt" not in f]:
        source_file = os.path.join(root, name)
        source_file_renamed = source_file
        if "_src_" in source_file:
            source_file_renamed = source_file_renamed.replace("_src_","_").replace(".txt", ".src")
        elif "_tgt_" in source_file:
            source_file_renamed = source_file_renamed.replace("_tgt_","_").replace(".txt", ".tgt")
        os.rename(source_file, source_file_renamed)
        print(f"Source '{source_file}' renamed to '{source_file_renamed}'")

### 3.a Create preprocessing shell scripts

As preprocessing is done for multiple seeds to account for random sampling, we divide the original datasets to seed specific datasets, called `'runs'` for models.

In [0]:
from pprint import pprint

# Create list of datasets to preprocess
datasets = []
transformed_dir = f"{data_dir}/Transformed"
dev_set = f"{transformed_dir}/All/dev5k_all"

for root, dirs, files in os.walk(transformed_dir, topdown=False):
  for name in [f for f in files if ".txt" not in f and "dev5k" not in f]:
      source_file = os.path.join(root, name)
      source_path = source_file.split(".")[0]
      opennmt_dest_path = source_path.replace("/Transformed/","/OpenNMT/")
      fairseq_dest_path = source_path.replace("/Transformed/","/Fairseq/")
      datasets.append((source_path, dev_set, opennmt_dest_path, fairseq_dest_path))
      
datasets = list(set(datasets))

pprint(datasets)

[]


In [0]:
# Remove existing bash files and write new ones
try:
  os.remove("code/fairseq_preprocess.sh")
except:
  pass

try:
  os.remove("code/opennmt_preprocess.sh")
except:
  pass

for seed in seeds:
  for source_path, dev_path, opennmt_dest_path, fairseq_dest_path in datasets:
    if "train" in source_path:
      try:
        os.makedirs(opennmt_dest_path)
      except:
        pass
      try:
        os.makedirs(fairseq_dest_path)
      except:
        pass
      with open("code/fairseq_preprocess.sh", "a") as scriptfile:
        scriptfile.write(f'python fairseq/preprocess.py  --source-lang src --target-lang tgt --trainpref "{source_path}" --validpref "{dev_path}" --destdir "{fairseq_dest_path}" --seed {seed}\n')
      with open("code/opennmt_preprocess.sh", "a") as scriptfile:
        scriptfile.write(f'python OpenNMT-py/preprocess.py  -train_src "{source_path}.src" -train_tgt "{source_path}.tgt" -valid_src "{dev_path}.src" -valid_tgt "{dev_path}.tgt" -save_data "{opennmt_dest_path}" -src_vocab_size 68 -tgt_vocab_size 66 --seed {seed}\n')

### 3.b Preprocess OpenNMT & Fairseq data

In [0]:
!chmod +x ./code/fairseq_preprocess.sh
!./code/fairseq_preprocess.sh

In [0]:
!chmod +x ./code/opennmt_preprocess.sh
!./code/opennmt_preprocess.sh

## Model and Training Script Creation 

For this project the following *hints* were given as models to train. For our research we focused on a bi-directional LSTM and GRU model, both with two layers and a 512 hidden unit and word embedding space. As these were baseline models, we will also look into a convolutional model, a Transformer and an optimized LSTM but these require significantly more training time.

Hupkes et al.(2019) LSTMS2S of Klein et al.(2018) (2 layers)
- (+) sequential nature useful for local encodings
- (-) long-distance encodings complicated due to sequential nature

ConvS2S of Gehring et al.(2017) (15 layers)
+ (+) local convolutions useful for local encodings
+ (+) layered architecture eases modelling hierarchy
- (-) long-distance encodings complicated due to local convolutions

Transformer of Vaswani et al.(2017) (6 layers)
+ (+) long-distance encodings easy due to attention
+ (+) layered architecture eases modelling hierarchy
- (-) out-of-order encoding complicate local encoding

### OpenNMT: Hyperparameters 

- The two models to compare;
- Hidden dimensionalities (e.g. 128, 256, 512);
- Number of stacked layers in enc./dec. (e.g. 2-6);
- Regularisation (e.g. dropout of 0.1-0.3);
- Balance model size, training speed, and dataset size

Experiment with hyperparameters: A full grid search is not necessary

In [0]:
# Create model directories
for seed in seeds:
  for model in models:
    for _, _, opennmt_dest_path, _ in datasets:
      try:
        os.makedirs(opennmt_dest_path.replace(f"{data_dir}/OpenNMT",f"models/{model}/run_{seed}"))
      except:
        pass
      try:
        os.makedirs(opennmt_dest_path.replace(f"{data_dir}/OpenNMT",f"logs/{model}/run_{seed}"))
      except:
        pass

### LSTMS2S

As the first model to train, we will look at a bi-directional Long Short-Term Memory recurrent neural network, or LSTMS2S in here for short. We will create a seperate command for each seed and model task and combine them as a `bash` file, which will train all `runs` and tasks when executed.

Training can be monitored at the [Logging](#logging) of this notebook with the Tensorboard extension.

In [0]:
# Create training scripts

try:
  os.remove("code/lstms2s_train.sh")
except:
  pass
  
for seed in seeds:
  for source_path, dev_path, dest_path, _ in datasets:
    if "train" in source_path:
      with open("code/lstms2s_train.sh", "a") as scriptfile:
        data_path = dest_path
        model_path = dest_path.replace(f"{data_dir}/OpenNMT", f"models/lstms2s/run_{seed}")
        log_path = dest_path.replace(f"{data_dir}/OpenNMT", f"logs/lstms2s/run_{seed}")
        scriptfile.write(f"""
python OpenNMT-py/train.py  -data "{data_path}" -save_model "{model_path}" \\
                            -train_steps 25000 -layers 2 -rnn_size 512 -word_vec_size 512 \\
                            -rnn_type LSTM --encoder_type brnn \\
                            -batch_size 64 -gpu_rank 0 -seed {seed} \\
                            -learning_rate 0.1 -optim sgd -global_attention general \\
                            -tensorboard -tensorboard_log_dir "{log_path}"
""")

### GRUS2S

As the second model to train, we look at a comparable GRU network, with similar parameters for fair comparison.

In [0]:
# Create training scripts
try:
  os.remove(f"code/grus2s_train.sh")
except:
  pass

for seed in seeds:
  for source_path, dev_path, dest_path, _ in datasets:
    if "train" in source_path:
      with open("code/grus2s_train.sh", "a") as scriptfile:
        data_path = dest_path
        model_path = dest_path.replace(f"{data_dir}/OpenNMT", f"models/lstms2s/run_{seed}")
        log_path = dest_path.replace(f"{data_dir}/OpenNMT", f"logs/lstms2s/run_{seed}")
        scriptfile.write(f"""
python OpenNMT-py/train.py  -data "{data_path}" -save_model "{model_path}" \\
                            -train_steps 25000 -layers 2 -rnn_size 512 -word_vec_size 512 \\
                            -rnn_type GRU --encoder_type brnn \\
                            -batch_size 64 -gpu_rank 0 -seed {seed} \\
                            -learning_rate 0.1 -optim sgd -global_attention general \\
                            -tensorboard -tensorboard_log_dir "{log_path}"
""")

### Large model preprocessing (optional)

As large models require more data, these need more steps before training. 

1.   Please run the `pcfg_big` notebook to generate a dataset five times larger than the original PCFG. This should suffice for the Transformer and possibly ConvS2S as well
2.   Run the processing cells below after dataset generation

**WARNING!** As these models haven't been trained succesfully, this notebook is set up to the best of our knowledge, but hasn't been fully tested as we did not have access to sufficient computation resources (Google Colab has a maximum of 12 consecutive GPU hours). Be prepared that some script might not work 'out of the box'.[link text](https://)

In [0]:
# Transformer preprocessing
from distutils.dir_util import copy_tree


original_dir = f"{data_dir}/pcfg_big"
transformed_dir = f"{data_dir}/TransformedBig"

copy_data = False

if copy_data:
  copy_tree(original_dir, transformed_dir)
  for root, dirs, files in os.walk(transformed_dir, topdown=False):
    for name in [f for f in files if ".txt" in f and "README.txt" not in f]:
        source_file = os.path.join(root, name)
        source_file_renamed = source_file
        if "_src_" in source_file:
            source_file_renamed = source_file_renamed.replace("_src_","_").replace(".txt", ".src")
        elif "_tgt_" in source_file:
            source_file_renamed = source_file_renamed.replace("_tgt_","_").replace(".txt", ".tgt")
        os.rename(source_file, source_file_renamed)
        print(f"Source '{source_file}' renamed to '{source_file_renamed}'")


# Create list of datasets to preprocess
datasets = []
dev_set = f"{transformed_dir}/All/dev5k_all"

for root, dirs, files in os.walk(transformed_dir, topdown=False):
  for name in [f for f in files if ".txt" not in f and "dev5k" not in f]:
      source_file = os.path.join(root, name)
      source_path = source_file.split(".")[0]
      opennmt_dest_path = source_path.replace("/TransformedBig/","/OpenNMTBig/")
      fairseq_dest_path = source_path.replace("/TransformedBig/","/FairseqBig/")
      datasets.append((source_path, dev_set, opennmt_dest_path, fairseq_dest_path))
      
datasets = list(set(datasets))

# Remove existing bash files and write new ones
try:
  os.remove("code/opennmt_preprocess_big.sh")
except:
  pass

for seed in seeds:
  for source_path, dev_path, opennmt_dest_path, fairseq_dest_path in datasets:
    if "train" in source_path:
      try:
        os.makedirs(opennmt_dest_path)
      except:
        pass
      try:
        os.makedirs(fairseq_dest_path)
      except:
        pass
      with open("code/fairseq_preprocess_big.sh", "a") as scriptfile:
        scriptfile.write(f'python fairseq/preprocess.py  --source-lang src --target-lang tgt --trainpref "{source_path}" --validpref "{dev_path}" --destdir "{fairseq_dest_path}" --seed {seed}\n')
      with open("code/opennmt_preprocess_big.sh", "a") as scriptfile:
        scriptfile.write(f'python OpenNMT-py/preprocess.py  -train_src "{source_path}.src" -train_tgt "{source_path}.tgt" -valid_src "{dev_path}.src" -valid_tgt "{dev_path}.tgt" -save_data "{opennmt_dest_path}" -src_vocab_size 70 -tgt_vocab_size 70 --seed {seed}\n')


In [0]:
!chmod +x ./code/fairseq_preprocess_big.sh
!./code/fairseq_preprocess_big.sh

In [0]:
!chmod +x ./code/fairseq_preprocess_big.sh
!./code/fairseq_preprocess_big.sh

### Transformer (optional)

As a third comparison, we could train a transformer. This, however, takes far more data and computation time than the models above. If this is desired, the `pcfg_big` notebook can create the necessary data. Below the scripts for preprocesssing can be run seperately.

In [0]:
# Train a Transformer for 50000 steps, with 16000 warmup steps
# As this training script takes significantly longer, we have
# not been able to test its full potential

try:
  os.remove("code/transformer_train.sh")
except:
  pass
  
for seed in seeds:
  for source_path, dev_path, dest_path, _ in datasets:
    with open("code/transformer_train.sh", "a") as scriptfile:
      if "train" in source_path:
        data_path = dest_path
        model_path = dest_path.replace("data/OpenNMTBig",f"models/transformer/run_{seed}")
        log_path = dest_path.replace("data/OpenNMTBig",f"logs/transformer/run_{seed}")
        scriptfile.write(f"""
python OpenNMT-py/train.py  -data "{data_path}" -save_model "{model_path}" \\
                            -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \\
                            -encoder_type transformer -decoder_type transformer -position_encoding \\
                            -train_steps 50000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 \\
                            -batch_type tokens -normalization tokens -accum_count 2 -optim adam \\
                            -adam_beta2 0.998 -decay_method noam -warmup_steps 16000 -learning_rate 2 \\
                            -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 \\
                            -valid_steps 1000 -save_checkpoint_steps 5000 -world_size 1 -gpu_rank 0 -seed {seed} \\
                            -tensorboard -tensorboard_log_dir "{log_path}"
  """)

### ConvS2S (optional)

This model has been trained on the original PCFG, but we were unable to validate its performance and were informed that accuracies might not be sufficient as convolutional networks require more training data. We have set up this notebook to use the new 'Big' data set, but have not tested its performance or functionality.

In [0]:
# Train a convolutional model based on Fairseq
# We have been able to train this model on a small
# dataset, but have not been able to validate its
# performance

try:
  os.remove("code/convs2s_train.sh")
except:
  pass
  
for seed in seeds:
  for source_path, dev_path, dest_path in datasets:
    with open("code/convs2s_train.sh", "a") as scriptfile:
      if "train" in source_path:
        data_path = dest_path
        model_path = dest_path.replace("data/FairseqBig",f"models/convs2s/run_{seed}")
        log_path = dest_path.replace("data/FairseqBig",f"logs/convs2s/run_{seed}")
        scriptfile.write(f"""
python fairseq/train.py "{data_path}" --no-epoch-checkpoints --arch fconv_wmt_en_de --lr 0.25 --max-tokens 3000 --save-dir "{model_path}" \\
                                      --clip-norm 0.1 --dropout 0.1 --max-epoch 25 --save-interval 1 --no-epoch-checkpoints \\
                                      --encoder-embed-dim 512 --decoder-embed-dim 512 \\
                                      --tensorboard-logdir "{log_path}" \\
                                      --batch-size 64 --seed {seed}
""")


## Model Training

Training all models, with optional training for the Transformer and ConvS2S.

### LSTMS2S

In [0]:
!chmod +x ./code/lstms2s_train.sh
!./code/lstms2s_train.sh

### GRUS2S

In [0]:
!chmod +x ./code/grus2s_train.sh
!./code/grus2s_train.sh

### Transformer (optional)

In [0]:
!chmod +x ./code/transformer_train.sh
!./code/transformer_train.sh

### ConvS2S (optional)

In [0]:
!chmod +x ./code/convs2s_train.sh
!./code/convs2s_train.sh

## Evaluation

These scripts were used for initial evaluation and experimentation. They do not contribute to training, but are left for convenience

In [0]:
def sequence_accuracy(a, b):
    return float(a.split() == b.split())

def word_accuracy(a, b):
    return float(sum([1. for _ in a.split() if _ in b.split()]) / len(b.split()))

import numpy as np

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    # print (matrix)
    return (matrix[size_x - 1, size_y - 1])

In [0]:
# Analyzing predictions
from os import listdir
from os.path import isfile, join


test_dir = f'{data_dir}/Final/All/'
for seed in seeds:
  for model in models:
    pred_dir = f'results/run_{seed}/All/'
    with open(join(test_dir, "test5k_all.tgt")) as input_file:
      pred = input_file.readlines()
    
    with open(join(pred_dir, "test5k_all.prd")) as input_file:
      actual = input_file.readlines()

    comparison = zip(pred, actual)

    sequence_acc = 0
    word_acc = 0
    lev_scores = []
    for i, c in enumerate(comparison):
      p, a = c
      sequence_acc += sequence_accuracy(p, a)
      word_acc += word_accuracy(p, a)
      lev_scores.append(levenshtein(p, a))

    sequence_acc /= len(pred)
    word_acc /= len(pred)
    lev_score = sum(lev_scores) / len(lev_scores)

    print(f'{model}, seed {seed}: Seq. acc.: {sequence_acc:.4f}, Word acc.: {word_acc:.4f}, Avg. Lev.: {lev_score:.4f}')

LSTMS2S: Seq. acc.: 0.6382, Word acc.: 0.9477, Avg. Lev.: 3.8848
