<a href="https://colab.research.google.com/github/soniasol/test_normalisation_2/blob/main/notebooks/Train_norm_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train a normaliser for French

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gabays/32M7131/blob/main/Cours_04/Cours04.ipynb)

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Licence Creative Commons" style="border-width:0;float:right;\" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a>

Simon Gabay (UniGE), Rachel Bawden (INRIA Paris)

<img width="30px" style="float:left" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/>  Specific requirements for Colab users are signaled with this sign.

## I. Preparing the experiment

You might need first to clean your env (OOD user in Geneva)

**note**: bblabla

In [None]:
!pip freeze --user --exclude-editable | xargs pip uninstall -y
!pip list

Installing required packages.

In [None]:
!pip install fairseq==0.12.2 sentencepiece sacrebleu omegaconf==2.0.5 gdown==4.2.0 tensorboardX numpy==1.25.2

### I.a Colab vs other tools

<img width="30px" style="float:left" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/>  Colab users:

In [None]:
DIRECTORY="/content/FreEMnorm"

Non colab users:

In [None]:
DIRECTORY="FreEMnorm"

### I.b Retrieving data
We download the repo (`dev` branch only for now)

In [None]:
#delete if there is an older version of the repo
!rm -rf $DIRECTORY
# cloning dev branch
!git clone -b dev https://github.com/FreEM-corpora/FreEMnorm.git

### I.c If you want to redo the split and the creation of the training data

We can split the corpus and create a file with the data in the requires format.

<img width="30px" style="float:left" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"/> Add `-c` to the last command (`split_to_src_trg.py`).

In [None]:
#Remove previously split directory if there is any
!rm -rf $DIRECTORY/split/ $DIRECTORY/data/
!echo "Split and Data directories have been removed"
#We split the corpus into train/test/dev
!python $DIRECTORY/split.py
#We turn the split into the requires format
!python $DIRECTORY/split_to_src_trg.py
#Colab users should comment the previous one and use the following
#!python /content/FreEMnorm/split_to_src_trg.py -c

## II. Traing a model


### II.a Preprocessing

We will need a few functions

In [None]:
# Read a file per line
def read_file(filename):
  list_sents = []
  with open(filename) as fp:
    for line in fp:
      list_sents.append(line.strip())
  return list_sents

#write a file per line
def write_file(list_sents, filename):
    with open(filename, 'w') as fp:
        for sent in list_sents:
            fp.write(sent + '\n')

We will need various sizes of vocabulary: 2000, 3000, 4000

In [None]:
import sentencepiece
import os

!rm -rf $DIRECTORY/data/vocabulary.src-trg
!rm -rf $DIRECTORY/data/data_norm_bin_1000
!rm -rf $DIRECTORY/data/data_norm_bin_2000
!rm -rf $DIRECTORY/data/data_norm_bin_3000
!rm -rf $DIRECTORY/data/data_norm_bin_4000

# We make a big file with all the data
!cat $DIRECTORY/data/* > $DIRECTORY/data/vocabulary.src-trg

# 1000
sentencepiece.SentencePieceTrainer.train(input=os.path.join(DIRECTORY,"data/vocabulary.src-trg"),
                               model_prefix=os.path.join(DIRECTORY,"data/bpe_joint_1000"),
                               vocab_size=1000)

# 2000
sentencepiece.SentencePieceTrainer.train(input=os.path.join(DIRECTORY,"data/vocabulary.src-trg"),
                               model_prefix=os.path.join(DIRECTORY,"data/bpe_joint_2000"),
                               vocab_size=2000)

#3000
sentencepiece.SentencePieceTrainer.train(input=os.path.join(DIRECTORY,"data/vocabulary.src-trg"),
                               model_prefix=os.path.join(DIRECTORY,"data/bpe_joint_3000"),
                               vocab_size=3000)

#4000
sentencepiece.SentencePieceTrainer.train(input=os.path.join(DIRECTORY,"data/vocabulary.src-trg"),
                               model_prefix=os.path.join(DIRECTORY,"data/bpe_joint_4000"),
                               vocab_size=4000)

We prepare the various datasets with 1000 wods vocab

In [None]:
#Loading datasets
train_src = read_file(os.path.join(DIRECTORY,'data/train.src'))
train_trg = read_file(os.path.join(DIRECTORY,'data/train.trg'))
dev_src = read_file(os.path.join(DIRECTORY,'data/dev.src'))
dev_trg = read_file(os.path.join(DIRECTORY,'data/dev.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))

# Loading the bpe model
spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_1000.model'))

# Apply the bpe model to the datasets
train_src_sp = spm.encode(train_src, out_type=str)
train_trg_sp = spm.encode(train_trg, out_type=str)
dev_src_sp = spm.encode(dev_src, out_type=str)
dev_trg_sp = spm.encode(dev_trg, out_type=str)
test_src_sp = spm.encode(test_src, out_type=str)
test_trg_sp = spm.encode(test_trg, out_type=str)

# Checking the result (src and trg should have the same length)
print(len(train_src_sp), len(train_trg_sp))
print(len(dev_src_sp), len(dev_trg_sp))
print(len(test_src_sp), len(test_trg_sp))

# We create the files bpe-zed
write_file([' '.join(sent) for sent in train_src_sp], os.path.join(DIRECTORY,'data/train.sp1000.src'))
write_file([' '.join(sent) for sent in train_trg_sp], os.path.join(DIRECTORY,'data/train.sp1000.trg'))
write_file([' '.join(sent) for sent in dev_src_sp], os.path.join(DIRECTORY,'data/dev.sp1000.src'))
write_file([' '.join(sent) for sent in dev_trg_sp], os.path.join(DIRECTORY,'data/dev.sp1000.trg'))
write_file([' '.join(sent) for sent in test_src_sp], os.path.join(DIRECTORY,'data/test.sp1000.src'))
write_file([' '.join(sent) for sent in test_trg_sp], os.path.join(DIRECTORY,'data/test.sp1000.trg'))

19472 19472
2615 2615
5879 5879


2000 words vocab

In [None]:
#Loading datasets
train_src = read_file(os.path.join(DIRECTORY,'data/train.src'))
train_trg = read_file(os.path.join(DIRECTORY,'data/train.trg'))
dev_src = read_file(os.path.join(DIRECTORY,'data/dev.src'))
dev_trg = read_file(os.path.join(DIRECTORY,'data/dev.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))

# Loading the bpe model
spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_2000.model'))

# Apply the bpe model to the datasets
train_src_sp = spm.encode(train_src, out_type=str)
train_trg_sp = spm.encode(train_trg, out_type=str)
dev_src_sp = spm.encode(dev_src, out_type=str)
dev_trg_sp = spm.encode(dev_trg, out_type=str)
test_src_sp = spm.encode(test_src, out_type=str)
test_trg_sp = spm.encode(test_trg, out_type=str)

# Checking the result (src and trg should have the same length)
print(len(train_src_sp), len(train_trg_sp))
print(len(dev_src_sp), len(dev_trg_sp))
print(len(test_src_sp), len(test_trg_sp))

# We create the files bpe-zed
write_file([' '.join(sent) for sent in train_src_sp], os.path.join(DIRECTORY,'data/train.sp2000.src'))
write_file([' '.join(sent) for sent in train_trg_sp], os.path.join(DIRECTORY,'data/train.sp2000.trg'))
write_file([' '.join(sent) for sent in dev_src_sp], os.path.join(DIRECTORY,'data/dev.sp2000.src'))
write_file([' '.join(sent) for sent in dev_trg_sp], os.path.join(DIRECTORY,'data/dev.sp2000.trg'))
write_file([' '.join(sent) for sent in test_src_sp], os.path.join(DIRECTORY,'data/test.sp2000.src'))
write_file([' '.join(sent) for sent in test_trg_sp], os.path.join(DIRECTORY,'data/test.sp2000.trg'))

19472 19472
2615 2615
5879 5879


3000 words vocab

In [None]:
# Loading the bpe model
spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_3000.model'))

# Apply the bpe model to the datasets
train_src_sp = spm.encode(train_src, out_type=str)
train_trg_sp = spm.encode(train_trg, out_type=str)
dev_src_sp = spm.encode(dev_src, out_type=str)
dev_trg_sp = spm.encode(dev_trg, out_type=str)
test_src_sp = spm.encode(test_src, out_type=str)
test_trg_sp = spm.encode(test_trg, out_type=str)

# Checking the result (src and trg should have the same length)
print(len(train_src_sp), len(train_trg_sp))
print(len(dev_src_sp), len(dev_trg_sp))
print(len(test_src_sp), len(test_trg_sp))

# We create the files bpe-zed
write_file([' '.join(sent) for sent in train_src_sp], os.path.join(DIRECTORY,'data/train.sp3000.src'))
write_file([' '.join(sent) for sent in train_trg_sp], os.path.join(DIRECTORY,'data/train.sp3000.trg'))
write_file([' '.join(sent) for sent in dev_src_sp], os.path.join(DIRECTORY,'data/dev.sp3000.src'))
write_file([' '.join(sent) for sent in dev_trg_sp], os.path.join(DIRECTORY,'data/dev.sp3000.trg'))
write_file([' '.join(sent) for sent in test_src_sp], os.path.join(DIRECTORY,'data/test.sp3000.src'))
write_file([' '.join(sent) for sent in test_trg_sp], os.path.join(DIRECTORY,'data/test.sp3000.trg'))

19472 19472
2615 2615
5879 5879


4000 words vocab

In [None]:
# Loading the bpe model
spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_4000.model'))

# Apply the bpe model to the datasets
train_src_sp = spm.encode(train_src, out_type=str)
train_trg_sp = spm.encode(train_trg, out_type=str)
dev_src_sp = spm.encode(dev_src, out_type=str)
dev_trg_sp = spm.encode(dev_trg, out_type=str)
test_src_sp = spm.encode(test_src, out_type=str)
test_trg_sp = spm.encode(test_trg, out_type=str)

# Checking the result (src and trg should have the same length)
print(len(train_src_sp), len(train_trg_sp))
print(len(dev_src_sp), len(dev_trg_sp))
print(len(test_src_sp), len(test_trg_sp))

# We create the files bpe-zed
write_file([' '.join(sent) for sent in train_src_sp], os.path.join(DIRECTORY,'data/train.sp4000.src'))
write_file([' '.join(sent) for sent in train_trg_sp], os.path.join(DIRECTORY,'data/train.sp4000.trg'))
write_file([' '.join(sent) for sent in dev_src_sp], os.path.join(DIRECTORY,'data/dev.sp4000.src'))
write_file([' '.join(sent) for sent in dev_trg_sp], os.path.join(DIRECTORY,'data/dev.sp4000.trg'))
write_file([' '.join(sent) for sent in test_src_sp], os.path.join(DIRECTORY,'data/test.sp4000.src'))
write_file([' '.join(sent) for sent in test_trg_sp], os.path.join(DIRECTORY,'data/test.sp4000.trg'))

19472 19472
2615 2615
5879 5879


In [None]:
!fairseq-preprocess --destdir $DIRECTORY/data/data_norm_bin_1000/ \
                    -s trg -t src \
                    --trainpref $DIRECTORY/data/train.sp1000 \
                    --validpref $DIRECTORY/data/dev.sp1000 \
                    --testpref $DIRECTORY/data/test.sp1000 \
                    --joined-dictionary

!fairseq-preprocess --destdir $DIRECTORY/data/data_norm_bin_2000/ \
                    -s trg -t src \
                    --trainpref $DIRECTORY/data/train.sp2000 \
                    --validpref $DIRECTORY/data/dev.sp2000 \
                    --testpref $DIRECTORY/data/test.sp2000 \
                    --joined-dictionary

!fairseq-preprocess --destdir $DIRECTORY/data/data_norm_bin_3000/ \
                    -s trg -t src \
                    --trainpref $DIRECTORY/data/train.sp3000 \
                    --validpref $DIRECTORY/data/dev.sp3000 \
                    --testpref $DIRECTORY/data/test.sp3000 \
                    --joined-dictionary

!fairseq-preprocess --destdir $DIRECTORY/data/data_norm_bin_4000/ \
                    -s trg -t src \
                    --trainpref $DIRECTORY/data/train.sp4000 \
                    --validpref $DIRECTORY/data/dev.sp4000 \
                    --testpref $DIRECTORY/data/test.sp4000 \
                    --joined-dictionary

2024-06-20 20:34:03 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, aim_repo=None, aim_run_hash=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='trg', target_lang='src', 

### II.b Training

We now train a model with a vocab of 1000

In [None]:
!pip uninstall numpy
!pip install numpy==1.25.2
import numpy
numpy.version.version

In [None]:
# create an empty model folder to store the model in
!mkdir -p $DIRECTORY/models/lstm_dict1000_3l_embed384

# call fairseq-train
!fairseq-train \
       $DIRECTORY/data/data_norm_bin_1000 \
        --save-dir $DIRECTORY/models/lstm_dict1000_3l_embed384 \
        --save-interval 1 --patience 12 \
        --arch lstm \
        --encoder-layers 3 --decoder-layers 3 \
        --encoder-embed-dim 384 --decoder-embed-dim 384 --decoder-out-embed-dim 384 \
        --encoder-hidden-size 768 --encoder-bidirectional --decoder-hidden-size 768 \
        --dropout 0.3 \
        --criterion cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' \
        --lr 0.001 --lr-scheduler inverse_sqrt \
        --warmup-updates 4000 \
        --share-all-embeddings \
        --max-tokens 3000 \
        --batch-size-valid 64

2024-06-20 20:38:42 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': N

We now train a model with a vocab of 2000

In [None]:
# create an empty model folder to store the model in
!mkdir -p $DIRECTORY/models/lstm_dict2000_3l_embed384

# call fairseq-train
!fairseq-train \
       $DIRECTORY/data/data_norm_bin_2000 \
        --save-dir $DIRECTORY/models/lstm_dict2000_3l_embed384 \
        --save-interval 1 --patience 12 \
        --arch lstm \
        --encoder-layers 3 --decoder-layers 3 \
        --encoder-embed-dim 384 --decoder-embed-dim 384 --decoder-out-embed-dim 384 \
        --encoder-hidden-size 768 --encoder-bidirectional --decoder-hidden-size 768 \
        --dropout 0.3 \
        --criterion cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' \
        --lr 0.001 --lr-scheduler inverse_sqrt \
        --warmup-updates 4000 \
        --share-all-embeddings \
        --max-tokens 3000 \
        --batch-size-valid 64

We now train a model with a vocab of 3000

In [None]:
# create an empty model folder to store the model in
!mkdir -p $DIRECTORY/models/lstm_dict3000_3l_embed384

# call fairseq-train
!fairseq-train \
       $DIRECTORY/data/data_norm_bin_3000 \
        --save-dir $DIRECTORY/models/lstm_dict3000_3l_embed384 \
        --save-interval 1 --patience 12 \
        --arch lstm \
        --encoder-layers 3 --decoder-layers 3 \
        --encoder-embed-dim 384 --decoder-embed-dim 384 --decoder-out-embed-dim 384 \
        --encoder-hidden-size 768 --encoder-bidirectional --decoder-hidden-size 768 \
        --dropout 0.3 \
        --criterion cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' \
        --lr 0.001 --lr-scheduler inverse_sqrt \
        --warmup-updates 4000 \
        --share-all-embeddings \
        --max-tokens 3000 \
        --batch-size-valid 64

We now train a model with a vocab of 4000

In [None]:
# create an empty model folder to store the model in
!mkdir -p $DIRECTORY/models/lstm_dict4000_3l_embed384

# call fairseq-train
!fairseq-train \
       $DIRECTORY/data/data_norm_bin_4000 \
        --save-dir $DIRECTORY/models/lstm_dict4000_3l_embed384 \
        --save-interval 1 --patience 12 \
        --arch lstm \
        --encoder-layers 3 --decoder-layers 3 \
        --encoder-embed-dim 384 --decoder-embed-dim 384 --decoder-out-embed-dim 384 \
        --encoder-hidden-size 768 --encoder-bidirectional --decoder-hidden-size 768 \
        --dropout 0.3 \
        --criterion cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' \
        --lr 0.001 --lr-scheduler inverse_sqrt \
        --warmup-updates 4000 \
        --share-all-embeddings \
        --max-tokens 3000 \
        --batch-size-valid 64

## III Testing

We will need a few functions. One for "pasting" the BPEs

In [None]:
def decode_sp(list_sents):
    return [''.join(sent).replace(' ', '').replace('▁', ' ').strip() for sent in list_sents]

One to extract the hypothesis in the prediction

In [None]:
def extract_hypothesis(filename):
    outputs = []
    with open(filename) as fp:
        for line in fp:
            # seulement les lignes qui commencet par H- (pour Hypothèse)
            if 'H-' in line:
                # prendre la 3ème colonne (c'est-à-dire l'indice 2)
                outputs.append(line.strip().split('\t')[2])
    return outputs

### III.a 1000 words

Prepare the model

In [None]:
import sentencepiece, os
!mkdir -p $DIRECTORY/dev

def normalise(sents):
    # generate temporary file
    filetmp = os.path.join(DIRECTORY,'data/tmp_norm.sp.src.tmp')
    # preprocessing
    input_sp = spm.encode(sents, out_type=str)
    # encode src sentences
    input_sp_sents = [' '.join(sent) for sent in input_sp]
    write_file(input_sp_sents, filetmp)
    #print("preprocessed = ", input_sp_sents)
    # normalise
    !cat $DIRECTORY/data/tmp_norm.sp.src.tmp | fairseq-interactive $DIRECTORY/data/data_norm_bin_1000 --source-lang src --target-lang trg --path $DIRECTORY/models/lstm_dict1000_3l_embed384/checkpoint_best.pt > $DIRECTORY/data/tmp_norm.sp.src.output  #2> $DIRECTORY/dev
    # postprocessing
    outputs = extract_hypothesis(os.path.join(DIRECTORY,'data/tmp_norm.sp.src.output'))
    outputs_postproc = decode_sp(outputs)
    return outputs_postproc

spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_1000.model'))

Prepare the files

In [None]:
#getting the test files
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
#normalising the test.src
test_norm = normalise(test_src)
#save the test.src
write_file(test_norm, os.path.join(DIRECTORY,'data/test.norm.trg'))

2024-06-20 21:49:57 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_na

BLEU:

In [None]:
from sacrebleu.metrics import BLEU, CHRF, TER

bleu = BLEU()
bleu.corpus_score(test_norm,[test_trg])

BLEU = 52.51 77.4/59.4/46.2/35.8 (BP = 1.000 ratio = 1.003 hyp_len = 82838 ref_len = 82598)

Translation Error Rate:

In [None]:
ter = TER()
ter.corpus_score(test_norm,[test_trg])

Character n-gram F-score:

In [None]:
chrf = CHRF()
chrf.corpus_score(test_norm,[test_trg])

Word accuracy

In [None]:
#Wacc
import align
# d'abord créer un fichier qui ne contient que les 10 première phrases du document cible
!head -n 10 data/dev.trg > data/dev.10.trg
align_dev_norm_10 = align.align(test_trg, test_norm)
num_diff = 0
total = 0
for sentence in align_dev_norm_10:
    for word in sentence:
        if '>' in word:
            num_diff += 1
        total += 1
print('Accuracy = ' + str((total - num_diff)/total))

### III.a 2000 words

Prepare the model

In [None]:
!mkdir -p $DIRECTORY/dev

def normalise(sents):
    # generate temporary file
    filetmp = os.path.join(DIRECTORY,'data/tmp_norm.sp.src.tmp')
    # preprocessing
    input_sp = spm.encode(sents, out_type=str)
    # encode src sentences
    input_sp_sents = [' '.join(sent) for sent in input_sp]
    write_file(input_sp_sents, filetmp)
    #print("preprocessed = ", input_sp_sents)
    # normalise
    !cat $DIRECTORY/data/tmp_norm.sp.src.tmp | fairseq-interactive $DIRECTORY/data/data_norm_bin_2000 --source-lang src --target-lang trg --path $DIRECTORY/models/lstm_dict2000_3l_embed384/checkpoint_best.pt > $DIRECTORY/data/tmp_norm.sp.src.output  #2> $DIRECTORY/dev
    # postprocessing
    outputs = extract_hypothesis(os.path.join(DIRECTORY,'data/tmp_norm.sp.src.output'))
    outputs_postproc = decode_sp(outputs)
    return outputs_postproc

spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_2000.model'))

Prepare the files

In [None]:
#getting the test files
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
#normalising the test.src
test_norm = normalise(test_src)
#save the test.src
write_file(test_norm, os.path.join(DIRECTORY,'data/test.norm.trg'))

BLEU:

In [None]:
from sacrebleu.metrics import BLEU, CHRF, TER

bleu = BLEU()
bleu.corpus_score(test_norm,[test_trg])

Translation Error Rate:

In [None]:
ter = TER()
ter.corpus_score(test_norm,[test_trg])

Character n-gram F-score:

In [None]:
chrf = CHRF()
chrf.corpus_score(test_norm,[test_trg])

Word accuracy:

In [None]:
#Wacc
import align
# d'abord créer un fichier qui ne contient que les 10 première phrases du document cible
!head -n 10 data/dev.trg > data/dev.10.trg
align_dev_norm_10 = align.align(test_trg, test_norm)
num_diff = 0
total = 0
for sentence in align_dev_norm_10:
    for word in sentence:
        if '>' in word:
            num_diff += 1
        total += 1
print('Accuracy = ' + str((total - num_diff)/total))

### III.b 3000 words

Prepare the model

In [None]:
def normalise(sents):
    # generate temporary file
    filetmp = os.path.join(DIRECTORY,'data/tmp_norm.sp.src.tmp')
    # preprocessing
    input_sp = spm.encode(sents, out_type=str)
    # encode src sentences
    input_sp_sents = [' '.join(sent) for sent in input_sp]
    write_file(input_sp_sents, filetmp)
    #print("preprocessed = ", input_sp_sents)
    # normalise
    !cat $DIRECTORY/data/tmp_norm.sp.src.tmp | fairseq-interactive $DIRECTORY/data/data_norm_bin_3000 --source-lang src --target-lang trg --path $DIRECTORY/models/lstm_dict3000_3l_embed384/checkpoint_best.pt > $DIRECTORY/data/tmp_norm.sp.src.output  #2> $DIRECTORY/dev
    # postprocessing
    outputs = extract_hypothesis(os.path.join(DIRECTORY,'data/tmp_norm.sp.src.output'))
    outputs_postproc = decode_sp(outputs)
    return outputs_postproc

spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_3000.model'))

Prepare the files

In [None]:
#getting the test files
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
#normalising the test.src
test_norm = normalise(test_src)
#save the test.src
write_file(test_norm, os.path.join(DIRECTORY,'data/test.norm.trg'))

BLEU:

In [None]:
from sacrebleu.metrics import BLEU, CHRF, TER

#BLEU
bleu = BLEU()
bleu.corpus_score(test_norm,[test_trg])

Translation Error Rate:

In [None]:
ter = TER()
ter.corpus_score(test_norm,[test_trg])

Character n-gram F-score:

In [None]:
chrf = CHRF()
chrf.corpus_score(test_norm,[test_trg])

Word accuracy:

In [None]:
#Wacc
import align
# d'abord créer un fichier qui ne contient que les 10 première phrases du document cible
!head -n 10 data/dev.trg > data/dev.10.trg
align_dev_norm_10 = align.align(test_trg, test_norm)
num_diff = 0
total = 0
for sentence in align_dev_norm_10:
    for word in sentence:
        if '>' in word:
            num_diff += 1
        total += 1
print('Accuracy = ' + str((total - num_diff)/total))

### III.b 4000 words

Prepare the model

In [None]:
def normalise(sents):
    # generate temporary file
    filetmp = os.path.join(DIRECTORY,'data/tmp_norm.sp.src.tmp')
    # preprocessing
    input_sp = spm.encode(sents, out_type=str)
    # encode src sentences
    input_sp_sents = [' '.join(sent) for sent in input_sp]
    write_file(input_sp_sents, filetmp)
    #print("preprocessed = ", input_sp_sents)
    # normalise
    !cat $DIRECTORY/data/tmp_norm.sp.src.tmp | fairseq-interactive $DIRECTORY/data/data_norm_bin_4000 --source-lang src --target-lang trg --path $DIRECTORY/models/lstm_dict4000_3l_embed384/checkpoint_best.pt > $DIRECTORY/data/tmp_norm.sp.src.output  #2> $DIRECTORY/dev
    # postprocessing
    outputs = extract_hypothesis(os.path.join(DIRECTORY,'data/tmp_norm.sp.src.output'))
    outputs_postproc = decode_sp(outputs)
    return outputs_postproc

spm = sentencepiece.SentencePieceProcessor(model_file=os.path.join(DIRECTORY,'data/bpe_joint_4000.model'))

Prepare the files

In [None]:
#getting the test files
test_trg = read_file(os.path.join(DIRECTORY,'data/test.trg'))
test_src = read_file(os.path.join(DIRECTORY,'data/test.src'))
#normalising the test.src
test_norm = normalise(test_src)
#save the test.src
write_file(test_norm, os.path.join(DIRECTORY,'data/test.norm.trg'))

BLEU:

In [None]:
from sacrebleu.metrics import BLEU, CHRF, TER

#BLEU
bleu = BLEU()
bleu.corpus_score(test_norm,[test_trg])

Translation Error Rate:

In [None]:
ter = TER()
ter.corpus_score(test_norm,[test_trg])

Character n-gram F-score:

In [None]:
chrf = CHRF()
chrf.corpus_score(test_norm,[test_trg])

Word accuracy:

In [None]:
#Wacc
import align
# d'abord créer un fichier qui ne contient que les 10 première phrases du document cible
!head -n 10 data/dev.trg > data/dev.10.trg
align_dev_norm_10 = align.align(test_trg, test_norm)
num_diff = 0
total = 0
for sentence in align_dev_norm_10:
    for word in sentence:
        if '>' in word:
            num_diff += 1
        total += 1
print('Accuracy = ' + str((total - num_diff)/total))