<a href="https://colab.research.google.com/github/stephenkiilu/NLP_Week2/blob/main/Stephen_Kiilu_day2_lab_extra_fairseq_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairseq

from https://fairseq.readthedocs.io/en/latest/getting_started.html

"Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks." It provides reference implementations of various sequence-to-sequence models making our life much more easier!

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Installation

In [None]:
!pip install --upgrade fairseq
!pip install sacremoses subword_nmt

Collecting fairseq
  Downloading fairseq-0.10.2-cp37-cp37m-manylinux1_x86_64.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 1.6 MB/s 
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 4.2 MB/s 
Collecting hydra-core
  Downloading hydra_core-1.1.2-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 18.3 MB/s 
[?25hCollecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting portalocker
  Downloading portalocker-2.4.0-py2.py3-none-any.whl (16 kB)
Collecting omegaconf==2.1.*
  Downloading omegaconf-2.1.2-py3-none-any.whl (74 kB)
[K     |████████████████████████████████| 74 kB 1.6 MB/s 
[?25hCollecting importlib-resources<5.3
  Downloading importlib_resources-5.2.3-py3-none-any.whl (27 kB)
Collecting antlr4-python3-runtime==4.8
  Downloading antlr4-python3-runti

Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[?25l[K     |▍                               | 10 kB 18.0 MB/s eta 0:00:01[K     |▊                               | 20 kB 22.1 MB/s eta 0:00:01[K     |█▏                              | 30 kB 26.5 MB/s eta 0:00:01[K     |█▌                              | 40 kB 30.1 MB/s eta 0:00:01[K     |█▉                              | 51 kB 25.0 MB/s eta 0:00:01[K     |██▎                             | 61 kB 27.4 MB/s eta 0:00:01[K     |██▋                             | 71 kB 28.5 MB/s eta 0:00:01[K     |███                             | 81 kB 21.5 MB/s eta 0:00:01[K     |███▍                            | 92 kB 22.7 MB/s eta 0:00:01[K     |███▊                            | 102 kB 24.3 MB/s eta 0:00:01[K     |████                            | 112 kB 24.3 MB/s eta 0:00:01[K     |████▌                           | 122 kB 24.3 MB/s eta 0:00:01[K     |████▉                           | 133 kB 24.3 MB/s eta 0:00

## Generation using pre-trained MT model

In [None]:
! curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 1909M    0 11892    0     0  20753      0 26:48:01 --:--:-- 26:48:01 20717wmt14.en-fr.fconv-py/
wmt14.en-fr.fconv-py/model.pt
 99 1909M   99 1901M    0     0  7537k      0  0:04:19  0:04:18  0:00:01 8258kwmt14.en-fr.fconv-py/dict.en.txt
wmt14.en-fr.fconv-py/dict.fr.txt
100 1909M  100 1909M    0     0  7540k      0  0:04:19  0:04:19 --:--:-- 8323k
wmt14.en-fr.fconv-py/bpecodes
wmt14.en-fr.fconv-py/README.md


We're going to use this downloaded model in an interactive setting and try out all the decoding algorithms we learnt about! 

##### First lets try out standard beam search:

In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --beam 5 --nbest 5 --source-lang en --target-lang fr --remove-bpe 

2022-05-17 07:00:49 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


This generation script produces three types of outputs: a line prefixed with O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker which is omitted from the text.

In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --beam 100 --nbest 5 --source-lang en --target-lang fr --remove-bpe 

2022-05-17 07:01:20 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=100, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=N

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


Diverse beam search produces much more varied generation:

In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --beam 5 --nbest 5 --source-lang en --target-lang fr --remove-bpe --diverse-beam-groups 5 --diverse-beam-strength 10

2022-05-17 07:01:42 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=5, diverse_beam_strength=10.0, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  beams_buf = indices_buf // vocab_size
  unfin_idx = idx // beam_size


Let's try using different decoding methods for generation and see what results we get!

In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --sampling --sampling-topk 10 --nbest 5 --source-lang en --target-lang fr --remove-bpe

2022-05-17 07:01:57 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  unfin_idx = idx // beam_size


In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --sampling --sampling-topp 0.1 --nbest 5 --source-lang en --target-lang fr --remove-bpe

2022-05-17 07:02:12 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  unfin_idx = idx // beam_size


In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes  \
    --sampling --sampling-topp 0.5 --nbest 5 --source-lang en --target-lang fr --remove-bpe

2022-05-17 07:02:27 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  unfin_idx = idx // beam_size


In [None]:
%%bash
MODEL_DIR=wmt14.en-fr.fconv-py
echo "Please tell me your name and a little about yourself ." | fairseq-interactive \
    --path $MODEL_DIR/model.pt $MODEL_DIR \
    --tokenizer moses --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes \
    --sampling --sampling-topp 0.9 --nbest 5 --source-lang en --target-lang fr --remove-bpe

2022-05-17 07:02:43 | INFO | fairseq_cli.interactive | Namespace(all_gather_list_size=16384, batch_size=1, batch_size_valid=None, beam=5, bf16=False, bpe='subword_nmt', bpe_codes='wmt14.en-fr.fconv-py/bpecodes', bpe_separator='@@', broadcast_buffers=False, bucket_cap_mb=25, buffer_size=1, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='wmt14.en-fr.fconv-py', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=0, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=Non

  unfin_idx = idx // beam_size
