Skip to content
No description, website, or topics provided.
Python Cuda Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github fix Windows build (#1007) Jan 24, 2020
docs More fully deprecate --raw-text and --lazy-load (fixes #1488) Dec 17, 2019
examples Script Fairseq transformer (#1011) Jan 30, 2020
fairseq knnlm initial commit Feb 14, 2020
fairseq_cli knnlm initial commit Feb 14, 2020
scripts Switch to Python logging (+ lint) (#1627) Jan 17, 2020
tests Add save and load tests to fairseq export test (#1653) Jan 31, 2020
.gitignore REFACTOR: NAT Implementation (#925) Dec 4, 2019
CODE_OF_CONDUCT.md Adopt Contributor Covenant Aug 30, 2019
CONTRIBUTING.md Relicense fairseq under MIT license (#786) Jul 30, 2019
LICENSE Relicense fairseq under MIT license (#786) Jul 30, 2019
README.md Update README.md Feb 14, 2020
README_fairseq.md knnlm initial commit Feb 14, 2020
build_dstore.py knnlm initial commit Feb 14, 2020
eval_lm.py Fix binaries in root dir (#995) Jan 17, 2020
fairseq.gif Initial commit Sep 15, 2017
fairseq_logo.png Fixes (#442) Jan 14, 2019
generate.py Fix binaries in root dir (#995) Jan 17, 2020
hubconf.py Build Cython components when loading hub (#1386) Nov 18, 2019
interactive.py Fix binaries in root dir (#995) Jan 17, 2020
preprocess.py Fix binaries in root dir (#995) Jan 17, 2020
score.py Fix binaries in root dir (#995) Jan 17, 2020
setup.py Fix binaries in root dir (#995) Jan 17, 2020
train.py Fix binaries in root dir (#995) Jan 17, 2020
validate.py Fix binaries in root dir (#995) Jan 17, 2020

README.md

Nearest Neighbor Language Models

This repository is a fork of the Fairseq repository and the exact commit that this code is based on can be found here. Please use the exact commit page to determine software requirements for using this code. This README will be updated once the code has been merged into Fairseq.

This code pertains to the ICLR 2020 paper: Generalization through Memorization: Nearest Neighbor Language Models. If you use this code or results from our paper, please cite:

@inproceedings{khandelwal20generalization,
  title={{Generalization through Memorization: Nearest Neighbor Language Models}},
  author={Khandelwal, Urvashi and Levy, Omer and Jurafsky, Dan and Zettlemoyer, Luke and Lewis, Mike},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2020}
}

Wikitext-103 Experiments

Before starting, make sure you install Fairseq (after pulling the code, from the project directory) and FAISS:

pip install --editable .

pip install faiss

A Note about Hardware

Experiments for this paper were conducted on machines that contain 500GB of RAM, NVIDIA V100 32GB GPUs and flash storage (SSDs). Saving the Wikitext-103 datastore requires 400GB of disk space. The speed of saving the datastore, building the FAISS index and evaluating the nearest neighbors language model heavily depends on the amount of RAM available for each job. Some of these steps can be sped up by parallelizing, which we leave for users to do in order to best cater to their setup.

If you are working with a remote cluster, please note that we use memmaps for saving the datastore. This allows us to keep the data on disk while accessing it by loading small chunks into memory, depending on the available RAM. This means there are a large number of disk seeks. In order to prevent slowing down your entire cluster, we suggest always reading/writing this data to/from local disks (as opposed to NFS directories), and flash storage is best for faster access.

Preparing the data

We share Fairseq's instructions on how to prepare the data here.

cd examples/language_model/
bash prepare-wikitext-103.sh
cd ../..


TEXT=examples/language_model/wikitext-103
python preprocess.py \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

Training the Language Model

We share Fairseq's instructions on how to train the language model here. Alternatively, you can download the checkpoint used for our experiments here.

python train.py --task language_modeling \
    data-bin/wikitext-103 \
    --save-dir checkpoints/ \
    --arch transformer_lm_wiki103 \
    --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
    --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 --fp16 \
    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d

This model was trained on 8 gpus.

Evaluating the Language Model

To evaluate the model on the validation set:

python eval_lm.py data-bin/wikitext-103 \
    --path checkpoints/checkpoint_best.pt \
    --sample-break-mode complete --max-tokens 3072 \
    --context-window 2560 --softmax-batch 1024 \
    --gen-subset valid

Saving the keys and values for the datastore

In order to save keys and values for the datastore, we must run model evaluation over the entire training set.

Caution: Running this step requires a large amount of disk space (400GB!). Please read the note about hardware above, before running this!

python eval_lm.py data-bin/wikitext-103 \
    --path checkpoints/checkpoint_best.pt \
    --sample-break-mode none --max-tokens 3072 \
    --softmax-batch 1024 --gen-subset train \
    --context-window 1536 --tokens-per-sample 1536 \
    --dstore-mmap checkpoints/dstore --knn-keytype 'last_ffn_input' \
    --dstore-size 103225485 --model-overrides "{'knn_keytype': 'last_ffn_input'}" \
    --save-knnlm-dstore --fp16

The total number of tokens in the Wikitext-103 training set is 103227021. The dstore size 103225485 is 1536 tokens less than the total due to the context-window. We want each key to be constructed using a minimum amount of prior context.

If you would prefer to save the keys and values in float16, please use the --dstore-fp16 flag and remember to use it during the index building and evaluation steps as well.

Building the FAISS index

The FAISS index requires a training stage where it learns a set of clusters for the keys. Once this is completed, the keys must all be added to the index. The speed of adding keys to the index depends on the hardware, particularly the amount of RAM available. Please check the paper for more details on our use of FAISS.

Note that the following command runs on CPU.

python build_dstore.py 
    --dstore_mmap checkpoints/dstore \
    --dstore_size 103225485 \
    --faiss_index checkpoints/knn.index \
    --num_keys_to_add_at_a_time 500000 \
    --starting_point 0

Evaluating the Nearest Neighbor Language Model

To evaluate the model on the validation set:

python eval_lm.py data-bin/wikitext-103 \
    --path checkpoints/checkpoint_best.pt \
    --sample-break-mode complete --max-tokens 3072 \
    --context-window 2560 --softmax-batch 1024 \
    --gen-subset valid --dstore-filename checkpoints/dstore \
    --indexfile checkpoints/knn.index  \
    --model-overrides "{'knn_keytype': 'last_ffn_input'}" \
    --k 1024 --lmbda 0.25 --dstore-size 103225485 --knn-keytype last_ffn_input \
    --probe 32 --knnlm --fp16

If your hardware constraints make this too slow, you can run it without using full precision keys by adding two flags: --no-load-keys and --knn-sim-func "do_not_recomp_l2". This uses the quantized versions of keys stored within the FAISS index. You can make things faster by reducing the value of the probe (the number of clusters FAISS checks for neighbors) at the cost of performance. You can also try reducing the number of neighbors k.

You can’t perform that action at this time.