Data Contamination

Code for "Data Contamination: From Memorization to Exploitation" by Inbal Magar and Roy Schwartz, ACL 2022.

Setup

The code is implemented in python 3.7.3 using Hugging Face's transformers. To run it, please install the requirements.txt file:

pip install -r requirements.txt

Prepare datasets

To create the combined corpus of clean and contaminated data, one needs to extract and preprocess the April 21’ English Wikipedia dump (we used the wikiextractor tool (Attardi, 2015), which can be found in https://github.com/attardi/wikiextractor). Then, run the prepare_data.py command.

E.g.,

python prepare_data/prepare_data.py \
--contamination_path data/SST5/sst5_train_test_1_set.txt\
--contamination_copies 100 \
--wiki_path **path_to_wikipedia_file** \
--wiki_size 1e6 \
--path **path_to_new_file**

Running experiments

To reproduce the experiments in the paper (or experiment with your own data), use the run_pipeline.sh command. To reproduce the experiments of changing the position of the contaminated data (figure 4), replace pretrain/run_mlm.py with ``pretrain/run_mlm_no_shuffle.py``` in the above script.

Cite

@inproceedings{magar-schwartz-2022-data,
    title = "Data Contamination: From Memorization to Exploitation",
    author = "Magar, Inbal  and Schwartz, Roy",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-short.18",
    doi = "10.18653/v1/2022.acl-short.18",
    pages = "157--165",
}

Contact

For inquiries, please send an email to inbal.magar@mail.huji.ac.il.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
datasets		datasets
finetune		finetune
papers_results		papers_results
prepare_data		prepare_data
pretrain		pretrain
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Contamination

Setup

Prepare datasets

Running experiments

Cite

Contact

About

Releases

Packages

Languages

License

schwartz-lab-NLP/data_contamination

Folders and files

Latest commit

History

Repository files navigation

Data Contamination

Setup

Prepare datasets

Running experiments

Cite

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages