Using Confidential Data for NMT Domain Adaptation

This is the page for the datasets used for the paper Using Confidential Data for NMT Domain Adaptation. All the original parallel corpora are from OPUS.

We chose EMEA, GNOME and JRC-Acquis domains for German to English.

Description of datasets

\full_sent_datasets :
- Full sentence training, validation and test set.
\phrase_pairs :
- Extracted phrase pairs with max length 4.
- After shuffling and sub-sampling with 50%.
\original_documents :
- In order to simulate our scenario better, we obtain the datasets by documents. We also provide the documents used to consist of our datasets.
- All documents are named sequentially from the number 1.
- Currently, JRC is not online but it will be uploaded soon.

Statistics of datasets

Full sentence datasets

Type	Sentences
Train	10k
Validation	150
Test	2k

References/Credits

Please cite the following paper if you use the code and of course thanks for that!

Kim S., Bisazza A., Turkmen F., Using Confidential Data for Domain Adaptation of Neural Machine Translation, Third Workshop on Privacy in Natural Language Processing (PrivateNLP) 2021, Colocated with NAACL 2021 (Paper).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Confidential Data for NMT Domain Adaptation

Description of datasets

Statistics of datasets

References/Credits

About

Releases

Packages

Contributors 2

License

Sohyo/Using-Confidential-Data-for-NMT

Folders and files

Latest commit

History

Repository files navigation

Using Confidential Data for NMT Domain Adaptation

Description of datasets

Statistics of datasets

References/Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages