GitHub - sythello/CIBB-dataset: The CIBB dataset used for 'blacklist method' evaluation for MT system's performance on Chinese idioms.

sythello / CIBB-dataset Public

Notifications You must be signed in to change notification settings
Fork 2
Star 6

The CIBB dataset used for 'blacklist method' evaluation for MT system's performance on Chinese idioms.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.txt		README.txt
filter-trig-pairs.py		filter-trig-pairs.py
idiom_blacklist.blacklist.en.txt		idiom_blacklist.blacklist.en.txt
idiom_blacklist.ref.en.txt		idiom_blacklist.ref.en.txt
idiom_blacklist.src.zh.txt		idiom_blacklist.src.zh.txt
list_idiom_blacklist.txt		list_idiom_blacklist.txt

Repository files navigation

README of the CIBB dataset.

1.Files in this directory:

- README.txt: This file.

- filter-trig-pairs.py: The python script for automatically separating translations triggering or not triggering the blacklist. (Need 'nltk' package to run)

- idiom_blacklist.blacklist.en.txt: The blacklist for each Chinese source sentence.

- idiom_blacklist.src.zh.txt: Chinese source sentences.

- idiom_blacklist.ref.en.txt: Reference English translations for each Chinese source sentence.

- list_idiom_blacklist.txt: The idiom list. Every 5 lines describe an idiom. The format is like the following:
The Chinese idiom
Frequency in the training data
English translation
Blacklist
(blank line)

2. How to execute the evaluation:
(1) Translate all the Chinese source sentences using your MT system. Put all the translations in a file named 'idiom_blacklist.trans.en.txt' under this directory, one translation in a line, just like in 'idiom_blacklist.ref.en.txt'.
(2) Run the script 'filter-trig-pairs.py'.
(3) All the translations triggering the blacklist will be put into file 'idiom_blacklist.trig.pairs.txt', while those not triggering the blacklist will be put into 'idiom_blacklist.nontrig.pairs.txt'. The result (how many triggered, how many not) will be printed on the shell.