Skip to content

tmu-nlp/paraphrase-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

TMUP is an evaluation corpus for Japanese paraphrase identification. It consists of 655 sentence pairs in total.

  • 363 paraphrase sentence pairs
  • 292 non-paraphrase sentence pairs

Candidate Acquisition Method

To acquire both paraphrase and non-paraphrase instances, we

  • generated sentence pairs using Google PBMT and NMT to acquire paraphrases
  • extracted sentence pairs from Japanese Wikipedia to acquire non-paraphrases

To acquire both trivial and non-trivial instances, we

  • calculated word overlap rate (Jaccard score) of each sentence pair and uniformly sampled candidates

Annotation

Two annotators judged whether the candidates are paraphrases.

*For more details, please refer to the paper.

Data Format

label <TAB> sentence_A_ja <TAB> sentence_B_ja <TAB> source_sentence_en (if applicable)

Labels

  • 1: Paraphrase
  • 0: Non-paraphrase

Citing

If you make use of this corpus, please cite the following publication:

Yui Suzuki, Tomoyuki Kajiwara and Mamoru Komachi. Building a Non-Trivial Paraphrase Corpus using Multiple Machine Translation Systems. In Proceedings of ACL 2017 Student Research Workshop, Vancouver, Canada. July 2017 (to appear).

@inproceedings{,
    author      = {Suzuki, Yui and Kajiwara, Tomoyuki and Komachi, Mamoru},
    title       = {Building a Non-Trivial Paraphrase Corpus
                  using Multiple Machine Translation Systems},
    booktitle   = {Proceedings of ACL 2017 Student Research Workshop},
    month       = {July},
    year        = {2017},
    address     = {Vancouver, Canada},
    publisher   = {Association for Computational Linguistics},
    pages     = {(to appear)},
    url       = {http://www.aclweb.org/anthology/}
}

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Copyright (c) 2017 TMU-NLP

Contact

For inquiry and feedback please contact the authors below:

  • Yui Suzuki <suzuki-yui at ed.tmu.ac.jp>
  • Tomoyuki Kajiwara <kajiwara-tomoyuki at ed.tmu.ac.jp>
  • Mamoru Komachi <komachi at tmu.ac.jp>

About

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published