Skip to content

Latest commit

 

History

History
39 lines (24 loc) · 587 Bytes

README.md

File metadata and controls

39 lines (24 loc) · 587 Bytes

Benchmarks for word embeddings evaluation

The metadata for a dataset includes:

  • language (en, ja, etc)
  • task (analogy, similarity, etc)
  • description (e.g. Bigger Analogy Test Set)
  • version (e.g. 3.0)
  • cite (bibtex for the paper to cite)

Available datasets

English

Word similarity:

  1. WordSim 353
  2. MEN
  3. SimLex
  4. Rare Words
  5. MTurk

Word analogy:

  1. BATS

Text classification:

  1. IMDb moview reviews sample

Japanese

Word similarity:

  1. Japanese word similarity (https://github.com/tmu-nlp/JapaneseWordSimilarityDataset)

Japanese word similarity:

  1. JBATS