Code accompanying Incorporating Chinese Characters of Words for Lexical Sememe Prediction (ACL2018) https://arxiv.org/abs/1806.06349
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
CSP.sh
Ensemble_model_CSP.py
Ensemble_model_external.py
Ensemble_model_internal.py
LICENSE
README.md
SPCSE.sh
SPCSE_prediction.py
SPCSE_train.py
SPWCF.sh
SPWCF_prediction.py
Sememe_PMI_Matrix_Generator.py
data_generator.sh
hownet.txt
hownet_corpus_data_picker.py
pickle_version_change.py
scorer.py
test_data_generator.py
work.sh

README.md

Character-enhanced-Sememe-Prediction

Table of contents

Introduction

The code for Incorporating Chinese Characters of Words for Lexical Sememe Prediction (ACL2018) [1]

Usage

Dependency Requirements

The version of python to be used for different python files has been explicitly designated in shell files.

  1. Python 2.7 (For running the main code)
  2. Python 3 (For changing the version of pickle-dumped file generated by SPWE and SPSE, only CSP.sh requires)
  3. Numpy > 1.0
  4. In order to manage your dependency environment, we strongly encourage you to install the Anaconda.

Preparation 

  1. Prepare a file that contains pre-trained Chinese word embeddings(of Google Word2Vec form). We recommend that the amount of words be at least 200,000 and the number of dimentions be at least 200. It will achieve much better result using a large (20GB or more is recommended) corpus to train your embeddings for running this program.

  2. Rename the word embedding file as embedding_200.txt and put it in the repository root directory.

mv path/to/file/your_word_vec.txt ./embedding_200.txt
  1. Prepare a file that contains pre-trained Chinese character embeddings(of CWE form; see paper [2] and code). We recommend that the number of dimentions be at least 200. It will achieve much better result using a large (20GB or more is recommended) corpus to train your embeddings for running this program.

  2. Rename the word embedding file as char_embedding_200.txt and put it in the repository root directory.

mv path/to/file/your_character_embedding_file.txt ./char_embedding_200.txt
  1. Run data_generator.sh, the program will automatically generate evaluation data set and other data files required during training.
./data_generator.sh

Training and Prediction

  1. Run SPWCF.sh/SPCSE.sh The corresponding model will be automatically learned and evaluated.
./SPWCF.sh
./SPCSE.sh
  1. Since we need SPWE and SPSE as a part of our model, see paper [3] and code for details. Please use SPWE and SPSE to get the model files model_SPWE and model_SPSE and copy them to the root directory of this repository.
mv path/to/file/model_SPWE ./
mv path/to/file/model_SPSE ./
  1. Run CSP.sh The corresponding model will be automatically learned and evaluated.
./CSP.sh

References

[1] Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and Leyu Lin. 2018. Incorporating Chinese Characters of Words for Lexical Sememe Prediction. In Proceedings of ACL.

[2] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huan-Bo Luan. 2015. Joint Learning of Character and Word Embeddings. In Proceedings of IJCAI.

[3] Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, and Maosong Sun. 2017. Lexical sememe prediction via word embeddings and matrix factorization. In Proceedings of IJCAI