Skip to content
Source code for ACL 2019 paper "Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge"
Python
Branch: master
Clone or download
Latest commit 08adbe7 Nov 3, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data/SanWen code Jul 20, 2019
nn code Jul 20, 2019
utils code Jul 20, 2019
LICENSE Initial commit May 28, 2019
README.md Update README.md Nov 3, 2019
configure.py code Jul 20, 2019
main.py code Jul 20, 2019

README.md

Chinese-NRE

Source code for ACL 2019 paper "Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge". Some code in this repository is based on the excellent open-source project https://github.com/jiesutd/LatticeLSTM.

Requirements

  • Python 3.6
  • Pytorch 0.4.1

Datasets

Three datasets are used in our paper:

  • FinRE: A manual-labeled financial news RE dataset. The data cannot be made public for the time being.

  • SanWen: A Chinese literature NER-RE dataset, the source of the dataset is https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset.

  • ACE 2005: A benchmark RE dataset. According to the terms of LDC, we are not allowed to share the dataset with the third party. If you have the LDC license, please obtain the dataset (LDC2006T06) and follow the data format by yourself.

In this project, train.txt , dev.txt and test.txt are all from SanWen.

Data Format

Input Format

data/SanWen/train.txt, dev.txt, test.txt One instance per line with 4 columns separated by tab character. The first and second columns are head and tail entities. The third column is the relation label and the last one is text:

[head]	[tail]	[relation]	  text

For example ( one line ):

 湖底	   卵石	 Located	 连湖底的卵石颜色也可分辨

data/SanWen/relation2id.txt One relation per line with 2 columns separated by tab character. The first column is teh label while the second one is the corresponding ID:

[relation]	[ID]

Pre-trained Character Embeddings

data/vec.txt One character per line. For each line, the first column is the character, the rest columns is the value of the embedding of the character.

Pre-trained Word-Sense Embeddings

data/sense.txt Similar to character embedding but for word senses. For example:

释放#1 0.304095 ...
释放#2 -0.175496 ...
夏天 -0.230772 ...

Here, A#n means that it is the n-th sense of word A ( A is a polysemous word ). And the word-sense embeddings could be trained by the SAT (Sememe Attention over Target) approach.

Word-Sense Map

data/sense_map.txt Recording all senses for each polysemous word, corresponding to the word sense embedding. One word per line, for each line, the first column is the word, and the rest columns are all the senses of it ( if exist ). For example:

释放 释放#1 释放#2
夏天

The sense_map file could be obtained by HowNet.

Data Preparation

You can download the pre-trained character embeddings vec.txt, pre-trained word-sense embeddings sense.txt and word-sense map sense_map.txt from Tsinghua Cloud or Google Drive. Then put them in place following the folder structure:

MG-Lattice
|-- ...
|-- data
	|
	|-- sense.txt
	|
	|-- vec.txt
	|
	|-- sense_map.txt
	|
	|-- DATASET_NAME_1
	|	|
	|	|-- train.txt
	|	|-- valid.txt
	|	|-- test.txt
	|	|-- relation2id.txt
    	|
   	|-- DATASET_NAME_2
    		|-- ...

How to run

Arguments are set in configure.py, the default values are for SanWen dataset. The full usage is:

-- savemodel  			path to save the model					
-- loadmodel			path to load the model					
-- savedset			path to load the data settings 			

-- public_path			the parent path of the dataset 			(data/)
-- dataset          		the folder name of dataset			(SanWen/)
-- train_file			train dataset  					(train.txt)
-- dev_file			developement dataset  				(dev.txt)
-- test_file			test dataset  					(test.txt)
-- relation2id			map relation to id  				(relation2id.txt)
-- char_emb_file		pre-trained char embeddings 			(vec.txt)
-- sense_emb_file		pre-trained sense embeddings 			(sense.txt)
-- word_sense_map		record polysemous words 			(sense_map.txt)
-- max_length			the max length of the input				
					
-- Encoder			Specify which encoder to use
-- Optimizer			Specify which optimizier to use
-- lr				learning rate							
-- weights_mode			mode to set weights for each class in loss function

With appropriate configuration and data preparation, you can run the model by:

python main.py

Citation

If you use the code, please cite the paper:

@inproceedings{li2019chinese,
 title={Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge},
 author={Li, Ziran and Ding, Ning and Liu, Zhiyuan and Zheng, Hai-Tao and Shen, Ying},
 booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
 pages={4377--4386},
 year={2019}
}
You can’t perform that action at this time.