Introduction

This repository is a Python implementation to generate term pairs of the English Bigram Relatedness Dataset (BiRD). Details on how we created the BiRD can be found in our paper: Big BiRD: A Large, Fine-Grained, Bigram Relatedness Dataset for Examining Semantic Composition.

BiRD can be used for two purposes: (1) to evaluate methods of semantic composition and (2) to analyse to obtain insights into bigram semantic relatedness. Both purposes are studied in detail in our paper. An interactive visualizations of the data to explore and the annotation questionnaire which is used for data annotation are also available through the project's webpage.

Usage

Dependencies

Requirements:

Python3
nltk>=3

Install required Python dependencies using the command below. pip install -r python/requirements.txt

Files

generate_related_terms.py Main script to obtain the related terms for any bigram AB.
bigrams_file.txt List of bigrams of which the related terms will be extracted. Each line is a bigram.
data.zip A zip file which contains the following files:
1. phrase_table_unigram_bigram.txt This file is the output of the processed phrase table of NRC Portage Machine Translation Toolkit. It only contains one-word and two-word English translations of french phrases (disregarding frequency). This file is in data.zip. Unzip to use in the code.
2. wikipedia_unigram_bigram.pickle The list of unigram and bigrams and their frequencies in Wikipedia. We used English Wikipedia dump 2018 in our work. We only consider adjective-noun or noun-noun bigrams. This file is in data.zip. Unzip to use in the code.
semantic_composition_evaluation.py Main script to examine semantic composition methods on BiRD including addition, multiplication, dilation, tensor product+convolution, head only and modifier only methods. This script calls the word embeddings from fastText, GloVe and term_context matrix from the word_embeddings folder in the project and reproduces the results of the Table 4 in the paper. Note that these results were obtained from a subset of BiRD (3,159 pairs) since some words in BiRD do not occur in some of the corpora used to create the word vectors embeddings.
word_embeddings.zip Contains four embedding files. Unzip the file for further process. The files fasttext.txt, glove.txt and term-context.txt in the folder contain only the words and their vectors which occur on a subset of BiRD (3,159 pairs). fasttext-all.txt contains all the words of BiRD.
BiRD.txt The BiRD dataset.

Running the code

In order to obtain the related terms for any bigram, run the script generate_related_terms.py. The command to run the script is as follows:

python generate_related_terms.py -p2pt PATH_TO_PHRASE_TABLE -p2wikipedia PATH_TO_WIKIPEDIA -freq 30 -syn_number 5 -p2in PATH_TO_INPUT -p2out PATH_TO_OUTPUT

where

p2pt is the path to the generated phrase_table_unigram_bigram.txt file from phrase table,
p2wikipedia is the path to the generated wikipedia_unigram_bigram.pickle file from Wikipedia,
freq is the frequency threshold of English phrases in the phrase table,
syn_number is the maximum number of required related term for each bigram AB,
p2in is the path to the input file bigrams_file.txt,
p2out is the path to the output file where all related terms of the bigrams is saved.

If p2bigram argument is not set in the argument list, you are asked to enter a bigram as the input after running the script. In this case, if you type exit it quits the program. If p2out is not set in the argument list, the obtained related terms are not saved in the output file. If a bigram does not exist in WordNet or phrase table, the program outputs "not found" into STDOUT.

In order to examine methods of semantic composition on BiRD run the script semantic_composition_evaluation.py as follows:

python semantic_composition_evaluation.py -p2bird PATH_TO_BiRD -p2embedding PATH_TO_WORD_EMBEDDING_TEXT_FILE -p2out PATH_TO_OUTPUT_FILE

For example:

python semantic_composition_evaluation.py -p2bird BiRD.txt -p2embedding word_embeddings/term-context.txt -p2out output.txt

where:

p2bird is the path to the BiRD file,
p2embedding is the path to the embedding file. In this project, three embedding files from fastText, glove and term-context matrix are available in the word_embeddings folder. You can use other word embeddings. The file format should be as follows: each line contains <word><space><vector>. Please see the files in the word_embedding folder.
p2out is the path to write the results of the semantic composition evaluation to the output file.

Reference

Please cite our paper [1] to reference to our dataset or code.

[1] Big BiRD: A Large, Fine-Grained, Bigram Relatedness Dataset for Examining Semantic Composition. Shima Asaadi, Saif M. Mohammad, and Svetlana Kiritchenko. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-2019), June 2019, Minnesota, USA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BiRD.txt

BiRD.txt

README.md

README.md

bigrams_file.txt

bigrams_file.txt

data.zip

data.zip

generate_related_terms.py

generate_related_terms.py

generated_related_terms.txt

generated_related_terms.txt

requirements.txt

requirements.txt

semantic_composition_evaluation.py

semantic_composition_evaluation.py

word_embeddings.zip

word_embeddings.zip

Repository files navigation

Introduction

Usage

Dependencies

Files

Running the code

Reference

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
BiRD.txt		BiRD.txt
README.md		README.md
bigrams_file.txt		bigrams_file.txt
data.zip		data.zip
generate_related_terms.py		generate_related_terms.py
generated_related_terms.txt		generated_related_terms.txt
requirements.txt		requirements.txt
semantic_composition_evaluation.py		semantic_composition_evaluation.py
word_embeddings.zip		word_embeddings.zip

sasaadi/BiRD

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

Dependencies

Files

Running the code

Reference

About

Resources

Stars

Watchers

Forks

Languages