Skip to content
/ taxi Public
forked from uhh-lt/taxi

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

License

Notifications You must be signed in to change notification settings

shannonyu/taxi

 
 

Repository files navigation

TAXI

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

More information about the approach can be found at the TAXI web site.

System Requirements

The system was tested on Debian/Ubuntu Linux and Mac OS X. To load all resources in memory you need about 64 Gb of RAM.

Installation

  1. Clone repository:
git clone https://github.com/tudarmstadt-lt/taxi.git
  1. Download resources into the repository (4.4G compressed by gzip):
cd taxi && wget http://panchenko.me/data/joint/taxi/res/resources.tgz && tar xzf resources.tgz
  1. Install dependencies:
pip install -r requirements.txt
  1. Setup spaCy. Download the language models for English, Dutch, French and Italian
$ python -m spacy download en
$ python -m spacy download nl
$ python -m spacy download fr
$ python -m spacy download it
  1. Setup NLTK
$ python -m nltk.downloader stopwords
  1. Install treetagger and treetagger-python

Induction of SemEval Taxonomies

Run the semeval.py to reproduce experimental results, e.g.:

For a test run (few resources loaded, quick):

python semeval.py vocabularies/science_en.csv en simple --test

For a normal run (all resources are loaded, requires 64Gb of RAM):

python semeval.py vocabularies/science_en.csv en simple

Afterwards a noisy graph is being created. Clean the output by executing(this example uses the inputfile science_en.csv-relations.csv-taxo-knn1.csv):

./run.sh taxi_output/simple_full/science_en.csv-relations.csv-taxo-knn1.csv

The vocabularies directory contains input terms for different domains and languages. The script lets you reproduce results in the SemEval 2016 Task 13 Taxonomy Extraction Evaluation described in the our paper. This script load hypernyms from the downloaded resources and constructs a taxonomy for every input vocabulary of the SemEval datasets, e.g. English Food domain. Generally, the TAXI approach takes as input a vocabulary and outputs a taxonomy for a linked subset of the terms from this vocabulary. Currently the main purpose of this repository is to ensure reproducibility of the SemEval results. The results taxonomies will be generated next to the corresponding input vocabulary file. If you need to adapt the script for your needs and require help do not hesitate to contact us.

About

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 87.9%
  • Python 11.9%
  • Shell 0.2%