How to Use

Jump to bottom

Marcel Heinz edited this page Jul 30, 2018 · 5 revisions

Required technology:
- pip install sparqlwrapper
- pip install nltk
- pip install requests
- pip install pandas (if you want to run plot/table scripts)
- Download Stanford Core NLP: https://stanfordnlp.github.io/CoreNLP/index.html#download
How to reproduce results:
1. Mine Dbpedia Live by running src/mine/miner.py.
  1. This produces the file data/langdict.json, where all articles are listed with depth information. The entries will be annotated by indicators/checks.
  2. This produces the file data/catdict.json, where all categories are listed and annotated.
2. Start Stanford Core NLP server.
  1. Open a Terminal at the folder where you deployed Stanford Core NLP.
  2. Start the server using: java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 150000 -quiet
  3. (Be sure that your computer's network settings allow connection to the created local host server. Especially, settings like http_proxies in your environment variables seem to cause trouble.)
3. Start the indication pipeline.
  1. Run src/check/pipeline.py`
  2. It's safe to run the following indicators in src/check separately for the software languages domain.
    1. infobox_dbpedia_existence.py (search for infobox template references maintained in Dbpedia)
    2. url_pattern.py (search for keywords in the URL)
    3. lists_of.py (search for links to extracted article names in a list of lists that is located at data/Language_Lists.txt)
    4. summary_keywords.py (search for keywords in the summary)
    5. hypernym_nlp_firstsentence.py (search for part-of-speech pattern indicating that one of the keywords is a hypernym)
  3. It's also safe to run one of the additional indicators separately:
    1. hypernym_dbpedia.py (search for Hypernyms annotated only in Non-Live Dbpedia)
    2. hypernym_wordnet.py (tries to match article names in Wordnet)
    3. infobox_position.py (search for infobox template references in Wikipedia. The Revision numbers from Dbpedia entries are reused here and the API is used for accessing.)
    4. semantic_distance.py (computes a metric based on annotated categories)
4. Customize the configuration if necessary. Otherwise you can keep the configuration for the software languages domain.
  1. Set root categories (CATS) in src/data/__init__.py
  2. Set the domain keywords in src/data/__init__.py
  3. Note that lists_of.py is a domain-specific indicator. You'd have to retrieve a new list of lists for a new target domain.
  4. src/check/seed.py annotates extract articles on whether they were matched in the seed (see data/seed_annotated.json).
5. The Wiki provides a more general overview on evaluation results and provides further statistics.