Skip to content

How to Use

Marcel Heinz edited this page Jul 30, 2018 · 5 revisions
  • Required technology:
  • How to reproduce results:
    1. Mine Dbpedia Live by running src/mine/miner.py.
      1. This produces the file data/langdict.json, where all articles are listed with depth information. The entries will be annotated by indicators/checks.
      2. This produces the file data/catdict.json, where all categories are listed and annotated.
    2. Start Stanford Core NLP server.
      1. Open a Terminal at the folder where you deployed Stanford Core NLP.
      2. Start the server using: java -cp "*" -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 150000 -quiet
      3. (Be sure that your computer's network settings allow connection to the created local host server. Especially, settings like http_proxies in your environment variables seem to cause trouble.)
    3. Start the indication pipeline.
      1. Run src/check/pipeline.py`
      2. It's safe to run the following indicators in src/check separately for the software languages domain.
        1. infobox_dbpedia_existence.py (search for infobox template references maintained in Dbpedia)
        2. url_pattern.py (search for keywords in the URL)
        3. lists_of.py (search for links to extracted article names in a list of lists that is located at data/Language_Lists.txt)
        4. summary_keywords.py (search for keywords in the summary)
        5. hypernym_nlp_firstsentence.py (search for part-of-speech pattern indicating that one of the keywords is a hypernym)
      3. It's also safe to run one of the additional indicators separately:
        1. hypernym_dbpedia.py (search for Hypernyms annotated only in Non-Live Dbpedia)
        2. hypernym_wordnet.py (tries to match article names in Wordnet)
        3. infobox_position.py (search for infobox template references in Wikipedia. The Revision numbers from Dbpedia entries are reused here and the API is used for accessing.)
        4. semantic_distance.py (computes a metric based on annotated categories)
    4. Customize the configuration if necessary. Otherwise you can keep the configuration for the software languages domain.
      1. Set root categories (CATS) in src/data/__init__.py
      2. Set the domain keywords in src/data/__init__.py
      3. Note that lists_of.py is a domain-specific indicator. You'd have to retrieve a new list of lists for a new target domain.
      4. src/check/seed.py annotates extract articles on whether they were matched in the seed (see data/seed_annotated.json).
    5. The Wiki provides a more general overview on evaluation results and provides further statistics.

Clone this wiki locally