WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
vad_package
10_split_csv_with_header.sh
11_create_final_dataset.py
12_map_dbpedia.py
13_analyse_results.py
14_TransformCsvToRDF.py
1_download_and_extract.sh
2_a_sentences_to_one_file.py
2_b_sentences_sort.sh
2_c_sentences_make_skip_file.py
3_transform_into_csv_with_threshold.py
4_randomSampling.py
5_create_mturk_files.py
6_a_append_mturk_relation_results_to_samples.py
6_b_append_mturk_sentence_results_to_samples.py
7_calculate_cycles.py
8_append_sentences.py
9_prepare_for_analysis.py
README.md
TypeAnalyse.xlsx
confidenceScores.xlsx
countAndJugementOfRelations.xlsx
mTurk_Relation_20.html
mTurk_Sentence.html
ontology.pptx
pattern_details.csv
pattern_regex.csv
utilwebisadb.py
webisa_0_sample_results.csv
webisa_10_sample_results.csv
webisa_1_sample_results.csv
webisa_1_sentence_results.csv
webisa_20_sample_results.csv
webisa_2_sample_results.csv
webisa_3_sample_results.csv
webisa_5_sample_results.csv

README.md

WebIsALOD: Providing Hypernymy Relations extracted from the Web as Linked Open Data

This repository contains all the code used for the WebIsALOD paper.

Abstract

Hypernymy relations are an important asset in many applications,and a central ingredient to Semantic Web ontologies. The IsA database is a large collection of such hypernymy relations extracted from the Common Crawl. In this paper, we introduce WebIsALOD, a Linked Open Data version of the IsA database, containing 11.7M hyernymy relations, each provided with rich provenance information. As the original dataset contained more than 80% wrong, noisy extractions, we run a machine learning algorithm to assign confdence scores to the individual statements.

Structure of the files

All files starting with a number are files to generate the csv files, mappings and nquad generation. The files starting with mTurk are HTML surveys used to generate the ground truth. Files with the name "webisa_{threshold}_sample_results" are the samples from corresponding thresholds together with the majority vote and the answer of each worker. webisa_1_sentence_results.csv conatins the results from the mapping to Wikipedia pages and categories.

Most of the csv files are structed as follows:

  1. id
  2. instance
  3. class
  4. frequency
  5. pidspread
  6. pldspread
  7. ipremod
  8. ilemma
  9. ipostmod
  10. cpremod
  11. clemma
  12. cpostmod
  13. pids
  14. plds
  15. provids
  16. majority voting
  17. yes (counts)
  18. uncertain (counts)
  19. no (counts)
  20. mapping instance to dbpedia page (json array)
  21. mapping instance to dbpedia category (json array)
  22. mapping class to dbpedia page (json array)
  23. mapping class to dbpedia category (json array)
  24. mapping instance to yago (string)
  25. mapping class to yago (string)