## DBpedia Fixes
The gold-standard datasets (metacritic-movies etc.) reference some entity URIs that do not exist anymore in our version of the DBpedia Knowledge Graph (subset of dump 2016-04).

Example:
http://dbpedia.org/resource/Carpool_(film) -> http://dbpedia.org/resource/Carpool_(1996_film)

This notebook aims to identify and fix these issues. It outputs a fixed version of the corresponding dataset in the original dataset folder.

Make sure that Jena Fuseki is up and running.

In [1]:
import pandas as pd
from pyrdf2vec.graphs import KG
import re
import os.path
from rdflimeConfig import dbpediaLocation, datasets, load_dataset
import json

# 0: Metacritic-Movies
# 1: Metacritic-Albums
# 2: Forbes-Companies
# 3: Mercer-Cities
# 4: AAUP-Universities
# -> See rdflimeConfig.py
explored_dataset_idx = 1
dataset_full, dataset_entities = load_dataset(datasets[explored_dataset_idx], fixed=False)
cfg = datasets[explored_dataset_idx]
entity_kind = cfg["entity"]

SyntaxError: f-string: single '}' is not allowed (connectors.py, line 159)

### Identify broken URIs

In [27]:
# Initiate dbpedia connection
dbpedia = KG(dbpediaLocation)

# Check gold-standard datasets
missing = [entity for entity in dataset_entities if not dbpedia.is_exist([entity])]
print(f"{len(missing)} out of {len(dataset_entities)} {entity_kind} have broken URIs.")
len(missing)/len(dataset_entities)

50 out of 1600 albums have broken URIs.


0.03125

### Apply URI fixes so that DBpedia and gold-standard dataset match

In [28]:
fixes = []

In [29]:
# Try using the (fixed) DBpedia_URI15
ctr = len(fixes)
for entity in missing:
    fix = dataset_full[dataset_full["DBpedia_URI"] == entity].DBpedia_URI15.values[0]
    fix = fix.replace(" ", "+") # Some URIs contain spaces instead of "+"
    if dbpedia.is_exist([fix]): fixes.append({"original": entity, "fix": fix})

print(f"{len(fixes)-ctr} can be fixed by using the fixed DBpedia_URI15 instead.")

13 can be fixed by using the fixed DBpedia_URI15 instead.


#### Dataset-specific fixes

##### Specific fixes for metacritic-movies

In [5]:
# Try adding _(film) or _(yyyy_film) to URI
ctr = len(fixes)
for movie in missing:
    
    # Simply add _(film)
    simpleFix = movie + "_(film)"
    if dbpedia.is_exist([simpleFix]) and movie not in [f["original"] for f in fixes]: 
        fixes.append({"original": movie, "fix": simpleFix})
    
    # Try to add _(yyyy_film)
    else:
        releaseDate = dataset_full[dataset_full.DBpedia_URI == movie]["Release date"].iloc[0]
        releaseYear = int(re.search(r"\d{4}", releaseDate).group(0))

        maxDistYears = 1
        for y in range(releaseYear-maxDistYears, releaseYear+maxDistYears+1):

            # _(movie) -> _(yyyy_movie) or {blank} -> _(yyyy_movie)
            fix = movie.replace("(film)", f"({y}_film)")
            if not fix.endswith("film)"):
                fix += f"_({y}_film)"

            if dbpedia.is_exist([fix]) and movie not in [f["original"] for f in fixes]:
                fixes.append({"original": movie, "fix": fix})
                break
print(f"{len(fixes)-ctr} can be fixed by adding _(film) or _(yyyy_film) to the URI.")

26 can be fixed by adding _(film) or _(yyyy_film) to the URI.


In [6]:
# Try removing duplicate _(film) or _(yyyy_film) endings
ctr = len(fixes)
for movie in missing:
    match = re.search(r"(?:_\((?:\d{4}_|)film\)){2}", movie)
    if match:
        fix = match.group(0)
        fix = fix[:int(len(fix)/2)]
        fix = movie.replace(fix, "", 1)
        if dbpedia.is_exist([fix]) and movie not in [f["original"] for f in fixes]:
            fixes.append({"original": movie, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by removing duplicate (film) part of URI.")

2 can be fixed by removing duplicate (film) part of URI.


##### Specific fixes for metacritic-albums

In [30]:
# Try adding _(album)
ctr = len(fixes)
for album in missing:
    for ending in ["_(album)"]:
        fix = album + ending
        if dbpedia.is_exist([fix]) and album not in [f["original"] for f in fixes]:
            fixes.append({"original": album, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by adding _(album).")

9 can be fixed by adding _(album).


##### Specific fixes for forbes-companies

In [113]:
# Try various legal forms
ctr = len(fixes)
for company in missing:
    for ending in ["_Corporation", "_(company)", "_Group", "_AG", "_SE", "_Inc.", "_EMC", "_SA", "_ASA", "_N.V.", "_plc"]:
        fix = company + ending
        if dbpedia.is_exist([fix]) and company not in [f["original"] for f in fixes]:
            fixes.append({"original": company, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by adding legal forms.")

0 can be fixed by adding legal forms.


##### Specific fixes for aaup-salaries

In [124]:
# Replace URL encoded part
ctr = len(fixes)
for university in missing:
    fix = university.replace("%E2%80%93","–")
    if dbpedia.is_exist([fix]) and university not in [f["original"] for f in fixes]:
        fixes.append({"original": university, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by replacing URL encoded part.")

29 can be fixed by replacing URL encoded part.


#### Manual matches

In [125]:
ctr = len(fixes)

file = os.path.join(cfg["location"], "manual_fixes.json")
with open(file, "r") as f:
    manual_fixes = json.load(f)

for fix in manual_fixes:
    if fix["original"] not in [f["original"] for f in fixes]:
        fixes.append(fix)

print(f"{len(fixes)-ctr} can be fixed manually.")

11 can be fixed manually.


In [126]:
remaining = [m for m in missing if m not in [f["original"] for f in fixes]]
print(remaining)
print(len(remaining))

[]
0


### Check and create fixed datsets

In [127]:
dataset_full["DBpedia_URI16"] = dataset_full.DBpedia_URI

for fix in fixes: 
    dataset_full["DBpedia_URI16"] = dataset_full["DBpedia_URI16"].replace(fix["original"], fix["fix"])
    
len(dataset_full[dataset_full.DBpedia_URI != dataset_full.DBpedia_URI16])

71

In [128]:
# All issues fixed?
still_missing = [entity for entity in dataset_full.DBpedia_URI16.values if not dbpedia.is_exist([entity])]
assert len(still_missing) == 0, "Not every broken URI has been fixed!"

# Duplicate entries make this check impossible: Concordia College - http://www.wikidata.org/entity/Q5158956
# No duplicate fixes?
# assert len(fixes) == len(set([fix["original"] for fix in fixes])), "Some entities have more than one fix!"
dataset_full.to_csv(os.path.join(cfg["location"], "data_fixed.tsv"), sep="\t", index=False)

print("Created fixed dataset and stored fixes to file.")

Created fixed dataset and stored fixes to file.


In [129]:
still_missing

[]