## DBpedia Fixes
The gold-standard datasets (metacritic-movies etc.) reference some movie URIs that do not exist anymore in our version of the DBpedia Knowledge Graph (subset of dump 2016-04).

Example:
http://dbpedia.org/resource/Carpool_(film) -> http://dbpedia.org/resource/Carpool_(1996_film)

This notebook aims to identify and fix these issues. It outputs a fixed version of the corresponding dataset in the original dataset folder.

Make sure that Jena Fuseki is up and running.

In [141]:
import pandas as pd
from pyrdf2vec.graphs import KG
import re
import os.path
from rdflimeConfig import dbpediaLocation, datasets, load_dataset

# 0: Metacritic-Movies
# 1: Metacritic-Albums
# 2: Forbes-Companies
# 3: Mercer-Cities
# 4: AAUP-Salaries
# -> See rdflimeConfig.py
explored_dataset_idx = 4
dataset_full, dataset_entities = load_dataset(datasets[explored_dataset_idx], fixed=False)
cfg = datasets[explored_dataset_idx]
entity_kind = cfg["entity"]

### Identify broken URIs

In [142]:
# Initiate dbpedia connection
dbpedia = KG(dbpediaLocation)

# Check gold-standard datasets
missing = [entity for entity in dataset_entities if not dbpedia.is_exist([entity])]
print(f"{len(missing)} out of {len(dataset_entities)} {entity_kind} have broken URIs.")

71 out of 960 salaries have broken URIs.


### Apply URI fixes so that DBpedia and gold-standard dataset match

In [145]:
fixes = []

In [146]:
# Try using the (fixed) DBpedia_URI15
ctr = len(fixes)
for entity in missing:
    fix = dataset_full[dataset_full["DBpedia_URI"] == entity].DBpedia_URI15.values[0]
    fix = fix.replace(" ", "+") # Some URIs contain spaces instead of "+"
    if dbpedia.is_exist([fix]): fixes.append({"original": entity, "fix": fix})

print(f"{len(fixes)-ctr} can be fixed by using the fixed DBpedia_URI15 instead.")

30 can be fixed by using the fixed DBpedia_URI15 instead.


#### Specific fixes for metacritic-movies

In [103]:
# Try adding _(film) or _(yyyy_film) to URI
ctr = len(fixes)
for movie in missing:
    
    # Simply add _(film)
    simpleFix = movie + "_(film)"
    if dbpedia.is_exist([simpleFix]) and movie not in [f["original"] for f in fixes]: 
        fixes.append({"original": movie, "fix": simpleFix})
    
    # Try to add _(yyyy_film)
    else:
        releaseDate = dataset_full[dataset_full.DBpedia_URI == movie]["Release date"].iloc[0]
        releaseYear = int(re.search(r"\d{4}", releaseDate).group(0))

        maxDistYears = 1
        for y in range(releaseYear-maxDistYears, releaseYear+maxDistYears+1):

            # _(movie) -> _(yyyy_movie) or {blank} -> _(yyyy_movie)
            fix = movie.replace("(film)", f"({y}_film)")
            if not fix.endswith("film)"):
                fix += f"_({y}_film)"

            if dbpedia.is_exist([fix]) and movie not in [f["original"] for f in fixes]:
                fixes.append({"original": movie, "fix": fix})
                break
print(f"{len(fixes)-ctr} can be fixed by adding _(film) or _(yyyy_film) to the URI.")

26 can be fixed by adding _(film) or _(yyyy_film) to the URI.


In [104]:
# Try removing duplicate _(film) or _(yyyy_film) endings
ctr = len(fixes)
for movie in missing:
    match = re.search(r"(?:_\((?:\d{4}_|)film\)){2}", movie)
    if match:
        fix = match.group(0)
        fix = fix[:int(len(fix)/2)]
        fix = movie.replace(fix, "", 1)
        if dbpedia.is_exist([fix]) and movie not in [f["original"] for f in fixes]:
            fixes.append({"original": movie, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by removing duplicate (film) part of URI.")

2 can be fixed by removing duplicate (film) part of URI.


#### Manual matches

In [105]:
# Manual matches for remaining MOVIES
ctr = len(fixes)
mappings = [
    {"original": "http://dbpedia.org:8890/resource/Lone_Star_(1996_film)", "fix": "http://dbpedia.org/resource/Lone_Star_(1996_film)"},
    {"original": "http://dbpedia.org/resource/Good_Hair_(film)", "fix": "http://dbpedia.org/resource/Good_Hair"},
    {"original": "http://dbpedia.org/resource/Alice_and_Martin_(1998_film)", "fix": "http://dbpedia.org/resource/Alice_and_Martin"},
    {"original": "http://dbpedia.org/resource/Yours,_Mine_and_Ours_(2005_film)", "fix": "http://dbpedia.org/resource/Yours,_Mine_&_Ours_(2005_film)"},
    {"original": "http://dbpedia.org/resource/As_Luck_Would_Have_It_(2012_film)_(2012_film)", "fix": "http://dbpedia.org/resource/As_Luck_Would_Have_It_(2011_film)"},
    {"original": "http://dbpedia.org/resource/Das_Wilde_Leben", "fix": "http://dbpedia.org/resource/Eight_Miles_High_(film)"},
    {"original": "http://dbpedia.org/resource/Someone_Like_You_(film)", "fix": "http://dbpedia.org/resource/Someone_like_You_(film)"},
    {"original": "http://dbpedia.org/resource/John_Q", "fix": "http://dbpedia.org/resource/John_Q."},
]

for mapping in mappings:
    if mapping["original"] not in [f["original"] for f in fixes]:
        fixes.append(mapping)
print(f"{len(fixes)-ctr} can be fixed by manual matching.")

7 can be fixed by manual matching.


In [118]:
remaining = [m for m in missing if m not in [f["original"] for f in fixes]]
print(remaining)
print(len(remaining))

['http://dbpedia.org/resource/Illinois_(album)', 'http://dbpedia.org/resource/Untrue', 'http://dbpedia.org/resource/Hell_Hath_No_Fury', 'http://dbpedia.org/resource/Fleet_Foxes_(EP)', 'http://dbpedia.org/resource/Franz_Ferdinand_(DVD)', 'http://dbpedia.org/resource/Laulu_Laakson_Kukista', 'http://dbpedia.org/resource/Life..._The_Best_Game_in_Town', 'http://dbpedia.org/resource/Mar_Dulce_(album)', 'http://dbpedia.org/resource/Second_Chance_(album)', 'http://dbpedia.org/resource/Pattern+Grid_World', 'http://dbpedia.org/resource/Untitled_23', 'http://dbpedia.org/resource/03/07%E2%80%9309/07', 'http://dbpedia.org/resource/Hold_on_Now,_Youngster', 'http://dbpedia.org/resource/Love_Remains', 'http://dbpedia.org/resource/The_Evangelist', 'http://dbpedia.org/resource/Life_Cycles_(album)', 'http://dbpedia.org/resource/Flaws', 'http://dbpedia.org/resource/The_Age_of_Miracles', 'http://dbpedia.org/resource/Time_to_Die', 'http://dbpedia.org/resource/Audioslave_(DVD)', 'http://dbpedia.org/resource/

In [131]:
dataset_full[dataset_full.DBpedia_URI.isin(remaining)]

Unnamed: 0,Wikidata_URI15,id,album,artist,date,rating,DBpedia_URI,label,DBpedia_URI15,YAGO_URI15,DBpedia_URI15_Base32
22,http://www.wikidata.org/entity/Q1658831,23.0,Illinois,Sufjan Stevens,5-Jul-05,90.0,http://dbpedia.org/resource/Illinois_(album),good,http://dbpedia.org/resource/Illinois_(album),http://yago-knowledge.org/resource/Illinois_(a...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6S...
25,http://www.wikidata.org/entity/Q2445090,26.0,Untrue,Burial,6-Nov-07,90.0,http://dbpedia.org/resource/Untrue,good,http://dbpedia.org/resource/Untrue,http://yago-knowledge.org/resource/Untrue,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6V...
29,,30.0,Hell Hath No Fury,Clipse,28-Nov-06,89.0,http://dbpedia.org/resource/Hell_Hath_No_Fury,good,http://dbpedia.org/resource/Hell_hath_no_fury,http://yago-knowledge.org/resource/Hell_Hath_N...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6S...
79,http://www.wikidata.org/entity/Q2340936,80.0,Fleet Foxes,Fleet Foxes,3-Jun-08,87.0,http://dbpedia.org/resource/Fleet_Foxes_(EP),good,http://dbpedia.org/resource/Fleet_Foxes_(EP),http://yago-knowledge.org/resource/Fleet_Foxes...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6R...
80,http://www.wikidata.org/entity/Q2795556,81.0,Franz Ferdinand,Franz Ferdinand,9-Mar-04,87.0,http://dbpedia.org/resource/Franz_Ferdinand_(DVD),good,http://dbpedia.org/resource/Franz_Ferdinand_(DVD),http://yago-knowledge.org/resource/Franz_Ferdi...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6R...
187,http://www.wikidata.org/entity/Q4042654,188.0,Laulu Laakson Kukista,Paavoharju,22-Jul-08,85.0,http://dbpedia.org/resource/Laulu_Laakson_Kukista,good,http://dbpedia.org/resource/Laulu_Laakson_Kukista,http://yago-knowledge.org/resource/Laulu_Laaks...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6T...
188,http://www.wikidata.org/entity/Q6544592,189.0,Life...The Best Game In Town,Harvey Milk,3-Jun-08,85.0,http://dbpedia.org/resource/Life..._The_Best_G...,good,http://dbpedia.org/resource/Life..._The_Best_G...,http://yago-knowledge.org/resource/Life..._The...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6T...
264,http://www.wikidata.org/entity/Q9028412,265.0,Mar Dulce,Bajofondo,15-Jul-08,84.0,http://dbpedia.org/resource/Mar_Dulce_(album),good,http://dbpedia.org/resource/Mar_Dulce_(album),http://yago-knowledge.org/resource/Mar_Dulce_(...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6T...
274,http://www.wikidata.org/entity/Q7443165,275.0,Second Chance,El DeBarge,30-Nov-10,84.0,http://dbpedia.org/resource/Second_Chance_(album),good,http://dbpedia.org/resource/Second_Chance_(album),http://yago-knowledge.org/resource/Second_Chan...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6U...
440,http://www.wikidata.org/entity/Q7148355,441.0,Pattern + Grid World,Flying Lotus,21-Sep-10,82.0,http://dbpedia.org/resource/Pattern+Grid_World,good,http://dbpedia.org/resource/Pattern Grid_World,http://yago-knowledge.org/resource/Pattern+Gri...,NB2HI4B2F4XWIYTQMVSGSYJON5ZGOL3SMVZW65LSMNSS6U...


In [129]:
for x in remaining:
    print('{"original": "'+x+'", "fix": ""},')

{"original": "http://dbpedia.org/resource/Illinois_(album)", "fix": ""},
{"original": "http://dbpedia.org/resource/Untrue", "fix": ""},
{"original": "http://dbpedia.org/resource/Hell_Hath_No_Fury", "fix": ""},
{"original": "http://dbpedia.org/resource/Fleet_Foxes_(EP)", "fix": ""},
{"original": "http://dbpedia.org/resource/Franz_Ferdinand_(DVD)", "fix": ""},
{"original": "http://dbpedia.org/resource/Laulu_Laakson_Kukista", "fix": ""},
{"original": "http://dbpedia.org/resource/Life..._The_Best_Game_in_Town", "fix": ""},
{"original": "http://dbpedia.org/resource/Mar_Dulce_(album)", "fix": ""},
{"original": "http://dbpedia.org/resource/Second_Chance_(album)", "fix": ""},
{"original": "http://dbpedia.org/resource/Pattern+Grid_World", "fix": ""},
{"original": "http://dbpedia.org/resource/Untitled_23", "fix": ""},
{"original": "http://dbpedia.org/resource/03/07%E2%80%9309/07", "fix": ""},
{"original": "http://dbpedia.org/resource/Hold_on_Now,_Youngster", "fix": ""},
{"original": "http://dbpe

In [130]:
# Manual matches for remaining ALBUMS
ctr = len(fixes)
mappings = [
    {"original": "http://dbpedia.org/resource/Illinois_(album)", "fix": "http://dbpedia.org/resource/Illinois_(Sufjan_Stevens_album)"},
    {"original": "http://dbpedia.org/resource/Untrue", "fix": "https://dbpedia.org/page/Untrue_(album)"},
    {"original": "http://dbpedia.org/resource/Hell_Hath_No_Fury", "fix": "http://dbpedia.org/resource/Hell_Hath_No_Fury_(Clipse_album)"},
    {"original": "http://dbpedia.org/resource/Fleet_Foxes_(EP)", "fix": "http://dbpedia.org/resource/Fleet_Foxes_(album)"},
    {"original": "http://dbpedia.org/resource/Franz_Ferdinand_(DVD)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Laulu_Laakson_Kukista", "fix": ""},
    {"original": "http://dbpedia.org/resource/Life..._The_Best_Game_in_Town", "fix": ""},
    {"original": "http://dbpedia.org/resource/Mar_Dulce_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Second_Chance_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Pattern+Grid_World", "fix": ""},
    {"original": "http://dbpedia.org/resource/Untitled_23", "fix": ""},
    {"original": "http://dbpedia.org/resource/03/07%E2%80%9309/07", "fix": ""},
    {"original": "http://dbpedia.org/resource/Hold_on_Now,_Youngster", "fix": ""},
    {"original": "http://dbpedia.org/resource/Love_Remains", "fix": ""},
    {"original": "http://dbpedia.org/resource/The_Evangelist", "fix": ""},
    {"original": "http://dbpedia.org/resource/Life_Cycles_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Flaws", "fix": ""},
    {"original": "http://dbpedia.org/resource/The_Age_of_Miracles", "fix": ""},
    {"original": "http://dbpedia.org/resource/Time_to_Die", "fix": ""},
    {"original": "http://dbpedia.org/resource/Audioslave_(DVD)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Morning_View_Sessions", "fix": ""},
    {"original": "http://dbpedia.org/resource/Let_Them_Talk", "fix": ""},
    {"original": "http://dbpedia.org/resource/Love_Notes/Letter_Bombs", "fix": ""},
    {"original": "http://dbpedia.org/resource/Dead_Drunk", "fix": ""},
    {"original": "http://dbpedia.org/resource/Folker", "fix": ""},
    {"original": "http://dbpedia.org/resource/Generation_Freakshow_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/The_Wombats_Proudly_Present:_This_Modern_Glitch", "fix": ""},
    {"original": "http://dbpedia.org/resource/Ariels", "fix": ""},
    {"original": "http://dbpedia.org/resource/Mr._A%E2%80%93Z", "fix": ""},
    {"original": "http://dbpedia.org/resource/The_Sellout", "fix": ""},
    {"original": "http://dbpedia.org/resource/Seal_(2003_album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/The_Temper_Trap_EP", "fix": ""},
    {"original": "http://dbpedia.org/resource/What_Is_Love%3F_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Wild_Ones_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Rebelution_(Pitbull_album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Lotusflower_(album)", "fix": ""},
    {"original": "http://dbpedia.org/resource/Sorry_for_Party_Rocking", "fix": ""},
]

for mapping in mappings:
    if mapping["original"] not in [f["original"] for f in fixes]:
        fixes.append(mapping)
print(f"{len(fixes)-ctr} can be fixed by manual matching.")

37 can be fixed by manual matching.


### Check and create fixed datsets

In [107]:
dataset_full["DBpedia_URI16"] = dataset_full.DBpedia_URI

for fix in fixes: 
    dataset_full["DBpedia_URI16"] = dataset_full["DBpedia_URI16"].replace(fix["original"], fix["fix"])
    
len(dataset_full[dataset_full.DBpedia_URI != dataset_full.DBpedia_URI16])

38

In [81]:
# All issues fixed?
still_missing = [entity for entity in dataset_full.DBpedia_URI16.values if not dbpedia.is_exist([entity])]
assert len(still_missing) == 0, "Not every broken URI has been fixed!"

# No duplicate fixes?
assert len(fixes) == len(set([fix["original"] for fix in fixes])), "Some entities have more than one fix!"

dataset_full.to_csv(os.path.join(cfg["location"], cfg["file"]), sep="\t", index=False)

print("Created fixed dataset and stored fixes to file.")

AssertionError: Some entities have more than one fix!