## DBpedia Fixes
The _metacritic-movies_ dataset references some movie URIs that do not exist anymore in our version of the DBpedia Knowledge Graph (subset of dump 2016-04).

Example:
http://dbpedia.org/resource/Carpool_(film) -> http://dbpedia.org/resource/Carpool_(1996_film)

This notebook aims to identify and fix these issues. It outputs a fixed version of the _metacritic-movies_ dataset in the original dataset folder.

Make sure that Jena Fuseki is up and running.

In [4]:
import pandas as pd
from pyrdf2vec.graphs import KG
import re
import os.path
import json
from rdflimeConfig import dbpediaLocation, movieLocation

### Identify broken URIs

In [5]:
# Initiate dbpedia connection
dbpedia = KG(dbpediaLocation)

# Read movie dataset
movieFull = pd.read_csv(os.path.join(movieLocation, "movies.tsv"), sep="\t")
movies = [movie.DBpedia_URI for index, movie in movieFull.iterrows()]

# Check movies
missing = [movie for movie in movies if not dbpedia.is_exist([movie])]
print(f"{len(missing)} out of {len(movies)} movies have broken URIs.")

38 out of 2000 movies have broken URIs.


### Try to automagically fix some URIs

In [6]:
fixes = []

In [7]:
# Try adding _(film) or _(yyyy_film) to URI
for movie in missing:
    
    # Simply add _(film)
    simpleFix = movie + "_(film)"
    if dbpedia.is_exist([simpleFix]) : fixes.append({"original": movie, "fix": simpleFix})
    
    # Try to add _(yyyy_film)
    else:
        releaseDate = movieFull[movieFull.DBpedia_URI == movie]["Release date"].iloc[0]
        releaseYear = int(re.search(r"\d{4}", releaseDate).group(0))

        maxDistYears = 1
        for y in range(releaseYear-maxDistYears, releaseYear+maxDistYears+1):

            # _(movie) -> _(yyyy_movie) or {blank} -> _(yyyy_movie)
            fix = movie.replace("(film)", f"({y}_film)")
            if not fix.endswith("film)"):
                fix += f"_({y}_film)"

            if dbpedia.is_exist([fix]):
                fixes.append({"original": movie, "fix": fix})
                break
print(f"{len(fixes)} can be fixed by adding _(film) or _(yyyy_film) to the URI.")

27 can be fixed by adding _(film) or _(yyyy_film) to the URI.


In [8]:
# Try removing duplicate _(film) or _(yyyy_film) endings
ctr = len(fixes)
for movie in missing:
    match = re.search(r"(?:_\((?:\d{4}_|)film\)){2}", movie)
    if match:
        fix = match.group(0)
        fix = fix[:int(len(fix)/2)]
        fix = movie.replace(fix, "", 1)
        if dbpedia.is_exist([fix]): fixes.append({"original": movie, "fix": fix})
print(f"{len(fixes)-ctr} can be fixed by removing duplicate (film) part of URI.")

2 can be fixed by removing duplicate (film) part of URI.


### Manual matches for remaining movies

In [9]:
ctr = len(fixes)
fixes.extend([
    {"original": "http://dbpedia.org:8890/resource/Lone_Star_(1996_film)", "fix": "http://dbpedia.org/resource/Lone_Star_(1996_film)"},
    {"original": "http://dbpedia.org/resource/La_grande_strada_azzurra", "fix": "http://dbpedia.org/resource/La_grande_strada"}, # appeared in US in 2001, but is from 1957
    {"original": "http://dbpedia.org/resource/Good_Hair_(film)", "fix": "http://dbpedia.org/resource/Good_Hair"},
    {"original": "http://dbpedia.org/resource/Alice_and_Martin_(1998_film)", "fix": "http://dbpedia.org/resource/Alice_and_Martin"},
    {"original": "http://dbpedia.org/resource/Yours,_Mine_and_Ours_(2005_film)", "fix": "http://dbpedia.org/resource/Yours,_Mine_&_Ours_(2005_film)"},
    {"original": "http://dbpedia.org/resource/As_Luck_Would_Have_It_(2012_film)_(2012_film)", "fix": "http://dbpedia.org/resource/As_Luck_Would_Have_It_(2011_film)"},
    {"original": "http://dbpedia.org/resource/Das_Wilde_Leben", "fix": "http://dbpedia.org/resource/Eight_Miles_High_(film)"},
    {"original": "http://dbpedia.org/resource/Someone_Like_You_(film)", "fix": "http://dbpedia.org/resource/Someone_like_You_(film)"},
    {"original": "http://dbpedia.org/resource/John_Q", "fix": "http://dbpedia.org/resource/John_Q."},
    #{"original": "http://dbpedia.org/resource/Moolaad%C3%A9", "fix": "http://dbpedia.org/resource/Moolaadé"},
])
print(f"{len(fixes)-ctr} can be fixed by manual matching.")

9 can be fixed by manual matching.


### Check and create fixed datsets

In [10]:
# All issues fixed?
assert len(fixes) >= len(missing), "Not every broken URI has been fixed!"

# No duplicate fixes?
assert len(fixes) == len(set([fix["original"] for fix in fixes])), "Some movies have more than one fix!"

# Do all fixes exist?
assert dbpedia.is_exist([fix["fix"] for fix in fixes]), "Not all fixes are found in dbpedia!"

for fix in fixes:
    movieFull = movieFull.replace(fix["original"], fix["fix"])

movieFull.to_csv(os.path.join(movieLocation, "movies_fixed.tsv"), sep="\t", index=False)
with open(os.path.join(movieLocation, "datasetFixes.json"), "w") as file:
    json.dump(fixes, file, indent=4)

print("Created fixed dataset and stored fixes to file.")

Created fixed dataset and stored fixes to file.
