Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
A set of scripts that is a functional approach to creating a domain specific LOD name directory
Python
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
data
pdf_flowcharts
README.md
addDatesToJazzPeople.py
filterDBpediaJazzFile.py
filterLOCskos.py
filterToJazzData.py
mergeLOCandDBpedia.py

README.md

Linked Jazz

Mou icon

Name Directory Creation

These set of scripts are a functional approach to creating a domain specific LOD name directory. It works with extract files that are sequentially processed, no DB interface needed just the extracts and the scripts. The scripts uses keywords to build our Jazz directory but the keywords could easily be replaced to create a name directory for other domains. A lot of the process it designed so it will work on a VPS but some parts (filterLOCskos.py) needed to be done locally.

Installing:

Requires osx/linux command line tools, grep, wget, etc..

Extracts Needed:

The process requires a number of extract files from dbpedia and the Library of Congress

DBpedia:

(When a new version of dbpedia extract comes you would need to change the urls below)

Library of Congress:

Extract these files into the data directory (you are going to need a lot of space)

Running:

Building the directory is just running the scripts in order.

python filterDBpediaJazzFile.py     

This takes a article category approach to everything related to jazz and filters it down to people. It is diagramed in filterDBpediaJazzFile.pdf

python filterLOCskos.py

Takes the enormous LC data file and creates a new LC lookup that is more manageable. The first step it does it create personURIs.nt, this could be done locally and added to the extract data on a server to reduce the space needed. Making this file will take a long time as its greping a 30GB extract. The process is in filter_LOC_filterLOCskos.pdf.

python addDatesToJazzPeople.py

This adds birth and death dates to the name directory for people who don't have that data structured but it is in their abstract. Just cares about the year.

python mergeLOCandDBpedia.py

This attempts to merge the two authorities based on name and dates, it makes a number of final name directory sameAs_*.nt files based on the confidence of the match. Documented in mergeLOCandDBpedia.pdf

python filterToJazzData.py

Optional, this script creates an auxiliary file for the sameAs files which has the person image if in wikipedia and their short abstract.

Something went wrong with that request. Please try again.