The WikidataTreeBuilderSPARQL.py contains a class to :
- Query wikidata for all descendants of a node;
- Structure the result as an arboresence;
- Create a 'flare' dictionary object, suitable for writing as a json file and visualisation with d3js
- Create a table with all descendants of a node, their properties as extracted from Wikidata, and all the paths to go from the root to the given node
Go to an installation folder and simply run :
git clone https://github.com/petartodorov/WikidataTreeBuilderSPARQL
. You are ready to run the jupyter notebooks provided with the projet!
For more context, check out my Medium blogpost: https://medium.com/@Montag86/exploring-wikidata-for-nlp-24c4a7babf0f
The WikidataSampleRun-Software.ipynb gives an example of how to use the class to get the arborescence of a given node (software, which Wikidata ID is Q7397) and a table will all properties we define as relevant (this list of properties is the lookupClaims parameter of the class init). The default list of parameters is suitable to extract information from the 'software' node and its arborescence. If you need to explore the arborescence of another node, you have simply to replace the root="Q7397" in the call of the fromRoot function with the desired root. If you want to get a table with relevant properties, you should research the Wikidata documentation to find out which properties are relevant to your problem, to build their list, and to pass it as input parameter on the class init.
During initialization, the following parameters are set up with the following default settings :
query_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
debug = false
The labels we want to get from the base. Those three should be of general interest for any root node:
query_labels = ["rdfs:label", "skos:altLabel", "schema:description"]
labels_languages = ["en", "fr"]
(References for names: https://www.wikidata.org/wiki/Wikidata:List_of_properties/all_in_one_table)
lookup_claims = ["P571", "P275", "P101", "P135", "P348", "P306", "P1482", "P277", "P577", "P366", "P178", "P31", "P279", "P2572", "P3966", "P144", "P170", "P1324"]
properties_set_membership = ["P31", "P279"]
default_language = "en"
Depending on your goal, you might not need to run all of these commands.
In order to get a flare.json file, suitable for visualisation with the d3js' tree layout (https://bl.ocks.org/mbostock/4339083), from a given root, you can type :
> from WikidataTreeBuilderSPARQL import WikidataTreeQuery
> tree = WikidataTreeQuery()
> flare = tree.from_root("Q7397", forbidden=["Q7889", "Q28923"])
The forbidden
parameter tells the recursive exploration function of the tree to not explore the given nodes.
> with open("flare.json","wb+") as f: json.dump(flare, f, indent=4)
Now you have the flare.json file !
If you want to convert the labels to human-readable:
> flare = tree.add_labels(flare)
> with open("flareHR.json","wb+") as f: json.dump(flare, f, indent=4)
And finally, if you want to get the table, with all the descendant nodes, and all the properties you need, in a pandas
dataframe, you can use:
> df = tree.get_pretty_DF()
> df.to_excel("tableComputerScience.xlsx")
And voilà!