# Querying Wikidata

Wikidata is a free linked database that serves as a central storage for the structured data in Wikipedia and other Wikimedia projects. Their [query service](https://query.wikidata.org/) is officially live.

This service allows you to execute [SPARQL](https://en.wikipedia.org/wiki/SPARQL) queries for answering questions like *What are the heights of all the mountains in California?* or *What are the most populated cities whose mayors are women?* or even *For each country, how many ministers are alive who are themselves children of a minister?*  For more query examples see [this page](https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples).



## Wikidata's data model

Wikidata is trying to build a structured database of every claim about every entity on Wikipedia -- and in every language. The data model gets complex, but the most basic distinction is that there are:
* entities (things in the world: California, Mount Tamalpais, George Washington, Harry Potter, carbon-14, python)
* properties (types of claims: 'instance of', 'coordinate location', 'cause of death', 'population')
* statements (an entity-property-data relation: 'python is an instance of a programming language', 'California has a population of 39,144,818'). 

Everything that would get its own Wikipedia article is an entity, and the Wikidata project is about importing all the unstructured statements from those articles into a database. Because Wikidata was built to be language-independent, everything has a unique alphanumeric identifier. So California is Q99, 

The best way to get a feel for Wikidata's data model is to browse an individual entity. So let's look at the entry for Mount Tamalpais, to see what is there. You can search for any entity or relation in the search bar at wikidata.org, or you can click the "Wikidata item" link on the lefthand sidebar of any Wikipedia article. The URL for Mount Tamalpais is linked to the unique identifier [Q785665](https://www.wikidata.org/wiki/Q785665).

### Mount Tam's statements
<img src="mount-tam-wikidata.png">

We see that the first statement is one of the most common and foundational statements in Wikidata: instance of. If you hover over the 'instance of' link, you can see that it links to Property P31, which is the structured identifier for this kind of relation between entities and data. Mount Tam is an instance of a mountain, and a mountain is also an entity in Wikidata. 

For many statements in Wikidata, the data in the statement is another Wikidata entity, which has its own kinds of statements. One of Mount Tam's other statements is the property 'located in the administrative territorial entity' (or P131), with the data for that statement being the Wikidata entity 'California' (or Q99). Other Wikidata statements have raw data, like the 'coordinate location' (Property P625) statement.

### Querying Wikidata 

To query these data, you can use a structured querying language called SPARQL, which is an extention of SQL. The pseudoquery for this would be something like:

Return all statements about coordinate locations
For all entities that are instances of mountains 
That are located in the administrative territorial entity 'California'

We then have to translate these statements and entites into language-neutral identifiers, which becomes:

For all entities that are instances of (P31) mountains (Q8502)
That are located in the administrative territorial entity (P131) 'California' (Q99)
Return all statements about coordinate locations (P625)

The way we do this in SPARQL is:

    SELECT ?mountain ?coord 
    WHERE {
        ?mountain wdt:P31 wd:Q8502 .     # define ?mountain as all entities that are instances of (P31) mountains (Q8502) ...
        ?mountain wdt:P131 wd:Q99 .      # that are in the administrative territorial entity (P131) 'California' (Q99)...
        ?mountain wdt:P625 ?coord        # for ?mountain, find all coordinate statements (P625) in the variable ?coord 
    }
    
(we also have to put in a bunch of declarations, which are similar to importing a library)

### Using Wikidata's web query service

There is a great way to test out your queries in the browser at https://query.wikidata.org. [Here](http://tinyurl.com/ca-mountain-nolabel) is the above SPARQL query in the web query service. One of the great things about the web query service is that you can hover over every property or entity and see what it is. You can also directly download the data to a number of formats.
<img src="ca-mountain-wikidata.png">

### Labels

The first thing you'll notice is that the ?mountain variable is the unique identifier for each mountain, not the English name (or Spanish or Japanese or Arabic...). To get that, you have to add another block to the SPARQL query.

    # Out of the following query, select the variables: ?mountain ?mountainLabel? ?coord
    
    SELECT ?mountain ?mountainLabel ?coord 
    WHERE {
        
        # define ?mountain as all entities that are instances of (P31) mountains (Q8502)
        ?mountain wdt:P31 wd:Q8502 .     
       
        # that are in the administrative territorial entity (P131) 'California' (Q99)
        ?mountain wdt:P131 wd:Q99 .      
        
        # Then for every ?mountain, return data for all coordinate 
        # statements (P625) in the variable ?coord
        ?mountain wdt:P625 ?coord        
        
        # Then for every ?mountain, return data for all labels (rdfs:label)
        # into the variable ?mountainLabel, but filter for only english language labels
        ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
    }
    


[Here](http://tinyurl.com/ca-mountain) is the labeled query in the Wikidata web query service.

## Extend it!

Looking at the [Mount Tam Wikidata page](https://www.wikidata.org/wiki/Q785665), we can see there is a property called "elevation above sea level." How would we extend the SPARQL query above to also return this data?


    # Out of the following query, select the variables: 
    # ?mountain ?mountainLabel? ?coord ?elevation
    SELECT ?mountain ?mountainLabel ?coord ?elevation
    
    WHERE {
        
        # Define ?mountain as all entities that are instances of (P31) mountains (Q8502) 
        ?mountain wdt:P31 wd:Q8502 .     
       
        # that are in the administrative territorial entity (P131) 'California' (Q99).
        ?mountain wdt:P131 wd:Q99 .      
        
        # Then for every ?mountain, return data for all coordinate statements (P625)
        # in the variable ?coord
        ?mountain wdt:P625 ?coord        
        
        # Then for every ?mountain, return data for all 
        # 'elevation above sea level' statements (P2044) in the variable ?elevation
        ?mountain wdt:P2044 ?elevation
        
        # Then for every ?mountain, return data for all labels (rdfs:label)
        # into the variable ?mountainLabel, but filter for only english language labels
        ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
    }

## Make your own query

For simple queries, it is best to first filter by a fundamental property, like 'instance of', 'occupation', 

# Querying in Python

In [None]:
%matplotlib inline

import requests

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pprint import pprint
infosize = 12

## Define the SPARQL query

SPARQL is a declarative language to query RDF stores. I don't have much experience with SPARQL myself, but I'll try to explain the query you see below. First we define prefixes, which serve as URL shortcuts pointing to Wikidata's resources. The names following the `SELECT` keyword are the variables that will be retrieved by the query, and all variables indicated by a `?` prefix. 

What these variables mean is defined by the triple patterns that follow in the `WHERE` clause. The first triple basically says that `?language` stands for the query of [P31](https://www.wikidata.org/wiki/P31) ("instance of") and [Q9143](https://www.wikidata.org/wiki/Q9143) (programming language). Note that the `wdt:` prefix comes before predicates (prefixed with a P- in Wikidata URLs) and the `wd:` prefix comes before entities (prefixed with a Q- in Wikidata URLs).

The next triple returns values.

The following `SERVICE` query is used to assign natural language labels to the variables `?language`. The first

As said I'm by no means a SPARQL expert, so if you can improve my explanation feel to edit the [notebook on GitHub](https://github.com/yaph/ipython-notebooks/blob/master/us-presidents-causes-of-death.ipynb) and submit a pull-request.

In [None]:
query = '''
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX q: <http://www.wikidata.org/prop/qualifier/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?mountain ?mountainLabel ?elevation 
WHERE {
    ?mountain wdt:P31 wd:Q8502 .
    ?mountain wdt:P131 wd:Q99 .
    ?mountain wdt:P2044 ?elevation .
    ?mountain rdfs:label ?mountainLabel filter (lang(?mountainLabel) = "en")
}
'''

## Get and process the data

Next we send an HTTP request to the SPARQL endpoint providing the query as a URL parameter, we also specify that we want the result encoded as JSON rather than the default XML. Thanks to the [requests library](http://docs.python-requests.org/en/latest/) this is practically self-explaining code.

Now we iterate through the result, creating a list of dictionaries, each of which contains values for the query variables defined above. Then we create a [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) from this list, print its length and the first few rows.

In [None]:
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
data = requests.get(url, params={'query': query, 'format': 'json'}).json()

In [None]:
data

In [None]:
languages = []
for item in data['results']['bindings']:
    languages.append({
        'name': item['mountainLabel']['value'],
        'elevation': item['elevation']['value']
    })


In [None]:

df = pd.DataFrame(languages)
print(len(df))
df.head()

Let's also see the data types of the columns.

In [None]:
df.dtypes