## Introduction to Databases

### Using SPARQL

Based on [this](https://rebeccabilbro.github.io/sparql-from-python/), [this](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial) and [this](https://towardsdatascience.com/where-do-mayors-come-from-querying-wikidata-with-python-and-sparql-91f3c0af22e2) blog posts  

In [17]:
import os
import sys
import time
import datetime
import numpy as np
import pandas as pd

from SPARQLWrapper import SPARQLWrapper, JSON

SPARQL (“SPARQL Protocol And RDF Query Language”) is a W3C standard for querying RDF and can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.  

RDF represents data in the format [Subject - Predicate - Object](https://en.wikipedia.org/wiki/Semantic_triple)

SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions.  
SPARQL also supports extensible value testing and constraining queries by source RDF graph.  
The results of SPARQL queries can be results sets or RDF graphs.

SPARQL allows us to express queries as three-part statements:


> PREFIX ... // identifies & nicknames namespace URIs of desired variables  
> SELECT ... // lists variables to be returned (start with a ?)  
> WHERE  ... // contains restrictions on variables expressed as triples  

SPARQL basics
A simple SPARQL query looks like this: 
    
> SELECT ?a ?b ?c  
> WHERE  
> {  
>   x y ?a.  
>   m n ?b.  
>   ?b f ?c.  
> }  

The SELECT clause lists variables that you want returned (variables start with a question mark), and the WHERE clause contains restrictions on them, mostly in the form of triples. A triple can be read like a sentence (which is why it ends with a period), with a subject, a predicate, and an object:   

> SELECT ?fruit  
> WHERE  
> {  
>   ?fruit color yellow.  
>   ?fruit taste sour.  
> }  

Given that color and taste are properties in some triples 

### First Example: Query DBPedia

In [3]:
# Specify the DBPedia endpoint
sparql = SPARQLWrapper("http://dbpedia.org/sparql")

In [4]:
# Query for the description of "Capsaicin", filtered by language

sparql.setQuery(
    """
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?comment
    WHERE { <http://dbpedia.org/resource/Capsaicin> rdfs:comment ?comment
    FILTER (LANG(?comment)='en')
    } 
    """
)

In [5]:
# Convert results to JSON format
sparql.setReturnFormat(JSON)
result = sparql.query().convert()

In [8]:
# The return data contains "bindings" (a list of dictionaries)
print(result.keys())

dict_keys(['head', 'results'])


In [10]:
result

{'head': {'link': [], 'vars': ['comment']},
 'results': {'distinct': False,
  'ordered': True,
  'bindings': [{'comment': {'type': 'literal',
     'xml:lang': 'en',
     'value': 'Capsaicin (/kæpˈseɪ.ᵻsɪn/ (INN); 8-methyl-N-vanillyl-6-nonenamide) is an active component of chili peppers, which are plants belonging to the genus Capsicum. It is an irritant for mammals, including humans, and produces a sensation of burning in any tissue with which it comes into contact. Capsaicin and several related compounds are called capsaicinoids and are produced as secondary metabolites by chili peppers, probably as deterrents against certain mammals and fungi. Pure capsaicin is a volatile, hydrophobic, colorless, odorless, crystalline to waxy compound.'}}]}}

In [6]:
for hit in result["results"]["bindings"]:
    # We want the "value" attribute of the "comment" field
    print(hit["comment"]["value"])

Capsaicin (/kæpˈseɪ.ᵻsɪn/ (INN); 8-methyl-N-vanillyl-6-nonenamide) is an active component of chili peppers, which are plants belonging to the genus Capsicum. It is an irritant for mammals, including humans, and produces a sensation of burning in any tissue with which it comes into contact. Capsaicin and several related compounds are called capsaicinoids and are produced as secondary metabolites by chili peppers, probably as deterrents against certain mammals and fungi. Pure capsaicin is a volatile, hydrophobic, colorless, odorless, crystalline to waxy compound.


### Second Example: [Querying Wikidata](https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial)

We can also use the Wikidata Query Service (WDQS) endpoint to query Wikidata.  
Wikidata is a knowledge database. It contains millions of statements, such as “the capital of Canada is Ottawa”, or “the Mona Lisa is painted in oil paint on poplar wood”, or “gold has a melting point of 1,064.18 degrees Celsius”.  

Let’s say we want to continue our research into spicy things by searching for information about hot sauces in Wikidata. The first step is to find the unique identifier that Wikidata uses to reference “hot sauce”, which we can do by searching on Wikidata. It turns out to be “Q522171”, which is an “entity”, which corresponds to the “wd” prefix in Wikidata.  

If we want to get back results for all of the kinds of hot sauces cataloged in Wikidata, we want to query for the results that have the direct property – or “wdt” in Wikidata prefix speak – “<subclasses of>”, which is encoded as “P279” in Wikidata.  

NOTE: For simple WDQS triples, items should be prefixed with wd:, and properties with wdt:. We don’t need to explicitly alias any prefixes in this case because WDQS already knows many shortcut abbreviations commonly used externally (e.g. rdf, skos, owl, schema, etc.) as well as ones internal to Wikidata, such as:  
    
> PREFIX wd: <http://www.wikidata.org/entity/>  
> PREFIX wds: <http://www.wikidata.org/entity/statement/>  
> PREFIX wdv: <http://www.wikidata.org/value/>  
> PREFIX wdt: <http://www.wikidata.org/prop/direct/>  
> PREFIX wikibase: <http://wikiba.se/ontology#>  
> PREFIX p: <http://www.wikidata.org/prop/>  
> PREFIX ps: <http://www.wikidata.org/prop/statement/>  
> PREFIX pq: <http://www.wikidata.org/prop/qualifier/>  
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>  
> PREFIX bd: <http://www.bigdata.com/rdf#>  

In [11]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Below we SELECT both the hot sauce items & their labels
# in the WHERE clause we specify that we want labels as well as items

sparql.setQuery("""
SELECT ?item ?itemLabel
WHERE {?item wdt:P279 wd:Q522171.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")

sparql.setReturnFormat(JSON)
results = sparql.query().convert()

In [12]:
results

{'head': {'vars': ['item', 'itemLabel']},
 'results': {'bindings': [{'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q249114'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'salsa'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q335016'},
    'itemLabel': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'Tabasco sauce'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q360459'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Adobo'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q460439'},
    'itemLabel': {'xml:lang': 'en',
     'type': 'literal',
     'value': "Blair's 16 Million Reserve"}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q966327'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'harissa'}},
   {'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q1026822'},
    'itemLabel': {

In [16]:
results_df = pd.json_normalize(results['results']['bindings'])
results_df[['item.value', 'itemLabel.value']]

Unnamed: 0,item.value,itemLabel.value
0,http://www.wikidata.org/entity/Q249114,salsa
1,http://www.wikidata.org/entity/Q335016,Tabasco sauce
2,http://www.wikidata.org/entity/Q360459,Adobo
3,http://www.wikidata.org/entity/Q460439,Blair's 16 Million Reserve
4,http://www.wikidata.org/entity/Q966327,harissa
5,http://www.wikidata.org/entity/Q1026822,Chili oil
6,http://www.wikidata.org/entity/Q1392674,sriracha sauce
7,http://www.wikidata.org/entity/Q2227032,mojo
8,http://www.wikidata.org/entity/Q2279518,Shito
9,http://www.wikidata.org/entity/Q2402909,Valentina


### A complete query example with Wikidata