# Building dataset for retrieval question answering from Wikidata

Theoretically, wikidata can provide us a fantastic dataset for QRA. The workflow is the following:
+ For a given set of predicates (I elaborate below) ...
+ That do have string values ...
+ With references such as url

... we can collect the data in a form 
+ Question: can be asked by a predicate, synthetically
+ Answer: predicate value
+ Source text: webpage content

The seach can be performed as simply on a page among all mentions, as well as in broader space of mentions.

***Correctly (after human accession) retrieved statemens (passages) can further populate reference with a matching quote***  
https://www.wikidata.org/wiki/Property:P1683 

### Properties
There's a list of properties with desirable qualities:
+ https://www.wikidata.org/wiki/Wikidata:List_of_properties/Wikidata_property_with_datatype_string_that_is_not_an_external_identifier

In [1]:
from pprint import pprint
from SPARQLWrapper import SPARQLWrapper, JSON, RDF

In [2]:
sparql = SPARQLWrapper("http://query.wikidata.org/sparql")
sparql.setReturnFormat(JSON)

In [6]:
query = """
SELECT ?company ?companyLabel ?val ?ref 
WHERE
{
  ?company wdt:P31 wd:Q43229 .
  ?company p:P571 [ 
    prov:wasDerivedFrom [ pr:P854 ?ref ] ;
    ps:P571 ?val
  ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,ru". }
} LIMIT 20
"""

In [7]:
sparql.setQuery(query)

results = sparql.query().convert()

In [8]:
pprint(results["results"]["bindings"])

[{'company': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q169747'},
  'companyLabel': {'type': 'literal',
                   'value': 'Dahlbusch Verwaltungs-AG',
                   'xml:lang': 'en'},
  'ref': {'type': 'uri',
          'value': 'https://www.gelsenkirchener-geschichten.de/wiki/Zeche_Dahlbusch'},
  'val': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime',
          'type': 'literal',
          'value': '1873-01-01T00:00:00Z'}},
 {'company': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q160112'},
  'companyLabel': {'type': 'literal',
                   'value': 'Museo del Prado',
                   'xml:lang': 'en'},
  'ref': {'type': 'uri',
          'value': 'https://www.museodelprado.es/museo/historia-del-museo'},
  'val': {'datatype': 'http://www.w3.org/2001/XMLSchema#dateTime',
          'type': 'literal',
          'value': '1819-01-01T00:00:00Z'}},
 {'company': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q183629'},
  'com