This notebook introduces items into a Wikibase instance for countries of the world. It uses Wikidata as its source, which can be somewhat problematic. The primary thing we are establishing here is the set of ISO3166 codes (2-character alpha, 3-character alpha, and 3-digit numeric) that are used in other data sources we want to integrate into this source. The ISO codes are locked up in the ISO3166 standard from the standpoint of being able to write effective code to pull them into something like this, so we have to rely on some other open, public source.

I,m experimenting further here with the idea of encoding every data source as items in the knowledgebase itself so  they can be referenced from item claims. In this case, I added a new property for "query string" that contains the URL to the Wikidata SPARQL endpoint. This is parsed and used in a query from Python to retrieve records for processing. There's likely more that could be encoded to drive this process such as the class of item these records are going into, but I'll leave some of that off for the time being.

In [1]:
from wbmaker import WikibaseConnection

from wikibaseintegrator import models, datatypes

In [2]:
eew = WikibaseConnection('eew')

In [3]:
properties = eew.properties()
classes = eew.classification()
datasources = eew.datasources()

In [4]:
display(properties)
display(classes)
display(datasources)

{'instance of': 'P1',
 'subclass of': 'P2',
 'related wikidata property': 'P3',
 'related wikidata item': 'P4',
 'formatter URL': 'P5',
 'data source': 'P6',
 'data format': 'P7',
 'item of this property': 'P8',
 'equivalent property': 'P9',
 'reference URL': 'P10',
 'coordinate location': 'P11',
 'geographical location': 'P12',
 'country': 'P13',
 'U.S. state': 'P14',
 'U.S. county': 'P15',
 'municipality': 'P16',
 'ISO 3166-1 numeric code': 'P17',
 'ISO 3166-1 alpha-2 code': 'P18',
 'ISO 3166-1 alpha-3 code': 'P19',
 'ISO 3166-2 code': 'P20',
 'FIPS 6-4': 'P21',
 'FIPS 5-2 numeric': 'P22',
 'FIPS 5-2 alpha': 'P23',
 'FIPS 55-3': 'P24',
 'GNIS ID': 'P25',
 'NAICS Code': 'P26',
 'SIC code': 'P28',
 'query string': 'P29',
 'caveat': 'P30'}

{'property': 'Q2',
 'spatio-temporal entity': 'Q3',
 'location': 'Q4',
 'object': 'Q8',
 'temporal entity': 'Q5',
 'spatial entity': 'Q6',
 'human activity': 'Q7',
 'country': 'Q14',
 'artificial entity': 'Q9',
 'work': 'Q10',
 'dataset': 'Q11'}

Unnamed: 0,ds,dsLabel,query_string
0,https://eew-edgi.wikibase.cloud/entity/Q13,Wikidata listing of world countries,https://query.wikidata.org/sparql?query=SELECT...


In [5]:
country_datasource = datasources[datasources.dsLabel == 'Wikidata listing of world countries'].iloc[0]

sparql_endpoint, sparql_query = eew.parse_sparql_url(country_datasource.query_string)

wd_countries = eew.sparql_query(
    endpoint=sparql_endpoint,
    query=sparql_query,
    output='dataframe'
)

### Check for duplicates

I recorded a caveat on the data source query string indicating that the results should be checked for duplicate identifiers. Since our main purpose here is to establish country items we can link to from other data, this is the main thing we need to check about what we are bringing in from Wikidata. We find an alternate record was created at some point for some reason that conflicts with the earlier record. Because these are ordered on the "country" QID by default from the query service, we see the earlier record and can drop the latter. Something more robust would probably be wise, but this is good enough for now.

In [6]:
ids = wd_countries['iso3166_alpha2']
wd_countries[ids.isin(ids[ids.duplicated()])].sort_values('iso3166_alpha2')

Unnamed: 0,country,countryLabel,countryDescription,countryAltLabel,iso3166_alpha2,iso3166_alpha3,iso3166_num
4,http://www.wikidata.org/entity/Q55,Netherlands,country in Northwestern Europe with territorie...,"NL, Holland, Nederland, NED, nl, the Netherlan...",NL,NLD,528
47,http://www.wikidata.org/entity/Q29999,Kingdom of the Netherlands,sovereign state and constitutional monarchy,"NL, Netherlands, the Netherlands, 🇳🇱",NL,NLD,528


In [7]:
# Drop the last 
wd_countries.drop_duplicates(subset='iso3166_alpha2', keep="first", inplace=True)
wd_countries[wd_countries.iso3166_alpha2 == 'NL']

Unnamed: 0,country,countryLabel,countryDescription,countryAltLabel,iso3166_alpha2,iso3166_alpha3,iso3166_num
4,http://www.wikidata.org/entity/Q55,Netherlands,country in Northwestern Europe with territorie...,"NL, Holland, Nederland, NED, nl, the Netherlan...",NL,NLD,528


### Processing workflow

I can and maybe should abstract the WikibaseIntegrator functionality into the Python class I'm building for this work. I could do something like simply present the data structure to be used in building items. However, there are a number of things that need to be worked out in terms of hardening the process to that point. For now, the key decisions are pretty straightforward when worked directly against built-in WBI functionality.

In [10]:
wd_references = models.References()
wd_reference = datatypes.Item(
    prop_nr=properties['data source'],
    value=country_datasource.ds.split('/')[-1]
)
wd_references.add(wd_reference)

for index, row in wd_countries.iterrows():
    print("PROCESSING:", row.countryLabel)

    item = eew.wbi.item.new()
    
    # Set label and description
    item.labels.set('en', row.countryLabel)
    item.descriptions.set('en', row.countryDescription)

    # Basic classification
    item.claims.add(
        datatypes.Item(
            prop_nr=properties['instance of'],
            value=classes['country'],
            references=wd_references
        )
    )

    # Identifier claims
    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['ISO 3166-1 alpha-2 code'],
            value=row.iso3166_alpha2,
            references=wd_references
        )
    )

    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['ISO 3166-1 alpha-3 code'],
            value=row.iso3166_alpha3,
            references=wd_references
        )
    )

    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['ISO 3166-1 numeric code'],
            references=wd_references,
            value=row.iso3166_num
        )
    )

    # Link to related wikidata item
    wd_link_qualifiers = models.Qualifiers()
    wd_link_caveat = datatypes.String(
        prop_nr=properties['caveat'],
        value='Some statements about countries in Wikidata may be innaccurate or invalid and should be used with caution.'
    )
    wd_link_qualifiers.add(wd_link_caveat)

    item.claims.add(
        datatypes.ExternalID(
            prop_nr=properties['related wikidata item'],
            value=row.country.split("/")[-1],
            qualifiers=wd_link_qualifiers
        )
    )

    item.write(summary="Establishing initial country record with claims from corresponding Wikidata item")

PROCESSING: Japan
PROCESSING: Republic of Ireland
PROCESSING: United States of America
PROCESSING: Italy
PROCESSING: Netherlands
PROCESSING: Uruguay
PROCESSING: Egypt
PROCESSING: Ethiopia
PROCESSING: Ghana
PROCESSING: Andorra
PROCESSING: Cyprus
PROCESSING: Kazakhstan
PROCESSING: Uzbekistan
PROCESSING: Australia
PROCESSING: Chad
PROCESSING: Samoa
PROCESSING: Fiji
PROCESSING: Paraguay
PROCESSING: Guyana
PROCESSING: Ecuador
PROCESSING: Jamaica
PROCESSING: Haiti
PROCESSING: Iran
PROCESSING: Yemen
PROCESSING: Kuwait
PROCESSING: Maldives
PROCESSING: Nepal
PROCESSING: Oman
PROCESSING: Sri Lanka
PROCESSING: Taiwan
PROCESSING: Turkmenistan
PROCESSING: Tanzania
PROCESSING: Central African Republic
PROCESSING: Zimbabwe
PROCESSING: Botswana
PROCESSING: Burkina Faso
PROCESSING: Republic of the Congo
PROCESSING: Djibouti


Service unavailable (HTTP Code 502). Sleeping for 60 seconds.


PROCESSING: Eritrea
PROCESSING: Guinea
PROCESSING: Cameroon
PROCESSING: Madagascar
PROCESSING: Malawi
PROCESSING: Western Sahara
PROCESSING: Northern Mariana Islands
PROCESSING: Martinique
PROCESSING: Sint Maarten
PROCESSING: New Caledonia
PROCESSING: Saint Martin
PROCESSING: Heard Island and McDonald Islands
PROCESSING: Canada
PROCESSING: Norway
PROCESSING: Hungary
PROCESSING: Belgium
PROCESSING: Luxembourg
PROCESSING: Finland
PROCESSING: Switzerland
PROCESSING: Austria
PROCESSING: Greece
PROCESSING: Turkey
PROCESSING: Portugal
PROCESSING: Kenya
PROCESSING: France
PROCESSING: People's Republic of China
PROCESSING: Brazil


Service unavailable (HTTP Code 502). Sleeping for 60 seconds.


PROCESSING: Russia
PROCESSING: Germany
PROCESSING: Iceland
PROCESSING: Estonia
PROCESSING: Slovenia
PROCESSING: Romania
PROCESSING: Bulgaria
PROCESSING: North Macedonia
PROCESSING: Albania
PROCESSING: Greenland
PROCESSING: Bosnia and Herzegovina
PROCESSING: Malta
PROCESSING: Monaco
PROCESSING: Montenegro
PROCESSING: Vatican City
PROCESSING: Indonesia
PROCESSING: South Africa
PROCESSING: Algeria
PROCESSING: Chile
PROCESSING: Singapore
PROCESSING: Bahrain
PROCESSING: Argentina
PROCESSING: North Korea
PROCESSING: New Zealand
PROCESSING: Tuvalu
PROCESSING: Tonga
PROCESSING: Solomon Islands
PROCESSING: Vanuatu
PROCESSING: Papua New Guinea
PROCESSING: Nauru
PROCESSING: Marshall Islands
PROCESSING: Kiribati
PROCESSING: Mongolia
PROCESSING: Venezuela
PROCESSING: Saint Vincent and the Grenadines
PROCESSING: Saint Lucia
PROCESSING: Grenada
PROCESSING: Dominica
PROCESSING: Costa Rica
PROCESSING: Israel
PROCESSING: Panama
PROCESSING: Laos
PROCESSING: Lebanon
PROCESSING: Syria
PROCESSING: Tajikista