The actual build out process for ecoregion items is going to be somewhat involved. I've initiated a process to get a new [bot user account](https://www.wikidata.org/wiki/User:BristolBot) approved for running bot-based processes and submitted an [approval request](https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/BristolBot). I've submitted proposals for new external identifiers to handle the simple ID scheme for each of the 5 classification systems represented in the EPA ecoregion data.

Identifier management seems like kind of a mess in Wikidata right now. There are hundreds of properties used as different kinds of internal and external identifiers. There's a nice overall schema that can be used in describing identifiers, and there's really no reason that we shouldn't have as many different kinds of identifier schemes as needed in the system, especially if they are well described, configured, referenced, and qualified. However, the process of establishing a new identifier is completely community based, and speed of getting something done is going to be inherently dependent on just how active a few individual people are. It would be nice to have some kind of generic identifier property that could be qualified with its own property value settings so that, in a case like mine of building out a whole new concept pool from source data, we could at least put identifiers where they belong. In the meantime, I worked with aliases/alternative labels as the only reasonable place to shove ID strings, and this seems to be in use elsewhere as well.

I looked at the code for pywikibot, which is kind of the Swiss army knife tool for all things Mediawiki. After struggling with finding working examples of Wikidata processing, I went with the [wikidataintegrator](https://github.com/SuLab/WikidataIntegrator) from SuLab (Scripps). I forked it and started a [branch](https://github.com/skybristol/WikidataIntegrator/tree/ecoregion_helper) to build out what I needed for this case. I also got added to the SuLab's Slack channel on this as it seems like some interesting work going on there to jump in on.

The wikidataintegrator has taken an interesting approach of building out "wdi_helpers" that seem to be dealing with specific kinds of business logic needed for individual collections of items and their Wikidata layout and requirements. I'm not sure it's the best approach to tie everything together into the one codebase as it seems like it might be better to externalize the specific cases and have a clean foundation that does the background work in a configurable way. However, the two use cases built out by the SuLab authors were pretty useful in helping me to understand the usage pattern, and I added a new one for the [ecoregion use case](https://github.com/skybristol/WikidataIntegrator/blob/ecoregion_helper/wikidataintegrator/wdi_helpers/ecoregion.py).

The main thing that I had to adjust from what the SuLab folks have done is a different way of handling the "foreign keys" that I'm coming into the system with from source data. In my prep work, I generated what I called a "contextual identifier" that is essentially a uri stub that includes the specific classification system and the original source ID together in a urn-like string (e.g., US_L4CODE:51g). Where the SuLab work on documenting genomics data all seems to involve cases where they are able to use established identifier properties, I had to add a new function for working with aliases/alt-labels in Wikidata and stand this in place of their methods for identifying whether or not an item already exists in Wikidata to control create/update operations. I did try to mostly mimic the constructs they came up with so as not to monkey around in the code too much.

I've successfully tested this process a few times to build out items (e.g., [Door Peninsula L4 Ecoregion](https://www.wikidata.org/wiki/Q98687985)). While I could probably get away with running the full set of items into the system, I'm holding off for a few days to see if someone gets around to my bot task and identifier property requests. I'm curious to figure out the level of activation energy needed in this community. The following code sequence grabs up my prepped ecoregion data (that's really where all the work went on), splits out items with existing identifiers, and creates a new item from source. It took a little bit of work to get here, but I'm pleased so far with the progress. Being able to build minimally on the wikidataintegrator work was a huge boost.

In [1]:
import pickle
from wikidataintegrator import wdi_core, wdi_login, wdi_helpers
import os

In [2]:
infile = open("data_cache/ecoregions.pkl", "rb")
ecoregions = pickle.load(infile)

In [3]:
ecoregions_in_wikidata = [i for i in ecoregions if i["wikidata_id"] is not None]
ecoregions_to_create = [i for i in ecoregions if i["wikidata_id"] is None]

In [4]:
login_instance = wdi_login.WDLogin(
    user=os.environ["WD_BOT"], 
    pwd=os.environ["WD_PASS"]
)

In [5]:
ecoregion = ecoregions_to_create[2]
display(ecoregion)

{'contextual_identifier': 'NA_L3CODE:9.4.5',
 'wikidata_id': None,
 'source': 'NA_CEC_Eco_Level3',
 'common_name': 'Cross Timbers',
 'part_of_identifiers': ['NA_L1CODE:9', 'NA_L2CODE:9.4'],
 'x': -97.5065596504837,
 'y': 33.5884848315732,
 'admin_intersects': ['Q484603',
  'Q484586',
  'Q495867',
  'Q495873',
  'Q110500',
  'Q495925',
  'Q495951',
  'Q379540',
  'Q484596',
  'Q112149',
  'Q111844',
  'Q486651',
  'Q491616',
  'Q108827',
  'Q484616',
  'Q495570',
  'Q110384',
  'Q111368',
  'Q485024',
  'Q110412',
  'Q114550',
  'Q489471',
  'Q108784',
  'Q489312',
  'Q489457',
  'Q156270',
  'Q495479',
  'Q108424',
  'Q495930',
  'Q495880',
  'Q109457',
  'Q109308',
  'Q485042',
  'Q491590',
  'Q495581',
  'Q489306',
  'Q495906',
  'Q495885',
  'Q497854',
  'Q109170',
  'Q495899',
  'Q485038',
  'Q112673',
  'Q485912',
  'Q108793',
  'Q495891',
  'Q484567',
  'Q110779',
  'Q484590',
  'Q1649',
  'Q109265',
  'Q156613',
  'Q1439',
  'Q375008',
  'Q113843',
  'Q374256',
  'Q113919',
  'Q

In [6]:
r = wdi_helpers.Ecoregion(
    identifier=ecoregion["contextual_identifier"],
    title=ecoregion["common_name"],
    instance_of=ecoregion["source"],
    country=["Q30"],
    located_in=ecoregion["admin_intersects"],
    latitude=ecoregion['y'],
    longitude=ecoregion['x']
)

In [7]:
wd_item = r.get_or_create(login=login_instance)
print(wd_item)

Q98688136


In [8]:
created_item = wdi_core.WDItemEngine(wd_item_id=wd_item)
created_item.get_wd_json_representation()

{'labels': {'en': {'language': 'en', 'value': 'Cross Timbers'}},
 'descriptions': {'en': {'language': 'en',
   'value': 'an instance of Level III ecoregion'}},
 'claims': {'P131': [{'mainsnak': {'snaktype': 'value',
     'property': 'P131',
     'datavalue': {'value': {'entity-type': 'item',
       'numeric-id': 484603,
       'id': 'Q484603'},
      'type': 'wikibase-entityid'},
     'datatype': 'wikibase-item'},
    'type': 'statement',
    'rank': 'normal',
    'qualifiers': {},
    'qualifiers-order': [],
    'references': [],
    'id': 'Q98688136$61397D49-5648-4B13-B1CF-12EB893D12FE'},
   {'mainsnak': {'snaktype': 'value',
     'property': 'P131',
     'datavalue': {'value': {'entity-type': 'item',
       'numeric-id': 484586,
       'id': 'Q484586'},
      'type': 'wikibase-entityid'},
     'datatype': 'wikibase-item'},
    'type': 'statement',
    'rank': 'normal',
    'qualifiers': {},
    'qualifiers-order': [],
    'references': [],
    'id': 'Q98688136$8DCCC587-6F26-42BB-97D