Skip to content

Experimental knowledgebase supporting EDJI

License

Notifications You must be signed in to change notification settings

skybristol/eew-edgi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Knowledgebase Experimentation for the Environmental Enforcement Watch project

This project develops a workflow to build the fundamental data the EEW project is integrating and making sense of from the USEPA's ECHO platform and other sources. It operates against the https://eew-edgi.wikibase.cloud instance of Wikibase. Read more about the project on that wikibase instance's front page.

Re-use of these codes

Anyone is welcome to take and build on anything I'm doing here - it is Unlicensed. I'm not currently building any of this as a deployable codebase, and you won't be able to run everything here unless you build it to operate on your own Wikibase instance (local of wikibase.cloud). I use environment variables in whatever platform I'm executing this on (currently a Pangeo environment via the ESIPLab) to store access information. You'd have to rework that according to your own preferences.

Data and Processing

I finally figured out (remembered?) that one of the datasets the Microsoft Planetary Computer folks had set up for use was a bunch of the Census data. These are built products from the 2020 census for demographic work, but they are perfectly suitable for related work I need to do here. I may still leverage the methods developed previously for building out reference sources on tribal lands, states, counties, etc., but when I need to run geospatial processes to clean up and link FRS facilities to these areas, I can do that very efficiently working with what MPC has in the cloud.

I also figured out something I was clueless about - I can hit a remote Jupyter server with a notebook from my local VSCode environment. The MPC folks have some nice instructions on that. I have to decide whether or not I'm going to push items and claims directly to my Wikibase instance from processing on MPC or dump out an intermediary file to be loaded elsewhere. There's not a great way to manage custom environments in the MPC Hub (a Pangeo/JupyterLab instance), so it's a little difficult to set up necessary Python packages and environment variables there. It would be much cleaner to do data processing there and then send results right into the knowledge graph.

Some principles I'm figuring out here

My notebooks will have specific text and notes on what I'm working through. I sometimes come back to the readme with things I've worked out as general principles.

  • We need a few hard and fast constraints to make all of the automated parts of this work.
    • We need to count on a few rules so we can understand and navigate through the graph while letting some other things develop organically through time. I'm using some basic conventions like the following:
      • Everything is either an "instance of" something or a "subclass of" something except for the notional point of origin, which I called entity (could have been something else, but that's the frame I used)
      • Classes are all "subclass of" either entity itself or something below that. Once "entity" (Q1), "instance of" (P1), and "subclass of" (P2) are established, we can get busy with everything else.
  • Every source needs to be an item that can be pointed to with sufficient detailed characteristics to link to how things came to be in the knowledgebase. I'm trying to drive actual processing from the content of the items to make this reality. I'm working toward a balance of configuration variables stored in "data" (in terms of details within a source item structure) acted on by code workflows.
  • It's better to break processes up for clarity and simplicity. Do one thing and one thing only, leaving other processing for subsequent operations. For instance, bring in county based solely on what is in the specific source. We can then run further processing to add in linkage and other information as separate bots operating through time.
  • Sometimes "simplistic" data access methods are okay
    • It initially seemed illogical to use what amounts to a web scraping method to get tabular data from the U.S. Census TIGER data as opposed to using their web services. However, in exploring those services, I found them to be in the same kind of shape that I've found elsewhere - they are fundamentally GIS services set up and tuned to drive GIS applications; they are not data distribution or general access data services. Sure, I could write code that would use the ArcGIS Server REST APIs to return JSON and process that just fine. But that requires a bunch of complicated parameterization and fundamentally isn't very different from any other HTTP call. We might also find that those services are even more "brittle" in terms of ongoing change than the cached HTML tables I ended up using. Since those pages with single HTML tables are actually advertised on the web site as a point of data access, it seems like a reasonanble way to go about this. I did use the specialization of Pandas read_html method here, which is a nontrivial dependency, but it works and makes things fairly efficient.
    • I'm trying to be pragmatic about how deep the class structure goes. Many ontologies are too academic and philosphical for practical use without translating concepts into something normal people can understand when looking at the information. Wikidata is a mess of sometimes conflicting ideas. I'm shooting for something in the middle.
  • Labels are not great identifiers for computers but they are what people look at to recognize something. In the giant context of a knowledge graph, contextual labels are important. We'll end up with duplicate labels over time, and Wikibase has some nice disambiguation features. Within a given context, meaning "instance of" classification, I'm trying to keep labels unique. This does mean a little frontloading like combining county names with state names to derive a unique label.

About

Experimental knowledgebase supporting EDJI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published