One of the sources for information we are interested in from a mining perspective in the GeoKB is a relatively new type of technical report required by the U.S. Securities and Exchange Commission (S-K 1300 Technical Report). These are similar to the National Instrument 43-101 Reports required by the Canadian Government, and they contain useful details on mineral prospects that can feed into mineral resource assessments.

The SEC provides an API called EDGAR that facilitates access to S-K 1300 reports. Report format varies from PDF files (like the NI 43-101 reports) to HTML "documents" (essentially, small web sites with many HTML and image elements).

From a knowledgebase perspective, working with these records indicates several entity types we need to make sense of:

* Documents that we'll classify as something like "government legal filing"
    * Classification properties for these documents
    * Identifier properties that provide reasonably persistent, resolvable access to the documents
    * Extracted subject matters indicated in the document texts
* Commercial companies that will be a type of organizational entity
    * Classification properties for these companies
    * Identifier properties for companies that make them linkable to other systems

Company/organization information is important to get nailed down as we work through these sources as these will be useful in other circumstances as well and will be needed in terms of tracking down and linking in other sources of information from both public and proprietary sources. The SEC has their own unique identifier called the Central Index Key (CIK).

Along with the CIK identifiers, we may also want to explore the following additional international registry sources that can help us in disambiguating organizational entities and reaching out for further details:

* [Global Legal Entitiy Identifier Foundation](https://www.gleif.org/en) - provides a persistent, resolvable identifier, API, and data dump with a ton of details on many commercial entities and their relationships; including a trace to ultimate parent entities
* [OpenAlex](https://explore.openalex.org/) - a fully open registry mostly focused on academic publishing but including its own take on institutions, identifiers, and mapping

There are other identifier systems/registries beyond these that apply here like the ISIN system, but most of them are either closed access or do not have a reasonable API or method for getting to useful data. Wikidata has a ton of the instititions we care about like mining companies, but it is somewhat fraught with semantic confusion issues. Where we can nail down a usable Wikidata Q identifier for a company, it may provide us with an avenue to get other identifiers, but we'll still need to run our own resolution to ensure they are valid mappings.

## Approaching from the Company Identifier

In looking over the SEC filings and in some discussions with assessment geologists, there are other artifacts from the SEC EDGAR source beyond the more formal S-K 1300 reports that could be useful (e.g., news releases with tables of production information). So, coming at this from the perspective of first identifying mining companies and then running routine queries to look for relevant filings may be a better way to approach the problem. We can establish entities for the companies in the knowledgebase, including relevant identifiers, and use those as a launching point to mine for useful documentation to bring into the knowledgebase and parse for claims.

In [1]:
import os
from sec_api import QueryApi, FullTextSearchApi, MappingApi
import requests
import pandas as pd
import swifter

In [2]:
sec_query_api = QueryApi(api_key=os.environ['SEC_EDGAR_KEY'])
sec_mapping_api = MappingApi(api_key=os.environ['SEC_EDGAR_KEY'])

# Mining Companies

Figuring out exactly which companies registered with the SEC are involved in mining operations likely to be filing reports of use to the GeoKB is a little bit tricky as they might be classified in different sectors. "Sector" is one of the available query parameters in the [EDGAR Mapping API](https://sec-api.io/docs/mapping-api), which is where we can go after company information directly.

Companies registered with the SEC are classified using the [Standard Industry Classification code](https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list) system. These are available in the data, but unfortunately, they are not available as a query parameter. With a little bit of sleuthing, we can figure out the "sector" values where the appropriate SIC coded companies would be classified and then filter from there.

I'm starting with a simplified list of SIC codes, from which we may need to expand if we find companies of interest in other sectors.

In [3]:
mining_sector_codes = [
    1000,
    1040,
    1090,
    1400
]

df_companies = pd.DataFrame(sec_mapping_api.resolve("sector", "Basic Materials"))
df_mining_companies = df_companies[df_companies.sic.isin([str(i) for i in mining_sector_codes])].copy()

In [4]:
df_mining_companies.head(10)

Unnamed: 0,name,ticker,cik,cusip,exchange,isDelisted,category,sector,industry,sic,sicSector,sicIndustry,famaSector,famaIndustry,currency,location,id
2,ALMADEN MINERALS LTD,AAU,1015647,020283305 020283107,NYSEMKT,False,ADR Common Stock,Basic Materials,Gold,1000,Mining,Metal Mining,,Non-Metallic and Industrial Metal Mining,CAD,British Columbia; Canada,80ac18a59ea33ac49a4b4eb0c508e78c
4,ABTECH HOLDINGS INC,ABHD,1405858,00400H108 00400H207,OTC,True,Domestic Common Stock,Basic Materials,Uranium,1090,Mining,Miscellaneous Metal Ores,,Non-Metallic and Industrial Metal Mining,USD,Arizona; U.S.A,97df33c6904298396f07475f5d484227
14,AGNICO EAGLE MINES LTD,AEM,2809,008474108,NYSE,False,Canadian Common Stock,Basic Materials,Gold,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,Ontario; Canada,78de930fa8ae4077b3e540765185443b
18,FIRST MAJESTIC SILVER CORP,AG,1308648,32076V103 32076R102,NYSE,False,Canadian Common Stock,Basic Materials,Silver,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,British Columbia; Canada,f933bb34a6eb75d8a744d8bc19e853fd
21,ALAMOS GOLD INC,AGI,1178819,011532108 011527108,NYSE,False,Canadian Common Stock,Basic Materials,Gold,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,Ontario; Canada,a5d252de2675a3ed99ed7dd0fe930a75
30,ALIO GOLD INC,ALO,1502154,88741P103 01627X108,NYSEMKT,True,ADR Common Stock,Basic Materials,Gold,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,British Columbia; Canada,f7f6b0f5d64f20af70da91b6614cc352
35,AMERICAN LITHIUM CORP,AMLI,1699880,027259209 027259100,NASDAQ,False,Canadian Common Stock,Basic Materials,Other Industrial Metals & Mining,1000,Mining,Metal Mining,,Non-Metallic and Industrial Metal Mining,USD,British Columbia; Canada,15af68251c201b0bab0e800e12478e21
40,AMERICAN EAGLE ENERGY CORP,AMZGQ,1282613,02554F300 02554F102 29759Y107,NYSEMKT,True,Domestic Common Stock,Basic Materials,Other Industrial Metals & Mining,1000,Mining,Metal Mining,,Non-Metallic and Industrial Metal Mining,USD,Colorado; U.S.A,12d3b11e04222a7d0fc3159ab41df766
43,ALLIED NEVADA GOLD CORP,ANVGQ,1376610,019344100 448629105,NYSEMKT,True,Domestic Common Stock,Basic Materials,Gold,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,Nevada; U.S.A,a512cb7c6b11ef8ecfa2b8875d27257e
50,GOLDEN MINERALS CO,APXSQ,1011509,G04074103,NYSEMKT,True,Domestic Common Stock,Basic Materials,Other Precious Metals & Mining,1040,Mining,Gold And Silver Ores,,Precious Metals,USD,Colorado; U.S.A,58b8768e35aee777011cdbe0e863e5c4


## Knowledgebase Mapping

Looking at these records, we have a somewhat useful start to stubbing out organization/commercial company records in the GeoKB. This at least probably narrows down to companies we may care about for current use cases related to mineral resource assessments.

We could go further and look for relevant SEC filings first or check other sources where we may have mention of these companies already. At some point, it probably doesn't hurt to have a start to records in our GeoKB for all of the companies filed with the SEC classified in one of the mining categories where we are likely to encounter them somewhere in our work. If we don't tie anything else to these entities for a while, that's okay.

We could also track down other records for these companies in other registries like the LEI system mentioned above where we could have more robust information to start our records with.

* Perhaps better primary name
* Additional company names (aliases)
* Some type of status indicator (have to track this down further and compare with other sources)
* Jurisdiction may be the most useful indication of where the company is based or operating from. Many of these have "legal addresses" in Delaware (like thousands of other "holding companies").

In [5]:
# Starter function to tease out useful LEI information
def lei_lookup(company_name):
    lei_query = f"https://api.gleif.org/api/v1/lei-records?page[size]=10&page[number]=1&filter[entity.names]={company_name}"
    r_lei = requests.get(lei_query)
    lei_entity = r_lei.json()

    if 'data' not in lei_entity or len(lei_entity['data']) != 1:
        return
    
    lei_info = {
        'lei': lei_entity['data'][0]['attributes']['lei'],
        'legal_name': lei_entity['data'][0]['attributes']['entity']['legalName']['name'],
        'other_names': lei_entity['data'][0]['attributes']['entity']['otherNames'],
        'status': lei_entity['data'][0]['attributes']['entity']['status'],
        'jurisdiction': lei_entity['data'][0]['attributes']['entity']['jurisdiction']
    }

    return lei_info


In [7]:
sample_name = df_mining_companies.sample().iloc[0]["name"]
print(sample_name)

display(lei_lookup(
    company_name=sample_name
))

PARAMOUNT GOLD NEVADA CORP


{'lei': '5493000CWEBEVDLIW256',
 'legal_name': 'Paramount Gold Nevada Corp.',
 'other_names': [],
 'status': 'ACTIVE',
 'jurisdiction': 'US-NV'}

This additional information introduces some nuances we need to think about in modeling these entities into our knowledgebase. Other names has more useful information content beyond just a name string and language (both needed to encode into Wikibase). It also gives us a qualifier indicating the type of name, often indicating a former name. I've seen examples in Wikidata where someone tossed this into the alias as a parenthetical, but that's not a great way to handle things. We may want to introduce these as claims where we can propery encode the qualifier, potentially in addition to incorporating the name string into aliases for standard search operations.

We could introduce an overall status type of property at a higher level of semantics, applying across a range of item types, using either specific item objects as values (e.g., "LEI ACTIVE") or general values with a qualifier indicating what kind of "ACTIVE" we mean or just use a reference on the claim to indicate where it comes from. Whatever we decide, we'll need to write down the design principle and use it consistently.

We also have suggested claims pointing to a country (e.g., CA, US, etc.) and state/province (e.g., British Columbia, Nevada, etc.). The target object for these will come from a place name reference we need to build into the GeoKB for many purposes. The "jurisdiction" property from the LEI system represents the governmental jurisdictions under which the company operates, which leads to some kind of unique property in the GeoKB for this type of information.

In [8]:
df_mining_companies['lei'] = df_mining_companies['name'].swifter.apply(lambda x: lei_lookup(x))

Pandas Apply:   0%|          | 0/317 [00:00<?, ?it/s]

### Central point of facts

One of the interesting dynamics we are pursuing with the GeoKB idea is to establish what will amount to a "central point of facts" that is dynamic in nature. Some of the operations we may undertake will be costly in terms of computational processing time like the above example. But once we go to the effort and encode what we want to use into our knowledgebase, we can then leverage it in lots of different ways. Some sources will "demand" that we build something to revisit them periodically for new and updated information, while others will be more of a one-time deal. Some sources will need to be revisited when we determine that there is other useful information to be had or we need to re-think how we encoded something.

The above process essentially worked through 300+ records with individual queries in a pretty inefficient way, one record at a time hitting a REST API. We can actually send multiple names at once to the same API end point, but the results that come back have to be processed further to figure out matches to those names. We also ignored cases for now where a name comes up with more than one hit. That will all have to get worked out.

If this particular case of working the LEI system is important enough in other use cases, we may want to consider a different approach where we load their bulk data (the "Gold File") somewhere into our own infrastructure (e.g., an Elasticsearch index) where we can run processing more efficiently at scale in the cloud with our own custom output. At a few hundred records, this is no big deal. Once we nail down the identifier mapping and gather a few useful details, we have our single point of facts to operate with. But we have to keep the reprocessing dynamic in mind, recognizing that nothing we put into the knowledgebase will ever remain completely static.

In [20]:
df_mining_companies_with_lei = df_mining_companies[df_mining_companies.lei.notnull()][["name","lei"]].copy()
df_mining_companies_with_lei['lei_name'] = df_mining_companies_with_lei.lei.apply(lambda x: x['legal_name'])
df_mining_companies_with_lei[["name","lei_name"]]

Unnamed: 0,name,lei_name
2,ALMADEN MINERALS LTD,ALMADEN MINERALS LTD.
18,FIRST MAJESTIC SILVER CORP,First Majestic Silver Corp.
21,ALAMOS GOLD INC,ALAMOS GOLD INC.
30,ALIO GOLD INC,Alio Gold Inc.
35,AMERICAN LITHIUM CORP,AMERICAN LITHIUM CORP.
...,...,...
1011,PARAMOUNT GOLD NEVADA CORP,Paramount Gold Nevada Corp.
1013,RICHMONT MINES INC,Richmont Mines Inc.
1020,SEABRIDGE GOLD INC,SEABRIDGE GOLD INC.
1052,VIZSLA SILVER CORP,VIZSLA SILVER CORP.


Out of our 317 "mining companies" registered with the SEC, we were able to find 114 pretty solid matches to the LEI system based on name. We don't absolutely know those are the same entity, but chances are pretty good because of the way we ran the searches. Looking at the results above, we can see some interesting things like the match on "WMC RESOURCES LTD" from the SEC EDGAR and the much more comprehensive legal name from the LEI registry. Maybe this doesn't matter all that much for our current use cases, but as we work to build on the knowledgebase foundation and go after other linked information, this might make a difference.

# SEC Filings

Assuming we use the combination of SEC mining company records with some matches to the LEI registry to stub out the start to company records in our GeoKB, we could then initiate routine data mining for relevant SEC findings based on the CIK identifiers and the SEC EDGAR API. We can start by exploring everything associated with a given CIK identifier to develop a better query/processing pathway to tease out what we want.

Filings are also essentially static "government legal filing" entities from the standpoint of our knowledge model. Once we identify what we want and bring them into our system, we can build on them from that point.

In [52]:
def query_sec_filings(cik):
    query = {
      "query": { "query_string": { 
          "query": "cik:{cik}",
      } },
      "from": "0",
      "size": "10",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }

    return sec_query_api.get_filings(query)


In [53]:
df_mining_companies['sec_filings'] = df_mining_companies.cik.swifter.apply(lambda x: query_sec_filings(x))

Exception: API error: 429 - {"status":429,"error":"You send a lot of requests. We like that. But you exceeded the free query limit of 100 requests. Upgrade your account to get unlimited access. Visit sec-api.io for more."}

Wah...wah...wah...

I should have anticipated this issue; I just hadn't keyed in on the "Pricing" menu item at the top of the SEC EDGAR API site. Many groups like this (including government) are going this route to control costs for offering these kinds of services. It sucks and kind of runs counter to open data policies, but it's the reality we have to deal with.

Raw EDGAR data for bulk download seems to be available in a [monthly archive](https://www.sec.gov/Archives/edgar/monthly/). (This is how some government agencies are able to claim that they provide open data with things like their APIs considered to be over and above that basic service. Grrr!) It's stored in their foundational XML structure, and it looks like there are some methods in the [Python package](https://pypi.org/project/sec-api/#xbrl-to-json-converter-api) for transforming this to JSON. Notionally, we could figure out some kind of bulk download and do whatever we want with the data from that point on our own infrastructure. But we only need a very small slice of what they have for a small fraction of companies the SEC regulates. Another approach could be to set up a process that limits daily requests (I assume it's 100 requests/day), slowly gathering what we need as a baseline, and then picking up updates incrementally as needed.

# Building GeoKB Items

Stymied with the API throttle for SEC EDGAR, we can revisit the process of building items for our companies. I'm also somewhat stuck there at the moment because of an issue I'm having with our pilot Wikibase instance and an error I'm getting on items after building claims (see the Initialize GeoKB notebook). But it would be useful to take this exercise to the point of exploring how we'd model commercial company items into the GeoKB.

There are a couple of fundamental approaches we might take to the process of loading items and claims into the GeoKB. One approach could be to get data into a common format and then present records to a processor somewhere. This is essentially how things work with the [QuickStatements](https://www.wikidata.org/wiki/Help:QuickStatements) tool. This is a longstanding tool developed to operate Wikidata, and we're working to get it spun up now as an option for our GeoKB.

QuickStatements scripts are essentially read line by line as a series of commands to take actions on a Wikibase instance. The QuickStatements interface will also accept tabular records in a particular structure as CSV, and there are some nice mature tools in OpenRefine for building out a QuickStatements structure. Rather than us coming up with our own tooling for this, we're going to try to stick with what we can with the QuickStatements/OpenRefine tooling for that approach.

The other approach to this is following a path that is also pretty mature in Wikidata/Wikibase - building bots. These are essentially purpose-built tools designed to take some type of action, either introducing new entities, claims, qualifiers, and references or taking some transformational action on existing information in a Wikibase instance. This is what I've been exploring for initializing the GeoKB with our properties and classification options.

This notebook exploration of SEC companies and filings could go either route. I'll take a few steps here down the path of working this up as a bot using the pywikibot package. I'm including a developing set of functions that we'll eventually move out into their own package somewhere to import into a process like this once we 

## Company Modeling and Semantics

Building on the notes above on "knowledgebase mapping," we need to make some decisions about what to send into the GeoKB. Here is some of my reasoning:

* The primary label is important as it will show up in all kinds of reports and will be used as a human-centric identifier for companies. It looks like the LEI legal name is sometimes a little better or more comprehensive, so we'll prefer that as the primary label where we could get a match. We could try and title case the names that are all caps and may come back to that later, but I'll leave those for now.
* Where the SEC name is different from the LEI legal name, we'll add that in as an alias along with any other names turned up from the LEI records. We'll just use English language names for now until we work through multilingual support in our Wikibase instance.
* Descriptions are also important basic information on every item. When we put information together into different contexts, we often want to see a short description for things to help distinguish similarly named entities. This can also be done through classification (e.g., "instance of" claims), but a description can also suffice. Making up descriptions can get a bit tricky, though, in that we don't want to introduce conflicting semantics or necessarily have to keep up with changes in other parts of a record.
* We are trying to keep higher level classification as simple as possible and not go down the rabbit hole that Wikidata has on incredibly specific properties and classifiers. We already have entity>organization from our initialization work so far. We might want to build in entity>organization>commercial company, but I don't think we want to go deeper than that at this point. Rather, we can build on the classification by adding in other logical properties that our exploration of the SEC and LEI sources suggest. For instance, the sicSector and sicIndustry properties from the SEC EDGAR records seem to be useful descriptive concepts to further classify commercial companies that we should be able to apply in other cases. Process-wise, that means we need to build out additional items as a reference base so we have something to link to on the other end of a property.
* The SEC records include both the SIC classification for industries and an older industry classification system (FAMA). The terminology used in the FAMA classification seems like it might be more useful for our purposes; less prone to conflation with other concepts like the mineral commodities that a specific mining company might be associated with through other data linking. We can include both and see how it plays out in practice. Taking a more simplistic approach, we might put these in as "subject matter" claims, letting the characterization of the objects for those claims, references, and any qualifiers that might be appropriate handle the deeper level significance. In any case, we need to also build out the SIC and FAMA industry classification references in our GeoKB for use.
* The "sector" and "industry" classifiers that appear to be specific to SEC EDGAR seem to be pertinent to SEC's way of looking at the world and perhaps not as useful for us to bring into the GeoKB. Values like "Gold" and "Silver" as "industries" could complicate our semantics, essentially meaning we would need to make those items in the GeoKB not only instances of "mineral commodity" and "chemical element" but also something totally different like "SEC industry." 
* We need to handle the intersection of the SEC EDGAR location property with the LEI jurisdiction property. These look like they logically align, but we'll have to see if that's actually the case. In any case, we will want to introduce these concepts into our system along with their associated reference foundation (place names). Both "location" and "jurisdiction" are somewhat problematic and contextual as concepts. The more detailed location information (legal address, headquarters) from LEI is much more specific but perhaps less useful. One approach would be to take the very high level concept of any kind of location as the property, point it to reference items like "United States of America" and "Nevada," and qualify those with some kind of descriptor that indicates that the values indicate a governmental jurisdiction under which the commercial company entity is governed. Or we could create a specific property like "governance jurisdiction" that comes along with that specific semantic significance and simply point at the reference for the information.
* There are a number of different identifiers for our recordset. The CIK from SEC EDGAR and LEI identifier seem to be the most important as we can leverage those as unique, persistent, resolvable identifiers for linkages to other information. The stock ticker and CUSIP identifiers could be useful for some kinds of linkages, but we may just ignore those for now as not really pertinent to our current GeoKB use cases. What stock exchange a company trades on is not really all that useful for our purposes either.

## Reference Sources Specific to this Exploration

Before proceeding too far with building out company items, we need to introduce some fundamental reference sources to the knowledgebase. Company items themselves are a reference source for the document items (e.g., S-K 1300 reports) that we are really after here as the heart of our use case. The notes above suggest we need to build a process to add in the following:

* Standard Industrial Classification (SIC) code list
    * Usable source - https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list
    * Needs a new high level classifier - "industrial classification"
    * Need to decide if we will bring all of these into the GeoKB now or just those we're focused on for mining industries
* FAMA ("French") codes
    * This is an interesting case that we will run into elsewhere in that the fundamental source for this classification system is essentially copyrighted with no clear license for use. It is part of an academic project and longstanding financial analysis effort by a Dartmouth professor (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_48_ind_port.html). We can take SEC values and put those terms into our GeoKB source for use, but we couldn't really leitimately go and process the entire source for these if we wanted to do that without clarifying the license issue.
    * I think we can basically just take the values we encounter on companies we care about for our use case, introduce them as instances of "industrial classification," attribute them in their description, and reference them to where we get them from SEC EDGAR. There might be some better ways to model these, but we'll have to think that through as a general rule in dealing with copyrighted/ambiguous license information content.
* CIK and LEI identifiers
    * We're following the Wikidata approach on identifiers, setting them up with their resolvers built in so that humans and machines can follow them to their source, resolve them, and get whatever functionality they indicate. We need to work that out for this case still to see if we can get to something that supports content negotiation (meaning the same URL pattern can work for humans and machines).

## Other Reference Sources

The place name source is not at all specific here, so I'll be working that one up in a separate notebook.

In [56]:
sparql_endpoint = os.environ['SPARQL_ENDPOINT']
wb_domain = os.environ['WB_DOMAIN']
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

accepted_languages = ['en']

In [65]:
# Start to deployable functions
from utils import (
    get_wb
)

In [63]:
geokb_site = get_wb('geokb')