This notebook explores a new source of information we need to work through for the "Assessment Analytical Framework" we are developing to support mineral resource assessments in the USGS. The framework consists of various technologies designed to produce analysis-ready data for the assessment process. A big part of that is the development of workflows, as automated as possible, that pull source data and information and "mobilize" them up for use.

For many years, we've been collecting and working with a type of technical report required of mining companies based in Canada (NI 43-101 Technical Report) as a source for mineral deposit type information and other geoscientific details. The Securities and Exchange Commission (SEC) in the U.S. now also [requires this type of information](https://www.sec.gov/corpfin/secg-modernization-property-disclosures-mining-registrants) to be filed as a regular submission from any publicly traded mining companies in the U.S. Similar to the NI 43-101 Reports that we have assembled for use in a Zotero collection, we also need to start pulling and organizing the  S-K 1300 Technical Reports required by the SEC. All of the same basic requirements apply:

* Provide an online locale for the reports that is easily accessible and manageable by assessment geologists and science support staff
* Provide for consistent, publicly citable references in reports and articles, including ensuring long-term viability of the citable references
* Provide a mechanism for routine annotation of reports by assessment geologists and science support staff
* Provide a means for the report contents to be fed into AI/ML pipelines for further processing to identify/link entities (mineral deposit sites, etc.) and automate aspects of turning information into analyzable data

As an improvement over the situation with the [SEDAR](https://www.sedar.com/) source for NI 43-101 Reports where we have a web site expressly protected against robot interfaces, the SEC's EDGAR platform provides some decent [direct data interfaces](https://www.sec.gov/edgar/sec-api-documentation) to support automated retrieval of the company submissions we are interested in. Data from the SEC is of great interest in a lot of different circumstances, and their data handling and availability seems to be subsequently quite mature.

Our initial starting point is likely the set of two bulk data downloads compiled nightly. Both of these are zip files with JSON documents for each unique identified company over which the SEC exercises regulatory authority. One of these contains a set of "company facts," and the other lists the submissions the companies have filed with the SEC (which should include the S-K 1300 Technical Reports).

Many times, bulk downloads are a much better starting point or even long-term source for data from an origin point like this. They give us a big tranche of data that we have to download initially and spin up into some kind of our own data infrastructure for use, but they help us avoid a number of problems that can arise when trying to operate against what are often purpose-built APIs that may be less than fully stable or may not provide the most efficient means for us to obtain the answers we want. In this case as with many others, we really only want a small slice of what the overall source contains (the SEC regulates many companies, only some of which are involved in mineral resources).

The nighly files appear to contain current information, and it's not yet clear if we will need to tap some other source (like the live APIs) for historic facts (if we are interested in any of those) or submissions. One way or the other, we need to narrow in on those companies required to file S-K 1300 Technical Reports. The remainder of this notebook starts to crack open the bulk downloads to explore how we might go about effectively working this source into our analytical framework.

In [2]:
import json
from glob import glob

# Company Facts
The company facts download pulled on 8/15/2022 contains 16,410 files, presumably representing each of the uniquely identified companies with a "CIK" identifier that the SEC tracks. I pulled just a few of these to take a look.

In [3]:
sample_companyfacts_files = glob('/home/skybristol/experiments/data/sec_edgar/companyfacts/*')
sample_companyfacts_files

['/home/skybristol/experiments/data/sec_edgar/companyfacts/CIK0001892274.json',
 '/home/skybristol/experiments/data/sec_edgar/companyfacts/CIK0001898496.json',
 '/home/skybristol/experiments/data/sec_edgar/companyfacts/CIK0001885827.json']

In [4]:
sample_companyfacts = json.load(open(sample_companyfacts_files[0], "r"))

Beyond the CIK identifier the SEC uses, there is a name for a company, which is likely a whole other issue we may or may not need to deal with. There might be some scenarios for downstream uses of these data where we need to effectively link together multiple sources of data tied to a given company entity. There might be different derivations of the names to deal with or temporal issues where company names/identifiers change.

In [5]:
sample_companyfacts["entityName"]

'Visionary Education Technology Holdings Group Inc.'

## Company Facts leading to "Mining Registrant"
The bulk of the data in the company facts information is contained in a part of the structure coded as "facts/us-gaap." A short bit of digging shows that the properties here track to a [standard financial reporting taxonomy](https://www.fasb.org/xbrl) the SEC has [adopted](https://www.sec.gov/info/edgar/edgartaxonomies.shtml#USGAAP2019). This gives us a nice starting point to understand the data and figure out what we can use for what purposes.

We can approach our problem from a couple different perspectives. It might be useful to identify all of the companies who are considered "mining registrants." This is the term the SEC uses in its [guidance](https://www.sec.gov/corpfin/secg-modernization-property-disclosures-mining-registrants). So, one question is, how do we figure out which companies (of the, presumably, 16,410) are mining registrants? We need to do some digging to see if there is something specific or a combination of attributes that would help us work this out.

In [6]:
list(sample_companyfacts["facts"]['us-gaap'].keys())

['AccountsPayableCurrent',
 'AccountsReceivableGross',
 'AccountsReceivableNet',
 'AccountsReceivableNetCurrent',
 'AccountsReceivableRelatedParties',
 'AccountsReceivableRelatedPartiesCurrent',
 'AccruedIncomeTaxes',
 'AccruedIncomeTaxesCurrent',
 'AccruedLiabilitiesCurrent',
 'AccruedLiabilitiesForCommissionsExpenseAndTaxes',
 'AccumulatedDepreciationDepletionAndAmortizationPropertyPlantAndEquipment',
 'AccumulatedOtherComprehensiveIncomeLossNetOfTax',
 'AdditionalPaidInCapital',
 'AllowanceForDoubtfulAccountsReceivable',
 'AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount',
 'AssetImpairmentCharges',
 'Assets',
 'AssetsCurrent',
 'BusinessCombinationPriceOfAcquisitionExpected',
 'BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIndefiniteLivedIntangibleAssets',
 'BusinessCombinationRecognizedIdentifiableAssetsAcquiredGoodwillAndLiabilitiesAssumedNet',
 'CapitalExpenditureDiscontinuedOperations',
 'Cash',
 'CashCashEquivalentsRestrictedC

In [9]:
sample_companyfacts["facts"]['us-gaap']['Assets']

{'label': 'Assets',
 'description': 'Sum of the carrying amounts as of the balance sheet date of all assets that are recognized. Assets are probable future economic benefits obtained or controlled by an entity as a result of past transactions or events.',
 'units': {'USD': [{'end': '2021-03-31',
    'val': 13667102,
    'accn': '0001683168-22-005640',
    'fy': 2022,
    'fp': 'FY',
    'form': '20-F',
    'filed': '2022-08-12',
    'frame': 'CY2021Q1I'},
   {'end': '2022-03-31',
    'val': 36226584,
    'accn': '0001683168-22-005640',
    'fy': 2022,
    'fp': 'FY',
    'form': '20-F',
    'filed': '2022-08-12',
    'frame': 'CY2022Q1I'}]}}

In [8]:
sample_companyfacts["facts"]['us-gaap']['Land']

{'label': 'Land',
 'description': 'Amount before accumulated depletion of real estate held for productive use, excluding land held for sale.',
 'units': {'USD': [{'end': '2022-03-31',
    'val': 4400000,
    'accn': '0001683168-22-005640',
    'fy': 2022,
    'fp': 'FY',
    'form': '20-F',
    'filed': '2022-08-12',
    'frame': 'CY2022Q1I'}]}}

# Submissions
The main thing we are after from the SEC EDGAR source are the S-K 1300 Technical Reports. These should be contained and referenced in the submissions part of the EDGAR data model. Here, we grab up a few of these files to take a look at what they contain.

In [11]:
sample_submissions_files = glob('/home/skybristol/experiments/data/sec_edgar/submissions/*')
sample_submissions_files

['/home/skybristol/experiments/data/sec_edgar/submissions/CIK0001942630.json',
 '/home/skybristol/experiments/data/sec_edgar/submissions/CIK0001942661.json',
 '/home/skybristol/experiments/data/sec_edgar/submissions/CIK0001942693.json']

In [16]:
sample_submissions = json.load(open(sample_submissions_files[2], "r"))

In [17]:
sample_submissions

{'cik': '1942693',
 'entityType': 'other',
 'sic': '',
 'sicDescription': '',
 'insiderTransactionForOwnerExists': 0,
 'insiderTransactionForIssuerExists': 0,
 'name': 'Imagen Lakeshore Dental Support Services, LLC',
 'tickers': [],
 'exchanges': [],
 'ein': '883222397',
 'description': '',
 'website': '',
 'investorWebsite': '',
 'category': '',
 'fiscalYearEnd': '1231',
 'stateOfIncorporation': 'DE',
 'stateOfIncorporationDescription': 'DE',
 'addresses': {'mailing': {'street1': '16220 NORTH SCOTTSDALE ROAD',
   'street2': 'SUITE 300',
   'city': 'SCOTTSDALE',
   'stateOrCountry': 'AZ',
   'zipCode': '85254',
   'stateOrCountryDescription': 'AZ'},
  'business': {'street1': '16220 NORTH SCOTTSDALE ROAD',
   'street2': 'SUITE 300',
   'city': 'SCOTTSDALE',
   'stateOrCountry': 'AZ',
   'zipCode': '85254',
   'stateOrCountryDescription': 'AZ'}},
 'phone': '(480) 590-3396',
 'flags': '',
 'formerNames': [],
 'filings': {'recent': {'accessionNumber': ['0001942693-22-000001'],
   'filingDa

There are actually some interesting additional useful company details contained in this structure that may or may not be in the company facts information. I'm particularly intrigued by a potential array of formerNames that could be quite helpful.

We're going to need to dig through some more examples to figure out how to use the submission information effectively. It looks like we might be able to find what we need within the filings part of the document structure if we can suss out what an S-K 1300 filing looks like. We can do this by digging up a company that we know has to file the reports, tracing them through the SEC EDGAR data structures, and seeing if we can learn the query pattern we could use. We'll start that next using an [example](https://www.sec.gov/Archives/edgar/data/1001838/000155837022002995/scco-20211231ex968d77397.pdf) Mike Zientek provided.