This notebook explores a new source of information we need to work through for the "Assessment Analytical Framework" we are developing to support mineral resource assessments in the USGS. The framework consists of various technologies designed to produce analysis-ready data for the assessment process. A big part of that is the development of workflows, as automated as possible, that pull source data and information and "mobilize" them up for use.

For many years, we've been collecting and working with a type of technical report required of mining companies based in Canada (NI 43-101 Technical Report) as a source for mineral deposit type information and other geoscientific details. The Securities and Exchange Commission (SEC) in the U.S. now also [requires this type of information](https://www.sec.gov/corpfin/secg-modernization-property-disclosures-mining-registrants) to be filed as a regular submission from any publicly traded mining companies in the U.S. Similar to the NI 43-101 Reports that we have assembled for use in a Zotero collection, we also need to start pulling and organizing the  S-K 1300 Technical Reports required by the SEC. All of the same basic requirements apply:

* Provide an online locale for the reports that is easily accessible and manageable by assessment geologists and science support staff
* Provide for consistent, publicly citable references in reports and articles, including ensuring long-term viability of the citable references
* Provide a mechanism for routine annotation of reports by assessment geologists and science support staff
* Provide a means for the report contents to be fed into AI/ML pipelines for further processing to identify/link entities (mineral deposit sites, etc.) and automate aspects of turning information into analyzable data

As an improvement over the situation with the [SEDAR](https://www.sedar.com/) source for NI 43-101 Reports where we have a web site expressly protected against robot interfaces, the SEC's EDGAR platform provides some decent [direct data interfaces](https://www.sec.gov/edgar/sec-api-documentation) to support automated retrieval of the company submissions we are interested in. Data from the SEC is of great interest in a lot of different circumstances, and their data handling and availability seems to be subsequently quite mature.

Our initial starting point is likely the set of two bulk data downloads compiled nightly. Both of these are zip files with JSON documents for each unique identified company over which the SEC exercises regulatory authority. One of these contains a set of "company facts," and the other lists the submissions the companies have filed with the SEC (which should include the S-K 1300 Technical Reports).

Many times, bulk downloads are a much better starting point or even long-term source for data from an origin point like this. They give us a big tranche of data that we have to download initially and spin up into some kind of our own data infrastructure for use, but they help us avoid a number of problems that can arise when trying to operate against what are often purpose-built APIs that may be less than fully stable or may not provide the most efficient means for us to obtain the answers we want. In this case as with many others, we really only want a small slice of what the overall source contains (the SEC regulates many companies, only some of which are involved in mining).

After working through the two collections of bulk download files in previous versions of this notebook, we really only need to focus on the submissions data. These files give us useful details about the companies regulated by the SEC, including a code we can use to identify companies involved in mining, along with the filings the companies have made, which will include our target documents. We'll be setting up a data processing workflow that will involve the following steps:

1. Download and unzip the submissions bulk download (likely to S3 storage). We'll store everything initially but expunge records (JSON documents) for non-mining companies once we tease those out. The bucket containing individual company submission JSON source files can be versioned so that new versions of documents (they are file-named with the CIK identifier) trigger processing. In steps 2 and 3, we build a database from the extracted parts of these records that are important to us.
2. Work through all submissions files using the Standard Industrial Classification (SIC) code to identify mining-related companies and store basic company details in a registry (database). Once this is done once, we can shift to a lighter weight regular process of using the Central Index Key (CIK) identifiers in checking the submissions for updates as opposed to needing to work through the entire set of records. We will put a process in place to periodically check for new mining companies to add to our registry.
3. Within the submissions files for each company, there are records for "10-K Annual Filings." These are the submissions that contain S-K 1300 Technical Reports when those are available. We have to first identify all 10-K annual filings and then run a separate process to examine two HTML pages associated with the archive accession numbers for the filings. From these, we get identifiers, titles, and links for the technical reports themselves. We'll create another data store for the basic information on annual reports and technical reports from this process so we have an overall inventory of what we're going after, and then have a process that retieves and stores the files.
4. The inventory of technical reports data store will give us titles, dates, and provenance information that we can use to stup out bibliographic records. Once files are retrieved, we can have another process that writes bibliographic metadata and file content to a Zotero group library for the S-K 1300 reports. This will put them in the same infrastructure as the NI 43-101 Technical Reports, allowing us to feed them into xDD and other processing as well as using them as citable references in assessments.

The newer version of the code notebook below works through the logic in these steps with a start to the functional code we'll build into a package and then automate. I'm shifting development work to the USGS CHS Pangeo environment where we are able to work on CHS resources in the us-west-2 region.

## Note on Company Facts
There are potentially interesting details in the "company facts" data that we could reexamine at a later time. There are many accounting details, including some attributes specific to mining such as the expense outlays on mining equipment and mineral revenues. Once the registry of mining companies and their associated identifiers are in place, we could build another process to periodically pull company facts and extract out the interesting bits we'd want to use in analyses.

In [None]:
!pip install awswrangler

In [34]:
import json
from glob import glob
import pandas as pd
import requests
import io
import zipfile
from bs4 import BeautifulSoup
import boto3
import awswrangler as wr
import os
import json
from pathlib import Path

# Establish AWS Connection

In order to work with anything in AWS, we have to have an authenticated session that picks up and establishes roles for access that are authorized to do certain things. When we build this out as an automated process, we'll be operating the code as a Lambda or container application on AWS through which a role will be established specific to that process. For development and testing purposes, we need to use our own authenticated access tied to our login. CHS policy has limited this type of access to using SAML with Active Directory credentials that can only be established when connected to the network. In order to work with this from the Pangeo environment, I've worked up a process whereby I authenticate from my local workstation and then upload the necessary token information that is generated by that process. This is a temporary set of tokens that only last for a session duration (8 hours) and then must be refreshed.

When operating this notebook from the CHS Pangeo environment, the environment it is running in has a connection established to CHS with a role that has access to certain resources. We need to establish a different session using our own credentials in order to work with the AWS resources we are using for this exercise. The following functions check the file with our temporary cache of tokens, sets those as environment variables if necessary, and then uses those to establish a session using the Python Boto3 package. We establish this as a variable and then use it in our connections.

In [30]:
def set_env():
    home = str(Path.home())
    env_json_path = f"{home}/.aws/env.json"

    if "AWS_ACCESS_KEY_ID" in os.environ:
        print("Required environment variables set.")
        return True

    print("Required environment variables not set. Attempting to read from file.")
    
    if os.path.exists(env_json_path):
        env = json.load(open(env_json_path, "r"))
        for k,v in env.items():
            os.environ[k] = str(v)
        return True
    else:
        print(f"Required file with environment variables not available: {env_json_path}")
        return False

def boto3_session():
    if set_env():
        session = boto3.Session(
            aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
            aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
            aws_session_token=os.environ["AWS_SESSION_TOKEN"]
        )
        return session


# Interacting with SEC EDGAR
Much of the SEC EDGAR system is designed to be crawled by "polite bots" that declare themselves appropriately. We need to send a user agent string that lets the SEC system know who we are. Otherwise, they will pick up that we're coming from a script and block our access. We're also limited to 10 requests per second when we need to go after information in bulk.

In [36]:
user_agent = "US Geological Survey;https://www.usgs.gov/"

In [31]:
aws_session = boto3_session()

Required environment variables set.


# Mining Company Registry
In order to identify and download the S-K 1300 Technical Reports on individual mining properties, we have to essentially crawl the appropriate SEC filings (10-K annual reports), pick out the technical reports included in those accessions, and put a couple pieces of information together. To to do that effectively over time and at scale, we need to hone in on the mining companies and those involved in mining who will be filing these particular reports. To do that, we need to build our own registry of mining companies and their associated CIK identifiers.

Fortunately, there are many different ways of accessing data programmatically from the SEC EDGAR system. Not all routes into the system, though, contain the information we need for the document gathering process we are undertaking. The [company search interface] for human users provides what seems to be the most direct way of using a Standard Industrial Classification (SIC) code to retrieve a list of company names and identifiers. I haven't, however, found a programmatic way to do this as the APIs don't offer that functionality, and writing a web scraper against the search results would be kind of brittle.

I've decided to take the bulk data download approach for these reasons and the following additional benefits:
* The bulk submissions data (JSON documents for each company) contains some additional details about the companies that might be useful in our registry in connecting dots to other information sources
* The submissions data does contain a full listing of all the filing submissions for a given company, providing our initial pointers to the annual report filings where we'll be able to search for the technical reports we're after

## 1) Get Bulk Submissions Data
The following code block executes a function that downloads the daily compilation of submissions, unzips that in memory, and loads the individual JSON documents to an S3 path for later use. We only need to do this periodically when we need to assemble our mining company registry. There may also be an RSS feed method that would let us get notifications of newly registered companies to go and evaluate individually.

Note: There should be some more efficient way to handle this kind of large file unzip operations, but I haven't figured it out yet. I think that the zip iterable will support parallel operation so that once we download the bytes from the source and read it, we could parallelize the process of reading and writing each file object from the memory-stored Zip bytes to S3. For now, this code just runs the whole file sequentially, which takes a ton of time.

In [None]:
def download_upload_zip(download_url, upload_path, user_agent):
    headers = {"User-Agent": user_agent}
    response = requests.get(download_url, headers=headers)
    with zipfile.ZipFile(io.BytesIO(response.content)) as thezip:
        for zipinfo in thezip.infolist():
            with thezip.open(zipinfo) as thefile:
                target_path = f"{upload_path}/{zipinfo.filename}"
                wr.s3.upload(
                    local_file=thefile,
                    path=target_path,
                    boto3_session=aws_session
                )

download_upload_zip(
    download_url="https://www.sec.gov/Archives/edgar/daily-index/bulkdata/submissions.zip",
    upload_path="s3://usgs-geoarchive/sec_edgar/submissions",
    user_agent=user_agent
)

Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221093.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221091.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221130.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221146.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221158.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221092.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221178.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221094.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221152.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221161.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221095.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221096.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissions/CIK0001221098.json
Uploading: s3://usgs-geoarchive/sec_edgar/submissio

## 2) Identify Mining Companies and Send to Registry
Now that we have our submissions files somewhere, we can figure out which of these are "mining registrants" and companies we want in our registry. The individual company submission JSON documents (the primary ones, not the auxillary -submissions- overflow files) contain some basic information about the companies, including a [Standard Industrial Classification](https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list) (SIC) code. A big bunch of files like this can somewhat act like a database, but a huge batch of JSON documents aren't able to be accessed very efficiently. What we want to do in this application is read all of primary company JSON files in as efficient a manner as possible and pull some company information, including the CIK identifier, from all of those that we consider mining companies into a different kind of database structure that can be queried and worked with efficiently.

For now, we are going to only work with two SIC codes - 1000 for "Metal Mining" and 1400 for "Mining and Quarying of Non-Metallic Minerals (No Fuels)." There may be other classifications of companies that we need to include, but this gives us a decent start. If we were to look at this as a resource we want to use across EMMA or all of USGS, we might consider adding additional SIC codes for other industries of interest that we might want to tap information for from the SEC.

We'll grab a few other company details while we're at this in case we want to use those for some other purpose down the road. At this point, it makes reasonable sense to organize our company registry as a simple table of properties. We can store that as something more efficient like a Parquet file or send it to a database platform like RDS. For now, we'll just stash it as a file that we can read into memory or query via AWS Glue.

In [None]:
mining_sic_codes = [1000,1400]

def cik_str(cik_id, prefix="CIK"):
    """
    Simple function to transform an SEC EDGAR CIK identifier into a string.
    """
    padded_id = str(cik_id).zfill(10)
    return prefix + padded_id

def cik_data_files(
    cik_id, 
    source_type="companyfacts", 
    source_path="/home/skybristol/experiments/data/sec_edgar"
):
    cik_id_str = cik_str(cik_id)
    data_files = glob("/".join([source_path, source_type, cik_id_str + "*"]))
    return data_files

def load_data_file(path=None, cik_id=None, source_type=None):
    if path is not None:
        return json.load(open(path, "r"))
    
    if cik_id is not None:
        return [json.load(open(p, "r")) for p in cik_data_files(cik_id, source_type)]
            

# Mining Company Filing Registry

The bulk submissions data also contains a historical listing of all filings a company has made with the SEC since 1994. For our purposes, we're interested in what are called the 10-K Annual Reports. These are where we will find S-K 1300 Technical Reports included as part of the filing submission. The bulk data does not include the information on the individual contents (files) within a given submission, so we need a two part process that first identifies the annual reports and then checks a separate data source per each annual report for the contents of the filing.

## 1) Identify and Register Annual Reports

From the Mining Company Registry, we have the necessary CIK identifiers to determine exactly which parts of the submissions data we need to examine for 10-K Annual Reports. As we build this into an automated process, we will be able to fully parallelize the operation with a microservice approach such that each individual company in the registry can go get "its" latest submissions data via the SEC EDGAR API, check for any new annual reports, and tee those up for processing. To start with, we need to go back through our now filtered set of primary company submission JSON documents, pull the filings, and select any 10-K Annual Reports that could have included S-K 1300 Technical Reports. This should only be filings since 2018 when the new rule was put in place, so we will concentrate on those.

Since we have to run this as a multi-part process that is essentially adding data together iteratively, we will treat the annual reports as another registry in our architecture. This will give us an accession identifier that we can use in the subsequent crawling process to go check for and retrieve any technical reports associated with the annual report filing and included in the archival accession. This dynamic also lends itself well to an automated data processing pipeline where we can send the company CIK identifier and accession number to a message queue when something needs to be checked, kicking off the next stage in the pipeline.

## 2) Identify and Register S-K 1300 Technical Reports

To find the technical reports, we have to get into the actual contents of the annual reports. This is another area where the SEC EDGAR system explicitly supports crawling operations. A given accession for a company is always stored in a URL path like the following:

https://www.sec.gov/Archives/edgar/data/{CIK identifier}/{accession identifier}

A helpful index.json document is available from this path, giving us the file contents contained within the accession folder. From the file contents listing, we can quickly determine if there are any technical reports included using what appears to be a consistent file naming convention that uses a code for these specific parts of a filing ("...ex96.{n}...pdf"). This alone would give us the PDFs we need, but we also need to consult an HTML page that provides a full list of the contents in the annual report including titles for the documents. These all appear to be in a standard format "Technical Report Summary of Mineral...for {x} Mine". This provides us with improved bibliographic metadata to start our records. We can use the title and parse the mine name for tagging.

Some of the technical reports and other contents in annual reports are filed in ammendments, which is a separate accession in the system. Within a given accession, technical reports are numbered sequentially with the "exhibit 96" identifier and additional seemingly unique identification in the file name. We'll need to explore this fully to make sure we understand the pattern.

# Building and Maintaining the GeoArchive Collection

Once we have our Mining Company Registry and Annual Report/Technical Report registry, we can use these together to assemble our online collection of these additional resources. We have decent enough bibliographic metadata to stub out usable records that will include titles, company names, mine name tags, and provenance information leading back to exactly where these records originated. We are lacking some things at this point to align more fully with a still notional target schema for this type of information:

* We could assume a date for the reports, probably based on the filing date of the annual report submission. There is a date modified on the file listing within the accession, but we have no way of knowing what this means in relation to the report.
* We don't have author information at this point. We might be able to work with the more standardized "consent of qualified persons" forms that are also a part of the submission contents to get these pulled together somehow, but that will take more investigation.
* We don't have a geographic location for the mines documented in the technical reports, but being able to pull mine names as distinct properties may help us correlate other information sources.

## To Store or Not to Store Technical Report Files?

Notionally, the SEC EDGAR system is reasonably stable and robust enough to be available for the long run. We may not need to store the actual PDF file content for the S-K 1300 reports. They should remain available within the SEC archives at the paths we can record in URL links from the bibliographic records. We should also be able to direct our partners with the xDD system on how to retrieve the documents from their source for processing as they will need to download and run them through the xDD pipelines. As we've done here, any other automated system will need to declare itself via a user-agent header.

We will reevaluate this once we start getting more engaged in doing things like annotating and marking up documents and then extracting that annotation for AI training data. That will push us into storing and working from copies of the PDF files on board with our online library. For now, we can use a link in the bibliographic record to facilitate access via the Zotero library, and we'll do some experimentation with how the Zotero client experience works with the links to make it as seamless as possible.

## Building the Zotero Group Library

Similar to the NI 43-101 Technical Reports, we've created a new group library for the [S-K 1300 Technical Reports](https://www.zotero.org/groups/4754160/s-k_1300_technical_reports/library). We use the geoarchive package here to populate the collection with our previously assembled bibliographic metadata that we've stashed on CHS.

This particular use case brings up an interesting dynamic where we may end up with a mixed situation in terms of where and from whom information for the collection is originating. We'll be putting an automated data processing pipeline in place to do all the things in this notebook, keeping up with new filings discovered from the SEC EDGAR system. From the point where these items get added to our GeoArchive collection, other activities may commence where group members are editing and improving bibliographic metadata, we may start annotating documents (prompting the need to store a copy), and we may build in feedbacks from xDD or other processors to add in tags or additional value-added information. Subsequently, even though we have a cloud-based instance of some of the metadata for this collection, we'll want to run the same process to routinely pull information from Zotero and store that in a different cloud data asset for both backup/future-proofing and other uses.

It appears that the API and the bulk data may not necessarily give us a pathway to pull the actual technical report files themselves. We can get to the level of identifying a mining company and finding the annual filing where S-K 1300 Technical Reports might be included as one of many "exhibits." We'd have to then find a way to crack open the filing and 

In [23]:
s = requests.Session()
s.headers['User-Agent'] = "U.S. Geological Survey;https://www.usgs.gov"

r = s.get("https://www.sec.gov/Archives/edgar/data/0001001838/000155837022002995/scco-20211231x10ka.htm")

sample_soup = BeautifulSoup(r.content, 'html.parser')

tables = sample_soup.findAll("table")

In [24]:
for tr in tables[7].findAll("tr"):
    for td in tr.findAll("td"):
        print(td.text)

23.8
​
Consent of Qualified Persons for Technical Report Summary of Mineral Reserves and Mineral Resources for the La Caridad Mine (Filed as Exhibit 23.8 to the Company’s Annual Report on Form 10-K for the fiscal year ended December 31, 2021 and incorporated herein by reference).
​
​
23.9
​
Consent of Qualified Persons for Technical Report Summary of Mineral Resources for the Pilares Project (Filed as Exhibit 23.9 to the Company’s Annual Report on Form 10-K for the fiscal year ended December 31, 2021 and incorporated herein by reference).
​
​
23.10
​
Consent of Qualified Persons for Technical Report Summary of Mineral Reserves and Mineral Resources for the El Pilar Project (Filed as Exhibit 23.10 to the Company’s Annual Report on Form 10-K for the fiscal year ended December 31, 2021 and incorporated herein by reference).
​
​
23.11
​
Consent of Qualified Persons for Technical Report Summary of Mineral Reserves and Mineral Resources for the El Arco Project (Filed as Exhibit 23.11 to the 