<a href="https://colab.research.google.com/github/sensationalspace/colab/blob/main/unstructured_10k_pipeline_section.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SEC Filing Section Pipeline

This notebook defines the pipeline for extracting the narrative text sections
from the 10-K, 10-Q, and S-1 filings. This notebook contains both
exploration code and the code for defining the API. Code cells marked
with `#pipeline-api` are included in the API definition.

To demonstrate how off-the-shelf Unstructured Bricks extract
meaningful data from complex source documents, we will apply
a series of Bricks with explanations before defining the API.


#### Table of Contents

1. [Pulling in Raw Documents](#raw)
1. [Reading the Document](#reading)
1. [Custom Partitioning Bricks](#custom)
1. [Cleaning Bricks](#cleaning)
1. [Staging Bricks](#staging)
1. [Define the Pipeline API](#pipeline)

In [1]:
# Install pipeline-sec-filings
!git clone https://github.com/Unstructured-IO/pipeline-sec-filings.git --depth=1
%cd pipeline-sec-filings

Cloning into 'pipeline-sec-filings'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 69 (delta 2), reused 43 (delta 1), pack-reused 0[K
Receiving objects: 100% (69/69), 216.03 KiB | 3.43 MiB/s, done.
Resolving deltas: 100% (2/2), done.
/content/pipeline-sec-filings


In [None]:
# Install Python requirements
!pip install -q ratelimit unstructured==0.4.6
# upgrade to the latest, though has not been tested
# !pip install -q --upgrade ratelimit unstructured

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m43.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 KB[0m [31m724.0 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.8/152.8 KB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Section 1: Pulling in Raw Documents <a name="raw"></a>

First, let's pull in a filing from the SEC EDGAR database.
In this case, we'll pull in the most recent 10-K for Royal Gold (RGLD),
a publicly traded precious metals company.

In [None]:
from prepline_sec_filings.fetch import (
    get_form_by_ticker, open_form_by_ticker
)

text = get_form_by_ticker(
    'rgld',
    '10-K',
    company='Unstructured Technologies',
    email='support@unstructured.io'
)

In [None]:
print(text[1375:3284])

<XBRL>
<?xml version='1.0' encoding='UTF-8'?>

      <!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 9.6.7811.37134 -->
      <!-- Based on: iXBRL 1.1 -->
      <!-- Created on: 8/11/2021 10:45:07 PM -->
      <!-- iXBRL Library version: 1.0.7811.37150 -->
      <!-- iXBRL Service Job ID: 19f8db26-9ac2-4427-9c71-ed8734c57db7 -->

  <html xmlns:us-gaap="http://fasb.org/us-gaap/2020-01-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:country="http://xbrl.sec.gov/country/2020-01-31" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rgld="http://www.royalgold.com/20210630" xmlns:xbrldt="http://xbrl.org/2005/xbrldt" xmlns:ixt-sec="http://www.sec.gov/inlineXBRL/transformation/2015-08-31" xmlns:srt="http://fasb.org/srt/2020-01-31" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:ref="http://www.xbrl.org/2006/ref" xmlns:utr="http://www.xbrl.org/2009/utr" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31" xmlns:iso4

Wow! We're able to pull in the document, but it's really messy.
To help, we'll apply Unstructured Bricks to extract the information we're most interested in. Ultimately, we want to be able to ask the API
for a section and get back the narrative text within that section like the JSON file below. Once
we extract the narrative text, we can spin up a labeling task or send
it to a downstream ML service for inference.

```json
[
  {
    "text": "You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.",
    "type": "NarrativeText"
  },
  {
    "text": "Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.",
    "type": "NarrativeText"
  },
  {
    "text": "Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.",
    "type": "NarrativeText"
  },
```

## Section 2: Reading the Document <a name="reading"></a>

The first step is to identify and categorize text elements within the
document. When we get the SEC document from EDGAR, it is in `.xml` format. Like most text documents, XML documents contains text data we're
interested along with other information that we'd like to discard. In
an XML or HTML document, that could be styling or formatting tags.
Other document types, like PDFs, might have headers, footers, and page
numbers we're not interested in.

To help with this, Unstructured has created a library of
***partitioning bricks*** to break down a document into coherent chunks.
Different document types have different partitioning methods. For HTML
or XML documents, we identify sections based on text tags and use
NLP to differentiate between titles, narrative text, and other section
types. For PDF documents, we use visual clues such as the layout of the
document.

In [None]:
from unstructured.documents.html import HTMLDocument

html_document = HTMLDocument.from_string(text).doc_after_cleaners(skip_headers_and_footers=True, skip_table_text=True)

Here, we see that the `HTMLDocument` Brick was able to extract
text from the raw source XML document. This is a generic
`HTMLDocument` class that does not have any special knowledge about
the structure of EDGAR documents.

In [None]:
for element in html_document.pages[0].elements[71:75]:
    print(element)
    print("\n")

Acquisition and Management of Stream Interests—A metal stream is a purchase agreement that provides, in exchange for an upfront deposit payment, the right to purchase all or a portion of one or more metals produced from a mine, at a price determined for the life of the transaction by the purchase agreement. As of June 30, 2021, we owned eight stream interests, which are on six producing properties and two development stage properties. Our stream interests accounted for approximately 69% and 72% of our total revenue for each of the fiscal years ended June 30, 2021, and 2020, respectively. We expect stream interests to represent a significant portion of our total revenue.


Acquisition and Management of Royalty Interests—Royalties are non-operating interests in mining projects that provide the right to a percentage of revenue or metals produced from the project after deducting specified costs, if any. As of June 30, 2021, we owned royalty interests on 35 producing properties, 15 developm

Internally, the sections of text are represented using `Title` and
`NarrativeText` classes, as shown below. Bricks that are executed while
processing the document distinguish titles from narrative text.

In [None]:
html_document.pages[0].elements[71:75]

[<unstructured.documents.html.HTMLListItem at 0x7f9f53d2e190>,
 <unstructured.documents.html.HTMLListItem at 0x7f9f53d2e100>,
 <unstructured.documents.html.HTMLTitle at 0x7f9f53d2e1c0>,
 <unstructured.documents.html.HTMLNarrativeText at 0x7f9f53d2e310>]

Below, we show Bricks identifying possible titles and narrative text
sections. `"Regulation"` is identified as a title and not narrative text
because it is short and does not contain any verbs. Conversely, longer
sections with multiple, complex sentences are identified as narrative.

In [None]:
from unstructured.nlp.partition import is_possible_title

is_possible_title("Regulation")

True

In [None]:
is_possible_title("""Operators of the mines that are subject to our
stream and royalty interests must comply with numerous environmental,
mine safety, land use, waste disposal, remediation and public health
laws and regulations promulgated by federal, state, provincial and
local governments in the United States, Canada, Chile, the Dominican
Republic, Ghana, Mexico, Botswana, Australia and other countries where
we hold interests. Although we, as a stream or royalty interest owner,
are not""")

False

In [None]:
from unstructured.nlp.partition import is_possible_narrative_text

is_possible_narrative_text("Regulation")

False

In [None]:
is_possible_narrative_text("""Operators of the mines that are subject to our
stream and royalty interests must comply with numerous environmental,
mine safety, land use, waste disposal, remediation and public health
laws and regulations promulgated by federal, state, provincial and
local governments in the United States, Canada, Chile, the Dominican
Republic, Ghana, Mexico, Botswana, Australia and other countries where
we hold interests. Although we, as a stream or royalty interest owner,
are not""")

True

## Section 3: Custom Partitioning Bricks <a name="custom"></a>

In addition to the partitioning bricks Unstructured provides out of the
box, a given recipe may require custom partitioning bricks. In this case,
we're interested in identifying sections within the SEC filing. To
support that, we'll create a custom partitioning brick to identify
section titles. This will help distinguish section headings from sub-headings.

In [None]:
import re
from unstructured.documents.elements import Title

In [None]:
ITEM_TITLE_RE = re.compile(
    r"(?i)item \d{1,3}(?:[a-z]|\([a-z]\))?(?:\.)?(?::)?"
)

In [None]:
def is_10k_item_title(title: str) -> bool:
    """Determines if a title corresponds to a 10-K item heading."""
    return ITEM_TITLE_RE.match(title) is not None

This is looking pretty good! Those all look like the section headings
we're looking for. But wait, it seems like a few might be missing? It
turns out some of the titles contain extra whitespace characters. As a
result, they are not captured by the regex. Fortunately, Unstructured
has tools to help with that, which we'll cover in the next section.

In [None]:
for element in html_document.elements:
    if isinstance(element, Title) and is_10k_item_title(element.text):
        print(element)

ITEM 1A. RISK FACTORS
ITEM 16.    FORM 10-K SUMMARY


## Section 4: Cleaning Bricks <a name="cleaning"></a>

In addition to partitioning bricks, the Unstructured library has
***cleaning*** bricks for removing unwanted content from text. In this
case, we'll solve our whitespace problem by using the
`clean_extra_whitespace` brick. Other uses for cleaning bricks include
cleaning out boilerplate, sentence fragments, and other segments
of text that could impact labeling tasks or the accuracy of
machine learning models. As with partitioning bricks, users can
include custom cleaning bricks in a pipeline.

In [None]:
from unstructured.cleaners.core import clean_extra_whitespace

In [None]:
titles = []
for element in html_document.elements:
    element.text = clean_extra_whitespace(element.text)
    if isinstance(element, Title) and is_10k_item_title(element.text):
        titles.append(element)
        print(element)

ITEM 1. BUSINESS
ITEM 1A. RISK FACTORS
ITEM 1B. UNRESOLVED STAFF COMMENTS
ITEM 2. PROPERTIES
ITEM 3. LEGAL PROCEEDINGS
ITEM 4. MINE SAFETY DISCLOSURE
ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURE ABOUT MARKET RISK
ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA
ITEM 9. CHANGES IN AND DISAGREEMENTS WITH ACCOUNTANTS ON ACCOUNTING AND FINANCIAL DISCLOSURE
ITEM 9A. CONTROLS AND PROCEDURES
ITEM 9B. OTHER INFORMATION
ITEM 10. DIRECTORS, EXECUTIVE OFFICERS AND CORPORATE GOVERNANCE
ITEM 11. EXECUTIVE COMPENSATION
ITEM 13. CERTAIN RELATIONSHIPS AND RELATED TRANSACTIONS AND DIRECTOR INDEPENDENCE
ITEM 14. PRINCIPAL ACCOUNTANT FEES AND SERVICES
ITEM 15. EXHIBITS AND FINANCIAL STATEMENT SCHEDULES
ITEM 16. FORM 10-K SUMMARY


In [None]:
for i, el in enumerate(html_document.elements):
  if el.id == titles[0].id:
    break
first_title_index = i
for i in range(first_title_index, first_title_index+10):
  print(html_document.elements[i])

ITEM 1. BUSINESS
Overview
We acquire and manage precious metal streams, royalties, and similar interests. We seek to acquire existing stream and royalty interests or to finance projects that are in production or in the development stage in exchange for stream or royalty interests. We do not conduct mining operations on the properties in which we hold stream and royalty interests and are not required to contribute to capital costs, environmental costs, or other operating costs on the properties. Please refer to Item 2, Properties, for a discussion of the developments at our principal properties.
In the ordinary course of business, we engage in a continual review of opportunities to acquire existing stream and royalty interests, to establish new stream and royalty interests on operating mines, to create new stream and royalty interests through the financing of mine development or exploration, or to acquire companies that hold stream and royalty interests. We currently, and generally at a

In [None]:
{type(el) for el in html_document.elements}

{unstructured.documents.html.HTMLListItem,
 unstructured.documents.html.HTMLNarrativeText,
 unstructured.documents.html.HTMLText,
 unstructured.documents.html.HTMLTitle}

In [None]:
from unstructured.documents.html import HTMLListItem
for i, el in enumerate(html_document.elements):
  if isinstance(el, HTMLListItem):
    break
first_list_item_idx = i
for i in range(first_list_item_idx-1, first_list_item_idx+10):
  print(html_document.elements[i], type(html_document.elements[i]))

As discussed in further detail throughout this report, some key takeaways and developments for our business since the beginning of fiscal year 2021 were as follows: <class 'unstructured.documents.html.HTMLNarrativeText'>
We had record revenue of $615.9 million during fiscal year 2021, compared to $498.8 million during fiscal year 2020. This was a 23% increase year over year. <class 'unstructured.documents.html.HTMLListItem'>
We increased our calendar year dividend to $1.20 per basic share, which is paid in quarterly installments throughout calendar year 2021. This represents a 7% increase compared with the dividend paid during calendar year 2020. <class 'unstructured.documents.html.HTMLListItem'>
On September 30, 2020, we announced we had entered into an agreement with Kinross Gold Corporation to sell our interest in the Manh Choh Project (formerly known as the Peak Gold Project) and our common share position in Contango Ore, Inc., our partner in Peak Gold, LLC, the owner of the Manh C

After applying the cleaning brick, we have a good looking list of
section titles! With our set of bricks in place, we have what we
need to identify the risk section. For convenience, we've packaged them
up in the `SECDocument` class in the Python module for the pipeline. Notice
we include the `# pipeline-api` comment to include the `SECDocument`
import in the pipeline definition. Applying the full pipeline, we're
able to cleanly extract the risk narrative portion of the SEC filing.
The function is parameterized so we're easily able to get other sections
as well.

In [None]:
# pipeline-api
from prepline_sec_filings.sections import section_string_to_enum, validate_section_names, SECSection
from prepline_sec_filings.sec_document import SECDocument, REPORT_TYPES, VALID_FILING_TYPES

In [None]:
sec_document = SECDocument.from_string(text)
risk_narrative = sec_document.get_section_narrative(SECSection.RISK_FACTORS)

In [None]:
for element in risk_narrative[:3]:
    print(element)
    print("\n")

You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.


Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.


Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, forward sales by metal producers, and political, trade, economic, or banking conditions.




## Section 5: Staging Bricks <a name="staging"></a>

After we've extracted the information we want, the next step is to
prepare the data for downstream tasks. This can include preparation
to:

- Feed the data into a downstream ML model
- Convert the data into a labeling task
- Upload the data into a database

To help with this, Unstructured has created ***staging bricks***
that preparing extracted data for downstream tasks. In this case,
we'll show how to format the data so that it can be easily
uploaded into LabelStudio as a labeling task. Other future uses cases include:

- Preparing data for downstream inference by Primer, Co:here, or Vannevar
- Chunking text to fit into attention windows for Huggingface
- Uploading data to Palantir Foundry

In [None]:
from unstructured.staging.label_studio import stage_for_label_studio

In [None]:
label_studio_data = stage_for_label_studio(risk_narrative)
label_studio_data[:5]

[{'data': {'text': 'You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.',
   'ref_id': '7a912bb639b547404be4ceaf5d9083a9'}},
 {'data': {'text': 'Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.',
   'ref_id': 'd4cc8e0e0c2b68ef69282c5250b721c9'}},
 {'data': {'text': 'Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central bank

## Section 6: Define the Pipeline API <a name="pipeline"></a>

Once we have everything ready to go, we can package up the pipeline API
using the `# pipeline-api` comment. Unstructured has tooling that will
convert this notebook directly into a REST API. By convention, the REST
API calls the `pipeline_api` function when the API is invoked.
Once the REST API is up and running, users can make an HTTP request
to parse new documents.

In [None]:
# pipeline-api
from enum import Enum
import re
import signal

from unstructured.staging.base import convert_to_isd
from prepline_sec_filings.sections import (
    ALL_SECTIONS,
    SECTIONS_10K,
    SECTIONS_10Q,
    SECTIONS_S1,
)

In [None]:
# pipeline-api
class timeout:
    def __init__(self, seconds=1, error_message='Timeout'):
        self.seconds = seconds
        self.error_message = error_message
    def handle_timeout(self, signum, frame):
        raise TimeoutError(self.error_message)
    def __enter__(self):
        try:
            signal.signal(signal.SIGALRM, self.handle_timeout)
            signal.alarm(self.seconds)
        except ValueError:
            pass
    def __exit__(self, type, value, traceback):
        try:
            signal.alarm(0)
        except ValueError:
            pass

In [None]:
# pipeline-api
def get_regex_enum(section_regex):
    class CustomSECSection(Enum):
        CUSTOM = re.compile(section_regex)

        @property
        def pattern(self):
            return self.value

    return CustomSECSection.CUSTOM

In [None]:
# pipeline-api
def pipeline_api(text, m_section=[], m_section_regex=[]):
    """Many supported sections including: RISK_FACTORS, MANAGEMENT_DISCUSSION, and many more"""
    validate_section_names(m_section)

    sec_document = SECDocument.from_string(text)
    if sec_document.filing_type not in VALID_FILING_TYPES:
        raise ValueError(
            f"SEC document filing type {sec_document.filing_type} is not supported, "
            f"must be one of {','.join(VALID_FILING_TYPES)}"
        )
    results = {}
    if m_section == [ALL_SECTIONS]:
        filing_type = sec_document.filing_type
        if filing_type in REPORT_TYPES:
            if filing_type.startswith("10-K"):
                m_section = [enum.name for enum in SECTIONS_10K]
            elif filing_type.startswith("10-Q"):
                m_section = [enum.name for enum in SECTIONS_10Q]
            else:
                raise ValueError(f"Invalid report type: {filing_type}")

        else:
            m_section = [enum.name for enum in SECTIONS_S1]
    for section in m_section:
        results[section] = sec_document.get_section_narrative(
            section_string_to_enum[section]
        )
    for i, section_regex in enumerate(m_section_regex):
        regex_enum = get_regex_enum(section_regex)
        with timeout(seconds=5):
            section_elements = sec_document.get_section_narrative(regex_enum)
            results[f"REGEX_{i}"] = section_elements
    return {section:convert_to_isd(section_narrative) for section, section_narrative in results.items()}

As we can see, the output now matches what we wanted at the beginning
of the notebook.

In [None]:
risk_narrative = pipeline_api(text, ["RISK_FACTORS"])["RISK_FACTORS"]
risk_narrative[:5]

[{'text': 'You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.',
  'type': 'NarrativeText'},
 {'text': 'Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.',
  'type': 'NarrativeText'},
 {'text': 'Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, currency values, interest rates, for

In [None]:
all_narratives = pipeline_api(text, ["_ALL"])
for section, elems in all_narratives.items():
    print(section)
    print(elems[:4])
    print("---------------")

BUSINESS
[]
---------------
RISK_FACTORS
[{'text': 'You should carefully consider the risks described in this section. Our future performance is subject to risks and uncertainties that could have a material adverse effect on our business, results of operations, and financial condition and the trading price of our common stock. We may be subject to other risks and uncertainties not presently known to us. In addition, please see our note about forward-looking statements included in the MD&A.', 'type': 'NarrativeText'}, {'text': 'Our revenue is subject to volatility in metal prices, which could negatively affect our results of operations or cash flow.', 'type': 'NarrativeText'}, {'text': 'Market prices for gold, silver, copper, nickel, and other metals may fluctuate widely over time and are affected by numerous factors beyond our control. These factors include metal supply and demand, industrial and jewelry fabrication, investment demand, central banking actions, inflation expectations, c