## Example Use Case: 10K Processing

:::{admonition} [10-K filings](https://www.investopedia.com/terms/1/10-k.asp)
:class: important

Our objective: produce summaries of (portions of) 10-K filings

```{figure} ./images/10-K-investopedia.png
---
width: auto
name: 10-K-investopedia
---
```
:::

:::{note} All 10-K documents are available on KLC (up to 2024)

```{figure} ./images/10-K-klc.png
---
width: 900px
name: 10-K-klc
---
```
:::

:::{admonition} Type of Use Case: Summarize


```{figure} ./images/LLM-use-cases-summarize.png
---
width: 900px
name: LLM-use-cases-summarize
---
```
:::

:::{admonition} Find a Model

We can search for a summarization model on the [Hugging Face Model Hub](https://huggingface.co/Falconsai/text_summarization)

```{figure} ./images/model-hub-summarize.png
---
width: auto
name: model-hub-summarize
---
```
:::

:::{admonition} Create a Summarization Pipeline

Full script [here](https://github.com/rs-kellogg/krs-openllm-cookbook/blob/main/scripts/10K_use_case/process_annual_reports_transfomer.py)

```python
def main(
    cache_dir: Path = Path("/projects/kellogg/.cache"),
    input_dir: Path = Path("/kellogg/data/EDGAR/10-K/2023"),
    output_file: Path = Path("/projects/kellogg/output/annual_report_output.csv"),
    model_checkpoint: str = "Falconsai/text_summarization",
    num_files: int = 10,
):
    # validate input parameters
    assert cache_dir.exists() and cache_dir.is_dir()
    assert input_dir.exists() and input_dir.is_dir()
    assert num_files > 0
    output_file.touch(exist_ok=True)

    # set the huggingface model directory
    os.environ["HF_HOME"] = str(cache_dir)

    # get listing of 10K files
    files = list(input_dir.glob("*.txt"))[:num_files]
    files.sort()

    # load and clean text, extr
    data_dict = {"doc": [], "text": []}
    for f in files:
        print(f"loading: {f.name}")
        mda_text = extract_mda(clean_html(f.read_text()))
        if mda_text is None:
            continue
        data_dict["doc"].append(f.name)
        data_dict["text"].append(mda_text)

    # create a dataset object
    dataset_10k = Dataset.from_dict(data_dict)
    print(f"created dataset: {dataset_10k}")

    # apply summarization pipeline to dataset
    summarizer = pipeline("summarization", model=model_checkpoint)
    dataset_10k = dataset_10k.map(
        lambda batch: {
            "summary": summarizer(
                batch["text"],
                max_length=50,
                min_length=30,
                do_sample=False,
                truncation=True,
            )
        },
        batched=True,
    )

    # output to file
    dataset_10k.to_csv(output_file)
```
:::

:::{toggle}
```python
def clean_html(html):
    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()
```
:::

:::{toggle}
```python
def extract_mda(text):
    mda_text = None
    
    # obtain the second occurrence of "Discussion and Analysis of Financial Condition" with wildcards
    pattern = r"Discussion[\s,.-]*and[\s,.-]*Analysis[\s,.-]*of[\s,.-]*Financial[\s,.-]*Condition"
    mda_matches = list(re.finditer(pattern, text, re.IGNORECASE))
    if len(mda_matches) >= 2:
        m = mda_matches[1]
        mda_text = text[m.end():]
        return " ".join(mda_text.split()[:250])
    return mda_text
```
:::

:::{note}
Screen video [here](https://kellogg-shared.s3.us-east-2.amazonaws.com/videos/quest-on-demand-10-k.mp4) of executing script on Quest GPU node with Slurm script.
:::

:::{admonition} Original text snippet:
> In addition, the spread of COVID-19 has caused us to modify our business practices (including restricting employee travel, developing social distancing plans for our employees and cancelling physical participation in meetings, events and conferences), and we may take further actions as may be required by government authorities or as we determine is in the best interests of our employees, partners and customers. The outbreak has adversely impacted and may further adversely impact our workforce and operations and the operations of our partners, customers, suppliers and third-party vendors, throughout the time period during which the spread of COVID-19 continues and related restrictions remain in place, and even after the COVID-19 outbreak has subsided. &#160; Even after the COVID-19 outbreak has subsided and despite the formal declaration of the end of the COVID-19 global health emergency by the World Health Organization in May 2023, our business may continue to experience materially adverse impacts as a result of the virus&#x2019;s economic impact, including the availability and cost of funding and any recession that has occurred or may occur in the future. There are no comparable recent events that provide guidance as to the effect COVID-19 as a global pandemic may have, and, as a result, the ultimate impact of the outbreak is highly uncertain and subject to change. &#160; Additionally, many of the other risk factors described below are heightened by the effects of the COVID-19 pandemic and related economic conditions, which in turn could materially adversely affect...
:::

:::{admonition} Summary:
> the spread of COVID-19 has caused us to modify our business practices . The outbreak has adversely impacted and may further adversely impact our workforce and operations and the operations of our partners, customers, suppliers and third-party vendors
:::