## Example Use Case: 10K Processing

:::{admonition} [10-K filings](https://www.investopedia.com/terms/1/10-k.asp)
:class: important

Our objective: produce summaries of (portions of) 10-K filings

```{figure} ./images/10-K-investopedia.png
---
width: auto
name: 10-K-investopedia
---
```
:::

:::{note} All 10-K documents are available on KLC (up to 2024)

```{figure} ./images/10-K-klc.png
---
width: 900px
name: 10-K-klc
---
```
:::

:::{admonition} Find a Model

We can search for a summarization model on the [Hugging Face Model Hub](https://huggingface.co/Falconsai/text_summarization)

```{figure} ./images/model-hub-summarize.png
---
width: auto
name: model-hub-summarize
---
```
:::

:::{admonition} Create a Summarization Pipeline
```python
def main(num_files=10):
    # get listing of 10K files
    files = list(input_dir.glob("*.txt"))[:num_files]
    files.sort()

    # load and clean text, extr
    data_dict = {"doc": [], "text": []}
    for f in files:
        print(f"loading: {f.name}")
        text = clean_html(f.read_text())
        mda_text = extract_mda(text)
        if mda_text is None:
            continue
        data_dict["doc"].append(f.name)
        data_dict["text"].append(mda_text)

    # create a dataset object
    dataset_10k = Dataset.from_dict(data_dict)
    print(f"created dataset: {dataset_10k}")

    # apply summarization pipeline to dataset
    summarizer = pipeline("summarization", model=model_checkpoint)
    dataset_10k = dataset_10k.map(
        lambda batch: {
            "summary": summarizer(
                batch["text"],
                max_length=100,
                min_length=30,
                do_sample=False,
                truncation=True,
            )
        },
        batched=True,
    )

    # output to file
    dataset_10k.to_csv(output_file)
```
:::

:::{toggle}
```python
def clean_html(html):
    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()
```
:::

:::{toggle}
```python
def extract_mda(text):
    mda_text = None
    
    # obtain the second occurrence of "Discussion and Analysis of Financial Condition" with wildcards
    pattern = r"Discussion[\s,.-]*and[\s,.-]*Analysis[\s,.-]*of[\s,.-]*Financial[\s,.-]*Condition"
    mda_matches = list(re.finditer(pattern, text, re.IGNORECASE))
    if len(mda_matches) >= 2:
        m = mda_matches[1]
        mda_text = text[m.end():]
        return " ".join(mda_text.split()[:250])
    return mda_text
```
:::