(input)=
# Managing Input Data
```{epigraph}
One home run is much better than two doubles.

-- Steve Jobs
```
```{contents}
```

## Introduction




When building applications with language models, developers often default to complex architectures involving retrieval systems, chunking strategies, and sophisticated pipelines. However, these approaches add unnecessary complexity when simpler solutions exist. This is where long-context language models (LCLMs) {cite}`lee2024longcontextlanguagemodelssubsume` come in. LCLMs are a new class of models that can process massive amounts of text - up to millions of tokens - in a single forward pass. This capability means they can directly ingest and reason about entire documents or datasets without requiring external tools or complex preprocessing steps. The implications are significant: developers can build more maintainable systems by simply feeding raw text to the model rather than orchestrating complicated retrieval and chunking pipelines. Recent benchmarks have shown that this straightforward approach can match or exceed the performance of more complex systems like RAG, despite never being explicitly trained for such tasks. Before implementing sophisticated architectures, developers should first evaluate whether an LCLM's native capabilities might offer a simpler path to their goals.

## Parsing Documents

When discussing document processing with LLMs, there's often a focus on sophisticated algorithms from chunking to contextual inferencing to RAGs. However, this misses the core challenge in production systems, which is 80% about cleaning and normalizing the input, and 20% about actually algorithmic inferencing.

Building robust data ingestion and preprocessing pipelines is essential for any LLM application. This section explores powerful tools and frameworks like MarkItDown, Docling, and LangChain that streamline document processing. These tools provide unified interfaces for converting diverse document formats into standardized representations that LLMs can effectively process. By abstracting away format-specific complexities, they allow developers to focus on core application logic rather than parsing implementation details.


### MarkItDown

MarkItDown is a Python package and CLI too developed by the Microsoft AutoGen team for converting various file formats to Markdown. It supports a wide range of formats including PDF, PowerPoint, Word, Excel, images (with OCR and EXIF metadata), audio (with transcription), HTML, and other text-based formats. The tool is particularly useful for document indexing and text analysis tasks.

Key features:
- Simple command-line and Python API interfaces
- Support for multiple file formats
- Optional LLM integration for enhanced image descriptions
- Batch processing capabilities
- Docker support for containerized usage

Sample usage:
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```

### Docling

Docling is a Python package developed by IBM Research for parsing and converting documents into various formats. It provides advanced document understanding capabilities with a focus on maintaining document structure and formatting.

Key features:
- Support for multiple document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, etc.)
- Advanced PDF parsing including layout analysis and table extraction
- Unified document representation format
- Integration with LlamaIndex and LangChain
- OCR support for scanned documents
- Simple CLI interface

Sample usage:
```python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())
```

### Frameworks-Based Parsing


### Case Study: Structured Data Extraction

A common use case where document parsing matters is to extract structured data from documents, particularly in the presence of complex formatting and layout. In this case study, we will extract the economic forecasts from Merrill Lynch's CIO Capital Market Outlook released on December 16, 2024 {cite:p}`merrill2024`.  {numref}`forecast` shows page 7 of the mentioned document, which contains several economic variables. 


```{figure} ../data/input/forecast.png
---
name: forecast
alt: Forecast
scale: 50%
align: center
---
Forecast
```

We will focus on the page containing the economic forecasts.

In [76]:
FORECAST_FILE_PATH = "../data/input/forecast.pdf"


First, we will use MarkItDown to extract the text content from the document.

In [83]:
from markitdown import MarkItDown

md = MarkItDown()
result_md = md.convert(FORECAST_FILE_PATH).text_content

Next, we will do the same with Docling.

In [None]:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
forecast_result_docling = converter.convert(source).document.export_to_markdown()

How similar are the two results? We can use use Levenshtein distance to measure the similarity between the two results. We will also calculate a naive score using the `SequenceMatcher` from the `difflib` package, which is a simple measure of the similarity between two strings based on the number of matches in the longest common subsequence.

In [79]:
import Levenshtein
def levenshtein_similarity(text1: str, text2: str) -> float:
    """
    Calculate normalized Levenshtein distance
    Returns value between 0 (completely different) and 1 (identical)
    """
    distance = Levenshtein.distance(text1, text2)
    max_len = max(len(text1), len(text2))
    return 1 - (distance / max_len)

from difflib import SequenceMatcher
def simple_similarity(text1: str, text2: str) -> float:
    """
    Calculate similarity ratio using SequenceMatcher
    Returns value between 0 (completely different) and 1 (identical)
    """
    return SequenceMatcher(None, text1, text2).ratio()

In [80]:
levenshtein_similarity(forecast_result_md, forecast_result_docling)

0.13985705461925346

In [81]:
simple_similarity(forecast_result_md, forecast_result_docling)

0.17779960707269155

It turns out that the two results are quite different, with a similarity score of about 13.98% and 17.77% for Levenshtein and `SequenceMatcher` respectively.

Docling's result is a quite readable markdown displaying key economic variables and their forecasts. Conversely, MarkItDown's result is a bit messy and hard to read but the information is there just not in a structured format. Does it matter? That's what we will explore next.

**Docling's result**

In [85]:
display(Markdown(forecast_result_docling))

## MARKETS IN REVIEW

## Equities

|                       | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   |
|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
|                       | Current                   | WTD                       | MTD                       | YTD                       |
| DJIA                  | 43,828.06                 | -1.8                      | -2.3                      | 18.4                      |
| NASDAQ                | 19,926.72                 | 0.4                       | 3.7                       | 33.7                      |
| S&P 500               | 6,051.09                  | -0.6                      | 0.4                       | 28.6                      |
| S&P 400 Mid Cap       | 3,277.20                  | -1.6                      | -2.6                      | 19.5                      |
| Russell 2000          | 2,346.90                  | -2.5                      | -3.5                      | 17.3                      |
| MSCI World            | 3,817.24                  | -1.0                      | 0.2                       | 22.1                      |
| MSCI EAFE             | 2,319.05                  | -1.5                      | 0.2                       | 6.4                       |
| MSCI Emerging Markets | 1,107.01                  | 0.3                       | 2.7                       | 10.6                      |

## Fixed Income †

|                              | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   |
|------------------------------|---------------------------|---------------------------|---------------------------|---------------------------|
|                              | Current                   | WTD                       | MTD                       | YTD                       |
| Corporate & Government       | 4.66                      | -1.34                     | -0.92                     | 1.94                      |
| Agencies                     | 4.54                      | -0.58                     | -0.31                     | 3.35                      |
| Municipals                   | 3.55                      | -0.87                     | -0.54                     | 1.99                      |
| U.S. Investment Grade Credit | 4.79                      | -1.38                     | -0.93                     | 1.97                      |
| International                | 5.17                      | -1.40                     | -0.90                     | 3.20                      |
| High Yield                   | 7.19                      | -0.22                     | 0.20                      | 8.87                      |
| 90 Day Yield                 | 4.32                      | 4.39                      | 4.49                      | 5.33                      |
| 2 Year Yield                 | 4.24                      | 4.10                      | 4.15                      | 4.25                      |
| 10 Year Yield                | 4.40                      | 4.15                      | 4.17                      | 3.88                      |
| 30 Year Yield                | 4.60                      | 4.34                      | 4.36                      | 4.03                      |

## Commodities & Currencies

|                       | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   | Total Return in USD (%)   |
|-----------------------|---------------------------|---------------------------|---------------------------|---------------------------|
| Commodities           | Current                   | WTD                       | MTD                       | YTD                       |
| Bloomberg Commodity   | 237.90                    | 1.3                       | 0.7                       | 5.1                       |
| WTI Crude $/Barrel †† | 71.29                     | 6.1                       | 4.8                       | -0.5                      |
| Gold Spot $/Ounce ††  | 2648.23                   | 0.6                       | 0.2                       | 28.4                      |

## Total Return in USD (%)

| Currencies   |   Current |   Prior   Week End |   Prior   Month End |   2022   Year End |
|--------------|-----------|--------------------|---------------------|-------------------|
| EUR/USD      |      1.05 |               1.06 |                1.06 |              1.1  |
| USD/JPY      |    153.65 |             150    |              149.77 |            141.04 |
| USD/CNH      |      7.28 |               7.28 |                7.25 |              7.13 |

## S&P Sector Returns

<!-- image -->

Sources: Bloomberg, Factset. Total Returns from the period of 12/9/2024 to 12/13/2024. †Bloomberg Barclays Indices. ††Spot price returns. All data as of the 12/13/2024 close. Data would differ if a different time period was displayed. Short-term performance shown to illustrate more recent trend. Past performance is no guarantee

of future results.

## Economic Forecasts (as of 12/13/2024)

|                                    | Q4 2024E   |   2024E | Q1 2025E   | Q2 2025E   | Q3 2025E   | Q4 2025E   |   2025E |
|------------------------------------|------------|---------|------------|------------|------------|------------|---------|
| Real global GDP (% y/y annualized) | -          |    3.1  | -          | -          | -          | -          |    3.2  |
| Real U.S. GDP (% q/q annualized)   | 2.0        |    2.7  | 2.5        | 2.3        | 2.2        | 2.2        |    2.4  |
| CPI inflation (% y/y)              | 2.7        |    2.9  | 2.3        | 2.3        | 2.7        | 2.5        |    2.5  |
| Core CPI inflation (% y/y)         | 3.3        |    3.4  | 3.0        | 2.9        | 3.2        | 3.1        |    3    |
| Unemployment rate (%)              | 4.2        |    4    | 4.3        | 4.3        | 4.4        | 4.4        |    4.3  |
| Fed funds rate, end period (%)     | 4.38       |    4.38 | 4.13       | 3.88       | 3.88       | 3.88       |    3.88 |

The forecasts in the table above are the base line view from BofA Global Research. The Global Wealth & Investment Management (GWIM) Investment Strategy Committee (ISC) may make adjustments to this view over the course of the year and can express upside/downside to these forecasts. Historical data is sourced from Bloomberg, FactSet, and

Haver Analytics. There can be no assurance that the forecasts will be achieved. Economic or financial forecasts are inherently limited and should not be relied on as indicators of future investment performance.

A = Actual. E/* = Estimate.

Sources: BofA Global Research; GWIM ISC as of December 13, 2024.

## Asset Class Weightings (as of 12/3/2024)

|                                        | CIO View                     | CIO View    | CIO View   | CIO View   | CIO View   |
|----------------------------------------|------------------------------|-------------|------------|------------|------------|
| Asset Class                            | Underweight                  | Underweight | Neutral    | Overweight | Overweight |
| Global Equities                        | slight over weight green    |            |           |            |           |
| U.S. Large Cap Growth                  |                             |            |            |           |           |
| U.S. Large Cap Value                   | Slight over weight green    |            |           |            |           |
| U.S. Small Cap Growth                  | slight over weight green    |            |           |            |           |
| U.S. Small Cap Value                   | slight over weight green    |            |           |            |           |
| International Developed                | Slight underweight orange   |             |           |           |           |
| Emerging Markets                       |                             |            |            |           |           |
| Global Fixed Income                    | slight underweight orange   |             |           |           |           |
| U.S. Governments                       | slight over weight green    |            |           |            |           |
| U.S. Mortgages                         | Slight over weight green    |            |           |            |           |
| U.S. Corporates                        | Slight underweight orange   |             |           |           |           |
| International Fixed Income             |                             |            |            |           |           |
| High Yield                             | Slight underweight orange   |             |           |           |           |
| U.S. Investment-grade                  | Neutral yellow              |            |            |           |           |
| Tax Exempt  U.S. High Yield Tax Exempt | Slight underweight orange   |             |           |           |           |
| Cash                                   |                              |             |            |            |            |

## CIO Equity Sector Views

|                         | CIO View                     | CIO View    | CIO View   | CIO View   | CIO View   |
|-------------------------|------------------------------|-------------|------------|------------|------------|
| Sector                  |                              | Underweight | Neutral    |            | Overweight |
| Utilities               | slight over weight green    |            |           |            |           |
| Financials              | slight over weight green    |            |           |            |           |
| Healthcare              | slight over weight green    |            |           |            |           |
| Consumer  Discretionary | Slight over weight green    |            |           |            |           |
| Information  Technology | Neutral yellow              |            |            |           |           |
| Communication  Services | Neutral yellow              |            |            |           |           |
| Industrials             | Neutral yellow              |            |            |           |           |
| Real Estate             | Neutral yellow              |            |            |           |           |
| Energy                  | slight underweight orange   |             |           |           |           |
| Materials               | slight underweight orange   |             |           |           |           |
| Consumer  Staples       | underweight red              |            |           |           |           |

CIO asset class views are relative to the CIO Strategic Asset Allocation (SAA) of a multi-asset portfolio. Source: Chief Investment Office as of December 3, 2024. All sector and asset allocation recommendations must be considered in the context of an individual investor's goals, time horizon, liquidity needs and risk tolerance. Not all recommendations will be in the best interest of all investors.

**MarkItDown's result**

In [96]:
from IPython.display import display, Markdown
display(Markdown(forecast_result_md[:500]))

Economic Forecasts (as of 12/13/2024)

Real global GDP (% y/y annualized)
Real U.S. GDP (% q/q annualized)
CPI inflation (% y/y)
Core CPI inflation (% y/y)
Unemployment rate (%)
Fed funds rate, end period (%)

Q4 2024E
-
2.0
2.7
3.3
4.2
4.38

2024E
3.1
2.7
2.9
3.4
4.0
4.38

Q1 2025E  Q2 2025E  Q3 2025E  Q4 2025E

-
2.5
2.3
3.0
4.3
4.13

-
2.3
2.3
2.9
4.3
3.88

-
2.2
2.7
3.2
4.4
3.88

-
2.2
2.5
3.1
4.4
3.88

2025E
3.2
2.4
2.5
3.0
4.3
3.88

The forecasts in the table above are the base line view f

Now, let's focus on the economic forecasts. In particular, we are interested in the CIO's 2025E forecasts.

```{figure} ../data/input/2025.png
---
name: forecast2025
alt: Forecast 2025
scale: 60%
align: center
---
Forecast 2025
```

We will define a `Forecast` pydantic model to represent an economic forecast composed of a `financial_variable` and a `financial_forecast`. We will also define a `EconForecast` pydantic model to represent the list of economic forecasts we want to extract from the document.


In [12]:
from pydantic import BaseModel
class Forecast(BaseModel):
    financial_variable: str
    financial_forecast: float
class EconForecast(BaseModel):
    forecasts: list[Forecast]


We write a simple function to extract the economic forecasts from the document using an LLM model (with structured output) using the following prompt template, where `extract_prompt` is kind of data the user would like to extract and `doc` is the input document to analyze.

```python
BASE_PROMPT = f"""
    ROLE: You are an expert at structured data extraction. 
    TASK: Extract the following data {extract_prompt} from input DOCUMENT
    FORMAT: The output should be a JSON object with 'financial_variable' as key and 'financial_forecast' as value.
    """
prompt = f"{BASE_PROMPT} \n\n DOCUMENT: {doc}"
```

In [84]:
def extract_from_doc(extract_prompt: str,  doc: str, client) -> EconForecast:
    """
    Extract data of a financial document using an LLM model.
    
    Args:
        doc: The financial document text to analyze
        client: The LLM model to use for analysis
        extract_prompt: The prompt to use for extraction
        
    Returns:
        EconForecasts object containing sentiment analysis results
    """

    BASE_PROMPT = f"""
    ROLE: You are an expert at structured data extraction. 
    TASK: Extract the following data {extract_prompt} from input DOCUMENT
    FORMAT: The output should be a JSON object with 'financial_variable' as key and 'financial_forecast' as value.
    """
    prompt = f"{BASE_PROMPT} \n\n DOCUMENT: {doc}"
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": prompt
            },
            {"role": "user", "content": doc}
        ],
        response_format=EconForecast
    )
    return completion.choices[0].message.parsed

In [22]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv(override=True)
from openai import OpenAI
client = OpenAI()

The user then calls the `extract_from_doc` function simply defining that "Economic Forecasts for 2025E" is the data they would like to extract from the document. We perform the extraction twice, once with MarkItDown and once with Docling.

In [30]:
extract_prompt = "Economic Forecasts for 2025E"
md_financials = extract_from_doc(extract_prompt, forecast_result_md, client)
docling_financials = extract_from_doc(extract_prompt, forecast_result_docling, client)

The response is a `EconForecast` object containing a list of `Forecast` objects. We can then convert the response to a pandas DataFrame for easier comparison.

In [99]:
md_financials

EconForecast(forecasts=[Forecast(financial_variable='Real global GDP (% y/y annualized)', financial_forecast=3.2), Forecast(financial_variable='Real U.S. GDP (% q/q annualized)', financial_forecast=2.4), Forecast(financial_variable='CPI inflation (% y/y)', financial_forecast=2.5), Forecast(financial_variable='Core CPI inflation (% y/y)', financial_forecast=3.0), Forecast(financial_variable='Unemployment rate (%)', financial_forecast=4.3), Forecast(financial_variable='Fed funds rate, end period (%)', financial_forecast=3.88)])

In [None]:
df_md_forecasts = pd.DataFrame([(f.financial_variable, f.financial_forecast) for f in md_financials.forecasts], 
                      columns=['Variable', 'Forecast'])
df_docling_forecasts = pd.DataFrame([(f.financial_variable, f.financial_forecast) for f in docling_financials.forecasts], 
                      columns=['Variable', 'Forecast'])


In [97]:
df_md_forecasts

Unnamed: 0,Variable,Forecast
0,Real global GDP (% y/y annualized),3.2
1,Real U.S. GDP (% q/q annualized),2.4
2,CPI inflation (% y/y),2.5
3,Core CPI inflation (% y/y),3.0
4,Unemployment rate (%),4.3
5,"Fed funds rate, end period (%)",3.88


In [98]:
df_docling_forecasts

Unnamed: 0,Variable,Forecast
0,Real global GDP (% y/y annualized),3.2
1,Real U.S. GDP (% q/q annualized),2.4
2,CPI inflation (% y/y),2.5
3,Core CPI inflation (% y/y),3.0
4,Unemployment rate (%),4.3
5,"Fed funds rate, end period (%)",3.88


The results from both MarkItDown and Docling are identical and accurately match the true values from the document. This demonstrates that despite MarkItDown's output appearing less readable from a human perspective, both approaches successfully extracted the economic forecast data with equal precision. The formatting differences between the two methods did not impact their ability to capture and structure the underlying information at least in this particular case.

Now, let's focus on the asset class weightings. We will extract the asset class weightings from the document and compare the results from MarkItDown and Docling. The information now is presented in a quite different structure. The CIO view is represented in a spectrum from "Underweight", passing through "Neutral" to "Overweight". And the actual view is marked by some colored dots. Let's see if we can extract the information from the document.
```{figure} ../data/input/asset_class.png
---
name: asset_class
alt: Asset Class Weightings
scale: 60%
align: center
---
Asset Class Weightings
```

The user will simply define the following data to extract: "Asset Class Weightings (as of 12/3/2024) in a scale from -2 to 2". In that way, we expect that "Underweight" will be mapped to -2, "Neutral" to 0 and "Overweight" to 2 with some values in between.

In [41]:
extract_prompt = "Asset Class Weightings (as of 12/3/2024) in a scale from -2 to 2"
asset_class_docling = extract_from_doc(extract_prompt, forecast_result_docling, client)
asset_class_md = extract_from_doc(extract_prompt, forecast_result_md, client)

In [None]:

df_md = pd.DataFrame([(f.financial_variable, f.financial_forecast) for f in asset_class_md.forecasts], 
                 columns=['Variable', 'Forecast'])
df_docling = pd.DataFrame([(f.financial_variable, f.financial_forecast) for f in asset_class_docling.forecasts], 
                 columns=['Variable', 'Forecast'])

Now we construct a DataFrame to compare the results from MarkItDown and Docling with an added "true_value" column containing the true values from the document.

In [72]:
# Create DataFrame with specified columns
df_comparison = pd.DataFrame({
    'variable': df_docling['Variable'].iloc[:-1],
    'markitdown': df_md['Forecast'],
    'docling': df_docling['Forecast'].iloc[:-1],  # Drop last row
    'true_value': [1.0, 0.0, 1.0, 1.0, 1.0, -1.0, 0.0, -1.0, 1.0, 1.0, -1.0, 0.0, -1.0, 0.0, -1.0]
})

display(df_comparison)


Unnamed: 0,variable,markitdown,docling,true_value
0,Global Equities,1.0,1.0,1.0
1,U.S. Large Cap Growth,1.0,1.0,0.0
2,U.S. Large Cap Value,1.0,1.0,1.0
3,U.S. Small Cap Growth,1.0,1.0,1.0
4,U.S. Small Cap Value,1.0,1.0,1.0
5,International Developed,1.0,-1.0,-1.0
6,Emerging Markets,1.0,0.0,0.0
7,Global Fixed Income,-1.0,-1.0,-1.0
8,U.S. Governments,-1.0,1.0,1.0
9,U.S. Mortgages,-1.0,1.0,1.0


In [73]:
# Calculate accuracy for markitdown and docling
markitdown_accuracy = (df_comparison['markitdown'] == df_comparison['true_value']).mean()
docling_accuracy = (df_comparison['docling'] == df_comparison['true_value']).mean()

print(f"Markitdown accuracy: {markitdown_accuracy:.2%}")
print(f"Docling accuracy: {docling_accuracy:.2%}") 


Markitdown accuracy: 53.33%
Docling accuracy: 93.33%


Docling performs significantly better at 93.33% accuracy missing only one value. MarkItDown achieves 53.33% accuracy, struggling with nuanced asset class weightings. In this case, Docling's structured parsed output did help the LLM to extract the information more accurately compared to MarkItDown's unstructured output.

What if we want to systematically extract all tables from the document? We can use Docling to do that by simply accessing the `tables` attribute of the `DocumentConverter` object.

We observe that Docling extracted 7 tables from the document. Exporting tables from top down left to right in order of appearance.
We can see the first table successfully extracted for Equities forecasts, the second one for Fixed Income forecasts. We also display the last table, which contains CIO Equity Sector Views.


In [47]:
import time
from pathlib import Path
import pandas as pd
from docling.document_converter import DocumentConverter

In [50]:
def convert_and_export_tables(file_path: Path) -> list[pd.DataFrame]:
    """
    Convert document and export tables to DataFrames.
    
    Args:
        file_path: Path to input document
        
    Returns:
        List of pandas DataFrames containing the tables
    """
    doc_converter = DocumentConverter()
    start_time = time.time()
    
    conv_res = doc_converter.convert(file_path)
    
    tables = []
    # Export tables
    for table in conv_res.document.tables:
        table_df: pd.DataFrame = table.export_to_dataframe()
        tables.append(table_df)

    end_time = time.time() - start_time
    print(f"Document converted in {end_time:.2f} seconds.")
    
    return tables


In [None]:
# Convert and export tables
tables = convert_and_export_tables(Path(FORECAST_FILE_PATH))

In [100]:
len(tables)

7

In [59]:
display(tables[0])

Unnamed: 0,Unnamed: 1,Total Return in USD (%).Current,Total Return in USD (%).WTD,Total Return in USD (%).MTD,Total Return in USD (%).YTD
0,DJIA,43828.06,-1.8,-2.3,18.4
1,NASDAQ,19926.72,0.4,3.7,33.7
2,S&P 500,6051.09,-0.6,0.4,28.6
3,S&P 400 Mid Cap,3277.2,-1.6,-2.6,19.5
4,Russell 2000,2346.9,-2.5,-3.5,17.3
5,MSCI World,3817.24,-1.0,0.2,22.1
6,MSCI EAFE,2319.05,-1.5,0.2,6.4
7,MSCI Emerging Markets,1107.01,0.3,2.7,10.6


In [102]:
display(tables[1])

Unnamed: 0,Unnamed: 1,Total Return in USD (%).Current,Total Return in USD (%).WTD,Total Return in USD (%).MTD,Total Return in USD (%).YTD
0,Corporate & Government,4.66,-1.34,-0.92,1.94
1,Agencies,4.54,-0.58,-0.31,3.35
2,Municipals,3.55,-0.87,-0.54,1.99
3,U.S. Investment Grade Credit,4.79,-1.38,-0.93,1.97
4,International,5.17,-1.4,-0.9,3.2
5,High Yield,7.19,-0.22,0.2,8.87
6,90 Day Yield,4.32,4.39,4.49,5.33
7,2 Year Yield,4.24,4.1,4.15,4.25
8,10 Year Yield,4.4,4.15,4.17,3.88
9,30 Year Yield,4.6,4.34,4.36,4.03


In [61]:
display(tables[6])

Unnamed: 0,Sector,CIO View.,CIO View.Underweight,CIO View.Neutral,CIO View..1,CIO View.Overweight
0,Utilities,slight over weight green ,,,,
1,Financials,slight over weight green ,,,,
2,Healthcare,slight over weight green ,,,,
3,Consumer Discretionary,Slight over weight green ,,,,
4,Information Technology,Neutral yellow ,,,,
5,Communication Services,Neutral yellow ,,,,
6,Industrials,Neutral yellow ,,,,
7,Real Estate,Neutral yellow ,,,,
8,Energy,slight underweight orange ,,,,
9,Materials,slight underweight orange ,,,,


Coming back to MarkItDown, one interesting feature to explore is the ability to extract information from images by passing an image capable LLM model.

In [55]:
md_llm = MarkItDown(llm_client=client, llm_model="gpt-4o-mini")

In [None]:
result = md_llm.convert("../data/input/forecast.png")

Here's the description we obtain from the image of our input document. Overall, the description is somewhat accurate but contains a few inaccuracies including:

- For the sector weightings, the description states there are "underweight positions in U.S. Small Cap Growth" but looking at the Asset Class Weightings chart, U.S. Small Cap Growth actually shows an overweight position (green circle).
- The description mentions "overweight positions in certain sectors such as Utilities and Financials" but looking at the CIO Equity Sector Views, both these sectors show neutral positions, not overweight positions.
- For fixed income, the description cites a "10-Year (4.03%)" yield, but the image shows the 30-Year Yield at 4.03%, while the 10-Year Yield is actually 4.40%.

Arguably, the description's inaccuracies could be a consequence of the underlying LLM model's inability to process the image. Further research is needed to determine if this is the case.


In [64]:
display(Markdown(result.text_content))


# Description:
**Markets in Review: Economic Forecasts and Asset Class Weightings (as of 12/13/2024)**

This detailed market overview presents key performance metrics and economic forecasts as of December 13, 2024.

**Equities Overview:**
- **Total Returns:** Highlights returns for major indices such as the DJIA (18.4% YTD), NASDAQ (33.7% YTD), and S&P 500 (28.6% YTD), showcasing strong performance across the board.
- **Forecasts:** Economic indicators reveal a projected real global GDP growth of 3.1%, with inflation rates expected to stabilize around 2.2% in 2025. Unemployment rates are anticipated to remain low at 4.4%.

**Fixed Income:**
- Focuses on various segments, including Corporate & Government bonds, which offer an annualized return of 4.66% and indicate shifting trends in interest rates over 2-Year (4.25%) and 10-Year (4.03%) bonds.

**Commodities & Currencies:**
- Commodities such as crude oil and gold show varied performance, with oil increasing by 4.8% and gold prices sitting at $2,648.23 per ounce.
- Currency metrics highlight the Euro and USD trends over the past year.

**S&P Sector Returns:**
- A quick reference for sector performance indicates a significant 2.5% return in Communication Services, while other sectors like Consumer Staples and Materials display minor fluctuations.

**CIO Asset Class Weightings:**
- Emphasizes strategic asset allocation recommendations which are crucial for an investor's portfolio. Underweight positions in U.S. Small Cap Growth and International Developed contrast with overweight positions in certain sectors such as Utilities and Financials, signaling tactical shifts based on ongoing economic assessments.

**Note:** This summary is sourced from BofA Global Research and aims to provide a comprehensive view of current market conditions and forecasts to assist investors in making informed decisions.


## Retrieval-Augmented Generation

RAG is a technique that allows LLMs to retrieve information from a knowledge base to answer questions. It is a popular technique for building LLM applications that require knowledge-intensive tasks.

{cite}`lewis2021retrievalaugmentedgenerationknowledgeintensivenlp`

## Case Studies

This section presents three case studies that demonstrate practical solutions to common LLM limitations:

First, Content Chunking with Contextual Linking showcases how intelligent chunking strategies can overcome both context window and output token limitations. This case study illustrates techniques for breaking down and reassembling content while maintaining coherence, enabling the generation of high-quality long-form outputs despite model constraints.

Second, a Retrieval Augmented Generation case study addresses the challenge of stale or outdated model knowledge. By implementing semantic search over a GitHub repository, this example demonstrates how to augment LLM responses with current, accurate information - allowing users to query and receive up-to-date answers about code repository contents.

Third, the final case study builds a Quiz generator with citations. This case study explores some additional input management techniques that become particularly useful when long context window is available. This includes implementing prompt caching for efficiency and adding citations to enhance response accuracy and verifiability. These approaches show how to maximize the benefits of larger context models while maintaining response quality.

### Case Study I: Content Chunking with Contextual Linking

Content chunking with contextual linking is a technique used to manage the `max_output_tokens` limitation by breaking down long-form content into smaller, manageable chunks while keeping chunk-specific context. This approach tackles three problems:
1. The LLM's inability to process long inputs to do context-size limits
2. The LLM's inability to generate long-form content due to the `max_output_tokens` limitation.
3. The LLM's inability to maintain coherence and context when generating responses per chunks

The following steps are followed to implement content chunking with contextual linking:
1. **Chunking the Content**: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input.

2. **Maintaining Context**: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks.

3. **Generating Linked Prompts**: For each chunk, a prompt is generated that includes the chunk's content and its context. This prompt is then used to generate the output for that chunk.

4. **Combining the Outputs**: The outputs of all chunks are combined to form the final long-form content.

Let's examine an example implementation of this technique.

#### Generating long-form content

- Goal: Generate a long-form report analyzing a company's financial statement.
- Input: A company's 10K SEC filing.

```{figure} ../_static/structured_output/diagram1.png
---
name: content-chunking-with-contextual-linking
alt: Content Chunking with Contextual Linking
scale: 50%
align: center
---
Content Chunking with Contextual Linking Schematic Representation.
```

The diagram in {numref}`content-chunking-with-contextual-linking` illustrates the process we will follow for handling long-form content generation with Large Language Models through "Content Chunking with Contextual Linking." It shows how input content is first split into manageable chunks using a chunking function (e.g. `CharacterTextSplitter` with `tiktoken` tokenizer), then each chunk is processed sequentially while maintaining context from previous chunks. For each chunk, the system updates the context, generates a dynamic prompt with specific parameters, makes a call to the LLM chain, and stores the response. After all chunks are processed, the individual responses are combined with newlines to create the final report, effectively working around the token limit constraints of LLMs while maintaining coherence across the generated content.

**Step 1: Chunking the Content**

There are different methods for chunking, and each of them might be appropriate for different situations. However, we can broadly group chunking strategies in two types:
- **Fixed-size Chunking**: This is the most common and straightforward approach to chunking. We simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking may be a reasonable path in many common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any specialied techniques or libraries.
- **Content-aware Chunking**: These are a set of methods for taking advantage of the nature of the content we’re chunking and applying more sophisticated chunking to it. Examples include:
  - **Sentence Splitting**: Many models are optimized for embedding sentence-level content. Naturally, we would use sentence chunking, and there are several approaches and tools available to do this, including naive splitting (e.g. splitting on periods), NLTK, and spaCy.
  - **Recursive Chunking**: Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators.
  - **Semantic Chunking**: This is a class of methods that leverages embeddings to extract the semantic meaning present in your data, creating chunks that are made up of sentences that talk about the same theme or topic.

  Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model.


In [None]:
def get_chunks(text: str, chunk_size: int, chunk_overlap: int) -> list:
    """
    Split input text into chunks of specified size with specified overlap.

    Args:
        text (str): The input text to be chunked.
        chunk_size (int): The maximum size of each chunk in tokens.
        chunk_overlap (int): The number of tokens to overlap between chunks.

    Returns:
        list: A list of text chunks.
    """
    from langchain_text_splitters import CharacterTextSplitter

    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return text_splitter.split_text(text)


**Step 2: Writing the Base Prompt Template**

We will write a base prompt template which will serve as a foundational structure for all chunks, ensuring consistency in the instructions and context provided to the language model. The template includes the following parameters:
- `role`: Defines the role or persona the model should assume.
- `context`: Provides the background information or context for the task.
- `instruction`: Specifies the task or action the model needs to perform.
- `input_text`: Contains the actual text input that the model will process.
- `requirements`: Lists any specific requirements or constraints for the output.

In [None]:
from langchain_core.prompts import PromptTemplate
def get_base_prompt_template() -> str:
    
    base_prompt = """
    ROLE: {role}
    CONTEXT: {context}
    INSTRUCTION: {instruction}
    INPUT: {input}
    REQUIREMENTS: {requirements}
    """
    
    prompt = PromptTemplate.from_template(base_prompt)
    return prompt

We will write a simple function that returns an `LLMChain` which is a simple `langchain` construct that allows you to chain together a combination of prompt templates, language models and output parsers.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatLiteLLM

def get_llm_chain(prompt_template: str, model_name: str, temperature: float = 0):
    """
    Returns an LLMChain instance using langchain.

    Args:
        prompt_template (str): The prompt template to use.
        model_name (str): The name of the model to use.
        temperature (float): The temperature setting for the model.

    Returns:
        llm_chain: An instance of the LLMChain.
    """
    
    from dotenv import load_dotenv
    import os

    # Load environment variables from .env file
    load_dotenv()
    
    api_key_label = model_name.split("/")[0].upper() + "_API_KEY"
    llm = ChatLiteLLM(
        model=model_name,
        temperature=temperature,
        api_key=os.environ[api_key_label],
    )
    llm_chain = prompt_template | llm | StrOutputParser()
    return llm_chain

**Step 3: Constructing Dynamic Prompt Parameters**

Now, we will write a function (`get_dynamic_prompt_template`) that constructs prompt parameters dynamically for each chunk.

In [None]:
from typing import Dict
def get_dynamic_prompt_params(prompt_params: Dict, 
                            part_idx: int, 
                            total_parts: int,
                            chat_context: str,
                            chunk: str) -> str:
    """
    Construct prompt template dynamically per chunk while maintaining the chat context of the response generation.
    
    Args:
        prompt_params (Dict): Original prompt parameters
        part_idx (int): Index of current conversation part
        total_parts (int): Total number of conversation parts
        chat_context (str): Chat context from previous parts
        chunk (str): Current chunk of text to be processed
    Returns:
        str: Dynamically constructed prompt template with part-specific params
    """
    dynamic_prompt_params = prompt_params.copy()
    # saves the chat context from previous parts
    dynamic_prompt_params["context"] = chat_context
    # saves the current chunk of text to be processed as input
    dynamic_prompt_params["input"] = chunk
    
    # Add part-specific instructions
    if part_idx == 0: # Introduction part
        dynamic_prompt_params["instruction"] = f"""
        You are generating the Introduction part of a long report.
        Don't cover any topics yet, just define the scope of the report.
        """
    elif part_idx == total_parts - 1: # Conclusion part
        dynamic_prompt_params["instruction"] = f"""
        You are generating the last part of a long report. 
        For this part, first discuss the below INPUT. Second, write a "Conclusion" section summarizing the main points discussed given in CONTEXT.
        """
    else: # Main analysis part
        dynamic_prompt_params["instruction"] = f"""
        You are generating part {part_idx+1} of {total_parts} parts of a long report.
        For this part, analyze the below INPUT.
        Organize your response in a way that is easy to read and understand either by creating new or merging with previously created structured sections given in CONTEXT.
        """
    
    return dynamic_prompt_params


**Step 4: Generating the Report**

Finally, we will write a function that generates the actual report by calling the `LLMChain` with the dynamically updated prompt parameters for each chunk and concatenating the results at the end.

In [None]:
def generate_report(input_content: str, llm_model_name: str, 
                    role: str, requirements: str,
                    chunk_size: int, chunk_overlap: int) -> str:
    # stores the parts of the report, each generated by an individual LLM call
    report_parts = [] 
    # split the input content into chunks
    chunks = get_chunks(input_content, chunk_size, chunk_overlap)
    # initialize the chat context with the input content
    chat_context = input_content
    # number of parts to be generated
    num_parts = len(chunks)

    prompt_params = {
        "role": role, # user-provided
        "context": "", # dinamically updated per part
        "instruction": "", # dynamically updated per part
        "input": "", # dynamically updated per part
        "requirements": requirements #user-priovided
    }

    # get the LLMChain with the base prompt template
    llm_chain = get_llm_chain(get_base_prompt_template(), 
                                 llm_model_name)

    # dynamically update prompt_params per part
    print(f"Generating {num_parts} report parts")
    for i, chunk in enumerate(chunks):
        dynamic_prompt_params = get_dynamic_prompt_params(
            prompt_params,
            part_idx=i,
            total_parts=num_parts,
            chat_context=chat_context,
            chunk=chunk
        )
        
        # invoke the LLMChain with the dynamically updated prompt parameters
        response = llm_chain.invoke(dynamic_prompt_params)

        # update the chat context with the cummulative response
        if i == 0:
            chat_context = response
        else:
            chat_context = chat_context + response
            
        print(f"Generated part {i+1}/{num_parts}.")
        report_parts.append(response)

    report = "\n".join(report_parts)
    return report

**Example Usage**


In [None]:
# Load the text from sample 10K SEC filing
with open('../data/apple.txt', 'r') as file:
    text = file.read()

In [None]:
# Define the chunk and chunk overlap size
MAX_CHUNK_SIZE = 10000
MAX_CHUNK_OVERLAP = 0

In [None]:
report = generate_report(text, llm_model_name="gemini/gemini-1.5-flash-latest", 
                           role="Financial Analyst", 
                           requirements="The report should be in a readable, structured format, easy to understand and follow. Focus on finding risk factors and market moving insights.",
                           chunk_size=MAX_CHUNK_SIZE, 
                           chunk_overlap=MAX_CHUNK_OVERLAP)

In [None]:
# Save the generated report to a local file
with open('data/apple_report.txt', 'w') as file:
    file.write(report)


In [105]:
# Read and display the generated report
with open('../data/apple_report.txt', 'r') as file:
    report_content = file.read()
    
from IPython.display import Markdown

# Display first and last 25% of the report content
report_lines = report_content.splitlines()
total_lines = len(report_lines)
quarter_lines = total_lines // 4

top_portion = '\n'.join(report_lines[:quarter_lines])
bottom_portion = '\n'.join(report_lines[-quarter_lines:])

display(Markdown(f"{top_portion}\n\n (...) \n\n {bottom_portion}"))


**Introduction**

This report provides a comprehensive analysis of Apple Inc.'s financial performance and position for the fiscal year ended September 28, 2024, as disclosed in its Form 10-K filing with the United States Securities and Exchange Commission.  The analysis will focus on identifying key risk factors impacting Apple's business, evaluating its financial health, and uncovering market-moving insights derived from the provided data.  The report will delve into Apple's various segments, product lines, and services, examining their performance and contributions to overall financial results.  Specific attention will be paid to identifying trends, potential challenges, and opportunities for future growth.  The analysis will also consider the broader macroeconomic environment and its influence on Apple's operations and financial outlook.  Finally, the report will incorporate relevant information from Apple's definitive proxy statement for its 2025 annual meeting of shareholders, as incorporated by reference in the Form 10-K.

**PART 2: Key Risk Factors and Market-Moving Insights**

This section analyzes key risk factors disclosed in Apple Inc.'s 2024 Form 10-K, focusing on their potential impact on financial performance and identifying potential market-moving insights.  The analysis is structured around the major risk categories identified in the filing.

**2.1 Dependence on Third-Party Developers:**

Apple's success is heavily reliant on the continued support and innovation of third-party software developers.  The Form 10-K highlights several critical aspects of this dependence:

* **Market Share Vulnerability:** Apple's relatively smaller market share in smartphones, personal computers, and tablets compared to competitors (Android, Windows, gaming consoles) could discourage developers from prioritizing Apple's platform, leading to fewer high-quality apps and potentially impacting customer purchasing decisions.  This is a significant risk, especially given the rapid pace of technological change.  A decline in app availability or quality could negatively impact sales and market share.  **Market-moving insight:**  Monitoring developer activity and app quality across competing platforms is crucial for assessing this risk.  Any significant shift in developer focus away from iOS could be a negative market signal.

* **App Store Dynamics:** While Apple allows developers to retain most App Store revenue, its commission structure and recent changes (e.g., complying with the Digital Markets Act (DMA) in the EU) introduce uncertainty.  Changes to the App Store's policies or fee structures could materially affect Apple's revenue and profitability.  **Market-moving insight:**  Closely monitoring regulatory developments (especially concerning the DMA) and their impact on App Store revenue is essential.  Any significant changes to Apple's App Store policies or revenue streams could trigger market reactions.

* **Content Acquisition and Creation:** Apple's reliance on third-party digital content providers for its services introduces risks related to licensing agreements, competition, and pricing.  The cost of producing its own digital content is also increasing due to competition for talent and subscribers.  Failure to secure or create appealing content could negatively impact user engagement and revenue.  **Market-moving insight:**  Analyzing the success of Apple's original content initiatives and the renewal rates of third-party content agreements will provide insights into this risk.

**2.2 Operational Risks:**

Several operational risks could significantly impact Apple's performance:

* **Employee Retention:**  Competition for highly skilled employees, particularly in Silicon Valley, poses a significant risk.  Failure to retain key personnel or maintain its distinctive culture could negatively affect innovation, product development, and overall operational efficiency.  **Market-moving insight:**  Any significant changes in employee turnover rates or negative press regarding Apple's workplace culture could negatively impact investor sentiment.

* **Reseller Dependence:** Apple's reliance on carriers, wholesalers, and retailers for product distribution introduces risks related to their financial health, distribution decisions, and potential changes in financing or subsidy programs.  **Market-moving insight:**  Monitoring the financial performance of key resellers and any changes in their distribution strategies is crucial.

* **Information Technology and Cybersecurity:**  Apple's dependence on complex IT systems makes it vulnerable to system failures, network disruptions, and cybersecurity threats (including ransomware attacks).  These events could disrupt operations, damage reputation, and impact sales.  The Form 10-K highlights the company's proactive measures, but acknowledges that these may not be sufficient to prevent all incidents.  **Market-moving insight:**  Any major cybersecurity breach or significant service outage could trigger a negative market reaction.

**2.3 Legal and Regulatory Risks:**

Apple faces significant legal and regulatory challenges:

* **Antitrust Litigation:**  The ongoing antitrust lawsuits in the U.S. and investigations in Europe concerning App Store practices pose a substantial risk.  Adverse outcomes could result in significant fines, changes to business practices, and reputational damage.  **Market-moving insight:**  The progress and outcomes of these legal proceedings will be closely watched by the market.  Any negative developments could significantly impact Apple's stock price.

* **Digital Markets Act (DMA) Compliance:**  Apple's efforts to comply with the DMA in the EU introduce uncertainty and potential costs.  Non-compliance could lead to substantial fines.  **Market-moving insight:**  The Commission's ongoing investigations and any subsequent decisions will be closely monitored.

* **Data Privacy and Protection:**  Increasingly stringent data privacy regulations worldwide impose significant compliance costs and risks.  Non-compliance could result in penalties and reputational harm.  **Market-moving insight:**  Any significant fines or negative publicity related to data privacy violations could negatively impact Apple's stock price.

* **Other Legal Proceedings:**  The Form 10-K notes that Apple is subject to various other legal proceedings, the outcomes of which are uncertain and could materially affect its financial condition.

**2.4 Financial Risks:**

Several financial risks could impact Apple's performance:

* **Sales and Profit Margin Volatility:**  Apple's quarterly net sales and profit margins are subject to fluctuations due to various factors, including pricing pressures, competition, product life cycles, supply chain issues, and macroeconomic conditions.  **Market-moving insight:**  Any significant deviation from expected sales or profit margins could trigger market reactions.

* **Foreign Exchange Rate Fluctuations:**  Apple's international operations expose it to risks associated with changes in the value of the U.S. dollar.  Fluctuations in exchange rates can impact sales, earnings, and gross margins.  **Market-moving insight:**  Significant movements in major currency exchange rates relative to the USD should be monitored for their potential impact on Apple's financial results.

* **Credit Risk and Investment Portfolio:**  Apple's exposure to credit risk on trade receivables and fluctuations in the value of its investment portfolio could lead to losses.  **Market-moving insight:**  Any significant deterioration in the creditworthiness of key customers or a substantial decline in the value of Apple's investment portfolio could be viewed negatively by the market.

* **Tax Risks:**  Changes in tax rates, new tax legislation, and tax audits could materially affect Apple's financial performance.

 (...) 

 **4.10 Debt and Share Repurchases:**

Note 9 details Apple's debt structure, including commercial paper and term debt.  While the company has a strong credit rating, the significant amount of debt and the high weighted-average interest rate on commercial paper (5.00% in 2024) indicate potential interest rate risk.  Note 10 highlights the substantial share repurchase program ($95 billion in 2024), which, while returning value to shareholders, could limit funds available for future investments or acquisitions.  **Market-moving insight:**  Investors will monitor the balance between debt levels, share repurchases, and investments in future growth.

**4.11 Share-Based Compensation:**

Note 11 shows a steady increase in share-based compensation expense, reflecting Apple's reliance on equity-based incentives to attract and retain talent.  The significant unrecognized compensation cost related to outstanding RSUs ($19.4 billion in 2024) represents a future expense commitment.  **Market-moving insight:**  Changes in share-based compensation policies or unexpected increases in expense could impact future profitability.

**4.12 Commitments and Supply Concentrations:**

Note 12 reveals Apple's substantial unconditional purchase obligations, primarily for suppliers, licensed intellectual property, and content.  These commitments represent significant future cash outflows and highlight the company's dependence on its supply chain.  **Market-moving insight:**  Any disruptions in the supply chain or changes in supplier relationships could negatively impact Apple's production and sales.


This detailed analysis reveals several key risk factors and market-moving insights beyond those identified in Part 3.  Investors and analysts should carefully consider these factors when assessing Apple's future performance and valuation.

**PART 5: Contingencies, Supply Chain, and Segment Analysis**

This section analyzes additional information from Apple Inc.'s 2024 Form 10-K, focusing on contingencies, supply chain risks, and a deeper dive into segment performance.

**5.1 Contingencies and Legal Proceedings:**

The Form 10-K acknowledges that Apple is involved in various legal proceedings and claims. While management believes no material loss is reasonably possible beyond existing accruals, the inherent uncertainty of litigation remains a risk.  Adverse outcomes in any of these cases could negatively impact Apple's financial condition and reputation.  **Market-moving insight:**  Any significant legal developments or settlements should be closely monitored for their potential market impact.  Increased legal expenses or negative publicity could affect investor sentiment.

**5.2 Supply Chain Concentration:**

Apple's reliance on a concentrated network of outsourcing partners, primarily located in a few Asian countries, presents significant risks.  The dependence on single or limited sources for certain custom components exposes Apple to supply chain disruptions, shortages, and price fluctuations.  While Apple uses multiple sources for most components, the unique nature of some components used in new products creates vulnerability.  Suppliers might prioritize common components over custom ones, impacting Apple's ability to produce its innovative products.  **Market-moving insight:**  Any significant supply chain disruptions, geopolitical instability in key manufacturing regions, or changes in supplier relationships could negatively impact Apple's production and sales, triggering a negative market reaction.

**5.3 Detailed Segment Analysis:**

Note 13 provides a detailed breakdown of Apple's segment performance.  While the Americas and Europe showed growth, primarily driven by Services revenue, Greater China experienced a decline due to lower iPhone and iPad sales and currency headwinds.  This highlights the regional economic and currency risks impacting Apple's revenue.  The relatively flat year-over-year iPhone sales, despite growth in other product lines, warrants further investigation into market saturation and competitive pressures.  The significant contribution of the Services segment to overall revenue and profitability underscores both its importance and the risk associated with its dependence on this segment.

The reconciliation of segment operating income to consolidated operating income reveals that research and development (R&D) and other corporate expenses significantly impact overall profitability.  While increased R&D is generally positive, it reduces short-term profits.  The geographical breakdown of net sales and long-lived assets further emphasizes the concentration of Apple's business in the U.S. and China.  **Market-moving insight:**  Continued weakness in the Greater China market, sustained flat iPhone sales, or any significant changes in R&D spending should be closely monitored for their potential impact on Apple's financial performance and investor sentiment.


**5.4 Auditor's Report and Internal Controls:**

The auditor's report expresses an unqualified opinion on Apple's financial statements and internal control over financial reporting.  However, it identifies uncertain tax positions as a critical audit matter.  The significant amount of unrecognized tax benefits ($22.0 billion) and the complexity involved in evaluating these positions highlight a substantial risk.  Management's assessment of these positions involves significant judgment and relies on interpretations of complex tax laws.  Apple's management also asserts that its disclosure controls and procedures are effective.  **Market-moving insight:**  Any changes in tax laws, unfavorable rulings on uncertain tax positions, or weaknesses in internal controls could materially affect Apple's financial results and investor confidence.


**Conclusion**

This report provides a comprehensive analysis of Apple Inc.'s financial performance and position for fiscal year 2024.  While Apple maintains a strong financial position with substantial cash reserves and a robust capital return program, several key risk factors could significantly impact its future performance.  These risks include:

* **Dependence on third-party developers:**  A shift in developer focus away from iOS or changes to the App Store's policies could negatively impact Apple's revenue and profitability.
* **Operational risks:**  Employee retention challenges, reseller dependence, and cybersecurity threats pose significant operational risks.
* **Legal and regulatory risks:**  Ongoing antitrust litigation, the Digital Markets Act (DMA) compliance, and data privacy regulations introduce substantial legal and regulatory uncertainties.
* **Financial risks:**  Volatility in sales and profit margins, foreign exchange rate fluctuations, credit risk, and tax risks could impact Apple's financial performance.
* **Supply chain concentration:**  Apple's reliance on a concentrated network of outsourcing partners, primarily located in a few Asian countries, and dependence on single or limited sources for certain custom components, exposes the company to significant supply chain risks.
* **Uncertain tax positions:**  The significant amount of unrecognized tax benefits represents a substantial uncertainty that could materially affect Apple's financial results.

Despite these risks, Apple's strong liquidity position, continued growth in its Services segment, and robust capital return program provide a degree of resilience.  However, investors and analysts should closely monitor the market-moving insights identified throughout this report, including developer activity, regulatory developments, regional economic conditions, supply chain stability, and the resolution of uncertain tax positions, to assess their potential impact on Apple's future performance and valuation.  The significant short-term obligations, while manageable given Apple's cash position, highlight the need for continued financial discipline and effective risk management.  A deeper, more granular analysis of the financial statements and notes is recommended for a more complete assessment.

---

#### Discussion

Results from the generated report present a few interesting aspects:

- **Coherence**: The generated report demonstrates an apparent level of coherence. The sections are logically structured, and the flow of information is smooth. Each part of the report builds upon the previous sections, providing a comprehensive analysis of Apple Inc.'s financial performance and key risk factors. The use of headings and subheadings helps in maintaining clarity and organization throughout the document.

- **Adherence to Instructions**: The LLM followed the provided instructions effectively. The report is in a readable, structured format, and it focuses on identifying risk factors and market-moving insights as requested. The analysis is detailed and covers various aspects of Apple's financial performance, including revenue segmentation, profitability, liquidity, and capital resources. The inclusion of market-moving insights adds value to the report, aligning with the specified requirements.

Despite the seemingly good quality of the results, there are some limitations to consider:

- **Depth of Analysis**: While the report covers a wide range of topics, the depth of analysis in certain sections may not be as comprehensive as a human expert's evaluation. Some nuances and contextual factors might be overlooked by the LLM. Splitting the report into multiple parts helps in mitigating this issue.

- **Chunking Strategy**: The current approach splits the text into chunks based on size, which ensures that each chunk fits within the model's token limit. However, this method may disrupt the logical flow of the document, as sections of interest might be split across multiple chunks. An alternative approach could be "structured" chunking, where the text is divided based on meaningful sections or topics. This would preserve the coherence of each section, making it easier to follow and understand. Implementing structured chunking requires additional preprocessing to identify and segment the text appropriately, but it can significantly enhance the readability and logical flow of the generated report.


### Case Study II: Github RAG


### Case Study III: Quiz Generation with Citations

In this case study, we will build a Quiz generator with citations that explores additional input management techniques particularly useful with long context windows. The implementation includes prompt caching for efficiency and citation tracking to enhance accuracy and verifiability. We will use Gemini 1.5 Pro as our LLM model, which has a context window of 2M tokens.

#### Use Case

Let's assume you are a Harvard student enrolled in GOV 1039 "The Birth of Modern Democracy" (see {numref}`harvard-class`), you face a daunting reading list for next Tuesday's class on Rights. The readings include foundational documents like the Magna Carta, Declaration of Independence, and US Bill of Rights, each with specific sections to analyze.

```{figure} ../_static/input/harvard.png
---
name: harvard-class
alt: Harvard Class
scale: 50%
align: center
---
Harvard's Democratic Theory Class
```

Instead of trudging through these dense historical texts sequentially, we would like to:
- Extract key insights and connections between these documents, conversationally.
- Engage with the material through a quiz format.
- Add citations to help with verifying answers.


#### Implementation

The full implementation is available at Book's [Github repository](https://github.com/souzatharsis/tamingLLMs/tamingllms/notebooks/src/gemini_duo.py). Here, we will cover the most relevant parts of the implementation.

**Client Class**

First, we will define the `Client` class which will provide the key interface users will interact with. It has the following summarized interface:

- Initialization:
    - `__init__(knowledge_base: List[str] = [])`: Initialize with optional list of URLs as knowledge base

- Core Methods:
    - `add_knowledge_base(urls: List[str]) -> None`: Add URLs to the knowledge base
    - `add(urls: List[str]) -> None`: Extract content from URLs and add to conversation input
    - `msg(msg: str = "", add_citations: bool = False) -> str`: Enables users to send messages to the client
    - `quiz(add_citations: bool = True, num_questions: int = 10) -> str`: Generate a quiz based on full input memory

- Key Attributes:
    - `knowledge_base`: List of URLs providing foundation knowledge
    - `input`: Current input being studied (short-term memory)
    - `input_memory`: Cumulative input + knowledge base (long-term memory) 
    - `response`: Latest response from LLM
    - `response_memory`: Cumulative responses (long-term memory)
    - `urls_memory`: Cumulative list of processed URLs


**Corpus-in-Context Prompting**

The `add()` method is key since it is used to add content to the client. It takes a list of URLs and extracts the content from each URL using a content extractor, which we used MarkitDown. The content is then added to the conversation input in a way that enables citations using the "Corpus-in-Context" (CIC) Prompting {cite}`lee2024longcontextlanguagemodelssubsume`.

{numref}`cic` shows how CIC format is used to enable citations. It inserts a corpus into the prompt. Each candidate citable part (e.g., passage, chapter) in a corpus is assigned a unique identifier (ID) that can be referenced as needed for that task.

```{figure} ../_static/input/cic.png
---
name: cic
alt: CIC Format
scale: 50%
align: center
---
Example of Corpus-in-Context Prompting for retrieval. 
```

CiC prompting leverages LLM's capacity to follow instructions by carefully annotating the corpus with document IDs. It benefits from a strong, capable models to retrieve over large corpora provided in context. 

```python
    def add(self, urls: List[str]) -> None:
        self.urls = urls

        # Add new content to input following CIC format to enable citations
        for url in urls:
            self.urls_memory.append(url)
            content = self.extractor.convert(url).text_content
            formatted_content = f"ID: {self.reference_id} | {content} | END ID: {self.reference_id}"
            self.input += formatted_content + "\n" 
            self.reference_id += 1
        
        # Update memory
        self.input_memory = self.input_memory + self.input
```

The method `add_knowledge_base()` is a simple wrapper around the `add()` method. It is used to add URLs to the knowledge base, which are later cached by the LLM model as we will see later.

```python
    def add_knowledge_base(self, urls: List[str]) -> None:
        self.add(urls)
```


Later, when the user sends a message to the client, the `msg()` method is used to generate a response  while enabling citations. `self.content_generator` is an instance of our LLM model, which we will next.

```python
    def msg(self, msg: str = "", add_citations: bool = False) -> str:
        if add_citations:
            msg = msg + "\n\n For key statements, add Input ID to the response."

        self.response = self.content_generator.generate(
            input_content=self.input,
            user_instructions=msg
        )

        self.response_memory = self.response_memory + self.response.text

        return self.response.text
```

**Prompt Caching**

LLM-based applications often involve repeatedly passing the same input tokens to a model, which can be inefficient and costly. Context caching addresses this by allowing you to cache input tokens after their first use and reference them in subsequent requests. This approach significantly reduces costs compared to repeatedly sending the same token corpus, especially at scale.

Context caching proves especially valuable when a large initial context needs to be referenced multiple times by smaller requests. By caching the context upfront, these applications can maintain high performance while optimizing token usage and associated costs.

In our application, the user might pass a large knowledge base to the client that can be referenced multiple times by smaller user requests. Our `Client` class is composed of a `LLMBackend` class that takes the `input_memory` - containing the entire knowledge base and any additional user added content.
```python
self.llm = LLMBackend(input=self.input_memory)
```

In our `LLMBackend` Class, we leverage prompt caching on input tokens and uses them for subsequent requests.

```python
class LLMBackend:
    def __init__(self, model_name: str, input: str, cache_ttl: int = 60):
        self.cache = caching.CachedContent.create(
            model=model_name,
            display_name='due_knowledge_base', # used to identify the cache
            system_instruction=(
            self.compose_prompt(input, conversation_config)
        ),
        ttl=datetime.timedelta(minutes=cache_ttl),
    )

    self.model = genai.GenerativeModel.from_cached_content(cached_content=self.cache)
```

**Quiz Generation**

Coming back to our `Client` class, we implement the `quiz()` method to generate a quiz based on the full input memory, i.e. the initial knowledge base and any additional user added content.

The `quiz()` method returns a `Quiz` instance which behind the scenes caches input tokens. The user later can invoke the `generate()` method to generate a quiz passing the user instructions in `msg` parameter, as we will see later.

```python
    def quiz(self, add_citations: bool = True, num_questions: int = 10) -> str:
        """
        Returns a quiz instance based on full input memory.
        """
        self.quiz_instance = Quiz(
                         input=self.input_memory,
                         add_citations=add_citations,
                         num_questions=num_questions)
        return self.quiz_instance
```

We write a simple prompt template for quiz generation:

> ROLE:
> - You are a Harvard Professor providing a quiz.
> INSTRUCTIONS:
> - Generate a quiz with {num_questions} questions based on the input.
> - The quiz should be multi-choice.
> - Answers should be provided at the end of the quiz.
> - Questions should have broad coverage of the input including multiple Input IDs.
> - Level of difficulty is advanced/hard.
> - {citations}
> STRUCTURE:
> - Sequence of questions and alternatives.
> - At the end provide the correct answers.

where, `{citations}` instructs the model to add CiC citations to the response if user requests it.

#### Example Usage


**Dataset**

First, we will define our knowledge base. 

- Harvard Class: [GOV 1039 Syllabus](https://scholar.harvard.edu/files/dlcammack/files/gov_1039_syllabus.pdf)
- Class / Topic: "Rights"
- Reading List:
    - ID 1. The Declaration of Independence of the United States of America
    - ID 2. The United States Bill of Rights
    - ID 3. John F. Kennedy's Inaugural Address
    - ID 4. Lincoln's Gettysburg Address
    - ID 5. The United States Constitution
    - ID 6. Give Me Liberty or Give Me Death
    - ID 7. The Mayflower Compact
    - ID 8. Abraham Lincoln's Second Inaugural Address
    - ID 9. Abraham Lincoln's First Inaugural Address

We will take advantage of Project Gutenberg's to create our knowledge base.

In [None]:
kb = [f"https://www.gutenberg.org/cache/epub/{i}/pg{i}.txt" for i in range(1,9)]

We will import our module as `genai_duo` and initialize the `Client` class with our knowledge base.

In [None]:
import gemini_duo as genai_duo
from IPython.display import Markdown, display

In [None]:
duo = genai_duo.Client(knowledge_base=kb)

At this point, we converted each book into markdown using MarkitDown and cached the content in our LLM model. We can access how many tokens we have cached in our LLM model by looking at the `usage_metadata` attribute of the Gemini's model response. At this point, we have cached at total of 38470 tokens.

Now, we can add references to our knowledge base at anytime by calling the `add()` method. We add the following references:
1. The Magna Carta
2. William Shap McKechnie on Magna Carta book

In [None]:
study_references = ["https://www.gutenberg.org/cache/epub/10000/pg10000.txt", "https://www.gutenberg.org/cache/epub/65363/pg65363.txt"]

duo.add(study_references)

Now we can instantiate a `Quiz` object and generate a quiz based on the full input memory.

In [None]:
quiz = duo.quiz(add_citations=True)
display(Markdown(quiz.generate()))

{numref}`quiz` shows a sample sample quiz with citations. Marked in yellow are the citations which refer to the input IDs of the resources we added to the model.

```{figure} ../_static/input/quiz.png
---
name: quiz
alt: Quiz with Citations
scale: 50%
align: center
---
Sample Quiz with Citations.
```


#### Discussion

The experiment demonstrated the ability to build a knowledge base from multiple sources and generate quizzes with citations. The system successfully ingested content from Project Gutenberg texts, including historical documents like the Magna Carta, and used them to create interactive educational content.

However, several limitations emerged during this process:

1. Memory Management: The system currently loads all content into memory, which could become problematic with larger knowledge bases. A more scalable approach might involve chunking or streaming the content.

2. Context Window Constraints: With 38,470 tokens cached, we are approaching typical context window limits of many LLMs. This restricts how much knowledge can be referenced simultaneously during generation.

3. Citation Quality: While the system provides citations, they lack specificity - pointing to entire documents rather than specific passages or page numbers. This limits the ability to fact-check or verify specific claims.

4. Content Verification: The system does not currently verify the accuracy of generated quiz questions against the source material. This could lead to potential hallucinations or misinterpretations.

5. Input Format Limitations: The current implementation works well with plain text but may struggle with more complex document formats or structured data sources.

These limitations highlight opportunities for future improvements in knowledge management and citation systems when building LLM-powered educational tools.


Citation Granularity: While citations are provided, currently they are given at the resource level rather than specific passages.

## Conclusion

[![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
[cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
[cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC-BY--NC--SA-4.0-lightgrey.svg

```
@misc{tharsistpsouza2024tamingllms,
  author = {Tharsis T. P. Souza},
  title = {Taming LLMs: A Practical Guide to LLM Pitfalls with Open Source Software},
  year = {2024},
  chapter = {Managing Input Data},
  journal = {GitHub repository},
  url = {https://github.com/souzatharsis/tamingLLMs)
}
```
## References
```{bibliography}
:filter: docname in docnames
```