<table style="border:none; border-collapse:collapse; cellspacing:0; cellpadding:0">
<tr>
    <td width=30% style="border:none">
        <center>
            <img src="../images/iapau_icon.png" width="30%"/><br>
            <a href="https://iapau.org/">Association IA Pau</a><br>
            <a href="https://iapau.org/events/festival/">Festival IAPau 7</a>
        </center>
    </td>
    <td style="border:none">
        <center>
            <h1>Atelier - Agentic RAG</h1>
            <h2>The Knowledge Core</h2>
            <h2>Ingestion, Enrichissemment, et Multi-Modal Indexing</h2>
        </center>
    </td>
    <td width=20% style="border:none">
    </td>
</tr>
</table>

---

**Pr√©requis :** Compl√©ter d'abord la Phase 0 (acquisition des donn√©es).

<img src="../images/agentic-rag-data-ingestion-iapau.png" alt="data-ingestion" width="70%"/>

---

## üìã Table des mati√®res

- [**Import des biblioth√®ques et chargement des donn√©es**](#import-donnees)

1. [**Segmentation avanc√©e des documents**](#analyse-documents)
   - Extraction du contenu des documents
   - Segmentation avec la biblioth√®que `unstructured`

2. [**D√©coupage s√©mantique/structur√©**](#decoupage-intelligent)
   - Chunking conscient de la structure
   - Pr√©servation des informations hi√©rarchiques

3. [**Enrichissement avec LLM**](#enrichissement-llm)
   - G√©n√©ration de m√©tadonn√©es structur√©es
   - Enrichissement des chunks avec r√©sum√©s et mots-cl√©s

4. [**Emdeddings & Vector Database (Qdrant)**](#magasin-vectoriel)
   - G√©n√©ration d'embeddings
   - Population de la base vectorielle Qdrant

5. [**Cr√©ation de la base de donn√©es SQL**](#base-donnees-sql)
   - Cr√©ation de la base SQLite
   - Structuration des donn√©es financi√®res

- [**Prochaines √©tapes**](#phase-terminee)

---

<a id="import-donnees"></a>
Import des biblioth√®ques et chargement des donn√©es (Phase 0)

In [1]:
import os
import pandas as pd
import sqlite3
import re
import json
from tqdm.notebook import tqdm
from typing import List, Dict, Any, Optional
from pathlib import Path

from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from fastembed import TextEmbedding
import qdrant_client
from langchain_community.utilities import SQLDatabase

from IPython.display import display, Markdown

In [2]:
# Load data from Phase 0
COMPANY_TICKER = "NVDA"
DATA_PATH = Path(f"sec-edgar-filings/{COMPANY_TICKER}/")
CSV_PATH = "revenue_summary.csv"

# Find all SEC submission files
all_files = list(DATA_PATH.rglob("full-submission.txt"))
print(f"Loaded {len(all_files)} files from Phase 0")

# Load the CSV data
df = pd.read_csv(CSV_PATH)
print(f"\nLoaded structured data with {len(df)} rows")

Loaded 7 files from Phase 0

Loaded structured data with 20 rows


<a id="analyse-documents"></a>
### <b><div style='padding:15px;background-color:#4A5568;color:white;border-radius:2px;font-size:110%;text-align: left'>1. Segmentation avanc√©e des documents</div></b>

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0f7ff,#ffffff);padding:16px;border-left:6px solid #2b6cb0;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>üöÄ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Ce que nous allons faire</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous allons utiliser la biblioth√®que <a href="https://docs.unstructured.io">unstructured</a>  pour analyser les d√©p√¥ts HTML bruts. Contrairement √† une simple extraction de texte, <a href="https://docs.unstructured.io">unstructured</a>  partitionne le document en une liste d'¬´ √©l√©ments ¬ª significatifs tels que <code>Title</code>, <code>NarrativeText</code>, <code>ListItem</code> et <code>Table</code>. Cette pr√©servation de l'information structurelle est la premi√®re et la plus critique √©tape vers un d√©coupage s√©mantique.</p>
  </div>
</div>

In [3]:
def extract_html_from_sec_file(file_path) -> str:
    """Extracts the HTML content from an SEC submission file."""
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        content = f.read()
    
    match = re.search(r'<html[^>]*>.*?</html>', content, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(0)
    return ""

def parse_html_file(file_path):
    """Parses an HTML file using unstructured and returns a list of elements."""
    try:
        html_content = extract_html_from_sec_file(file_path)
        
        if not html_content:
            print("No HTML content found in file")
            return []
        
        print(f"Extracted HTML content: {len(html_content):,} characters")
        
        from io import BytesIO
        # Try with more aggressive detection settings
        elements = partition_html(
            file=BytesIO(html_content.encode('utf-8')),
            # Use hi_res for better element detection (slower but more accurate)
            # Note: This requires detectron2 for table detection
            # For now, we use default but with include_metadata
            include_metadata=True,
        )
        return elements
    except Exception as e:
        print(f"Error parsing {file_path}: {e}")
        import traceback
        traceback.print_exc()
        return []

# Test parsing on first 10-K file
ten_k_file = next(f for f in all_files if "10-K" in str(f))
print(f"Parsing file: {ten_k_file}...")

parsed_elements = parse_html_file(ten_k_file)

print(f"\nSuccessfully parsed into {len(parsed_elements)} elements.")

# Show element type distribution
from collections import Counter
element_types = Counter(elem.category if hasattr(elem, 'category') else type(elem).__name__ 
                        for elem in parsed_elements)
print("\n--- Distribution des types d'√©l√©ments ---")
for elem_type, count in element_types.most_common():
    print(f"{elem_type}: {count}")

# Find and display specific element types
print("\n" + "="*80)
print("EXEMPLES DE DIFF√âRENTS TYPES D'√âL√âMENTS")
print("="*80)

# Find examples of each type
title_example = None
narrative_example = None
table_example = None

for element in parsed_elements:
    elem_type = element.category if hasattr(element, 'category') else type(element).__name__
    
    if elem_type == "Title" and title_example is None:
        title_example = element
    elif elem_type == "NarrativeText" and narrative_example is None and len(str(element)) > 100:
        narrative_example = element
    elif elem_type == "Table" and table_example is None:
        table_example = element
    
    if title_example and narrative_example and table_example:
        break

# Display Title example
if title_example:
    print("\n--- Exemple de TITRE (Title) ---")
    print(f"Contenu: {str(title_example)}")
else:
    print("\n--- Exemple de TITRE (Title) ---")
    print("Aucun √©l√©ment Title trouv√© - tous les √©l√©ments sont probablement class√©s comme UncategorizedText")

# Display NarrativeText example
if narrative_example:
    print("\n--- Exemple de TEXTE NARRATIF (NarrativeText) ---")
    print(f"Contenu: {str(narrative_example)[:500]}...")
else:
    print("\n--- Exemple de TEXTE NARRATIF (NarrativeText) ---")
    print("Aucun √©l√©ment NarrativeText trouv√©")

# Display Table example
if table_example:
    print("\n--- Exemple de TABLEAU (Table) ---")
    # Check if table has HTML metadata
    if hasattr(table_example, 'metadata'):
        table_metadata = table_example.metadata.to_dict() if hasattr(table_example.metadata, 'to_dict') else {}
        if 'text_as_html' in table_metadata:
            print(f"Repr√©sentation HTML:\n{table_metadata['text_as_html'][:800]}...")
        else:
            print(f"Repr√©sentation texte:\n{str(table_example)[:500]}...")
    else:
        print(f"Repr√©sentation texte:\n{str(table_example)[:500]}...")
else:
    print("\n--- Exemple de TABLEAU (Table) ---")
    print("Aucun √©l√©ment Table trouv√©")

print("\n" + "="*80)
print("NOTE: Si vous ne voyez que 'UncategorizedText', c'est normal pour les fichiers SEC.")
print("Le HTML des d√©p√¥ts SEC est tr√®s complexe et unstructured a du mal √† d√©tecter")
print("la structure automatiquement. Les chunks fonctionneront quand m√™me correctement.")
print("="*80)

Parsing file: sec-edgar-filings/NVDA/10-K/0001045810-25-000023/full-submission.txt...
Extracted HTML content: 2,067,275 characters

Successfully parsed into 947 elements.

--- Distribution des types d'√©l√©ments ---
UncategorizedText: 426
NarrativeText: 381
ListItem: 109
Table: 29
Image: 2

EXEMPLES DE DIFF√âRENTS TYPES D'√âL√âMENTS

--- Exemple de TITRE (Title) ---
Aucun √©l√©ment Title trouv√© - tous les √©l√©ments sont probablement class√©s comme UncategorizedText

--- Exemple de TEXTE NARRATIF (NarrativeText) ---
Contenu: Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. Yes ‚òê ‚òí...

--- Exemple de TABLEAU (Table) ---
Repr√©sentation HTML:
<table><tr><td/><td/><td/><td/><td/><td/></tr><tr><td>‚òí</td><td>ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</td></tr></table>...

NOTE: Si vous ne voyez que 'UncategorizedText', c'est normal pour les fichiers SEC.
Le HTML des d√©p√¥

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>La sortie montre que nous avons r√©ussi √† partitionner le document 10-K en plusieurs centaines d'√©l√©ments individuels. L'exemple de sortie est crucial : il d√©montre que <a href="https://docs.unstructured.io">unstructured</a>  a identifi√© diff√©rents types de contenu. Nous pouvons voir <code>Title</code> et <code>NarrativeText</code>. Cette prise en compte de la structure est ce que nous allons exploiter dans l'√©tape suivante pour cr√©er des fragments s√©mantiques, en particulier pour pr√©server les tableaux.</p>
  </div>
</div>

<a id="decoupage-intelligent"></a>
### <b><div style='padding:15px;background-color:#4A5568;color:white;border-radius:2px;font-size:110%;text-align: left'>2. D√©coupage s√©mantique</div></b>

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0f7ff,#ffffff);padding:16px;border-left:6px solid #2b6cb0;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>üöÄ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Ce que nous allons faire</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Les m√©thodes de d√©coupage standard (comme la division par un nombre fixe de tokens) peuvent √™tre destructrices, en particulier pour les documents financiers o√π les tableaux sont critiques. Un tableau coup√© en deux perd tout son sens. Nous allons utiliser la strat√©gie <code>chunk_by_title</code>. Cette m√©thode regroupe le texte sous les titres et, surtout, tente de conserver les tableaux entiers, en les traitant comme des unit√©s atomiques.</p>
  </div>
</div>

In [4]:
chunks = chunk_by_title(
    parsed_elements,
    max_characters=2048,
    combine_text_under_n_chars=256,
    new_after_n_chars=1800
)

print(f"Document chunked into {len(chunks)} sections.")

# Find sample chunks
text_chunk_sample = None
table_chunk_sample = None

for chunk in chunks:
    chunk_metadata = chunk.metadata.to_dict() if hasattr(chunk.metadata, 'to_dict') else {}
    if 'text_as_html' not in chunk_metadata and text_chunk_sample is None and len(str(chunk)) > 500:
        text_chunk_sample = chunk
    if 'text_as_html' in chunk_metadata and table_chunk_sample is None:
        table_chunk_sample = chunk
    if text_chunk_sample and table_chunk_sample:
        break

# Display with Markdown formatting
from IPython.display import display, Markdown, HTML

if text_chunk_sample:
    display(Markdown(f"""
    ---
    ### üìÑ Exemple de Chunk de Texte

    **Type:** Texte narratif  
    **Longueur:** {len(str(text_chunk_sample))} caract√®res

    **Contenu:**
    ```
    {str(text_chunk_sample)[:700]}...
    ```
    ---
    """))

if table_chunk_sample:
    table_metadata = table_chunk_sample.metadata.to_dict() if hasattr(table_chunk_sample.metadata, 'to_dict') else {}
    
    display(Markdown("""
    ---
    ### üìä Exemple de Chunk de Tableau

    **Type:** Tableau (pr√©serv√© avec HTML)  
    **M√©tadonn√©e:** `text_as_html` pr√©sent = ‚úÖ
    """))
    
    # Display the actual HTML table if available
    if 'text_as_html' in table_metadata:
        display(Markdown("**Rendu du tableau:**"))
        display(HTML(table_metadata['text_as_html']))
        
        display(Markdown(f"""
    **Code HTML source (extrait):**
    ```html
    {table_metadata['text_as_html'][:600]}...
    ```
    """))
    else:
        display(Markdown(f"""
    **Contenu texte:**
    ```
    {str(table_chunk_sample)[:500]}...
    ```
    """))
    
    display(Markdown("---"))

Document chunked into 168 sections.



    ---
    ### üìÑ Exemple de Chunk de Texte

    **Type:** Texte narratif  
    **Longueur:** 871 caract√®res

    **Contenu:**
    ```
    The aggregate market value of the voting stock held by non-affiliates of the registrant as of July 26, 2024 was approximately $ trillion (based on the closing sales price of the registrant's common stock as reported by the Nasdaq Global Select Market on July 26, 2024). This calculation excludes 1.0 billion shares held by directors and executive officers of the registrant. This calculation does not exclude shares held by such organizations whose ownership exceeds 5% of the registrant's outstanding common stock that have represented to the registrant that they are registered investment advisers or investment companies registered under section 8 of the Investment Company Act of 1940.

The numbe...
    ```
    ---
    


    ---
    ### üìä Exemple de Chunk de Tableau

    **Type:** Tableau (pr√©serv√© avec HTML)  
    **M√©tadonn√©e:** `text_as_html` pr√©sent = ‚úÖ
    

**Rendu du tableau:**

0,1,2,3,4,5
,,,,,
‚òí,ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934,,,,

0,1,2,3,4,5
,,,,,
‚òê,TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934,,,,

0,1,2,3,4,5
,,,,,
Delaware,94-3177549,,,,
(State or other jurisdiction of,(I.R.S. Employer,,,,
incorporation or organization),Identification No.),,,,
,,,,,
"2788 San Tomas Expressway , Santa Clara , California",95051,,,,
(Address of principal executive offices),(Zip Code),,,,

0,1,2,3,4,5,6,7,8
,,,,,,,,
Title of each class,Trading Symbol(s),Name of each exchange on which registered,,,,,,
"Common Stock, $0.001 par value per share",NVDA,The Nasdaq Global Select Market,,,,,,



    **Code HTML source (extrait):**
    ```html
    <table><tr><td/><td/><td/><td/><td/><td/></tr><tr><td>‚òí</td><td>ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</td></tr></table> <table><tr><td/><td/><td/><td/><td/><td/></tr><tr><td>‚òê</td><td>TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</td></tr></table> <table><tr><td/><td/><td/><td/><td/><td/></tr><tr><td>Delaware</td><td>94-3177549</td></tr><tr><td>(State or other jurisdiction of</td><td>(I.R.S. Employer</td></tr><tr><td>incorporation or organization)</td><td>Identification No.)</td></tr><tr><td/><td/></tr><t...
    ```
    

---

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>La sortie montre que nous avons r√©duit des centaines d'√©l√©ments en quelques fragments. Le point cl√© se trouve dans les exemples de fragments. Nous voyons un fragment de texte standard, et plus important encore, un fragment de tableau. Notez que les m√©tadonn√©es du fragment de tableau incluent <code>text_as_html</code>. Cela indique que <a href="https://docs.unstructured.io">unstructured</a> a correctement identifi√© et pr√©serv√© un tableau, ce qui est un √©norme avantage pour la qualit√© des donn√©es. Nous avons r√©ussi √† √©viter de d√©truire des donn√©es tabulaires critiques pendant le processus de d√©coupage.</p>
  </div>
</div>

<a id="enrichissement-llm"></a>
### <b><div style='padding:15px;background-color:#4A5568;color:white;border-radius:2px;font-size:110%;text-align: left'>3. Enrichissement avec LLM</div></b>

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0f7ff,#ffffff);padding:16px;border-left:6px solid #2b6cb0;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>üöÄ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Ce que nous allons faire</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Il s'agit d'une pierre angulaire de notre pipeline RAG avanc√©. Au lieu d'int√©grer simplement du texte brut, nous utiliserons un LLM rapide et puissant pour g√©n√©rer des m√©tadonn√©es pour chaque fragment. Ces m√©tadonn√©es agissent comme des ¬´ signaux ¬ª suppl√©mentaires pour notre syst√®me, lui permettant de comprendre le contenu √† un niveau beaucoup plus profond.</p>
    <p style='margin:0 8px 0 0;font-weight:600;color:#000;'>Pour chaque fragment, nous g√©n√©rerons :</p>
    <ul style='margin:8px 0 0 18px;'>
      <li><strong>R√©sum√©</strong> : Un r√©sum√© concis de 1 √† 2 phrases.</li>
      <li><strong>Mots-cl√©s</strong> : Une liste de sujets cl√©s.</li>
      <li><strong>Questions hypoth√©tiques</strong> : Une liste de questions auxquelles le fragment peut r√©pondre.</li>
      <li><strong>R√©sum√© de tableau (pour les tableaux uniquement)</strong> : Une description en langage naturel des principales informations du tableau.</li>
    </ul>
  </div>
</div>

In [None]:
# Pydantic permet de d√©finir la structure JSON souhait√©e, garantissant que la sortie du LLM est fiable.

class ChunkMetadata(BaseModel):
    """Structured metadata for a document chunk."""
    summary: str = Field(description="A concise 1-2 sentence summary of the chunk.")
    keywords: List[str] = Field(description="A list of 5-7 key topics or entities mentioned.")
    hypothetical_questions: List[str] = Field(description="A list of 3-5 questions this chunk could answer.")
    table_summary: Optional[str] = Field(description="If the chunk is a table, a natural language summary of its key insights.", default=None)

print("Pydantic model for metadata defined.")
print(json.dumps(ChunkMetadata.model_json_schema(), indent=2))

Pydantic model for metadata defined.
{
  "description": "Structured metadata for a document chunk.",
  "properties": {
    "summary": {
      "description": "A concise 1-2 sentence summary of the chunk.",
      "title": "Summary",
      "type": "string"
    },
    "keywords": {
      "description": "A list of 5-7 key topics or entities mentioned.",
      "items": {
        "type": "string"
      },
      "title": "Keywords",
      "type": "array"
    },
    "hypothetical_questions": {
      "description": "A list of 3-5 questions this chunk could answer.",
      "items": {
        "type": "string"
      },
      "title": "Hypothetical Questions",
      "type": "array"
    },
    "table_summary": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "If the chunk is a table, a natural language summary of its key insights.",
      "title": "Table Summary"
    }
  },
  "required"

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous avons d√©fini le mod√®le Pydantic <code>ChunkMetadata</code>. L'affichage du sch√©ma JSON montre la structure exacte, y compris les noms de champs, les types et les descriptions, que nous demanderons au LLM. Cette utilisation de sorties structur√©es est bien plus fiable que la simple ing√©nierie de prompt.</p>
  </div>
</div>

In [6]:
enrichment_llm = ChatOpenAI(model='gpt-4o-mini', api_key=os.getenv("OPENAI_API_KEY"), temperature=0.).with_structured_output(ChunkMetadata)

def generate_enrichment_prompt(chunk_text: str, is_table: bool) -> str:
    table_instruction = """
    This chunk is a TABLE. Your summary should describe the main data points and trends.
    """ if is_table else ""

    prompt = f"""
    You are an expert financial analyst. Please analyze the following document chunk and generate the specified metadata.
    {table_instruction}
    Chunk Content:
    ---
    {chunk_text}
    ---
    """
    return prompt

def enrich_chunk(chunk) -> Dict[str, Any]:
    chunk_metadata = chunk.metadata.to_dict() if hasattr(chunk.metadata, 'to_dict') else {}
    is_table = 'text_as_html' in chunk_metadata
    content = chunk_metadata.get('text_as_html') if is_table else str(chunk)
    
    truncated_content = content[:3000]
    prompt = generate_enrichment_prompt(truncated_content, is_table)
    
    try:
        metadata_obj = enrichment_llm.invoke(prompt)
        return metadata_obj.model_dump()
    except Exception as e:
        print(f"  ‚ùå Error enriching chunk: {type(e).__name__}: {str(e)[:200]}")
        import traceback
        traceback.print_exc()
        return None

print("Enrichment functions and LLM are ready.")

Enrichment functions and LLM are ready.


<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous avons mis en place la logique de base pour l'enrichissement. Nous avons instanci√© un LLM (<i>gpt-4o-mini</i>) et l'avons li√© √† notre mod√®le Pydantic. La fonction <code>enrich_chunk</code> identifie correctement si un fragment est un tableau, le tronque √† une taille g√©rable et appelle le LLM pour g√©n√©rer les m√©tadonn√©es structur√©es. Maintenant, testons-le.</p>
  </div>
</div>

In [7]:
# Test enrichment on text chunk
enriched_text_meta = enrich_chunk(text_chunk_sample)

display(Markdown("""
---
### üîç Enrichissement d'un Chunk de Texte

**R√©sultat de l'enrichissement LLM:**
"""))

if enriched_text_meta:
    display(Markdown(f"""
**üìù R√©sum√©:**
> {enriched_text_meta.get('summary', 'N/A')}

**üè∑Ô∏è Mots-cl√©s:**
{', '.join([f'`{kw}`' for kw in enriched_text_meta.get('keywords', [])])}

**‚ùì Questions hypoth√©tiques:**
"""))
    for i, q in enumerate(enriched_text_meta.get('hypothetical_questions', []), 1):
        display(Markdown(f"{i}. *{q}*"))
    
    display(Markdown(f"""
**üìä JSON complet:**
```json
{json.dumps(enriched_text_meta, indent=2, ensure_ascii=False)}
```
---
"""))

# Test enrichment on table chunk
display(Markdown("""
### üìä Enrichissement d'un Chunk de Tableau
"""))

enriched_table_meta = enrich_chunk(table_chunk_sample)

if enriched_table_meta:
    display(Markdown(f"""
**üìù R√©sum√©:**
> {enriched_table_meta.get('summary', 'N/A')}

**üè∑Ô∏è Mots-cl√©s:**
{', '.join([f'`{kw}`' for kw in enriched_table_meta.get('keywords', [])])}

**‚ùì Questions hypoth√©tiques:**
"""))
    for i, q in enumerate(enriched_table_meta.get('hypothetical_questions', []), 1):
        display(Markdown(f"{i}. *{q}*"))
    
    if enriched_table_meta.get('table_summary'):
        display(Markdown(f"""
**üìã R√©sum√© du tableau (sp√©cifique):**
> {enriched_table_meta.get('table_summary')}
"""))
    
    display(Markdown(f"""
**üìä JSON complet:**
```json
{json.dumps(enriched_table_meta, indent=2, ensure_ascii=False)}
```
---
"""))


---
### üîç Enrichissement d'un Chunk de Texte

**R√©sultat de l'enrichissement LLM:**



**üìù R√©sum√©:**
> The document chunk provides information on the market value of voting stock held by non-affiliates of a registrant as of July 26, 2024, and mentions the number of shares outstanding as of February 21, 2025.

**üè∑Ô∏è Mots-cl√©s:**
`market value`, `voting stock`, `non-affiliates`, `common stock`, `NVIDIA Corporation`, `Investment Company Act`, `shares outstanding`

**‚ùì Questions hypoth√©tiques:**


1. *What was the market value of the voting stock as of July 26, 2024?*

2. *How many shares of common stock were outstanding as of February 21, 2025?*

3. *What exclusions were made in the calculation of the market value?*

4. *Who are considered non-affiliates in this context?*

5. *What is the significance of the Investment Company Act of 1940 in this document?*


**üìä JSON complet:**
```json
{
  "summary": "The document chunk provides information on the market value of voting stock held by non-affiliates of a registrant as of July 26, 2024, and mentions the number of shares outstanding as of February 21, 2025.",
  "keywords": [
    "market value",
    "voting stock",
    "non-affiliates",
    "common stock",
    "NVIDIA Corporation",
    "Investment Company Act",
    "shares outstanding"
  ],
  "hypothetical_questions": [
    "What was the market value of the voting stock as of July 26, 2024?",
    "How many shares of common stock were outstanding as of February 21, 2025?",
    "What exclusions were made in the calculation of the market value?",
    "Who are considered non-affiliates in this context?",
    "What is the significance of the Investment Company Act of 1940 in this document?"
  ],
  "table_summary": null
}
```
---



### üìä Enrichissement d'un Chunk de Tableau



**üìù R√©sum√©:**
> The document chunk contains key information from a financial report, including the type of report (annual), the company's state of incorporation (Delaware), its IRS identification number, address, and details about its common stock listing on the Nasdaq.

**üè∑Ô∏è Mots-cl√©s:**
`Annual Report`, `Securities Exchange Act`, `Delaware`, `IRS Identification Number`, `Common Stock`, `Trading Symbol`, `Nasdaq`

**‚ùì Questions hypoth√©tiques:**


1. *What type of report is being filed?*

2. *What is the company's state of incorporation?*

3. *What is the trading symbol for the company's stock?*

4. *Which exchange is the company's stock listed on?*

5. *What is the address of the company's principal executive offices?*


**üìã R√©sum√© du tableau (sp√©cifique):**
> The tables provide essential details about the company's annual report filing, its incorporation in Delaware, its IRS identification number, and the specifics of its common stock listing on the Nasdaq.



**üìä JSON complet:**
```json
{
  "summary": "The document chunk contains key information from a financial report, including the type of report (annual), the company's state of incorporation (Delaware), its IRS identification number, address, and details about its common stock listing on the Nasdaq.",
  "keywords": [
    "Annual Report",
    "Securities Exchange Act",
    "Delaware",
    "IRS Identification Number",
    "Common Stock",
    "Trading Symbol",
    "Nasdaq"
  ],
  "hypothetical_questions": [
    "What type of report is being filed?",
    "What is the company's state of incorporation?",
    "What is the trading symbol for the company's stock?",
    "Which exchange is the company's stock listed on?",
    "What is the address of the company's principal executive offices?"
  ],
  "table_summary": "The tables provide essential details about the company's annual report filing, its incorporation in Delaware, its IRS identification number, and the specifics of its common stock listing on the Nasdaq."
}
```
---


<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>C'est un r√©sultat fantastique. La sortie montre deux objets JSON, un pour chaque type de fragment.</p>
    <p style='margin:0 8px 0 0;font-weight:600;color:#000;'></p>
    <ul style='margin:8px 0 0 18px;color:#000;'>
      <li>Pour le fragment de texte, nous avons un r√©sum√© clair, des mots-cl√©s pertinents et des questions hypoth√©tiques perspicaces.</li><br>
      <li>Pour le fragment de tableau, le LLM l'a correctement identifi√© comme un tableau et a fourni un `table_summary` qui interpr√®te les donn√©es en langage naturel. C'est incroyablement puissant. D√©sormais, une recherche s√©mantique pour ¬´ croissance du chiffre d'affaires par segment ¬ª pourrait correspondre √† ce tableau, m√™me si ces mots exacts ne figurent pas dans le HTML brut.</li>
    </ul>
    <br>
    <p style='margin:0 0 8px 0;color:#000;'>Maintenant, nous allons appliquer ceci √† tous nos documents.</p>
  </div>
</div>

In [8]:
ENRICHED_CHUNKS_PATH = 'enriched_chunks.json'

if os.path.exists(ENRICHED_CHUNKS_PATH):
    print("Found existing enriched chunks file. Loading from disk.")
    with open(ENRICHED_CHUNKS_PATH, 'r') as f:
        all_enriched_chunks = json.load(f)
else:
    all_enriched_chunks = []

Found existing enriched chunks file. Loading from disk.


In [None]:
# Process all files (this takes time - run only if needed)
from IPython.display import display, Markdown
import time

def process_file(file_path: Path) -> tuple[List[Dict[str, Any]], int, int]:
    """Process a single SEC filing and return enriched chunks with stats.
    
    Returns:
        tuple: (enriched_chunks, success_count, error_count)
    """
    try:
        parsed_elements = parse_html_file(file_path)
        if not parsed_elements:
            print(f"‚ö†Ô∏è  No elements parsed from {file_path.name}")
            return [], 0, 0
        
        doc_chunks = chunk_by_title(
            parsed_elements, 
            max_characters=2048, 
            combine_text_under_n_chars=256
        )
        
        print(f"  üìÑ Created {len(doc_chunks)} chunks, starting enrichment...")
        
        enriched_chunks = []
        success_count = 0
        error_count = 0
        
        for idx, chunk in enumerate(tqdm(doc_chunks, desc="Enriching chunks", leave=False), 1):
            try:
                enrichment_data = enrich_chunk(chunk)
                
                if enrichment_data:
                    chunk_metadata = chunk.metadata.to_dict() if hasattr(chunk.metadata, 'to_dict') else {}
                    is_table = 'text_as_html' in chunk_metadata
                    content = chunk_metadata.get('text_as_html') if is_table else str(chunk)
                    
                    enriched_chunks.append({
                        'source': f"{file_path.parent.parent.name}/{file_path.parent.name}",
                        'content': content,
                        'is_table': is_table,
                        **enrichment_data
                    })
                    success_count += 1
                else:
                    error_count += 1
                    print(f"    ‚ö†Ô∏è  Chunk {idx}/{len(doc_chunks)} failed enrichment")
                
                # Small delay to avoid overwhelming the LLM endpoint
                time.sleep(0.1)
                
            except Exception as e:
                error_count += 1
                print(f"    ‚ùå Chunk {idx}/{len(doc_chunks)} raised exception: {type(e).__name__}")
        
        return enriched_chunks, success_count, error_count
    
    except Exception as e:
        print(f"‚ùå Critical error processing {file_path.name}: {type(e).__name__}: {str(e)}")
        import traceback
        traceback.print_exc()
        return [], 0, 0

# Process all files
display(Markdown(f"""
### üîÑ Traitement de {len(all_files)} fichiers SEC

Ce processus va:
1. Parser chaque fichier HTML
2. D√©couper en chunks s√©mantiques
3. Enrichir chaque chunk avec le LLM
4. Sauvegarder progressivement dans `{ENRICHED_CHUNKS_PATH}`

**‚è±Ô∏è Temps estim√©:** ~{len(all_files) * 20} minutes
"""))

total_success = 0
total_errors = 0
start_time = time.time()

for i, file_path in enumerate(tqdm(all_files, desc="Processing files"), 1):
    file_start = time.time()
    
    file_chunks, success, errors = process_file(file_path)
    all_enriched_chunks.extend(file_chunks)
    total_success += success
    total_errors += errors
    
    # Save progress after each file
    with open(ENRICHED_CHUNKS_PATH, 'w') as f:
        json.dump(all_enriched_chunks, f, indent=2)
    
    file_time = time.time() - file_start
    elapsed = time.time() - start_time
    avg_time = elapsed / i
    eta = avg_time * (len(all_files) - i)
    
    display(Markdown(f"""‚úÖ **Fichier {i}/{len(all_files)}**: `{file_path.name}`  
‚Üí {len(file_chunks)} chunks enrichis ({success} ‚úÖ, {errors} ‚ùå)  
‚Üí Total cumul√©: {len(all_enriched_chunks)} chunks  
‚Üí Temps: {file_time:.1f}s | ETA: {eta/60:.1f} min"""))

display(Markdown(f"""
---
### ‚ú® Traitement termin√©!

**Total:** {len(all_enriched_chunks)} chunks enrichis  
**Succ√®s:** {total_success} ‚úÖ  
**Erreurs:** {total_errors} ‚ùå  
**Taux de r√©ussite:** {100*total_success/(total_success+total_errors):.1f}%  
**Temps total:** {(time.time()-start_time)/60:.1f} minutes  
**Sauvegard√© dans:** `{ENRICHED_CHUNKS_PATH}`
---
"""))

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Cette cellule prend un certain temps √† s'ex√©cuter car elle implique de nombreux appels LLM. La sortie contient tous les fragments enrichis en m√©tadonn√©es, que nous avons cr√©√©s √† partir de tous les d√©p√¥ts SEC. Fait crucial, nous avons enregistr√© ce r√©sultat dans un fichier JSON. Il s'agit d'une bonne pratique essentielle pour sauvegarder notre progression et √©viter de r√©ex√©cuter des √©tapes co√ªteuses de traitement des donn√©es.</p>
  </div>
</div>

<a id="magasin-vectoriel"></a>
### <b><div style='padding:15px;background-color:#4A5568;color:white;border-radius:2px;font-size:110%;text-align: left'>4. Embeddings & Vector Database (Qdrant)</div></b>

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0f7ff,#ffffff);padding:16px;border-left:6px solid #2b6cb0;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>üöÄ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Ce que nous allons faire</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Maintenant que nous avons nos donn√©es enrichies, il est temps de construire notre ¬´ M√©moire Unifi√©e ¬ª.</p>
    <p style='margin:0 8px 0 0;font-weight:600;color:#000;'>Base vectorielle (Qdrant)</p>
    <p style='margin:0 0 8px 0;color:#000;'>Nous allons int√©grer nos fragments et les stocker dans `Qdrant`. La cl√© ici est *ce que* nous int√©grons. Au lieu du simple texte brut, nous allons cr√©er un texte combin√© pour l'int√©gration qui inclut le r√©sum√© et les mots-cl√©s. Cela injecte la compr√©hension du LLM directement dans la repr√©sentation vectorielle.
    </p>
  </div>
</div>

In [7]:
# Load enriched chunks
with open('enriched_chunks.json', 'r') as f:
    all_enriched_chunks = json.load(f)
print(f"Loaded {len(all_enriched_chunks)} enriched chunks from file.")

# Initialize embedding model
embedding_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
embedding_dim = len(list(embedding_model.embed(["test"]))[0])
print(f"Embedding dimension: {embedding_dim}")

# Configure Qdrant with persistent storage
QDRANT_PATH = "./qdrant_storage"
COLLECTION_NAME = "financial_docs"

client = qdrant_client.QdrantClient(path=QDRANT_PATH)

# Recreate collection if it exists
try:
    client.get_collection(collection_name=COLLECTION_NAME)
    print(f"Qdrant collection '{COLLECTION_NAME}' already exists. Deleting and recreating...")
    client.delete_collection(collection_name=COLLECTION_NAME)
except Exception:
    print(f"Creating new collection '{COLLECTION_NAME}'...")

# Create collection with vector configuration
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=qdrant_client.http.models.VectorParams(
        size=embedding_dim,
        distance=qdrant_client.http.models.Distance.COSINE
    )
)

print(f"Qdrant collection '{COLLECTION_NAME}' created and saved to '{QDRANT_PATH}'.")

Loaded 566 enriched chunks from file.
Embedding dimension: 384
Creating new collection 'financial_docs'...
Qdrant collection 'financial_docs' created and saved to './qdrant_storage'.


<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous avons pr√©par√© notre base de donn√©es vectorielle. Nous avons initialis√© un mod√®le d'embedding open source et cr√©√© une collection `Qdrant` configur√©e pour la similarit√© cosinus, qui est adapt√©e √† la recherche s√©mantique sur du texte.</p>
  </div>
</div>

In [8]:
def create_embedding_text(chunk: Dict) -> str:
    return f"""
    Summary: {chunk['summary']}
    Keywords: {', '.join(chunk['keywords'])}
    Content: {chunk['content'][:1000]} 
    """

texts_to_embed = [create_embedding_text(chunk) for chunk in all_enriched_chunks]

print(f"Prepared {len(texts_to_embed)} texts for embedding.")
print("Generating embeddings...")

embeddings = list(embedding_model.embed(texts_to_embed, batch_size=32))

print("Creating points for upsert...")
points_to_upsert = []
for i, (chunk, embedding) in enumerate(zip(all_enriched_chunks, embeddings)):
    points_to_upsert.append(qdrant_client.http.models.PointStruct(
        id=i,
        vector=embedding.tolist(),
        payload=chunk
    ))

print("Upserting into Qdrant...")
client.upsert(
    collection_name=COLLECTION_NAME,
    points=points_to_upsert,
    wait=True
)

print("\nUpsert complete!")
collection_info = client.get_collection(collection_name=COLLECTION_NAME)
print(f"Points in collection: {collection_info.points_count}")

Prepared 566 texts for embedding.
Generating embeddings...
Creating points for upsert...
Upserting into Qdrant...

Upsert complete!
Points in collection: 566


<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Notre base de connaissances de fragments enrichis est √† pr√©sent index√©e dans notre base de donn√©es vectorielle. La derni√®re ligne v√©rifie le nombre de points (fragements des documents) dans la collection, qui devrait correspondre au nombre total de fragments enrichis que nous avons cr√©√©s. Notre agent <i>Liberian</i> dispose d√©sormais d'une biblioth√®que enti√®rement peupl√©e dans laquelle effectuer des recherches.</p>
  </div>
</div>

<a id="base-donnees-sql"></a>
### <b><div style='padding:15px;background-color:#4A5568;color:white;border-radius:2px;font-size:110%;text-align: left'>5. Cr√©ation de la base de donn√©es SQL</div></b>

<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0f7ff,#ffffff);padding:16px;border-left:6px solid #2b6cb0;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>üöÄ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Base de donn√©es relationnelle (SQLite)</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous allons charger √† pr√©sent nos donn√©es structur√©es <i>revenue_summary.csv</i> dans une base de donn√©es SQLite pour que notre agent `Analyst` puisse l'interroger.</p>
  </div>
</div>

In [10]:
# Configuration
DB_PATH = "financials.db"
TABLE_NAME = "revenue_summary"

# Create and populate SQLite database
with sqlite3.connect(DB_PATH) as conn:
    df.to_sql(TABLE_NAME, conn, if_exists="replace", index=False)

print(f"SQLite database created at '{DB_PATH}'.")

# Verify database and schema
db = SQLDatabase.from_uri(f"sqlite:///{DB_PATH}")

print("\nVerifying table schema:")
print(db.get_table_info())
print("\nVerifying sample rows:")
print(db.run(f"SELECT * FROM {TABLE_NAME} LIMIT 5"))

SQLite database created at 'financials.db'.

Verifying table schema:

CREATE TABLE revenue_summary (
	year INTEGER, 
	quarter TEXT, 
	revenue_usd_billions REAL, 
	net_income_usd_billions REAL
)

/*
3 rows from revenue_summary table:
year	quarter	revenue_usd_billions	net_income_usd_billions
2020	Q1	3.11	0.95
2020	Q2	3.08	0.92
2020	Q3	3.87	0.62
*/

Verifying sample rows:
[(2020, 'Q1', 3.11, 0.95), (2020, 'Q2', 3.08, 0.92), (2020, 'Q3', 3.87, 0.62), (2020, 'Q4', 4.73, 1.34), (2021, 'Q1', 5.0, 1.46)]


<div style='display:flex;align-items:center;gap:16px;background:linear-gradient(90deg,#f0fff4,#ffffff);padding:16px;border-left:6px solid #16a34a;border-radius:8px;'>
  <div style='width:5%;min-width:64px;text-align:center;font-size:44px;line-height:1;'>‚úÖ</div>
  <div style='width:90%;color:#000;'>
    <h2 style='margin:0 0 6px 0;color:#000;font-size:1.15em;'>Discussion de la sortie</h2>
    <p style='margin:0 0 8px 0;color:#000;'>Nous avons cr√©√© avec la base de donn√©es SQLite et l'avons peupl√©e avec nos donn√©es structur√©es. Le wrapper <code>langchain_community.utilities.SQLDatabase</code> rend incroyablement simple la connexion de cette base de donn√©es √† un agent LLM. Notre agent <i>Analyst</i> est maintenant pr√™t √† l'emploi. Ceci conclut la Phase 1.</p>
  </div>
</div>

---

<a id="phase-terminee"></a>
### Prochaines √©tapes

**Ce que nous avons construit :**
- Analys√© et d√©coup√© les documents SEC avec pr√©servation de la structure
- Enrichi tous les fragments avec des m√©tadonn√©es g√©n√©r√©es par LLM
- Cr√©√© une base de donn√©e vectorielle `Qdrant` (sauvegard√© sur disque)
- Cr√©√© une base de donn√©es `SQLite` avec des donn√©es financi√®res structur√©es

**Sorties cl√©s :**
- `enriched_chunks.json` - Tous les fragments de documents enrichis
- `financials.db` - Base de donn√©es SQLite
- `qdrant_storage/` - Collection Qdrant persistante avec tous les embeddings

**Suivant :** [Passez √† la Phase 2](phase_2_specialist_agents.ipynb) pour construire les agents sp√©cialis√©s (nos outils).