# Document Ingestion with LiteIngestion üìö

This notebook demonstrates how to use the `LiteIngestion` class from `lite_agents` to perform optimized ingestion of company policy documents into a ChromaDB database.

## Ingestion Pipeline ‚öôÔ∏è

The `LiteIngestion` class manages the entire RAG pipeline:
1.  **Reading**: Loads files from the specified directory.
2.  **Chunking**: Splits text into manageable chunks.
3.  **Embedding**: Generates vector embeddings for each chunk.
4.  **Storage**: Saves chunks and embeddings to the Vector DB (ChromaDB).

## Contextual Retrieval (Optional) üß†

Contextual Retrieval (inspired by Anthropic) improves retrieval quality by adding context to each chunk before embedding. It answers:
- Which document does this chunk belong to?
- What is the main topic of this chunk?
- How does it relate to the overall document context?

You can enable this feature by setting `add_context=True` in `LiteIngestion`.

## 1. Imports and Setup üõ†Ô∏è

In [1]:
from pathlib import Path
from typing import List
import json
import time
from lite_agents.llm import LiteLLM
from lite_agents.db import ChromaDB
from lite_agents.ingestion import LiteIngestion
from litellm import embedding
from tqdm.auto import tqdm

# Configuration
DATA_DIR = Path("../data")
POLICIES_DIR = DATA_DIR / "company_policies"
CHROMA_PATH = DATA_DIR / "chroma_db"
COLLECTION_NAME = "company_policies"
CHUNK_SIZE = 800  # characters per chunk
CHUNK_OVERLAP = 200  # overlap between chunks
EMBEDDING_MODEL = "text-embedding-3-small"  # OpenAI embedding model
LLM_MODEL = "gpt-5-nano-2025-08-07"  # Model for contextual retrieval
ADD_CONTEXT = True  # Enable contextual retrieval

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def create_embeddings(texts: List[str]) -> List[List[float]]:
    """Creates embeddings using litellm.
    
    Args:
        texts (List[str]): list of texts to embed
        
    Returns:
        List[List[float]]: list of embedding vectors
    """
    embeddings = []
    
    # Batch of 100 to avoid rate limits
    batch_size = 100
    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
        batch = texts[i:i+batch_size]
        response = embedding(model=EMBEDDING_MODEL, input=batch)
        batch_embeddings = [item['embedding'] for item in response.data]
        embeddings.extend(batch_embeddings)
    
    return embeddings

## 2. LiteIngestion Initialization üöÄ

We create an instance of the class that will handle the entire process.

In [3]:
# Initialize LiteLLM
print("ü§ñ Initializing LiteLLM...")
llm = LiteLLM(model=LLM_MODEL)
print(f"‚úÖ LiteLLM initialized with model: {LLM_MODEL}\n")

# Initialize ChromaDB
print("üíæ Initializing ChromaDB...")
db = ChromaDB(
    collection_name=COLLECTION_NAME,
    path=CHROMA_PATH,
    persistent=True
)
print(f"‚úÖ ChromaDB initialized: {CHROMA_PATH}/{COLLECTION_NAME}\n")

# Create LiteIngestion instance
print(f"üöÄ Initializing LiteIngestion (add_context={ADD_CONTEXT})...")
ingestion = LiteIngestion(
    llm=llm,
    vector_db=db,
    embedding_function=create_embeddings,
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    add_context=ADD_CONTEXT
)
print("‚úÖ LiteIngestion ready!\n")

ü§ñ Initializing LiteLLM...
‚úÖ LiteLLM initialized with model: gpt-5-nano-2025-08-07

üíæ Initializing ChromaDB...
‚úÖ ChromaDB initialized: ../data/chroma_db/company_policies

üöÄ Initializing LiteIngestion (add_context=True)...
‚úÖ LiteIngestion ready!



In [4]:
t_start = time.time()
all_chunks = ingestion.process_directory(
    directory=POLICIES_DIR,
    file_pattern="*.md"
)
t_end = time.time()

print(f"\n‚úÖ Processing complete! Total chunks: {len(all_chunks)}. Time taken: {t_end - t_start:.2f} seconds.\n")

[1mINFO[0m | [93m__init__[0m | [36mprocess_directory:248[0m | [1müìÇ Found 5 files in ../data/company_policies[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:161[0m | [1müìÑ Processing document: POL-01-hr-ferie.md[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:172[0m | [1müìù Generating summary for: Gestione Ferie, Permessi e Assenze (POL-HR-001)[0m
[1mINFO[0m | [93m__init__[0m | [36mgenerate_document_summary:121[0m | [1m‚è±Ô∏è Document summary generated in 11.58 seconds.[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:192[0m | [1m‚úÇÔ∏è Created 13 chunks from POL-01-hr-ferie.md[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:207[0m | [1m‚è±Ô∏è Generated context for chunk 1/13 in 10.36 seconds.[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:207[0m | [1m‚è±Ô∏è Generated context for chunk 2/13 in 8.66 seconds.[0m
[1mINFO[0m | [93m__init__[0m | [36mprocess_document:207[0m | [1m‚è±Ô∏è Ge


‚úÖ Processing complete! Total chunks: 52. Time taken: 548.31 seconds.



In [5]:
# Ingest chunks into ChromaDB
print("\nüîÑ Ingesting into ChromaDB...\n")

ingestion.ingest_chunks(all_chunks)

print("\nüéâ INGESTION COMPLETED SUCCESSFULLY! üéâ")

[1mINFO[0m | [93m__init__[0m | [36mingest_chunks:270[0m | [1müöÄ Starting ingestion of 52 chunks[0m
[1mINFO[0m | [93m__init__[0m | [36mingest_chunks:281[0m | [1müß† Creating embeddings...[0m



üîÑ Ingesting into ChromaDB...



Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.45s/it]
[1mINFO[0m | [93m__init__[0m | [36mingest_chunks:285[0m | [1müíæ Adding to vector database...[0m



üéâ INGESTION COMPLETED SUCCESSFULLY! üéâ


In [6]:
ingestion.save_chunks_to_json(
    chunks=all_chunks,
    output_path=DATA_DIR/"ingested_chunks.json"
)

[1mINFO[0m | [93m__init__[0m | [36msave_chunks_to_json:357[0m | [1müíæ Saving 52 chunks to ../data/ingested_chunks.json[0m
[1mINFO[0m | [93m__init__[0m | [36msave_chunks_to_json:365[0m | [1m‚úÖ Chunks saved successfully to ../data/ingested_chunks.json[0m


In [7]:
# Show stats
stats = ingestion.get_statistics(all_chunks)
print(json.dumps(stats, indent=2, ensure_ascii=False))

{
  "total_documents": 5,
  "total_chunks": 52,
  "avg_chunks_per_document": 10.4,
  "documents": {
    "POL-01-hr-ferie": {
      "title": "Gestione Ferie, Permessi e Assenze (POL-HR-001)",
      "chunks": 13,
      "total_chars": 3132,
      "context_chars": 6039
    },
    "POL-02-lavoro-da-remoto": {
      "title": "Lavoro Agile e Smart Working (POL-HR-002)",
      "chunks": 10,
      "total_chars": 2607,
      "context_chars": 4549
    },
    "POL-03-spese-note-spese": {
      "title": "Gestione Spese e Rimborsi (POL-FIN-003)",
      "chunks": 10,
      "total_chars": 2835,
      "context_chars": 4813
    },
    "POL-04-viaggi-di-lavoro": {
      "title": "Viaggi e Trasferte (POL-FIN-004)",
      "chunks": 9,
      "total_chars": 2504,
      "context_chars": 4190
    },
    "POL-05-acquisti-fornitori": {
      "title": "Acquisti e Gestione Fornitori (POL-PROC-005)",
      "chunks": 10,
      "total_chars": 2534,
      "context_chars": 5130
    }
  }
}


## 3. Test Retrieval üîé

We save ingestion information for future reference and test the retrieval quality.

In [None]:
def test_retrieval(db: ChromaDB, query: str, n_results: int = 3) -> None:
    """Tests retrieval with a query."""
    print(f"\n{'='*80}")
    print(f"üîç QUERY: {query}")
    print(f"{'='*80}\n")
    
    # Create query embedding
    response = embedding(model=EMBEDDING_MODEL, input=[query])
    query_embedding = response.data[0]['embedding']
    
    # Execute query
    results = db.query(query_embeddings=query_embedding, n_results=n_results)
    
    if not results:
        print("‚ùå No results found.\n")
        return
    
    for i, result in enumerate(results, 1):
        print(f"\n{'‚îÄ'*80}")
        print(f"RESULT #{i} (similarity: {result['similarity']:.4f})")
        print(f"{'‚îÄ'*80}")
        print(f"üìÑ Document: {result['metadata']['document_title']}")
        print(f"üìç Chunk: {result['metadata']['chunk_index'] + 1}/{result['metadata']['total_chunks']}")
        if 'section_header' in result['metadata']:
            print(f"üìë Section: {result['metadata']['section_header']}")
        print(f"\nüîç Context:")
        print(result['metadata']['context'])
        print(f"\nüìù Content:")
        content_preview = result['content']
        print(content_preview)
    
    print(f"\n{'='*80}\n")

In [14]:
# Test with sample queries
test_queries = [
    "Quanto preavviso va dato per richiedere le ferie?",
]
for query in test_queries:
    test_retrieval(db, query, n_results=2)


üîç QUERY: Quanto preavviso va dato per richiedere le ferie?


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
RESULT #1 (similarity: 0.7158)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
üìÑ Document: Gestione Ferie, Permessi e Assenze (POL-HR-001)
üìç Chunk: 5/13
üìë Section: 2.2 Pianificazione e Richiesta

üîç Context:
Nel documento Gestione Ferie, Permessi e Assenze (POL-HR-001), sezione 2.2 Pianificazione e Richiesta. Riassume come pianificare e richiedere le ferie, indicandone il canale (portale HR Zucchetti), i preavvisi in base alla durata e i periodi di picco (agosto e periodi natalizi). Si