Skip to content

Summary#71

Merged
hmahmood24 merged 14 commits into
mainfrom
summary
Sep 24, 2025
Merged

Summary#71
hmahmood24 merged 14 commits into
mainfrom
summary

Conversation

@hmahmood24
Copy link
Copy Markdown
Member

No description provided.

…on_id, paragraph_id etc. Also removed content_text from columns to embed due to token issue (sticking to summary for now)
…ed entries for provided base log event ids only. Used this to add an embed_along mode to the Intranet RAG agent which embeds at the end of each batch of documents being ingested (rather than running once at the end of the global ingestion flow) for better scoping of potential erros from the embedding call. Also updated the Intranet table to be Content from content
- Token-aware summarization with iterative compression to meet embedding limits; uses a fast 0.25 tokens/byte estimator and re-requests with concise directives until within budget (post-generation media context added only if it fits)
- Rich media extraction from PDFs (tables/images) with Docling; auto image descriptions via SmolVLM-Instruct and inclusion in searchable summaries for better recall
- Efficient sentence splitting via a lightweight spaCy sentencizer (plus enum-prefix fix); clean fallbacks to LangChain/regex and clear warnings when optional deps are absent
- Hierarchy-first chunking leveraging Docling headings/refs; stable parent-child mapping so images/tables reliably attach to the correct sections
- Unified short 5-char IDs and strictly hierarchical content_ids/titles (doc>sec>para>sent/img/tbl); removed incremental IDs and any slicing in IDs
- Batch parsing support in the parser; FileManager leverages batch mode for multi-file runs (sync and streaming async) with ordered results
- Pre-parse .doc/.docx→.pdf conversion using OS-appropriate backends (LibreOffice/win32com/docx2pdf), with timeouts, serialized execution where needed, and optional cleanup of temporary PDFs
…valuation script to be consistent with the chosen test set format with multi tiered difficulty levels. Fixed race conditions when running the Intranet API on multiple parallel workers. Updated the RAGHTTPClient (used for evals) to reuse the http session for connection pooling to run parallel evals
…only a single table in the loaded schema. Disabled Contacts inclusion in the KM for the Intranet deployment. Adjusted the prompts accordingly to conditionally include join related instructions only when needed
- Introduced a parallel-safe usage logging context for intranet usage, created idempotently and recording query, answer, sources, confidence, response_time, success, and error for all RAG calls.
- Standardized a success/error contract across RAG responses so successful calls set success=true and error=null, and failures set answer=null with a descriptive error; all downstream processing now checks success before formatting or post‑processing. Extended the RAG API service and the evaluation pipeline to support this.
- Added a direct LLM retrieval path alongside the tool‑loop in the RAG agent with in-context document dumping.
- Improved the evaluation pipeline with concurrent streaming queries, rolling saves, validation only for successful non‑empty answers, minimal records for failures, and wall‑clock time tracking for accurate performance metrics.
- Clarified semantic‑validator guidance to not penalize accurate, non‑contradictory extra information and added supporting examples.
@hmahmood24 hmahmood24 merged commit a8e7b1a into main Sep 24, 2025
5 of 16 checks passed
@hmahmood24 hmahmood24 deleted the summary branch December 2, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant