BizBrain

A Python-based Q&A system for analyzing legal documents and contracts.

Overview

BizBrain processes legal documents into searchable chunks, creates vector embeddings, and uses LangChain to answer questions about the documents. The system maintains citations to source documents for verification, enabling accurate answers to complex queries across multiple business and legal documents.

Key Features

Cross-document reasoning for complex questions
Source citations for all answers
Focus on accuracy over speed
Designed for internal business use
Batch document processing with effective dates

Commands

Run main script: python src/main.py
Process documents in batches: python src/main.py --batch-process
Check document status: python src/main.py --status
Ask question: python src/main.py --question "your question"
Interactive mode: python src/main.py --interactive
Web interface: python src/interface/ctx.py
Setup directories: python src/utils/dir_setup.py
Install dependencies: pip install -r requirements.txt

Project Structure

src/: Main source code with modular architecture
raw_documents/: Original document files (PDF, DOCX)
processed_documents/: Extracted text and chunks
vector_store/: Vector embeddings for retrieval

System Architecture

BizBrain is organized into five distinct layers:

Document Processing Layer
- Loads documents from various formats
- Extracts and cleans text
- Chunks documents into manageable segments
- Extracts metadata for citation tracking
Storage & Indexing Layer
- Creates vector embeddings for text chunks
- Manages the vector database
- Stores document metadata
- Maps between chunks and source documents
Retrieval Layer
- Processes user queries
- Performs hybrid retrieval (semantic + keyword)
- Re-ranks results for relevance
- Connects related information across documents
Reasoning Layer
- Assembles context from retrieved chunks
- Integrates with LLM for reasoning
- Generates answers with citations
- Tracks sources for verification
Interface Layer
- Provides API for internal integration
- Implements simple user interface
- Includes web interface using Gradio
- Collects feedback for improvement
- Logs interactions for analysis

Technology

Python
LangChain
Vector embeddings
Large Language Models
Gradio (web interface)

Technical Details

Directory Structure

/raw_documents/ - Original, unprocessed legal documents (contracts, agreements, etc.)
/processed_documents/ - Processed document data
- /full_text/ - Cleaned and extracted complete text of documents before chunking
- /chunks/ - Contains JSON files with chunked text from documents
- document_index.json - Master index of all documents with metadata
- document_registry.json - Tracks processing status of documents
/vector_store/ - Storage for vector embeddings
- faiss.index - The vector database (created during processing)
- document_to_id_map.json - Maps between vector IDs and document chunks
/conversation_history/ - Records of Q&A interactions with the system
/logs/ - System logs for debugging and monitoring
/src/ - Python source code
- /processors/ - Document processing scripts
- /indexers/ - Vector storage and indexing scripts
- /retrievers/ - Query and retrieval scripts
- /reasoners/ - LLM integration and answer generation
- /interface/ - API and UI implementation

Document Registry

The system tracks document processing status in /processed_documents/document_registry.json:

{
  "documents": {
    "contract_123.pdf": {
      "status": "processed",
      "last_processed": "2025-04-12T14:32:05",
      "chunk_count": 45,
      "document_id": "doc_001",
      "md5_hash": "e8d4e5e2f0a3c1b2d3a4e5f6a7b8c9d0",
      "batch_id": "batch_001",
      "effective_date": "2025-04-12"
    }
  },
  "batches": {
    "batch_001": {
      "created_at": "2025-04-12T14:30:00",
      "effective_date": "2025-04-12",
      "document_count": 1
    }
  },
  "last_update": "2025-04-12T14:32:05",
  "total_documents": 1,
  "total_chunks": 45,
  "total_batches": 1
}

This registry enables incremental processing when new documents are added without reprocessing the entire collection. It also tracks document batches with their effective dates.

Document Processing Flow

Batch Creation: User creates a batch with an effective date
Document Selection: User selects which documents to include in the batch
Atomic Processing: Each document is processed through the complete pipeline:
- Extraction: Text is extracted from documents and processed in memory
- Chunking: Documents are split into semantic chunks with overlap
- Embedding: Vector embeddings are created for each chunk
- Storage: All data is written to disk only after successful processing
- Registry Update: Document registry is updated as the final step
Completion: Documents are fully processed and ready for querying

The system uses an atomic processing model where documents are either fully processed or not processed at all. This ensures system consistency and eliminates intermediate states. All documents must be processed through the batch interface, which requires specifying an effective date.

Chunk Format

Each document chunk is stored in a JSON structure:

{
  "chunk_id": "doc_001_chunk_023",
  "text": "The funding schedule outlined in Section 3.2 requires...",
  "metadata": {
    "document_id": "doc_001",
    "title": "Series A Agreement",
    "section": "Funding Terms",
    "chunk_num": 23,
    "batch_id": "batch_001",
    "effective_date": "2025-04-12",
    "filename": "contract_123.pdf"
  }
}

Incremental Updates

When new documents are added:

Only new documents are processed through Layer 1
New embeddings are added to the existing vector store in Layer 2
Retrieval (Layer 3) and Reasoning (Layer 4) layers automatically incorporate new documents in future queries
No changes needed to Interface Layer (Layer 5)

Batch Processing and Effective Dates

BizBrain supports processing documents in batches with effective dates. Users can specify when documents become valid, allowing for time-sensitive document analysis and retrieval.

Citation Mechanism

When answering questions, the system:

Retrieves relevant chunks from the vector store
Provides citations to specific document sections
Includes document title, effective date, section name, and page number when available
Enables verification by tracing back to original documents

Maintenance

Resetting the System

If you need to reset the system and reprocess all documents from scratch, run:

rm -rf processed_documents/ vector_store/ && mkdir -p processed_documents/full_text processed_documents/chunks vector_store/

This command removes all processed documents and vector embeddings, then recreates the necessary directory structure.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
notes.md		notes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BizBrain

Overview

Key Features

Commands

Project Structure

System Architecture

Technology

Technical Details

Directory Structure

Document Registry

Document Processing Flow

Chunk Format

Incremental Updates

Batch Processing and Effective Dates

Citation Mechanism

Maintenance

Resetting the System

About

Uh oh!

Releases

Packages

Languages

sfgeekgit/bizbrain

Folders and files

Latest commit

History

Repository files navigation

BizBrain

Overview

Key Features

Commands

Project Structure

System Architecture

Technology

Technical Details

Directory Structure

Document Registry

Document Processing Flow

Chunk Format

Incremental Updates

Batch Processing and Effective Dates

Citation Mechanism

Maintenance

Resetting the System

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages