EZRAG is a lightweight Retrieval-Augmented Generation playground powered by Streamlit, Chroma DB, and SentenceTransformers.
Drop Markdown or text files into documents/, fire up the UI, and start querying your knowledge base.
main.py– Streamlit UI that exposes query input, collection refresh, and results display.chroma_impl.py– VectorStore wrapper responsible for loading documents, chunking, embedding, and talking to Chroma.documents/– Place.md,.markdown, or.txtsources here; everything inside is eligible for ingestion.chroma_store/– Created on first run to persist the Chroma collection locally.
Documents are chunked in chroma_impl.py using a tiered approach:
- Split by Markdown headers (
#-level 1–3) to preserve meaningful sections. - If the file lacks headers, fall back to paragraph splitting on blank lines.
- As a final guard, split into long sentences (>50 characters) to ensure even sparse text is captured.
- Each chunk is embedded with
sentence-transformers/all-MiniLM-L6-v2(384-dimensional cosine space). - Chunks are packaged with metadata (
source,type,chunk_number,file_hash) and inserted into a persistent Chroma collection (chroma_store/). - Chroma IDs include the file hash so updates to document content automatically invalidate old entries.
- Before inserting, the store inspects Chroma for existing entries that share the same filename and MD5 hash.
- Unchanged files are skipped; modified files trigger a delete + reinsert cycle
- Queries are encoded with the same model and sent to Chroma’s cosine-based similarity search.
- Results include chunk metadata and raw text, which the Streamlit app surfaces with relevance hints and source details.
-
Create & activate a virtual environment
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Add your documents
- Drop Markdown/text files into the
documents/folder.
- Drop Markdown/text files into the
-
Run the Streamlit interface
streamlit run main.py