A Retrieval-Augmented Generation (RAG) based Documentation Assistant built with Streamlit, LangChain, Pinecone, and Google Vertex AI. It allows users to ingest documentation (local or scraped from the web) and query it using an LLM with relevant, context-grounded responses.
- 🔍 Document Ingestion from local files or web scraping
- 🧹 Processing: Cleaning, chunking, and metadata extraction
- 📊 Vector Store: Embeddings stored & retrieved using Pinecone
- 🤖 RAG System: Retrieve & augment queries with document context
- 💻 Streamlit Web UI: Simple Q&A and Chat interface with history
- ⚙️ Configurable: Manage API keys & settings via
.env
andconfig.py
clean_version/
├── .env # Environment variables (DO NOT COMMIT)
├── app.py # Streamlit web app entrypoint
├── config.py # Configuration management
├── document_processor.py # Document loading, cleaning, and chunking
├── ingestion.py # Ingestion pipeline for documents
├── rag_system.py # Retrieval-Augmented Generation system
├── vector_store.py # Embedding + Pinecone vector DB manager
├── web_scraper.py # Web scraper for documentation
├── __init__.py # Package initializer
└── __pycache__/ # Compiled cache files
git clone <your-repo-url>
cd clean_version
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
pip install -r requirements.txt
If requirements.txt
is missing, install manually:
pip install streamlit requests beautifulsoup4 python-dotenv langchain langchain-community langchain-pinecone langchain-google-vertexai
Create a .env
file in the root directory:
INDEX_NAME=langchain-doc-index
GOOGLE_CLOUD_PROJECT=your-gcp-project-id
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GEMINI_API_KEY=your-gemini-api-key
PINECONE_ENVIRONMENT=your-pinecone-env
PINECONE_API_KEY=your-pinecone-api-key
LANGCHAIN_TRACING_V2=false
.env
or credentials to GitHub.
streamlit run app.py
- Enter a query (e.g., "How do I integrate Pinecone with LangChain?")
- Retrieves relevant chunks, augments query, and returns an answer.
- Conversational interface with memory
- Supports follow-up questions using context
- Exclude
.env
and sensitive files using.gitignore
- Rotate keys if exposed
- Use a secrets manager for production deployments
- Add unit tests for ingestion & RAG
- Support additional vector DBs (FAISS, Weaviate)
- Improve error handling for network & LLM calls
- Add analytics (latency, token usage, retrieval quality)