Skip to content

spidycoder/Ai-Doc-Assitance

Repository files navigation

📚 Documentation Assistant (RAG System)

A Retrieval-Augmented Generation (RAG) based Documentation Assistant built with Streamlit, LangChain, Pinecone, and Google Vertex AI. It allows users to ingest documentation (local or scraped from the web) and query it using an LLM with relevant, context-grounded responses.


✨ Features

  • 🔍 Document Ingestion from local files or web scraping
  • 🧹 Processing: Cleaning, chunking, and metadata extraction
  • 📊 Vector Store: Embeddings stored & retrieved using Pinecone
  • 🤖 RAG System: Retrieve & augment queries with document context
  • 💻 Streamlit Web UI: Simple Q&A and Chat interface with history
  • ⚙️ Configurable: Manage API keys & settings via .env and config.py

📂 Project Structure

clean_version/
├── .env                      # Environment variables (DO NOT COMMIT)
├── app.py                    # Streamlit web app entrypoint
├── config.py                 # Configuration management
├── document_processor.py     # Document loading, cleaning, and chunking
├── ingestion.py              # Ingestion pipeline for documents
├── rag_system.py             # Retrieval-Augmented Generation system
├── vector_store.py           # Embedding + Pinecone vector DB manager
├── web_scraper.py            # Web scraper for documentation
├── __init__.py               # Package initializer
└── __pycache__/              # Compiled cache files

⚙️ Setup

1️⃣ Clone the Repository

git clone <your-repo-url>
cd clean_version

2️⃣ Create a Virtual Environment

python -m venv venv
source venv/bin/activate   # Linux/Mac
venv\Scripts\activate      # Windows

3️⃣ Install Dependencies

pip install -r requirements.txt

If requirements.txt is missing, install manually:

pip install streamlit requests beautifulsoup4 python-dotenv langchain langchain-community langchain-pinecone langchain-google-vertexai

4️⃣ Configure Environment Variables

Create a .env file in the root directory:

INDEX_NAME=langchain-doc-index
GOOGLE_CLOUD_PROJECT=your-gcp-project-id
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GEMINI_API_KEY=your-gemini-api-key
PINECONE_ENVIRONMENT=your-pinecone-env
PINECONE_API_KEY=your-pinecone-api-key
LANGCHAIN_TRACING_V2=false

⚠️ Important: Never commit .env or credentials to GitHub.

5️⃣ Run the Application

streamlit run app.py

🚀 Usage

Simple Q&A Mode

  • Enter a query (e.g., "How do I integrate Pinecone with LangChain?")
  • Retrieves relevant chunks, augments query, and returns an answer.

Chat Mode

  • Conversational interface with memory
  • Supports follow-up questions using context

🔒 Security

  • Exclude .env and sensitive files using .gitignore
  • Rotate keys if exposed
  • Use a secrets manager for production deployments

📌 Roadmap

  • Add unit tests for ingestion & RAG
  • Support additional vector DBs (FAISS, Weaviate)
  • Improve error handling for network & LLM calls
  • Add analytics (latency, token usage, retrieval quality)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages