GitHab is a sophisticated RAG (Retrieval-Augmented Generation) system designed to "understand" entire python GitHub repositories. By combining AST parsing, modular chunking, and multi-agent orchestration via LangGraph, GitHab allows developers to have high-context conversations with their python codebases.
- Deep AST Parsing: Extracts functions, classes, and dependencies to preserve code logic, not just raw text.
- Dual-Stream Indexing: Stores both raw code chunks and LLM-generated summaries in Pinecone for superior retrieval accuracy.
- Intelligent Query Optimization: Uses a specialized "Understand" node to rewrite messy user questions into optimized search queries.
- Stateful Multi-Agent Workflow: Orchestrated by LangGraph, ensuring a reliable path from question → retrieval → analysis → answer.
- Persistent Context: Remembers which repository you are discussing across chat sessions.
. ├── AI/ # Agentic logic & LangGraph │ ├── graph/ # Workflow definitions │ └── nodes/ # Specialized LLM tasks ├── ingestion/ # Data processing pipeline ├── pipeline/ # Orchestration layer ├── templates/ # Flask frontend (HTML) ├── vectorestore/ # Database & Embeddings ├── main.py # Flask Server Entry Point └── .env # Environment Secrets
The system is organized into specialized modules to ensure scalability and maintainability:
repository_loader.py: Handles cloning and local management of GitHub repos.ast_parser.py: Navigates the Abstract Syntax Tree to identify code structures.chunker.py: Breaks code into "semantic" chunks with rich metadata.summary.py: Uses LLMs to generate high-level summaries of code modules.
graph/stategraph.py: The "brain" of the app. Defines the LangGraph flow and state transitions.nodes/: Individual processing units:understand_question.py: Optimizes user intent.retrieve_code_context.py: Queries Pinecone namespaces.analyze_code.py: Reasons over retrieved snippets.generate_answer.py: Produces the final natural language response.
pinecone_client.py: Manages index lifecycle and serverless configurations.vectordb.py: Handles embedding generation and namespace-isolated storage.
- Framework: Flask (Backend), Jinja2 (Frontend)
- Orchestration: LangChain & LangGraph
- LLMs: Nvidia Nemotron (via OpenRouter/NIM)
- Embeddings: HuggingFace
all-MiniLM-L6-v2 - Vector Database: Pinecone
- Python 3.10+
- Pinecone API Key
- OpenAI-Compatible API Key (e.g., OpenRouter or NVIDIA)
# Clone this repository
git clone https://github.com/steve601/codebase-Understanding-agent.git
cd githab
# Install dependencies (using uv or pip)
pip install -r requirements.txt# Create .env file and add;
OPENAI_API_KEY=your_key_here
OPENAI_API_BASE=https://openrouter.ai # or your preferred provider
PINECONE_API_KEY=your_key_hereuv run python main.pyWe are constantly working to make GitHab more powerful and accessible. Our upcoming features include:
- Multi-Language AST Parsing: Expanding beyond Python to support TypeScript, Go, and Java using Tree-sitter for universal codebase compatibility.
- Streaming Responses: Transitioning from batch processing to real-time token streaming (SSE) to provide an "instant-reply" chat experience.
- Voice-to-Code Interaction: Integrated Speech-to-Text (STT) for hands-free codebase navigation and natural voice querying.
- Incremental Ingestion: Automatically detecting file changes in a repository to update Pinecone vectors without re-parsing the entire project.
- Local LLM Support: Optional integration with Ollama or vLLM for 100% private, on-premise code analysis.
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Stephen Odhiambo
Building the future of AI-assisted software engineering.