An end-to-end Retrieval-Augmented Generation (RAG) chatbot for consumer complaints.
The system processes raw CFPB complaint data, builds embeddings, and provides natural language responses via a RAG pipeline.
- Project Overview
- Business Context
- Objectives
- Dataset Overview
- Project Structure
- Architecture
- Pipeline & Modeling Approach
- MLOps & Engineering Practices
- Setup & Installation
- Running the Project
- Technologies Used
- Author
This project implements a RAG chatbot for consumer complaints.
It enables efficient information retrieval from large complaint datasets while generating coherent answers in natural language.
The system covers:
- Data ingestion and cleaning of CFPB complaint datasets
- Text chunking and embedding generation
- FAISS/Chroma vector store construction
- RAG pipeline for retrieval + generation
- Qualitative evaluation of responses
- Streamlit/Gradio user interface
- Containerized deployment with Docker
- Data versioning and reproducibility with DVC
Financial institutions and consumer protection agencies need fast, accurate responses to complaints.
Manual review is slow, error-prone, and expensive.
This chatbot supports:
- Automated complaint responses
- Insight extraction from large datasets
- Research and monitoring of consumer issues
- Enhanced customer support efficiency
- Build a retrieval-augmented system for complaint data
- Generate accurate and context-aware responses
- Maintain a scalable vector store for embeddings
- Ensure reproducibility with DVC and versioned datasets
- Deploy via API or interactive UI
Source: CFPB consumer complaints
Key fields:
| Column | Description |
|---|---|
| ComplaintID | Unique complaint identifier |
| Product | Complaint category |
| Issue | Detailed complaint issue |
| ConsumerComplaint | Complaint narrative |
| Company | Company involved |
| DateReceived | Complaint submission timestamp |
Derived features:
- Cleaned and normalized complaint text
- Chunked complaint segments for embedding
- Metadata for retrieval and filtering
rag-complaint-chatbot/
│
├── config/ # YAML configs: paths, embeddings, RAG, UI
├── data/ # Raw, interim, processed, external datasets
├── vector_store/ # Persisted FAISS / Chroma embeddings
├── notebooks/ # EDA, preprocessing, embeddings, evaluation
├── scripts/ # CLI scripts to run pipelines and UI
├── src/
│ └── rag_chatbot/ # Main Python package
│ ├── core/ # Settings and configuration loader
│ ├── data/ # Data loading and filtering
│ ├── preprocessing/ # Text cleaning & normalization
│ ├── chunking/ # Text chunking logic
│ ├── embeddings/ # Embedding generation
│ ├── vectorstore/ # FAISS / ChromaDB wrappers
│ ├── rag/ # RAG pipeline (retriever + generator)
│ ├── evaluation/ # Qualitative evaluation logic
│ ├── ui/ # Streamlit / Gradio interface
│ └── utils/ # Shared utilities
├── tests/ # Unit & integration tests
├── docker/ # Dockerfile and compose
├── dvc.yaml # DVC pipeline
├── params.yaml # Global parameters
├── requirements.txt
├── pyproject.toml
└── README.md
- EDA – Explore distribution of complaint categories, narrative length, missing values.
- Preprocessing – Text cleaning: lowercasing, punctuation removal, stopword handling.
- Chunking – Split long complaint narratives into manageable text chunks.
- Embedding – Generate vector representations for each chunk.
- Vector Store – Store embeddings in FAISS or Chroma for efficient retrieval.
- RAG Pipeline – Combine retrieval and generation to answer queries.
- Evaluation – Qualitative analysis of responses against test complaints.
- UI / API – Interactive interface for testing the chatbot or integrating into systems.
- Data Versioning: DVC tracks raw and processed datasets, embeddings, and vector stores
- Containerization: Docker & docker-compose for easy deployment
- Testing: Unit and integration tests for preprocessing, chunking, embedding, and RAG pipeline
Clone the repository:
git clone https://github.com/<username>/rag-complaint-chatbot.git
cd rag-complaint-chatbotCreate a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install -r requirements.txtOr install in editable mode:
pip install -e .dvc repropython app.py- Python 3.10+
- OpenAI API / LangChain
- FAISS / Chroma
- DVC (data version control)
- FastAPI
- Streamlit / Gradio
- Docker & docker-compose
- Pytest for testing
Tibebu Kaleb – Full-stack AI/ML engineer with experience in NLP and RAG pipelines