RAG (Retrieval-augmented generation) ChatBot

Important

Disclaimer: The code has been tested on Ubuntu 22.04.2 LTS running on a Lenovo Legion 5 Pro with twenty 12th Gen Intel® Core™ i7-12700H and an NVIDIA GeForce RTX 3060. If you are using another Operating System or different hardware, and you can't load the models, please take a look either at the official CTransformers's GitHub issue. or at the official Llama Cpp Python's GitHub issue

Warning

Note: it's important to note that the large language model sometimes generates hallucinations or false information.

Introduction

This project combines the power of CTransformers, Lama.cpp, LangChain (only used for document chunking and querying the Vector Database, and we plan to eliminate it entirely), Chroma and Streamlit to build:

a Conversation-aware Chatbot (ChatGPT like experience).
a RAG (Retrieval-augmented generation) ChatBot.

The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer based on the context provided by those files.

The Memory Builder component of the project loads Markdown pages from the docs folder. It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these sections with the all-MiniLM-L6-v2 sentence-transformer, and saves them in an embedding database called Chroma for later use.

When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database. Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question, then conduct retrieval-augmented reading. The most relevant sections are then used as context to generate the final answer using a local language model (LLM). Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the relevant context from previous conversations to provide more accurate answers.

To deal with context overflows, we implemented two approaches:

Create And Refine the Context: synthesize a responses sequentially through all retrieved contents.
Hierarchical Summarization of Context: generate an answer for each relevant section independently, and then hierarchically combine the answers.

Prerequisites

Python 3.10+
GPU supporting CUDA 12 and up.
Poetry

Install Poetry

Install Poetry by following this link.

Bootstrap Environment

To easily install the dependencies we created a make file.

How to use the make file

Important

Run make setup to install sentence-transformers with pip to avoid poetry's issues in installing torch (it doesn't install CUDA dependencies).

Check: make check
- Use It to check that which pip3 and which python3 points to the right path.
Setup: make setup
- Creates an environment and installs all dependencies.
Update: make update
- Update an environment and installs all updated dependencies.
Tidy up the code: make tidy
- Run Ruff check and format.
Clean: make clean
- Removes the environment and all cached files.
Test: make test
- Runs all tests.
- Using pytest

Note

Run Setup as your init command (or after Clean).

Using the Open-Source Models Locally

We utilize two open-source libraries, CTransformers and Lama.cpp, which allow us to work efficiently with transformer-based models efficiently. Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters. These libraries enable us to run them either on a CPU or GPU. Additionally, we use the Quantization and 4-bit precision to reduce number of bits required to represent the numbers. The quantized models are stored in GGML/GGUF format.

Supported Models

Example Data

You could download some Markdown pages from the Blendle Employee Handbook and put them under docs.

Build the memory index

Run:

python chat/memory_builder.py --chunk-size 1000

Run the Chatbot

To interact with a GUI type:

streamlit run chatbot/chatbot_app.py -- --model openchat

Run the RAG Chatbot

To interact with a GUI type:

streamlit run chatbot/rag_chatbot_app.py -- --model openchat --k 2 --synthesis-strategy async_tree_summarization

How to debug the Streamlit app on Pycharm

References

LLMs:
LLM integration and Modules:
- LangChain:
Embeddings:
- all-MiniLM-L6-v2
  - This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Vector Databases:
- Chroma
- Food Discovery with Qdrant
- Indexing algorithms:
  - There are many algorithms for building indexes to optimize vector search. Most vector databases implement Hierarchical Navigable Small World (HNSW) and/or Inverted File Index (IVF). Here are some great articles explaining them, and the trade-off between speed, memory and quality:
    - Nearest Neighbor Indexes for Similarity Search
    - Hierarchical Navigable Small World (HNSW)
    - From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT
    - From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms
    - PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the expense of speed.
Retrieval Augmented Generation (RAG):
- Rewrite-Retrieve-Read
  - Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
- Rerank
- Conversational awareness
- Summarization: Improving RAG quality in LLM apps while minimizing vector storage costs
Chatbot Development:
Text Processing and Cleaning:
- clean-text
Open Source Repositories:
- CTransformers
- GPT4All
- llama.cpp
- llama-cpp-python
- pyllamacpp
- chroma
- Inspirational repos:

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
chatbot		chatbot
docs		docs
experiments		experiments
images		images
models		models
tests		tests
vector_store		vector_store
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
demo.md		demo.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
todo.md		todo.md

License

umbertogriffo/rag-chatbot

Folders and files

Latest commit

History

Repository files navigation

RAG (Retrieval-augmented generation) ChatBot

Table of contents

Introduction

Prerequisites

Install Poetry

Bootstrap Environment

How to use the make file

Using the Open-Source Models Locally

Supported Models

Example Data

Build the memory index

Run the Chatbot

Run the RAG Chatbot

How to debug the Streamlit app on Pycharm

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages