In this tutorial, you will learn how to design and build a RAG enabled chatbot using LLamaindex and Chainlit. You will also learn to use the high-level APIs in LLamaindex and Chainlit to build a RAG enabled chatbot using just a few lines of code. Finally, you will create the RAG system from scratch to understand how things work behind the hood.

The article will cover the following topics:

1. Introduction to RAG enabled chatbots
2. Overview of LLamaindex and Chainlit
3. Designing and building a RAG enabled chatbot using LLamaindex and Chainlit
4. Creating the RAG system from scratch

By the end of this article, you will have a comprehensive understanding of how to design and build a RAG enabled chatbot using LLamaindex and Chainlit. You will also be able to create the RAG system from scratch and understand how things work behind the hood. The high-level APIs in LLamaindex and Chainlit will help you build a RAG enabled chatbot with ease. Happy learning! 🤖

## RAG
A RAG system consists of the following components-

* Document loader
* Splitting
* Embedding storage
* Query engine
* LLM

## Designing a RAG system
A RAG system consists of the following components-

* Document loader
* Splitting
* Embedding storage
* Query engine
* LLM

## Steps of RAG
* Load the documents.
* Split document into chunks and extract embeddings.
* Store embeddings into a storage (vector store).
* Enable system to take queries from user (frontend).
* Convert the query to embeddings and do a semantic search of the query embeddings in the vector store.
* Return the search results and augment the user query with these search results (context).
* Pass this augmented text to the LLM.
* LLM finally returns an answer to user query.

## Let's implement
We are going to use Llamaindex for building our RAG pipeline and chainlit for the chatbot front end. The vector store will be the default vector store in Llamaindex.

*LlamaIndex is a python library which provides APIs to build a RAG system.* 

*Chainlit is a python library that enables you to create frontend for your LLM based applications*.

## Installations


In [1]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.9.44-py3-none-any.whl.metadata (8.4 kB)
Collecting SQLAlchemy>=1.4.49 (from SQLAlchemy[asyncio]>=1.4.49->llama-index)
  Downloading SQLAlchemy-2.0.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.6 (from llama-index)
  Downloading aiohttp-3.9.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting nest-asyncio<2.0.0,>=1.5.8 (from llama-index)
  Downloading nest_asyncio-1.6.0-py3-none-any.whl.metadata (2.

In [2]:
!pip install chainlit

Collecting chainlit
  Downloading chainlit-1.0.200-py3-none-any.whl.metadata (5.0 kB)
Collecting aiofiles<24.0.0,>=23.1.0 (from chainlit)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting asyncer<0.0.3,>=0.0.2 (from chainlit)
  Downloading asyncer-0.0.2-py3-none-any.whl (8.3 kB)
Collecting dataclasses_json<0.6.0,>=0.5.7 (from chainlit)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl.metadata (22 kB)
Collecting fastapi<0.109.0,>=0.100 (from chainlit)
  Downloading fastapi-0.108.0-py3-none-any.whl.metadata (24 kB)
Collecting fastapi-socketio<0.0.11,>=0.0.10 (from chainlit)
  Downloading fastapi_socketio-0.0.10-py3-none-any.whl (7.4 kB)
Collecting filetype<2.0.0,>=1.2.0 (from chainlit)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting httpx<0.25.0,>=0.23.0 (from chainlit)
  Downloading httpx-0.24.1-py3-none-any.whl.metadata (7.4 kB)
Collecting lazify<0.5.0,>=0.4.0 (from chainlit)
  Downloading Lazify-0.4.0-py2.py3-none-any.whl (3.1 kB)
C

## Download the data
* We are going to use the "Goodreads books dataste" from [kaggle](https://www.kaggle.com/datasets/bahramjannesarr/goodreads-book-datasets-10m).
* This dataset contains information about books like author name, publish date, description etc. which are available in the goodreads database.
* Download the [data](https://www.kaggle.com/datasets/bahramjannesarr/goodreads-book-datasets-10m) from here and unzip it.

## Reading data
* The dataset contains csv file which we will read and create a dataframe.
* We will read a few of the files.

In [19]:
from fastai.text.all import pd, Path

In [4]:
path = "./archive"

In [5]:
files_path = Path(path)

In [6]:
files_path.ls()

(#30) [Path('archive/book1-100k.csv'),Path('archive/book1000k-1100k.csv'),Path('archive/book100k-200k.csv'),Path('archive/book1100k-1200k.csv'),Path('archive/book1200k-1300k.csv'),Path('archive/book1300k-1400k.csv'),Path('archive/book1400k-1500k.csv'),Path('archive/book1500k-1600k.csv'),Path('archive/book1600k-1700k.csv'),Path('archive/book1700k-1800k.csv')...]

In [16]:
from concurrent.futures import ThreadPoolExecutor

In [22]:
def create_df(files):
    with ThreadPoolExecutor() as executor:
        dfs = list(executor.map(pd.read_csv, files.ls()))

    combined_df = pd.concat(dfs, ignore_index=True)
    return combined_df

In [23]:
bigger_df = create_df(files_path)

## Extracting and storing data
* In a project setting you won't be using data straight from a dataframe in a memory.
* Rather you should extratc and store it in aa database.
* Here we will make use of a database.

## Managing Database
* We will use sqlalchemy to manage our database.
* Sqlalchemy is a python toolkit to manage interact with databses using python objects.

In [7]:
 !pip install sqlalchemy



create the database engine. here it is sqlite.

In [9]:
from sqlalchemy import create_engine

In [10]:
engine = create_engine('sqlite:///goodreads.db')

using `to_sql` function in pandas to export the dataframe into the database.

In [11]:
table_name = "goodreads"

In [12]:
df.to_sql(table_name, engine, if_exists='replace', index=False)

NameError: name 'df' is not defined