In today's digital era, where businesses are increasingly leveraging technology to enhance customer interactions, AI-powered chatbots have emerged as a game-changer. These chatbots can have a natural conversation with users, providing real-time support and information. Though chatbots have become popular since last two years, most of them are designed to interact in English. However, in a country like India, where Hindi is spoken by millions as a first language, there is a need for chatbots that can interact in Hindi. Building a Hindi-language chatbot can help businesses cater to a wider audience and provide better customer service. In this blog, we will discuss the technical journey of building a Hindi-language AI chatbot for enterprises. By the end of this blog, you will understand the challenges associated with building a Hindi-language chatbot and how to overcome them.

Building an AI chatbot is a two step process: Indexing and Querying. In the indexing phase, we will create a database of Hindi-language documents for the chatbot to refer to. This data is basically going to be the knowledge base of the chatbot. It can be a collection of FAQs, product manuals, or any other information that the chatbot needs to refer to while interacting with users. In the querying phase, we will use this indexed data to answer user queries with the help of an LLM.

In this blog, I will be using the following tools and frameworks for building the RAG based AI-Powered Hindi Chatbot:

- LangChain: I'll be using LangChain to build the RAG application, which will enhance the chatbot's ability to generate responses by leveraging information retrieved from a knowledge base.
- Qdrant: I'll be using Qdrant as the vector database to store the documents and their corresponding embeddings.
- FastText: I'll be using FastText as the language embedding framework for loading the Hindi language embeddings model.
- Ollama: Ollama will help us load the LLM very easily. We'll integrate the Ollama with LangChain to load the LLM.
- MLFlow: I'll be using MLFlow to manage the configurations of the RAG pipeline.

In [1]:
data_path = '../Hindi-Aesthetics-Corpus/Corpus'
chunk_size = 500
chunk_overlap = 50
batch_size = 4000
host = 'localhost'
port = 6333
embedding_model_path = '../wiki.hi.bin' #https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.hi.zip
# embedding_model_path = '../indicnlp.ft.hi.300.bin' #https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/embedding-v2/indicnlp.ft.hi.300.bin

For creating a knowledge base of our chatbot i'll be using the Hindi Aesthetic Corpus dataset. This dataset contains a large number of Hindi texts, more than 1000 text files. You can replace this dataset with your own business related data. It can be a collection of FAQs, product manuals, or any other information that you want your chatbot to have.

To start the process of indexing the data, we first need to load the dataset. As mentioned earlier, we will be using the Hindi Aesthetic Corpus dataset. Once the dataset is loaded, we will split the text into chunks using the RecursiveCharacterTextSplitter. Creating smaller chunks of text is essential since LLMs comes with a limited context size. Having smaller and relevant context will help us in two ways: First we will only have high quality and relevant context for the LLM to learn from. Second, processing larger chunk or context means more tokens that needs to be processed, which will increase the total runtime and be financially expensive.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the documents from the directory
loader = DirectoryLoader(data_path, loader_cls=TextLoader)

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len,
    is_separator_regex=False,
)
docs = loader.load_and_split(text_splitter=text_splitter)

Once we have converted the raw data into smaller chunks of text, we will then convert these chunks into embeddings using the FastText model. In this blog, we experimented with two different embeddings models: Hindi Model by FastText and IndicFT. The performance of IndicFT was not that good, so we decided to go with the FastText model. We will use the FastText model to convert the text into embeddings. These embeddings will be stored in a vector database using Qdrant. The embeddings will be used to retrieve the most relevant documents for a given query.

In [3]:
import fasttext as ft

# You will need to download these models from the URL mentioned below
embedding_model_path = '../wiki.hi.bin' #https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.hi.zip
# embedding_model_path = '../indicnlp.ft.hi.300.bin' #https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/embedding-v2/indicnlp.ft.hi.300.bin
embed_model = ft.load_model(embedding_model_path)



Once we have dowloaded the hindi embedding model, let's proceed to generate the embeddings for each chunk. 

In [4]:
import pandas as pd

# convert the documents to a dataframe
# This dataframe will be used to create the embeddings
# And later will be used to update the Qdrant Vector Database
data = []
for doc in docs:
    # Get the page content and metadata for each chunk
    # Meta data contains chunk source or file name
    row_data = {
        "page_content": doc.page_content,
        "metadata": doc.metadata
    }
    data.append(row_data)

df = pd.DataFrame(data)

# Replace the new line characters with space
df['page_content'] = df['page_content'].replace('\\n', ' ', regex=True)

# Create a unique id for each document.
# This id will be used to update the Qdrant Vector Database
df['id'] = range(1, len(df) + 1)

# Create a payload column in the dataframe
# This payload column includes the page content and metadata
# This payload will be used when LLM needs to answer a query
df['payload'] = df[['page_content', 'metadata']].to_dict(orient='records')

# Create embeddings for each chunk
# This embeddings will be used when doing a similarity search with the user query
df['embeddings'] = df['page_content'].apply(lambda x: (embed_model.get_sentence_vector(x)).tolist())

Great. Now that we have the embeddings, we need to store them in a vector database. We will be using Qdrant for this purpose. Qdrant is an open-source vector database that allows you to store and query high-dimensional vectors. The easiest way to get started with the Qdrant database is using the docker. Follow the below steps to get the Qdrant database up and running:

```
# Run the following command in terminal to get the docker image of the qdrant
docker pull qdrant/qdrant


# Run the following command in terminal to start the qdrant server
docker run -p 6333:6333 -v Hindi-Language-AI-Chatbot-for-Enterprises-using-Qdrant-MLFlow-and-LangChain/:/qdrant/storage qdrant/qdrant
```

Now let's open a connection to the Qdrant database using the qdrant_client. We then need to create a new collection in the Qdrant database in which we will store the embeddings. Once this is done, we will insert the embeddings, along with the corresponding document IDs and payloads, into the collection. The document IDs will be used to identify the documents, the payloads will contain the actual text of the document and the embeddings will be used to retrieve the most relevant documents for a given query.

In [7]:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, Batch

# Create a QdrantClient object
host = 'localhost'
port = 6333
client = QdrantClient(host=host, port=port)

# delete the collection if it already exists
client.delete_collection(collection_name="my_collection")

# Create a fresh collection in Qdrant
client.recreate_collection(
   collection_name="my_collection",
   vectors_config=VectorParams(size=300, distance=Distance.COSINE),
)

# Update the Qdrant Vector Database with the embeddings
# We are updating the embeddings in batches
# Since the data is large, we will only update the first batch of size 4000
batch_size = 4000
client.upsert(
 collection_name="my_collection",
 points=Batch(
     ids=df['id'].to_list()[:batch_size],
     payloads=df['payload'][:batch_size],
     vectors=df['embeddings'].to_list()[:batch_size],
 ),
)

# Close the QdrantClient
client.close()

After saving the embeddings in the Qdrant database, we can view the collection in the Qdrant dashboard. We can see from the dashboard that each chunk has got 3 infortmation: metadata, chunk text and embeddings. 