Now let's start building the next part of the chatbot. In this part, we will be using the LLM from the Ollama and integrating it with the chatbot. More particularly we will be using the Llama3 model. Llama 3 is Meta's latest and most advanced open-source large language model (LLM). It is the successor to the previous Llama 2 model and represents a significant improvement in performance across a variety of benchmarks and tasks. Llama 3 comes in two main versions - an 8 billion parameter model and a 70 billion parameter model. Llama 3 supports longer context lengths of up to 8,000 tokens

We will be using the MLFlow to track all the configurations and the results of the model. let's first insall the Ollama, get the llama3 model from the ollama and the MLFlow.

```
# install the Ollama
curl -fsSL https://ollama.com/install.sh | sh

# get the llama3 model
ollama pull llama2

# install the MLFlow
pip install mlflow
```

In [1]:

embed_model_path = '../wiki.hi.bin'
# embed_model_path = '../indicnlp.ft.hi.300.bin'

collection_name = 'my_collection'
limit = 1

model_name = 'llama3'
num_predict = 100
num_ctx = 3000
num_gpu = 2
temperature = 0.7
top_k = 50
top_p = 0.95

Now let's start by loading the qdrant client that will be used to retrieve the context for a given query. We will also start logging the configurations and the results of the workflows using the MLFlow.

In [2]:
import mlflow
from qdrant_client import QdrantClient

mlflow_lgging = True

if mlflow_lgging:
    # set the experiment name in the mlflow
    mlflow.set_experiment("Hindi Chatbot")
    # start the mlflow run
    mlflow.start_run()

# load the Qdrant client from the same host and port
# this client will be used to interact with the Qdrant server
host = "localhost"
port = 6333
client = QdrantClient(host=host, port=port)

# log the parameters in the mlflow
if mlflow_lgging:
    mlflow.log_param("qdrant_host", host)
    mlflow.log_param("qdrant_port", port)

2024/04/30 18:48:45 INFO mlflow.tracking.fluent: Experiment with name 'Hindi Chatbot' does not exist. Creating a new experiment.


We also need to load the embedding model. This embedding model is necessary to convert the query to the embedding that can be used to do a similarity search in the qdrant. The ultimate goal is to retrieve the context for a given query based on the similarity of the query embedding with the context embeddings.

In [3]:
import fasttext as ft

# You will need to download these models from the URL mentioned below
embedding_model_path = '../wiki.hi.bin' #https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.hi.zip
# embedding_model_path = '../indicnlp.ft.hi.300.bin' #https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/embedding-v2/indicnlp.ft.hi.300.bin
embed_model = ft.load_model(embed_model_path)

if mlflow_lgging:
    mlflow.log_param("embed_model_path", embed_model_path)



Langchain by default does not support the FastText embedding framework.It only supports Huggingface and OpenAI models. So that is why we need to define the custom langchain retriever class that will be used to retrieve the context for a given query. In this class, we will have one method _get_relevant_documents which will do the similarity seach in the qdrant based on the FastText embedding model and return the context for a given query.

In [4]:
from typing import List
from qdrant_client import QdrantClient
import fasttext as ft
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

# Define a custom retriever class that uses Qdrant for document retrieval
# Since we're using FastText embeddings, we won't be able to use the default lanchain retriever, as it only supports HuggingFace and OpenAI Models
class QdrantRetriever(BaseRetriever):
    client: QdrantClient
    embed_model: ft.FastText._FastText
    collection_name: str
    limit: int

    def _get_relevant_documents(self, query: str, *, run_manager: CallbackManagerForRetrieverRun) -> List[Document]:
        """Converts query to a vector and retrieves relevant documents using Qdrant."""
        # Get the vector representation of the query using the FastText model
        query_vector = self.embed_model.get_sentence_vector(query).tolist()

        # Search for the most similar documents in the Qdrant collection
        # The search method returns a list of hits, where each hit contains the most similar document
        # we can limit the number of hits to return using the limit parameter
        search_results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_vector,
            limit=self.limit
        )
        # Finally, we convert the search results to a list of Document objects
        # that can be used by the pipeline
        return [Document(page_content=hit.payload['page_content']) for hit in search_results]

# use the QdrantRetriever class to create a retriever object
retriever = QdrantRetriever(
    client=client,
    embed_model=embed_model,
    collection_name=collection_name,
    limit=limit
)

if mlflow_lgging:
    mlflow.log_param("collection_name", collection_name)
    mlflow.log_param("limit", limit)

Now we need to load the Llama3 model. We will be using the 8 billion parameter model. Instead of using huggingface to load the model, we will be using the Ollama to load the model. The Ollama provides a simple and easy way to load the models without much of a hassle. The class Ollama takes in a number of arguments out of which the most important ones are num_predict (number of tokens to be generated), num_ctx (maximum context size).

In [5]:
from langchain_community.llms.ollama import Ollama

# Create an Ollama object with the specified parameters
# This will very easily load the llama3 8-B model without the need of separately handling tokenizer like we do in huggingface
llm=Ollama(model='llama3', num_predict=100, num_ctx=3000, num_gpu=2, temperature=0.7, top_k=50, top_p=0.95)

if mlflow_lgging:
    mlflow.log_param("model_name", model_name)
    mlflow.log_param("num_predict", num_predict)
    mlflow.log_param("num_ctx", num_ctx)
    mlflow.log_param("num_gpu", num_gpu)
    mlflow.log_param("temperature", temperature)
    mlflow.log_param("top_k", top_k)
    mlflow.log_param("top_p", top_p)

Great. So far we have been able to setup the retriver which will retrive the context from thr database based on the similarity of the query embedding with the context embeddings. We have also loaded the Llama3 model. Now there's just one more thing left to do. We need create a chat template. Chat template includes two types of prompts. First one is system prompts and another one is user prompts. System prompts are the prompts that are written to control the behavior of the chatbot or LLMs. It is very important to have a good system prompts to get responses as per expectations. A bad system prompt can lead to poor or incorrect behavior of your chatbot. I spent sometime in optimizing the system prompts to get the best results. User prompts are the prompts that are written to get the responses from the chatbot. These prompts are the questions or queries that the user wants to ask the chatbot. Just like a good system prompt, it is always recommended to have a good user prompt. It should be concise, informative and to the point. So next we create these chat templates based on these two prompts.

In [6]:
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    """<s>[INST] आप एक सम्मानीय सहायक हैं। आपका काम नीचे दिए गए संदर्भ से प्रश्नों का उत्तर देना है। आप केवल हिंदी भाषा में उत्तर दे सकते हैं। धन्यवाद।
    
    You are never ever going to generate response in English. You are always going to generate response in Hindi no matter what. You also need to keep your answer short and to the point.

    संदर्भ: {context} </s>
"""
) 

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

if mlflow_lgging:
    mlflow.log_param("system_prompt", system_prompt)

Now let's tie up everything and create a chain of actions. We first want to retrive the relevant documents based on the prompt. We then want to generate the response based on the context and the prompt. create_stuff_documents_chain and create_retrieval_chain is exactly what we need to do this. 

In [7]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a chain that combines the retriever and the question-answer chain
# essentially, this chain will retrieve relevant documents using the retriever
# and the prompts
question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

Finally. We have now successfully built the chatbot using the Llama3 model. Let's now test the chatbot and see how it performs.

In [11]:
query = 'किस तरह के किरदार और कहानी तत्व रचनाकारों और फिल्म निर्माताओं को आकर्षित करते हैं?'

if mlflow_lgging:
    mlflow.log_param("query", query)

response = chain.invoke({"input": query})

if mlflow_lgging:
    mlflow.log_param("context", response['context'])
    mlflow.log_param("response", response['answer'])

print(response)

# end the logging of the mlflow
mlflow.end_run()

{'input': 'किस तरह के किरदार और कहानी तत्व रचनाकारों और फिल्म निर्माताओं को आकर्षित करते हैं?', 'context': [Document(page_content='अक्सर रचनाकारों और फिल्म निर्माताओं को ऐसी कहानियाँ आकर्षित करती रही हैं  जिनके जांबाज नायक नामी हैं और जीवित हैं  शहीदों से लेकर डाकुओं तक के जीवन ने कई फार्मूला फिल्म निर्देशकों से लेकर कला निर्देशकों तक को प्रेरित किया है  जब मैंने सुना कि राजस्थान के छोटे से गाँव भटेरी में महिला विकास कार्यक्रम में काम करने वाली  साथिन  भँवरी देवी के जीवन पर फिल्म का निर्माण हो रहा है  तो मेरे लिए यह आश्चर्य का विषय नहीं था')], 'answer': 'सामान्य तौर पर, रचनाकारों और फिल्म निर्माताओं को ऐसे किरदार और कहानी तत्व आकर्षित करते हैं जिनके साथ सम्बंधित लोग हों, या जिनके साथ उनका अपना अनुभव हो। इसके अलावा, रचनाकारों और फिल्म निर्माताओं को ऐसे किरदार'}


In [None]:
{'input': 'किस तरह के किरदार और कहानी तत्व रचनाकारों और फिल्म निर्माताओं को आकर्षित करते हैं?',
 'context': [Document(page_content='अक्सर रचनाकारों और फिल्म निर्माताओं को ऐसी कहानियाँ आकर्षित करती रही हैं  जिनके जांबाज नायक नामी हैं और जीवित हैं  शहीदों से लेकर डाकुओं तक के जीवन ने कई फार्मूला फिल्म निर्देशकों से लेकर कला निर्देशकों तक को प्रेरित किया है  जब मैंने सुना कि राजस्थान के छोटे से गाँव भटेरी में महिला विकास कार्यक्रम में काम करने वाली  साथिन  भँवरी देवी के जीवन पर फिल्म का निर्माण हो रहा है  तो मेरे लिए यह आश्चर्य का विषय नहीं था')],
 'answer': 'सामान्य तौर पर, रचनाकारों और फिल्म निर्माताओं को ऐसे किरदार और कहानी तत्व आकर्षित करते हैं जिनके साथ सम्बंधित लोग हों, या जिनके साथ उनका अपना अनुभव हो। इसके अलावा, रचनाकारों और फिल्म निर्माताओं को ऐसे किरदार'}

We also have been logging the parameters of the chatbot in the MLFlow. Let's now check that out and see how it looks.

```
# launches the MLFlow dashboard
mlflow ui --port 5000
```

Great. In this blog we saw how we can use LangChain, Ollama, Qdrant, MLFlow, and Llama3 Model to build a hindi language chatbot. We also saw how we can track the parameters and the results of the chatbot using the MLFlow. As a bonus, let's also build a gradio UI for the chatbot.

In [None]:
import gradio as gr

def answer_question(query, history):
    response = chain.invoke({"input": query})
    return response['answer']

gr.ChatInterface(answer_question).launch(share=True)

That's it for this blog. I hope you enjoyed this blog.

# Second Approach - Simple and Straightforward

In [None]:
# from qdrant_client import QdrantClient

# client = QdrantClient(host="localhost", port=6333)

# import fasttext as ft
# # Loding model for Hindi.
# embed_model = ft.load_model('wiki.hi.bin')

# query = 'किस तरह के किरदार और कहानी तत्व रचनाकारों और फिल्म निर्माताओं को आकर्षित करते हैं?'

# hits = client.search(
# collection_name="my_collection",
# query_vector= embed_model.get_sentence_vector(query).tolist(),
# limit=1,
# )


# context = ''
# for hit in hits:
#     context += hit.payload['page_content'] + '\n'


# prompt = f"""<s>[INST] आप एक सम्मानीय सहायक हैं। आपका काम नीचे दिए गए संदर्भ से प्रश्नों का उत्तर देना है। आप केवल हिंदी भाषा में उत्तर दे सकते हैं। धन्यवाद।
#     संदर्भ: {context}
#     प्रश्न: {query} [/INST] </s>
# """


# from langchain_community.llms.ollama import Ollama
# llm=Ollama(model='llama3', num_predict=100, num_ctx=3000, num_gpu=2, temperature=0.7, top_k=50, top_p=0.95)

# llm.invoke(prompt)