## Summary:

A **novel chatbot** driven by **Large Language Models (LLMs)** will be developed as part of this project to improve the user experience while asking questions about the Italian population in 2023. As part of the **Retrieval-Augmented Generation (RAG) pipeline**, the method makes use of the **complementary strengths** of the two frameworks **LangChain and LlamaIndex**. Each framework can focus on a different component of the pipeline when combined with the other, even though LangChain and LlamaIndex have certain common functions, especially in retrieval. Strong search techniques can be built with LlamaIndex's help, and agents may be made even more effective using LangChain. They form an impressive combination when used together. To efficiently manage user inquiries, the RAG technique incorporated **two open-source LLMs**: **Llama3 by Meta and Mixtral by Mistral**. In order to make the user experience more natural and effortless, these models mimic human behaviors while retrieving information.



In [None]:
# Install required libraries and frameworks for the project
! pip install langchain
! pip install llama_index
! pip install langchain_community
! pip install langchain-groq==0.1.6
! pip install llama-index-llms-groq
! pip install llama_index.embeddings.huggingface
! pip install llama_index.vector_stores.chroma
! pip install chromadb
! pip install datasets
! pip install ragas

In [None]:
import pandas as pd #Used for handling and manipulating tabular data (e.g., CSV files for dataset loading and preprocessing).
import chromadb #A library for working with Chroma, a vector database designed for managing embeddings and performing similarity search.
from pathlib import Path #Provides an object-oriented interface for handling file system paths.
from llama_index.core import Document #Represents a text document to be indexed. These documents are later processed into embeddings for similarity search.
from llama_index.core import Settings #Manages configuration settings for LlamaIndex components and operations.
from llama_index.core.node_parser import SentenceSplitter #Splits text into sentences or smaller units for processing or indexing.
from llama_index.core import VectorStoreIndex #A class for creating and managing vector indices, which are used to efficiently retrieve relevant documents.
from llama_index.core.storage.storage_context import StorageContext #Manages the storage and retrieval of index data, embeddings, and other related artifacts.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding #Embedding generator using HuggingFace models, useful for encoding documents into vector representations.
from llama_index.vector_stores.chroma import ChromaVectorStore #A vector store backend that uses Chroma for embedding storage and similarity searches.
from llama_index.llms.groq import Groq #Integrates Groq-based LLMs for high-performance inference and language generation.
from llama_index.llms.openai import OpenAI
from datasets import Dataset
from llama_index.core.tools import QueryEngineTool #Defines tools for querying and retrieving data from specific sources.
from llama_index.core.tools import ToolMetadata #Stores metadata about a tool, like its name and description.
from langchain.agents import AgentExecutor #Manages the execution flow of agents and their interactions with tools.
from langchain.agents import create_tool_calling_agent #Creates agents that can dynamically select and use tools.
from langchain_core.prompts import ChatPromptTemplate #Structures and formats prompts for conversational interactions with LLMs.
import os #Sets the API key required for accessing OpenAI services.
from getpass import getpass
os.environ["OPENAI_API_KEY"] = "" # Personal API key must be added
from langchain_groq import ChatGroq
from ragas.testset.generator import TestsetGenerator  #Generates test sets for evaluating retrieval-augmented generation systems.
from ragas.integrations.llama_index import evaluate #Evaluates RAG systems built with LlamaIndex using metrics.

**Dataset prepration**

The following code processes a CSV file containing population data for locations in Italy, converting it into a structured collection of Document objects for use in a Retrieval-Augmented Generation (RAG) pipeline. It first reads the dataset using pandas and iterates through each row to generate descriptive text summaries for each location, including details like the type of place, its region and province, and population statistics (male, female, and total). The generated text is split into smaller sections based on a ##### separator, with empty sections removed. Each section is then transformed into a Document object, which serves as a retrievable unit for embedding and querying in a RAG system.

In [None]:
df = pd.read_csv('/content/popolazione_Italia_2023_Places_updated.csv')

text = ""
for ind in df.index:
    text += f"{df['Luogo'][ind]} is kind of {df['Type of place'][ind]} of the part {df['group_of_region'][ind]} of Italy in province {df['province'][ind]} and region of {df['region'][ind]} that has {df['Maschi'][ind]} male population and {df['Femmine'][ind]} female population and {df['Totale'][ind]} persons as total population#####"


# Step 3: Split the text into individual sections based on the '#####' separator
sections = text.split('#####')
# Remove any empty sections
sections = [section.strip() for section in sections if section.strip()]

# Step 4: Convert each section into a Document object
documents = [Document(text=section) for section in sections]


The following cell sets up a persistent Chroma vector store for efficient document retrieval. A Chroma database (chroma_db_data) is initialized, and a collection named revieww is created to store embeddings. The ChromaVectorStore integrates this collection, and the StorageContext links it with the indexing process. Finally, a VectorStoreIndex is built from the provided documents, enabling optimized retrieval and storage of vectorized document data.

In [None]:
client = chromadb.PersistentClient(path="./chroma_db_data")
chroma_collection = client.create_collection(name="reviewwwwiii")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

The next cell configures the settings for the LlamaIndex pipeline, specifying how documents are processed and models are used. The SentenceSplitter is defined to split text into chunks of 250 characters with a 20-character overlap, separated by [#####]. The Groq LLM (Mixtral-8x7b-32768) is set as the default language model, with an optional second LLM (Llama3-8b-8192) provided as a commented-out alternative. The embedding model is set to HuggingFaceEmbedding, leveraging the sentence-transformers/all-MiniLM-L6-v2 for creating vector representations. These settings ensure consistency in text processing and model behavior throughout the pipeline.

LLM deployment: I used Groq for employing LLMs because it offers high-performance inference optimized for large models, ensuring faster processing and scalability. Its efficient API integration allows seamless deployment of LLMs like LLama3 and Mixtral.

In [None]:
Settings.text_splitter = SentenceSplitter(chunk_size=250, chunk_overlap=20, separator="[#####]")
Settings.llm = Groq( model='Mixtral-8x7b-32768', api_key='Personal API key')
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
transformations = [SentenceSplitter(chunk_size=250 ,  chunk_overlap=20, separator="[#####]")]


The Index cell Creates a vector-based index from documents with transformations and a specified storage context.

In [None]:
index = VectorStoreIndex.from_documents(documents, transformations=transformations, storage_context=storage_context)

In [None]:
query_engine = index.as_query_engine() #Creates a query engine from the index, enabling efficient search and retrieval operations.

The following cell of code defines a QueryEngineTool named population_lookup, which integrates LlamaIndex with LangChain for a Retrieval-Augmented Generation (RAG) use case. The query_engine is derived from the previously created VectorStoreIndex, enabling efficient search and retrieval of population data from Italy. The ToolMetadata provides the tool’s name and a description, guiding users to input detailed plain-text questions to query population information in different areas of Italy. This setup facilitates the seamless execution of data retrieval within the broader LangChain-based agent framework, showcasing the combined strengths of LlamaIndex and LangChain in handling complex queries.

In [None]:
query_engine_tools = QueryEngineTool(
        query_engine = query_engine,
        metadata=ToolMetadata(
            name="population_lookup",
            description=(
                "Provides information about population in Italy for different areas. "
                "Use a detailed plain text question as input to the tool."
            ),
        ),
    )

The next line of code converts the QueryEngineTool (query_engine_tools) from LlamaIndex into a format compatible with LangChain using the to_langchain_tool() method. This conversion allows the tool to be seamlessly integrated into LangChain's agent framework, enabling the tool to be used within LangChain-based workflows while maintaining its underlying functionality for querying the LlamaIndex engine.

In [None]:
llamaindex_to_langchain_converted_tools = query_engine_tools.to_langchain_tool()

In [None]:
tools = llamaindex_to_langchain_converted_tools

The system_context defines the assistant's role, instructing it to recognize the type of location (such as provinces and cities) and respond based on the provided information. A ChatPromptTemplate is created to structure the conversation, combining system instructions, placeholders for chat history and agent scratchpad, and user input. The create_tool_calling_agent() function then creates an agent that can call the appropriate tools, like the population_lookup, to answer questions. Finally, an AgentExecutor is instantiated, which ties the agent and tools together, enabling the execution of the agent's tasks, managing errors, and ensuring smooth interaction with the user, including controlling iterations and verbosity.

In [None]:
system_context = "you are an assistant to answer the questions about population in different types of areas. recognize the type of location and answer the question based on its information. consider that each province chunk include cities chunks listed after until the next province."
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system_context,
        ),
        ("placeholder", "{chat_history}"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)

agent = create_tool_calling_agent(llm, [tools], prompt,)

# Create an agent executor by passing in the agent and tools
agent_executor = AgentExecutor(agent=agent, tools=[tools], verbose=True, return_intermediate_steps=True, handle_parsing_errors=True, max_iterations=10)

The next cell converts a LlamaIndex-based QueryEngineTool into a LangChain-compatible tool using the to_langchain_tool() method. The converted tool is stored in llamaindex_to_langchain_converted_tools, and then assigned to the variable tools. Next, the create_tool_calling_agent() function is used to create an agent that can use this tool, with the specified language model (llm) and a structured prompt. Finally, an AgentExecutor is created to execute the agent's tasks. The executor manages the agent and tools, ensuring smooth interactions with verbosity enabled, returning intermediate steps, handling parsing errors, and limiting iterations to 10, allowing the system to answer questions efficiently while managing potential issues during execution.

In [None]:
llamaindex_to_langchain_converted_tools = query_engine_tools.to_langchain_tool()
tools = llamaindex_to_langchain_converted_tools
agent = create_tool_calling_agent(llm, [tools], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[tools], verbose=True, return_intermediate_steps=True,
                               handle_parsing_errors=True, max_iterations=10)


Now, we have a query as an example to the agent executor and prints the response generated by the agent.

In [None]:
question =  "tell me about the female population of province Arezzo??"

response = agent_executor.invoke({"input": question})
#print("\nFinal Response:", response['output'])
print(response['output'])



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `population_lookup` with `{'input': 'What is the female population of province Arezzo?'}`


[0m[36;1m[1;3mTo determine the female population of the entire province of Arezzo, the specific information about the female population of each city within the province would be required. The context provided gives the female population of the city of Arezzo, which is 49928. However, without information about the female population of other cities in the province, it's not possible to give an accurate answer for the entire province.[0m[32;1m[1;3m
Invoking: `population_lookup` with `{'input': 'What is the female population of the city of Cortona?'}`


[0m[36;1m[1;3mThe female population of Cortona is 10953.[0m[32;1m[1;3m
Invoking: `population_lookup` with `{'input': 'What is the female population of the city of Montevarchi?'}`


[0m[36;1m[1;3mThe female population of Montevarchi is 12241.[0m[32;1m[1;3m
Invoki

## RAGAS assessment:

RAGAS (Retrieval Augmented Generation Assessment) evaluates the functionality of the RAG system in various capacities. It evaluates the retrieval system's ability to identify pertinent passages, the LLM's ability to utilize them correctly, and the overall quality of the produced output.The efficacy of different components inside the RAG pipeline profoundly influences the overall experience. Ragas provides metrics designed to assess each element of the RAG pipeline independently.


**Metrics definition:**

These metrics are categorized into two groups: Generation assessment and retrieval
assessment, which are described as follows.


1.Generation metrics:

*   Faithfulness: This metric measures the factual consistency of the generated answer against the given context.
*   Answer Correctness: This metric measures the accuracy of the generated answer compared to the ground truth.
*   Answer Relevancy: These metrics gauge the relevancy of the retrieved context, calculated based on both the question and contexts.

2.Retrieval metrics:



*   Context Recall: This metric measures how much the retrieved context aligns with the annotated answer, treated as the ground truth.
*   Context Precision: This metric examines how well important information is prioritized in various situations.


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness

In the context of RAGAS (Retrieval-Augmented Generation for Answering System), ground truth refers to the reference or correct answer against which the system's performance is evaluated. It serves as the benchmark for assessing the accuracy and relevance of the generated responses. In RAGAS assessment for some metrics, the model's output is compared to the ground truth to measure how well it retrieves the correct information from external sources and uses that information to generate an appropriate answer.

In [None]:
eval_questions = [
     "What is the total population of province Prato?",
     "where is Assisi and what is the male population of there?",
     "which city in Napoli province is the most populous?",
     "Compare the population of men and women in the city of Roma.",
     "Tell me about the difference in sex between the people who live in Cusano Milanino?",
     "What is the total population of Leini?",
     "What is the total and Male population of Novara province?",
     "What is the female population of Palmi?",
     "What is the percentage of the total population of Italy that resides in the region Lombardy",
     "What is the ratio of male to female population in the province Latina?",
     "In Sicilia region, how does the female population compare to the male population in terms of percentage",
     "tell me about the population of women in Belmonte Mezzagno?",
     "what is the exact female population of Tivoli?",
     "how many male populations do reside in Ercolano?",
     "Tell me about the total population of Bari?",
     "How many people live in the city of Castellaneta?",
     "Which region in the Nord-est group has the most evenly balanced gender ratio?",
     "What is the male population of the region Piemonte?",
     "what is population of the region Emilia-Romagna?",
     "How does the male population of Alseno city compare to the female population"
]

eval_answers = [
     "The total population in Prato is 259244",
     "Assisi as a city in province Perugia and region of Umbria in center of Italy has 13339 male population",
     "The city of Napoli with total population of 917510 is the most populous city in the province of Napoli",
     "According to the provided data, the population of men in the city of Roma is 1308818, and the population of women is 1446491.",
     "The male population of Cusano Milanino is 8991, while the female population is 9900. Thus, the difference between the male and female population is 909.",
     "The total population of Leini is 16294.",
     "the total population Novara province is 362502 and male population in Novara province is 176980",
     "The female population of Palmi is 8733",
     "About 17% of all the people who live in Italy live in the area of Lombardy.",
     "The male to female population ratio in province Latina is 98%",
     "In Sicilia, the female population is approximately 51.3%, while the male population is 48.7%",
     "in the city of Belmonte Mezzagno, the women population is 5530",
     "The female population of Tivoli is 28032",
     "the male population in Ercolano is 24407",
     "Bari has a total population of 1225048 as a province, and 316736 as a city.",
     "the total population of Castellaneta is 16220 people",
     "The most balanced gender ratio in the Nord-est group is found in Veneto",
     "The male population of the region Piemonte is 2,072,771",
     "The total population of the region Emilia-Romagna is 4435758",
     "the male population in Alseno city is 2315, and the female population is 2374"
]

examples = [
    {"query": q, "ground_truth": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [None]:
# using GPT 3.5 : LLM as a judge
evaluator_llm = OpenAI(model="gpt-3.5-turbo")

In [None]:
import re
results = []
contexts = []
for query in eval_questions:
    response = agent_executor.invoke({"input": query})
    resultt = response['output']
    results.append(resultt)
    sources = str(response["intermediate_steps"][0])
    #contents = []

    match = re.search(r"text='(.*?)'", sources)
    if match: # Removed extra indentation here
        extracted_text = match.group(0)
        print(extracted_text)
        contexts.append(extracted_text)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `population_lookup` with `{'input': 'What is the total population of province Prato?'}`


[0m[36;1m[1;3mThe total population of Province Prato can be found by adding the male and female populations together. According to the information provided, the male population is 127190 and the female population is 132054. Therefore, the total population of Province Prato is 127190 + 132054 = 259244.[0m[32;1m[1;3mThe total population of Province Prato is 259244.[0m

[1m> Finished chain.[0m
text='Prato is kind of Province of the part Centro of Italy in province Prato and region of Toscana that has 127190 male population and 132054 female population and 259244 persons as total population'


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `population_lookup` with `{'input': 'What is the male population of Assisi?'}`


[0m[36;1m[1;3mThe male population of Assisi is 13339.[0m[32;1m[1;3mAssisi i

In [None]:
results

['The total population of Province Prato is 259244.',
 'Assisi is a city in the province of Perugia, in the Umbria region of Italy. The male population of Assisi is 13339.',
 'The most populous city in Napoli province is Napoli itself, with a total population of 917510 people.',
 'In the city of Roma, there are 1,308,818 men and 1,446,491 women.',
 'There are 909 more females than males in Cusano Milanino, with 8991 male inhabitants and 9900 female inhabitants.',
 'The total population of Leini is 16294.',
 'The total population of Novara province is 362502, and the male population is 176980.',
 'The female population of Palmi is 9217.',
 "Now, let's calculate the percentage of the total population of Italy that resides in the region Lombardy. \n\nThe total population of Lombardy is 3,228,006, and the total population of Italy is 58,997,201. \n\nTo find the percentage, divide the population of Lombardy by the total population of Italy and multiply by 100: \n\n(3,228,006 / 58,997,201) *

In [None]:
contexts

["text='Prato is kind of Province of the part Centro of Italy in province Prato and region of Toscana that has 127190 male population and 132054 female population and 259244 persons as total population'",
 "text='Assisi is kind of Cities of the part Centro of Italy in province Perugia and region of Umbria that has 13339 male population and 14332 female population and 27671 persons as total population'",
 "text='Napoli is kind of Cities of the part Sud of Italy in province Napoli and region of Campania that has 440910 male population and 476600 female population and 917510 persons as total population'",
 "text='Roma is kind of Cities of the part Centro of Italy in province Roma and region of Lazio that has 1308818 male population and 1446491 female population and 2755309 persons as total population'",
 "text='Cusano Milanino is kind of Cities of the part Nord-ovest of Italy in province Milano and region of Lombardia that has 8991 male population and 9900 female population and 18891 pers

In [None]:
! pip install datasets
from datasets import Dataset

# Constructing the dictionary with all required columns
d = {
    "question": eval_questions,
    "answer": results,
    "contexts": contexts,
    "ground_truth": eval_answers  # Make sure eval_answers contains the ground truth answers
}

# Create a Dataset object from the dictionary
dataset = Dataset.from_dict(d) # Create a Dataset object from the dictionary d

# Assuming query_engine is defined elsewhere
score = evaluate(dataset=dataset, query_engine=query_engine, metrics=[faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness])

# Convert score to a pandas DataFrame
score_df = score.to_pandas()

# Save the DataFrame to CSV
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)



Running Query Engine:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
score_df

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness,answer_relevancy,context_precision,context_recall,answer_correctness
0,What is the total population of province Prato?,[Prato is kind of Province of the part Centro ...,The total population of Province Prato can be ...,The total population in Prato is 259244,1.0,0.997537,1.0,1.0,0.530154
1,where is Assisi and what is the male populatio...,[Assisi is kind of Cities of the part Centro o...,Assisi is located in the central part of Italy...,Assisi as a city in province Perugia and regio...,1.0,0.908353,1.0,1.0,0.986377
2,which city in Napoli province is the most popu...,[Napoli is kind of Cities of the part Sud of I...,The most populous city in the province of Napo...,The city of Napoli with total population of 91...,0.5,0.973211,1.0,1.0,0.991314
3,Compare the population of men and women in the...,[Roma is kind of Cities of the part Centro of ...,The population of men in the city of Roma is s...,"According to the provided data, the population...",0.666667,0.97122,1.0,1.0,0.541884
4,Tell me about the difference in sex between th...,[Cusano Milanino is kind of Cities of the part...,"In Cusano Milanino, there is a slightly higher...",The male population of Cusano Milanino is 8991...,0.0,0.9274,1.0,1.0,0.491668
5,What is the total population of Leini?,[Leini is kind of Cities of the part Nord-oves...,The total population of Leini can be found by ...,The total population of Leini is 16294.,1.0,1.0,1.0,1.0,0.485421
6,What is the total and Male population of Novar...,[Novara is kind of Province of the part Nord-o...,The total population of Novara province is 362...,the total population Novara province is 362502...,1.0,0.989012,1.0,1.0,0.995627
7,What is the female population of Palmi?,[Palmi is kind of Cities of the part Sud of It...,The female population of Palmi is 9217.,The female population of Palmi is 8733,1.0,1.0,1.0,0.0,0.24077
8,What is the percentage of the total population...,[Milano is kind of Province of the part Nord-o...,To find the percentage of the total population...,About 17% of all the people who live in Italy ...,1.0,0.0,0.0,0.0,0.218156
9,What is the ratio of male to female population...,[Malé is kind of Cities of the part Nord-est o...,"Based on the given context, the query asks abo...",The male to female population ratio in provinc...,0.666667,0.0,0.0,0.0,0.198112


In [None]:
score_df = score.to_pandas()

# Save the DataFrame to CSV
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)

score_df[['faithfulness','answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness']].mean(axis=0)


Unnamed: 0,0
faithfulness,0.808333
answer_relevancy,0.86877
context_precision,0.8
context_recall,0.7
answer_correctness,0.651867


In [None]:
score_df[['faithfulness','answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness']].mean(axis=0)

Unnamed: 0,0
faithfulness,0.791667
answer_relevancy,0.869001
context_precision,0.75
context_recall,0.7
answer_correctness,0.654162
