# Summary

I endeavored to construct a chatbot as an **AI-based conversational tool** to enhance the user experience while conducting a research project for ISTAT about the Italian population in 2023. As part of the **Retrieval Augmented Generation (RAG)** method, **two open-source Large Language Models (LLMs)—Llama3 from Meta and Mixtral from Mistra**l—were implemented to generate human-like responses in the information retrieval process and comprehend user queries. Furthermore, the **LlamIndex orchestration framework** was employed to execute this procedure.

In [None]:
# Install required libraries and frameworks for the project
! pip install llama-index-llms-groq
! pip install llama_index
! pip install llama_index.embeddings.huggingface
! pip install llama_index.vector_stores.chroma
! pip install chromadb
! pip install datasets
! pip install ragas

In [None]:
# Import required modules and classes

import pandas as pd #Used for handling and manipulating tabular data (e.g., CSV files for dataset loading and preprocessing).
import chromadb #A library for working with Chroma, a vector database designed for managing embeddings and performing similarity search.
from pathlib import Path #Provides an object-oriented interface for handling file system paths.
from llama_index.core import Document #Represents a text document to be indexed. These documents are later processed into embeddings for similarity search.
from llama_index.core import VectorStoreIndex #A class for creating and managing vector indices, which are used to efficiently retrieve relevant documents.
from llama_index.core import ServiceContext #Provides a centralized way to configure and manage services like LLMs and embedding models.
from llama_index.core.prompts.prompts import SimpleInputPrompt #Allows for the creation of custom input prompts for guiding the behavior of LLMs.
from llama_index.core import ServiceContext #Provides a centralized way to configure and manage services like LLMs and embedding models.
from llama_index.core.storage.storage_context import StorageContext #Manages the storage and retrieval of index data, embeddings, and other related artifacts.
from llama_index.embeddings.huggingface import HuggingFaceEmbedding #Embedding generator using HuggingFace models, useful for encoding documents into vector representations.
from llama_index.vector_stores.chroma import ChromaVectorStore #A vector store backend that uses Chroma for embedding storage and similarity searches.
from llama_index.llms.groq import Groq #Integrates Groq-based LLMs for high-performance inference and language generation.
from llama_index.llms.openai import OpenAI
from datasets import Dataset

import os #Sets the API key required for accessing OpenAI services.
from getpass import getpass
os.environ["OPENAI_API_KEY"] = "" # A personal API key must be added.

from ragas.testset.generator import TestsetGenerator #Generates test sets for evaluating retrieval-augmented generation systems.
from ragas.testset.evolutions import simple, reasoning, multi_context #Provides tools for evolving test sets with additional complexity (e.g., multi-context, reasoning).
from ragas.integrations.llama_index import evaluate #Evaluates RAG systems built with LlamaIndex using metrics.


**Dataset prepration**

The following code processes a CSV file containing population data for locations in Italy, converting it into a structured collection of Document objects for use in a Retrieval-Augmented Generation (RAG) pipeline. It first reads the dataset using pandas and iterates through each row to generate descriptive text summaries for each location, including details like the type of place, its region and province, and population statistics (male, female, and total). The generated text is split into smaller sections based on a ##### separator, with empty sections removed. Each section is then transformed into a Document object, which serves as a retrievable unit for embedding and querying in a RAG system.

In [None]:
df = pd.read_csv('/content/popolazione_Italia_2023_Places_updated.csv')

text = ""
for ind in df.index:

    text += f"{df['Luogo'][ind]} is kind of {df['Type of place'][ind]} of the part {df['group_of_region'][ind]} of Italy in province {df['province'][ind]} and region of {df['region'][ind]} that has {df['Maschi'][ind]} male population and {df['Femmine'][ind]} female population and {df['Totale'][ind]} persons as total population#####"
# Step 3: Split the text into individual sections based on the '#####' separator
sections = text.split('#####')
# Remove any empty sections
sections = [section.strip() for section in sections if section.strip()]

# Step 4: Convert each section into a Document object
documents = [Document(text=section) for section in sections]


The following cell sets up a persistent Chroma vector store for efficient document retrieval. A Chroma database (chroma_db_data) is initialized, and a collection named revieww is created to store embeddings. The ChromaVectorStore integrates this collection, and the StorageContext links it with the indexing process. Finally, a VectorStoreIndex is built from the provided documents, enabling optimized retrieval and storage of vectorized document data.








In [None]:
client = chromadb.PersistentClient(path="./chroma_db_data")
chroma_collection = client.create_collection(name="revieww")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, transformations=transformations, storage_context=storage_context)

The next cell configures the settings for the LlamaIndex pipeline, specifying how documents are processed and models are used. The SentenceSplitter is defined to split text into chunks of 250 characters with a 20-character overlap, separated by [#####]. The Groq LLM (Mixtral-8x7b-32768) is set as the default language model, with an optional second LLM (Llama3-8b-8192) provided as a commented-out alternative. The embedding model is set to HuggingFaceEmbedding, leveraging the sentence-transformers/all-MiniLM-L6-v2 for creating vector representations. These settings ensure consistency in text processing and model behavior throughout the pipeline.

LLM deployment: I used Groq for employing LLMs because it offers high-performance inference optimized for large models, ensuring faster processing and scalability. Its efficient API integration allows seamless deployment of LLMs like LLama3 and Mixtral.

In [None]:
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter

Settings.text_splitter = SentenceSplitter(chunk_size=250, chunk_overlap=20, separator="[#####]")
Settings.llm = Groq( model='Mixtral-8x7b-32768', api_key='') # the personal Groq API key must be added
# Settings.llm = Groq( model='Llama3-8b-8192', api_key='') # the second LLMs
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
transformations = [SentenceSplitter(chunk_size=250 ,  chunk_overlap=20, separator="[#####]")]


This cell creates a VectorStoreIndex from the provided documents, enabling efficient retrieval in the RAG pipeline. It applies the specified transformations to preprocess the documents before embedding and stores the resulting vectorized data in the storage_context, which links to the configured Chroma vector store. This setup allows the system to efficiently search and retrieve relevant documents based on similarity to a query.

In [None]:
index = VectorStoreIndex.from_documents(documents, transformations=transformations, storage_context=storage_context)

The next cell converts the VectorStoreIndex into a query engine using index.as_query_engine(). It queries the indexed data for information about Bari's total population, retrieving relevant document chunks. Finally, the result is converted to a string and printed, providing an answer based on the indexed content.**bold text**

In [None]:
query_engine = index.as_query_engine()
response=query_engine.query("Tell me about the total population of Bari?")
print(str(response))

An exmaple:

In [None]:
response=query_engine.query("Tell me about the total population of Bari?")
print(str(response))

The total population of the province of Bari is 1,225,048.


## RAGAS assessment:

RAGAS (Retrieval Augmented Generation Assessment) evaluates the functionality of the RAG system in various capacities. It evaluates the retrieval system's ability to identify pertinent passages, the LLM's ability to utilize them correctly, and the overall quality of the produced output.The efficacy of different components inside the RAG pipeline profoundly influences the overall experience. Ragas provides metrics designed to assess each element of the RAG pipeline independently.


**Metrics definition:**

These metrics are categorized into two groups: Generation assessment and retrieval
assessment, which are described as follows.


1.Generation metrics:

*   Faithfulness: This metric measures the factual consistency of the generated answer against the given context.
*   Answer Correctness: This metric measures the accuracy of the generated answer compared to the ground truth.
*   Answer Relevancy: These metrics gauge the relevancy of the retrieved context, calculated based on both the question and contexts.

2.Retrieval metrics:



*   Context Recall: This metric measures how much the retrieved context aligns with the annotated answer, treated as the ground truth.
*   Context Precision: This metric examines how well important information is prioritized in various situations.


In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.integrations.llama_index import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, context_entity_recall, answer_similarity, answer_correctness

In the context of RAGAS (Retrieval-Augmented Generation for Answering System), ground truth refers to the reference or correct answer against which the system's performance is evaluated. It serves as the benchmark for assessing the accuracy and relevance of the generated responses. In RAGAS assessment for some metrics, the model's output is compared to the ground truth to measure how well it retrieves the correct information from external sources and uses that information to generate an appropriate answer.

In [None]:
eval_questions = [
     "What is the total population of province Prato?",
     "where is Assisi and what is the male population of there?",
     "which city in Napoli province is the most populous?",
     "Compare the population of men and women in the city of Roma.",
     "Tell me about the difference in sex between the people who live in Cusano Milanino?",
     "What is the total population of Leini?",
     "What is the total and Male population of Novara province?",
     "What is the female population of Palmi?",
     "What is the percentage of the total population of Italy that resides in the region Lombardy",
     "What is the ratio of male to female population in the province Latina?",
     "In Sicilia region, how does the female population compare to the male population in terms of percentage",
     "tell me about the population of women in Belmonte Mezzagno?",
     "what is the exact female population of Tivoli?",
     "how many male populations do reside in Ercolano?",
     "Tell me about the total population of Bari?",
     "How many people live in the city of Castellaneta?",
     "Which region in the Nord-est group has the most evenly balanced gender ratio?",
     "What is the male population of the region Piemonte?",
     "what is population of the region Emilia-Romagna?",
     "How does the male population of Alseno city compare to the female population"
]

eval_answers = [
     "The total population in Prato is 259244",
     "Assisi as a city in province Perugia and region of Umbria in center of Italy has 13339 male population",
     "The city of Napoli with total population of 917510 is the most populous city in the province of Napoli",
     "According to the provided data, the population of men in the city of Roma is 1308818, and the population of women is 1446491.",
     "The male population of Cusano Milanino is 8991, while the female population is 9900. Thus, the difference between the male and female population is 909.",
     "The total population of Leini is 16294.",
     "the total population Novara province is 362502 and male population in Novara province is 176980",
     "The female population of Palmi is 8733",
     "About 17% of all the people who live in Italy live in the area of Lombardy.",
     "The male to female population ratio in province Latina is 98%",
     "In Sicilia, the female population is approximately 51.3%, while the male population is 48.7%",
     "in the city of Belmonte Mezzagno, the women population is 5530",
     "The female population of Tivoli is 28032",
     "the male population in Ercolano is 24407",
     "Bari has a total population of 1225048 as a province, and 316736 as a city.",
     "the total population of Castellaneta is 16220 people",
     "The most balanced gender ratio in the Nord-est group is found in Veneto",
     "The male population of the region Piemonte is 2,072,771",
     "The total population of the region Emilia-Romagna is 4435758",
     "the male population in Alseno city is 2315, and the female population is 2374"
]

examples = [
    {"query": q, "ground_truth": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [None]:
# using GPT 3.5 for LLM as ajudge
evaluator_llm = OpenAI(model="gpt-3.5-turbo")

In [None]:
def process_queries(queries):
    results = []
    for query in eval_questions:
        # Call query_engine.query() for each query
        response = query_engine.query(query)
        results.append(response)
    return results
query_results = process_queries(eval_questions)
query_results

In [None]:
results = []
contexts = []
for question in eval_questions:
  result = query_engine.query(question)
  results.append(result.response)
  context_list = [source_node.text for source_node in result.source_nodes]
  contexts.append(context_list)

In [None]:
d = {
    "question": eval_questions,
    "answer": results,
    "contexts": contexts,
    "ground_truth": eval_answers
}

dataset = Dataset.from_dict(d)


# Ensure the query_engine returns a valid response object
score = evaluate(
    query_engine=query_engine,
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness],
)

# Convert to pandas DataFrame and save to CSV
score_df = score.to_pandas()
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)

# Print the types of objects returned by the query engine for debugging
print([type(r) for r in query_results])

In [None]:
score_df[['faithfulness','answer_relevancy', 'context_precision', 'context_recall','answer_correctness']].mean(axis=0)

Unnamed: 0,0
faithfulness,0.908333
answer_relevancy,0.822824
context_precision,0.85
context_recall,0.675
answer_correctness,0.659971
