**Summary**

This part integrates the LangChain framework with the PandasAI module to develop RAG.
PandasAI offers integrated support for LangChain models, merging the Pandas library with
artificial intelligence. It autonomously identifies and transforms a LangChain-deployed LLM
into a PandasAI-compatible LLM. This integration facilitates natural, human-like interactions
with data within data frames.

In [None]:
# Install required libraries and frameworks for the project
! pip install langchain==0.4.0
! pip install langchain_community
! pip install pandasai
! pip install langchain_groq==0.1.6
! pip install ragas
! pip install langchain_openai

This method creates a LangChain object utilizing the ChatGroq paradigm for LLM deployment, while PandasAI leverages this LangChain LLM to respond to user questions. Ultimately, the final model transforms the data frame into a SmartDataframe to enhance querying and response production from the underlying data.

In [None]:
import pandas as pd
from pandasai import SmartDataframe
from pandasai import Agent
from langchain_groq.chat_models import ChatGroq
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = "Personal API key"
from ragas.integrations.langchain import EvaluatorChain
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness
from datasets import Dataset

In [None]:
llm = ChatGroq(temperature = 0.2 , model_name='llama3-70b-8192', groq_api_key='Personal API key')

In [None]:
data = pd.read_csv('/content/popolazione_Italia_2023_Places 2.csv')
data.head()

Unnamed: 0,Type of place,Codice,Luogo,Codice_Luogo,Maschi,Femmine,Totale
0,Country,IT,Italia,[IT] Italia,28814832,30182369,58997201
1,Group of regions,ITCD,Nord,[ITCD] Nord,13429002,13988146,27417148
2,Group of regions,ITC,Nord-ovest,[ITC] Nord-ovest,7759911,8098715,15858626
3,Region,ITC1,Piemonte,[ITC1] Piemonte,2072771,2178580,4251351
4,Province,ITC11,Torino,[ITC11] Torino,1069885,1134747,2204632


This line creates a SmartDataframe object using the provided data and configuration. It passes the data into the dataframe and configures it to use the specified LLM for enhanced processing. The SmartDataframe allows the use of an LLM to perform operations like text analysis, generation, or processing within the dataframe, enabling more intelligent handling of the data.

In [None]:
df = SmartDataframe(data, config = {'llm':llm})

Some examples

In [None]:
response = df.chat("Tell me about the female population of Bologna city and tell from which row you took this answer?")
print(response)

The female population of Bologna city is 522509. I took this answer from row where Luogo is 'Bologna'.


In [None]:
response = df.chat('What is the total population of Bari?')
print(response)

The total population of province Prato is 259244 and the row number used for finding the answer is 4725.


**Assessment**

I have conducted the RAGAS assessment for the integration of Langchain and PandasAI manually, as the RAGAS framework does not provide a built-in solution for PandasAI. This manual approach involved creating the necessary contexts by my self with finding the source of answers, ensuring that the relevant information was properly structured for evaluation.

In [None]:
faithfulness_chain = EvaluatorChain(metric=faithfulness)
answer_rel_chain = EvaluatorChain(metric=answer_relevancy)
context_rel_chain = EvaluatorChain(metric=context_precision)
context_recall_chain = EvaluatorChain(metric=context_recall)

In [None]:
eval_questions = [

     "which city in Napoli province is the most populous?",
     "Tell me about the difference in sex between the people who live in Cusano Milanino?",
     "What is the total and Male population of Novara province?",
     "What is the female population of Palmi?",
     "In Sicilia region, how does the female population compare to the male population in terms of percentage",
     "tell me about the population of women in Belmonte Mezzagno?",
     "what is the exact female population of Tivoli?",
     "How many people live in the city of Castellaneta?",
     "Which region in the Nord-est group has the most evenly balanced gender ratio?",
     "How does the male population of Alseno city compare to the female population"
]

eval_answers = [
     "The city of Napoli with total population of 917510 is the most populous city in the province of Napoli",
     "The male population of Cusano Milanino is 8991, while the female population is 9900. Thus, the difference between the male and female population is 909.",
     "the total population Novara province is 362502 and male population in Novara province is 176980",
     "The female population of Palmi is 8733",
     "In Sicilia, the female population is approximately 51.3%, while the male population is 48.7%",
     "in the city of Belmonte Mezzagno, the women population is 5530",
     "The female population of Tivoli is 28032",
     "the total population of Castellaneta is 16220 people",
     "The most balanced gender ratio in the Nord-est group is found in Veneto",
     "the male population in Alseno city is 2315, and the female population is 2374"
]

examples = [
    {"query": q, "ground_truth": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [None]:
response1 = df.chat('What is the total population of Prato province and please tell from which row you took this answer?')
print(response1)

response2 = df.chat('what is the male population of Assisi and tell from which row you took this answer?')
print(response2)

response3 = df.chat('which city in Napoli province is the most populous and tell from which row you took this answer?')
print(response3)

response4 = df.chat('Compare the population of men and women in the city of Roma and tell from which row you took this answer?')
print(response4)

response5 = df.chat('Tell me about the difference in gender between the people who live in Cusano Milanino and tell from which row you took this answer?')
print(response5)

response6 = df.chat('What is the total population of Leini and tell from which row you took this answer?')
print(response6)

response7 = df.chat('What is the total and Male population of Novara province and tell from which row you took this answer?')
print(response7)

response8 = df.chat('What is the female population of Palmi and tell from which row you took this answer?')
print(response8)

response9 = df.chat('What is the percentage of the total population of Italy that resides in the region Lombardy and tell from which row you took this answer?')
print(response9)

response10 = df.chat('What is the ratio of male to female population in the province Latina and tell from which row you took this answer?')
print(response10)

response11 = df.chat('In Sicilia region, how does the female population compare to the male population in terms of percentage and tell from which row you took this answer?')
print(response11)

response12 = df.chat('tell me about the population of women in Belmonte Mezzagno and tell from which row you took this answer?')
print(response12)

response13 = df.chat('what is the exact female population of Tivoli and tell from which row you took this answer?')
print(response13)

response14 = df.chat('how many male populations do reside in Ercolano and tell from which row you took this answer?')
print(response14)

response15 = df.chat('Tell me about the total population of Bari and and tell from which row you took this answer?')
print(response15)

response16 = df.chat('How many people live in the city of Castellaneta and tell from which row you took this answer?')
print(response16)

response17 = df.chat('Which region in the Nord-est group has the most evenly balanced gender ratio and tell from which row you took this answer?')
print(response17)

response18 = df.chat('What is the male population of the region Piemonte and tell from which row you took this answer?')
print(response18)

response19 = df.chat('what is population of the region Emilia-Romagn and tell from which row you took this answer?')
print(response19)

response20 = df.chat('How does the male population of Alseno city compare to the female population and tell from which row you took this answer?')
print(response20)

259244
13339
The most populous city in Napoli province is Napoli and I took this answer from row 6079.
The female population of Roma city is higher than the male population by 159385 people.
The female population of Cusano Milanino is higher than the male population by 909 people. I took this answer from row 1927.
16294
The total population of Novara province is 362502 and the male population is 176980. I took this answer from row 401.
The female population of Palmi is 9217 and I took this answer from row 7138.
0.0
0.981471191582008
In Sicilia region, the female population is 105.25% of the male population. I took this answer from row 7259.
The female population of Belmonte Mezzagno is 5530 and I took this answer from row 7295.
The female population of Tivoli is 28032 and I took this answer from row 5306.
24407
1225048
There are 16220 people living in Castellaneta, and I took this answer from row 6558.
The region in the Nord-est group with the most evenly balanced gender ratio is Ligur

It is evident that the questions picked for completing the evaluation have the  number of source rows in their replies, which allows us to simply create contexts as the source of retrieved answers.

In [None]:
results = [response3, response5,response7, response8, response11, response12, response13,response16, response17, response20]

In [None]:
results

['The most populous city in Napoli province is Napoli and I took this answer from row 6079.',
 'The female population of Cusano Milanino is higher than the male population by 909 people. I took this answer from row 1927.',
 'The total population of Novara province is 362502 and the male population is 176980. I took this answer from row 401.',
 'The female population of Palmi is 9217 and I took this answer from row 7138.',
 'In Sicilia region, the female population is 105.25% of the male population. I took this answer from row 7259.',
 'The female population of Belmonte Mezzagno is 5530 and I took this answer from row 7295.',
 'The female population of Tivoli is 28032 and I took this answer from row 5306.',
 'There are 16220 people living in Castellaneta, and I took this answer from row 6558.',
 'The region in the Nord-est group with the most evenly balanced gender ratio is Liguria, and I took this answer from row 1270.',
 'The male population of Alseno city is less than the female popu

In [None]:
import re
def extract_row_number(text):
    # Define a pattern to capture the row number
    pattern = r'row (\d+)'

    # Search for the pattern in the text
    match = re.search(pattern, text)

    if match:
        row_number = int(match.group(1))  # Convert matched number to integer
        return row_number
    else:
        return None  # Return None if no match is found


In [None]:


# Example usage:

row_number3 = extract_row_number(response3)
row_number5 = extract_row_number(response5)
row_number7 = extract_row_number(response7)
row_number8 = extract_row_number(response8)
row_number11 = extract_row_number(response11)
row_number12 = extract_row_number(response12)
row_number13 = extract_row_number(response13)
row_number16 = extract_row_number(response16)
row_number17 = extract_row_number(response17)
row_number20 = extract_row_number(response20)

print(row_number3)
print(row_number5)
print(row_number7)
print(row_number8)
print(row_number11)
print(row_number12)
print(row_number13)
print(row_number16)
print(row_number17)
print(row_number20)


6079
1927
401
7138
7259
7295
5306
6558
1270
4112


In [None]:
text = ""
for ind in data.index:
    text += f"{data['Luogo'][ind]} is of the type of {data['Type of place'][ind]} that has {data['Maschi'][ind]} male population and {data['Femmine'][ind]} female population and {data['Totale'][ind]} persons as total population#####"

In [None]:
text

"Italia is of the type of Country that has 28814832 male population and 30182369 female population and 58997201 persons as total population#####Nord is of the type of Group of regions that has 13429002 male population and 13988146 female population and 27417148 persons as total population#####Nord-ovest is of the type of Group of regions that has 7759911 male population and 8098715 female population and 15858626 persons as total population#####Piemonte is of the type of Region that has 2072771 male population and 2178580 female population and 4251351 persons as total population#####Torino is of the type of Province that has 1069885 male population and 1134747 female population and 2204632 persons as total population#####Agliè is of the type of Cities that has 1229 male population and 1339 female population and 2568 persons as total population#####Airasca is of the type of Cities that has 1871 male population and 1798 female population and 3669 persons as total population#####Ala di Stu

In [None]:
data_list = []

for ind in data.index:
    text = f"{data['Luogo'][ind]} is of the type of {data['Type of place'][ind]} that has {data['Maschi'][ind]} male population and {data['Femmine'][ind]} female population and {data['Totale'][ind]} persons as total population"
    data_list.append(text)

# Print the list of formatted strings
for entry in data_list:
    print(entry)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Bressanone/Brixen is of the type of Cities that has 11133 male population and 11736 female population and 22869 persons as total population
Bronzolo/Branzoll is of the type of Cities that has 1363 male population and 1363 female population and 2726 persons as total population
Brunico/Bruneck is of the type of Cities that has 8413 male population and 8596 female population and 17009 persons as total population
Caines/Kuens is of the type of Cities that has 191 male population and 194 female population and 385 persons as total population
Caldaro sulla strada del vino/Kaltern an der Weinstraße is of the type of Cities that has 4012 male population and 4135 female population and 8147 persons as total population
Campo di Trens/Freienfeld is of the type of Cities that has 1391 male population and 1288 female population and 2679 persons as total population
Campo Tures/Sand in Taufers is of the type of Cities that has 2839 male p

In [None]:
context3 = data_list[row_number3]
context5 = data_list[row_number5]
context7 = data_list[row_number7]
context8 = data_list[row_number8]
context11 = data_list[row_number11]
context12 = data_list[row_number12]
context13 = data_list[row_number13]
context16 = data_list[row_number16]
context17 = data_list[row_number17]
context20 = data_list[row_number20]

In [None]:
contexts = [context3, context5, context7, context8, context11, context12, context13, context16, context17, context20]

In [None]:
contexts

['Napoli is of the type of Province that has 1449594 male population and 1530744 female population and 2980338 persons as total population',
 'Dairago is of the type of Cities that has 3143 male population and 3277 female population and 6420 persons as total population',
 'Agrate Conturbia is of the type of Cities that has 800 male population and 746 female population and 1546 persons as total population',
 'Palmi is of the type of Cities that has 8733 male population and 9217 female population and 17950 persons as total population',
 'Sicilia is of the type of Region that has 2345397 male population and 2468619 female population and 4814016 persons as total population',
 'Belmonte Mezzagno is of the type of Cities that has 5363 male population and 5530 female population and 10893 persons as total population',
 'Tivoli is of the type of Cities that has 26988 male population and 28032 female population and 55020 persons as total population',
 'Castellaneta is of the type of Cities that 

In [None]:
class Ragas:
    def __init__(self, contexts: list[list[str]]):
        self.contexts = contexts

# Original list of contexts


# Function to transform the contexts into a nested list of strings
def transform_contexts(contexts):
    transformed = []
    for context in contexts:
        parts = context.split(' that has ')
        location_type = parts[0].split(' is of the type of ')
        populations = parts[1].split(' and ')
        transformed.append([location_type[0], location_type[1], populations[0], populations[1], populations[2]])
    return transformed

# Transform the contexts
transformed_contexts = transform_contexts(contexts)

# Create an instance of Ragas with the transformed contexts
ragas_instance = Ragas(transformed_contexts)

# Display the contexts attribute of the Ragas instance
print(ragas_instance.contexts)

[['Napoli', 'Province', '1449594 male population', '1530744 female population', '2980338 persons as total population'], ['Dairago', 'Cities', '3143 male population', '3277 female population', '6420 persons as total population'], ['Agrate Conturbia', 'Cities', '800 male population', '746 female population', '1546 persons as total population'], ['Palmi', 'Cities', '8733 male population', '9217 female population', '17950 persons as total population'], ['Sicilia', 'Region', '2345397 male population', '2468619 female population', '4814016 persons as total population'], ['Belmonte Mezzagno', 'Cities', '5363 male population', '5530 female population', '10893 persons as total population'], ['Tivoli', 'Cities', '26988 male population', '28032 female population', '55020 persons as total population'], ['Castellaneta', 'Cities', '7917 male population', '8303 female population', '16220 persons as total population'], ['Liguria', 'Region', '726267 male population', '781369 female population', '150763

In [None]:
ragas_instance.contexts

[['Napoli',
  'Province',
  '1449594 male population',
  '1530744 female population',
  '2980338 persons as total population'],
 ['Dairago',
  'Cities',
  '3143 male population',
  '3277 female population',
  '6420 persons as total population'],
 ['Agrate Conturbia',
  'Cities',
  '800 male population',
  '746 female population',
  '1546 persons as total population'],
 ['Palmi',
  'Cities',
  '8733 male population',
  '9217 female population',
  '17950 persons as total population'],
 ['Sicilia',
  'Region',
  '2345397 male population',
  '2468619 female population',
  '4814016 persons as total population'],
 ['Belmonte Mezzagno',
  'Cities',
  '5363 male population',
  '5530 female population',
  '10893 persons as total population'],
 ['Tivoli',
  'Cities',
  '26988 male population',
  '28032 female population',
  '55020 persons as total population'],
 ['Castellaneta',
  'Cities',
  '7917 male population',
  '8303 female population',
  '16220 persons as total population'],
 ['Liguria',

In [None]:
d = {
    "question": eval_questions,
    "answer": results,
    "contexts": ragas_instance.contexts,
    "ground_truth": eval_answers
}

dataset = Dataset.from_dict(d)
dataset


Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 10
})

In [None]:
score = evaluate(dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness])
score_df = score.to_pandas()
score_df.to_csv("EvaluationScores.csv", encoding="utf-8", index=False)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
score_df[['faithfulness','answer_relevancy', 'context_precision', 'context_recall','answer_correctness']].mean(axis=0)

Unnamed: 0,0
faithfulness,0.2
answer_relevancy,0.972416
context_precision,0.12
context_recall,0.4
answer_correctness,0.599415
