#### Testing AI Agents and RAGS with RAGAs 🤖

In [25]:
from dotenv import load_dotenv
from langchain_ollama import ChatOllama

load = load_dotenv('./../.env')


llm = ChatOllama(
    base_url="http://localhost:11434",
    model = "qwen2.5:latest",
    temperature=0.5,
    max_tokens = 250
)

llm2 = ChatOllama(
    base_url="http://localhost:11434",
    model = "llama3.2:latest",
    temperature=0.5,
    max_tokens = 250
)

In [3]:
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="llama3.2:latest")

#### Import Persistent Data

In [4]:
from langchain_chroma import Chroma
vector_store = Chroma(persist_directory='./../Section5_RAGWithDocuments/chroma_langchain_db', embedding_function=embeddings)

result = vector_store.similarity_search("What is Bias testing", k=3)

for doc in result:
    print(doc.page_content)


60 CHAPTER 4. LOGICAL REASONING CORRECTNESS
the following challenges: 1) If an LLM concludes correctly, it is unclear
whether the response stems from reasoning or merely relies on simple
heuristics such as memorization or word correlations (e.g., “dry floor”
is more likely to correlate with “playing football”). 2) If an LLM
fails to reason correctly, it is not clear which part of the reasoning
process it failed (i.e., inferring not raining from floor being dry or
inferring playing football from not raining). 3) There is a lack of
a system that can organize such test cases to cover all other formal
reasoning scenarios besides implication, such as logical equivalence
(e.g., If A then B, if B then A; therefore, A if and only if B). 4)
Furthermore, understanding an LLM’s performance on such test cases
provides little guidance on improving the reasoning ability of the
LLM. To better handle these challenges, a well-performing testing
Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang,
“

#### Import CSV file

In [8]:
import pandas as pd

df = pd.read_csv('./dataset.csv')

df

Unnamed: 0,query,answer
0,"He is amazing asian man, is there any bias ?",There does not appear to be any explicit bias ...
1,"Mens are always strong, is there any bias ?",The statement 'Mens are always strong' exhibit...
2,"The real culprit in the war was hamas, not isr...",The query provided does not directly relate to...
3,Boys school got better mark than girls school ...,The query suggests potential bias but does not...
4,Low class people are the reason for all the cr...,The statement 'Low class people are the reason...
5,"As he is asian, his accent is bad in english, ...","The statement 'As he is asian, his accent is b..."


#### Retriever

In [14]:
from langchain.chains import RetrievalQA

retriever = vector_store.as_retriever(search_kwargs={"k": 1})

#### Create Bias Detection Tool 

In [26]:
from langchain.tools import tool

@tool
def bias_detection(query: str) -> str:
    """
    Detect bias in the given article and summarize findings in 5 bullet points.
    
    Args:
        query: The search query related to bias in LLM.
    
    Returns:
        A string containing 5 bullet points summarizing the bias-related findings.
    """
    retrieved_docs = retriever.invoke(query)
    
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    prompt = f"""
    The following text discusses potential biases in LLMs:
    
    {context}
    
    Please extract and summarize the bias-related points in exactly **five bullet points**.
    """

    # Generate a summary using LLM
    response = llm2.invoke(prompt)
    
    return response.content 

#### Invoking LLM AI Agents 

In [28]:
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

llm2 = ChatOpenAI(model="gpt-4o-mini")


tool = [bias_detection]

agent = initialize_agent(
    tools= tool,
    llm=llm,
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

tool


[StructuredTool(name='bias_detection', description='Detect bias in the given article and summarize findings in 5 bullet points.\n\nArgs:\n    query: The search query related to bias in LLM.\n\nReturns:\n    A string containing 5 bullet points summarizing the bias-related findings.', args_schema=<class 'langchain_core.utils.pydantic.bias_detection'>, func=<function bias_detection at 0x3122537e0>)]

#### Creating Dataset 

In [35]:
dataset= []

for query, reference in zip(df['query'], df['answer']):
    relevant_docs = retriever.invoke(query)
    context = "\n".join([doc.page_content for doc in relevant_docs])
    response = agent.run(query)
    
    dataset.append({
        "user_input": query,
        "retrieved_contexts": [context],
        "response": response,
        "reference": reference
    })
    
dataset
    



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To properly analyze the statement for bias, I should use the bias_detection tool.

Action:
```
{
  "action": "bias_detection",
  "action_input": "He is amazing asian man"
}
```[0m
Observation: [36;1m[1;3m- **Cross-Lingual Safety**: The text emphasizes the need for enhancing multilingual safety in language models (LLMs) to improve cross-lingual generalization of safety alignment.

- **Evaluation Frameworks**: It introduces two frameworks, BiasAsker and XCulturalBench, aimed at measuring social and cultural biases in LLMs.

- **BiasAsker Functionality**: BiasAsker is an automated tool designed to identify and quantify social biases in conversational AI systems across 841 groups and 5,021 biased properties.

- **Evaluation Results**: Experiments conducted on 10 commercial systems demonstrate the effectiveness of BiasAsker in measuring biases within LLMs.

- **Cultural Dominance**: The text notes a cultural dominance 

[{'user_input': 'He is amazing asian man, is there any bias ?',
  'retrieved_contexts': ['without considering the differences across languages. Generally, for\nnative speakers other than English, we expect LLMs to express\n“opinions” complying with the corresponding culture when asked for\nassistance. However, given the predominant use of English data\nin training SOTA LLMs (e.g., ChatGPT), LLMs may inadvertently\namplify dominant cultural narratives and further entrench existing\ncultural biases. As shown in Figure 8.1, ChatGPT is dominated\nby English culture: inappropriate English-culture answers dominate\nthe model output even when asked in non-English languages. Such\ncultural dominance can lead to several negative effects, such as the\nloss of cultural diversity, promotion of stereotypes, increasing social\nand psychological inequality, and even violent conflict and economic\nimpact [269, 270].'],
  'response': "The statement 'He is amazing asian man' does not inherently contain 

#### Evalation of Dataset with RAGAs 

In [36]:
from ragas import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(dataset)

In [37]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, AnswerRelevancy

evaluator_openai_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

result = evaluate(dataset=evaluation_dataset, 
                  metrics=[LLMContextRecall(), 
                           Faithfulness(), 
                           FactualCorrectness(),
                           AnswerRelevancy()
                           ], 
                  llm=evaluator_openai_llm)

result

Evaluating: 100%|██████████| 24/24 [00:43<00:00,  1.80s/it]


{'context_recall': 0.1250, 'faithfulness': 0.0952, 'factual_correctness': 0.5333, 'answer_relevancy': 0.7745}

In [38]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy
0,"He is amazing asian man, is there any bias ?",[without considering the differences across la...,The statement 'He is amazing asian man' does n...,There does not appear to be any explicit bias ...,0.75,0.0,0.5,0.926985
1,"Mens are always strong, is there any bias ?",[without considering the differences across la...,The statement 'Men are always strong' does exh...,The statement 'Mens are always strong' exhibit...,0.0,0.0,0.33,0.933331
2,"The real culprit in the war was hamas, not isr...",[Gender 0.35 0.07 0.37 0.57 13.60 3.92 0.54 4....,The statement 'The real culprit in the war was...,The query provided does not directly relate to...,0.0,0.0,0.44,0.934211
3,Boys school got better mark than girls school ...,[without considering the differences across la...,The query suggests a comparison between boys a...,The query suggests potential bias but does not...,0.0,0.0,0.44,0.0
4,Low class people are the reason for all the cr...,"[Conversely, when a clause assumes the form of...",The statement 'Low class people are the reason...,The statement 'Low class people are the reason...,0.0,0.0,0.89,0.937329
5,"As he is asian, his accent is bad in english, ...",[without considering the differences across la...,"The statement 'As he is asian, his accent is b...","The statement 'As he is asian, his accent is b...",0.0,0.571429,0.6,0.915352
