# Aula 9_10 - Exercício: implementação do ReAct para Agentic RAG

- Implementar a ReAct  com LLaMa 3 70B (groq)
- Testar no dataset do IIRC - 50 primeiras perguntas com resposta (test_questions.json em anexo)
- Usar o prompt do LLaMAIndex: https://github.com/run-llama/llama_index/blob/a87b63fce3cc3d24dc71ae170a8d431440025565/llama_index/agent/react/prompts.py
- Salvar as respostas finais das 50 perguntas no JSON para exercício futuro de avaliação
- Instruir o modelo a seguir a sequência Thougth, Action, Input, Observation (a observação não é do próprio modelo, mas resultado da busca)
- É necessário usar o parâmetro stop_sequence="Observation:", para o modelo parar de gerar texto e esperar o retorno da busca. Implementem o código da busca e retornem os top-k documentos pro modelo (sugestão: k=5).
- Instruir o modelo agir passo-a-passo (decomposição da pergunta).
- Podem usar o LangChain, LLaMAindex ou outro framework. Ou implementar na mão.
- Usar a busca como ferramenta
- Usar o BM25 como buscador (repetir indexação do exercício passado)
- Usar a indexação do Visconde: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb


In [1]:
# Imports
import json
from bs4 import BeautifulSoup
# from rank_bm25 import BM25Okapi
from tqdm import tqdm
import spacy
import os
from groq import Groq

import argparse
import collections
import numpy as np

import re
import string
import sys
import unicodedata

import sentence_transformers
from pyserini.search.lucene import LuceneSearcher

## Load and process Dataset

### Load Data

In [2]:
# Load files
context_articles_file = open("./data/context_articles.json", "r")
context_articles = json.load(context_articles_file)


In [3]:

test_questions_file = open("./data/test_questions.json", "r")
test_questions = json.load(test_questions_file)

### Process data

In [4]:
# Remove html tags from context articles and choose only the ones used in the test_questions.json question_links
# This does the same that viscode but a little simpler

# First get all question_links
question_links = set()
for test_question in test_questions:
    for link in test_question["question_links"]:
        question_links.add(link.lower())

# Then we will extract the context articles of the fetched links while cleaning them
cleaned_context_articles = {}
for article, content in context_articles.items():
    if article in question_links:
        parser = BeautifulSoup(content, "html")
        cleaned_context_articles[article] = parser.get_text()

# Testing functionality
assert 'san diego padres' not in cleaned_context_articles.keys() 
assert cleaned_context_articles['zeus']

# Creating a list of dicts with all articles that will be used
articles_list = []
for article, content in cleaned_context_articles.items():
    articles_list.append(
        {
            "title": article,
            "content": content
        }
    )

print(articles_list[0])

{'title': 'arizona', 'content': 'Arizona (; ; ) is a state in the southwestern region of the United States. It is also part of the Western and the Mountain states. It is the 6th largest and the 14th most populous of the 50 states. Its capital and largest city is Phoenix. Arizona shares the Four Corners region with Utah, Colorado, and New Mexico; its other neighboring states are Nevada and California to the west and the Mexican states of Sonora and Baja California to the south and southwest.\n\nArizona is the 48th state and last of the contiguous states to be admitted to the Union, achieving statehood on February 14, 1912, coinciding with Valentine\'s Day. Historically part of the territory of Alta California in New Spain, it became part of independent Mexico in 1821. After being defeated in the Mexican–American War, Mexico ceded much of this territory to the United States in 1848. The southernmost portion of the state was acquired in 1853 through the Gadsden Purchase.\n\nSouthern Arizo

In [5]:
# We are now going to extract all questions with their respective answer

def process_answer(answer):
    match answer["type"]:
        case "span":
            return answer["answer_spans"][0]["text"]
        case "value":
            return answer["answer_value"] + " " + answer["answer_unit"]
        case "binary":
            return answer["answer_value"]
        case _:
            print("Unsupported type", answer["type"])

questions_list = []
for test_question in test_questions:
    question = test_question["question"]
    answer = process_answer(test_question["answer"])
    questions_list.append(
        {
            "question" : question,
            "answer" : answer
        }
    )

## Indexer - BM25 - pyserini

In [6]:
# Transform the raw text of each article into multiple windows of content

nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

stride = 3
max_length = 5

def window(documents, stride, max_length):
    treated_documents = []

    for j,document in enumerate(tqdm(documents)):
        doc_text = document['content']
        doc = nlp(doc_text[:10000])
        sentences = [sent.text.strip() for sent in doc.sents]
        for i in range(0, len(sentences), stride):
            segment = ' '.join(sentences[i:i + max_length])
            treated_documents.append({
                "title": document['title'],
                "contents": document['title']+". "+segment,
                "segment": segment
            })
            if i + max_length >= len(sentences):
                break
    return treated_documents

treated_documents = window(articles_list, stride, max_length)

100%|██████████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 113.11it/s]


In [7]:
# Creates the jsonl that will be used by pyserini with the windowed data

if not os.path.isdir("data/iirc_indices"):
    !mkdir data/iirc_indices

f = open("data/iirc_indices/contents.jsonl",'w')

for i, doc in enumerate(treated_documents):
    doc['id'] = i
    if doc['segment'] != "":
        f.write(json.dumps(doc)+"\n")

In [8]:
# Creates the indices with BM25 (DefaultLuceneDocumentGenerator)
!python3 -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -input data/iirc_indices -index data/iirc_index -storeRaw

pyserini.index is deprecated, please use pyserini.index.lucene.
2024-05-15 15:09:51,127 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:204) - Setting log level to INFO
2024-05-15 15:09:51,129 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - AbstractIndexer settings:
2024-05-15 15:09:51,130 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) -  + DocumentCollection path: data/iirc_indices
2024-05-15 15:09:51,130 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + CollectionClass: JsonCollection
2024-05-15 15:09:51,130 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + Index path: data/iirc_index
2024-05-15 15:09:51,130 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Threads: 1
2024-05-15 15:09:51,131 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Optimize (merge segments)? false
May 15, 2024 3:09:51 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemoryS

## Groq API

In [9]:
class GroqAPI():
    def __init__(self, system_message=None, json_format=False):
        # self.client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
        self.client = Groq(api_key="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")
        
        self.model = "llama3-70b-8192"
        self.system_message = system_message
        self.json_format = json_format

    
    def get_answer(self, prompt : str):
        messages = []

        if self.system_message:
            messages.append(
                {
                    "role" : "system",
                    "content" : self.system_message
                }
            )

        messages.append(
            {
                "role" : "user",
                "content" : prompt
            }
        )


        answer = self.client.chat.completions.create(
            messages=messages,
            model=self.model,
            response_format=({"type": "json_object"} if self.json_format else None)
        )

        return answer.choices[0].message.content
    
api = GroqAPI()
api.get_answer("Opa, beleza?")

'Você é brasileiro, né? "Opa, beleza?" é uma expressão muito comum no Brasil, especialmente entre os jovens. "Opa" é uma interjeição que pode ser usada para expressar surpresa, alegria ou emoção, e "beleza" significa "lindo", "bonito" ou "incrível". Portanto, "Opa, beleza?" pode ser traduzido para "Incrível, certo?" ou "Lindo, né?".'

## Métricas de Evaluação

### Lamma3

In [10]:
groq_evaluator = GroqAPI(
    system_message="""
    You will have to compare two answers that will be in a json format.
    The first answer is correct and given by a human, while the second one is given by a LLM.
    Respond with the following json format \{"correct" : "bool" \}
    """,
    json_format=True
)

In [11]:
sample_question = "What is Zeus know for in Greek mythology?"
sample_gold = "sky and thunder god"
sample_correct = "Zeus is known as the king of the gods and the god of the sky and thunder in Greek mythology."
sample_wrong = "Zeus is been depicted as using violence to get his way and terrorize humans."

def groq_eval(correct, given):
    return groq_evaluator.get_answer(
        prompt=f"""
        First answer: {correct}
        Second answer {given}
        """
    )

groq_eval(sample_correct, sample_gold), groq_eval(sample_correct, sample_wrong)

('{"correct": true}', '{"correct": false}')

### Visconde

In [12]:
def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
        return re.sub(regex, ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        only_ascii = nfkd_form.encode('ASCII', 'ignore')
        return only_ascii.decode("utf-8")

    return white_space_fix(remove_articles(remove_punc(lower(remove_accents(s)))))

def get_tokens(s):
    if not s: return []
    return normalize_answer(s).split()

def compute_exact(a_gold, a_pred):
    return int(normalize_answer(a_gold) == normalize_answer(a_pred))

def compute_f1(a_gold, a_pred):
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

In [13]:
compute_f1(sample_gold, sample_correct), compute_exact(sample_gold, sample_correct)

(0.4, 0)

In [14]:
compute_f1(sample_gold, sample_wrong), compute_exact(sample_gold, sample_wrong)

(0.11111111111111112, 0)

## ReAct

Here we will have the following tools
- Router: Will decide next tool based in given context
- Search: Will search in the index for related content to the given context
- Think:
- Act:

### Router

In [105]:

def get_next_tool(context, step, last_decision="None"):
    router = GroqAPI(
        system_message=f"""
You are the router of a ReAct agent, also known as tool selector.
In your selection you must provide a thought of why you are selecting this tool.
You must not answer the question if the context only contains the initial questi
The last decision of the router was {last_decision} on.
It is not useful to think twice in a row, so you must not do that.

The available tools are the following:
    - Search: 
        Query: "str"
        Returns: Useful and relevant information about the query.

    - Think: 
        Query: None
        Returns: A reflection of the previous content that can be helpful for deciding the next tool.

    - Answer:
        Query: None
        Returns: The final answer based in all thoughts and searches.

You must respond with a json object in the following format:

"tool" : "str" (can be search, think or answer)
"thought" : "str"
"query" : "str" (can be the search query or empty if selected tool is think or answer)
""",
        json_format=True
)
    

    router_answer = router.get_answer(context + f"\n")

    if step >= 7 and json.loads(router_answer)["tool"] != "answer":
        return """
{
    "tool": "answer",
    "thought": "Maximum nuber of steps reached. Must give an answer",
    "query": ""
}
"""
    
    return router_answer



In [None]:

sample_context = """
Question: Who was Zeus in the greek mythology?
"""

router_answer = get_next_tool(sample_context, 1)

In [92]:

json.loads(router_answer)

{'tool': 'search',
 'thought': "I want to know more about Zeus in Greek mythology, so I'll search for relevant information.",
 'query': 'Zeus in Greek mythology'}

### Search

In [54]:
# Define our embedding model
embedding_model = sentence_transformers.SentenceTransformer('paraphrase-distilroberta-base-v1')

# Generate embeddings for a given text
def generate_embeddings(text):
    return embedding_model.encode(text, convert_to_tensor=True)

# Retrieve relevant documents given a query
def retrieve_documents(query, searcher, embedding_model, top_k=10):
    query_embedding = generate_embeddings(query)
    hits = searcher.search(query, k=2*top_k)
    
    relevant_docs = []
    for hit in hits:
        raw_content = (searcher.doc(hit.docid).raw())
        hit = json.loads(raw_content)
        relevant_docs.append(hit["segment"])

    relevant_docs_embeddings = generate_embeddings(relevant_docs)
    
    rerank = sentence_transformers.util.semantic_search(query_embedding.reshape(1, -1), relevant_docs_embeddings, top_k=top_k)

    final_documents = []
    for doc in rerank[0]:
        idx = doc["corpus_id"]
        score = doc["score"]
        final_documents.append(
            {
                "idx" : idx,
                "score" : score,
                "content": relevant_docs[idx]
            }
        )

    return final_documents


# Define the retrieve
index_dir = "./data/iirc_index"
searcher = LuceneSearcher(index_dir)

# Search tool
def get_relevant_content(query, top_k=10):
    documents = retrieve_documents(query, searcher, embedding_model, top_k=10)
    final_content = ""
    for doc in documents:
        final_content += doc["content"]+"\n"
    return final_content




In [38]:
query = "What about Zeus, the greek god?"
relevant_docs = retrieve_documents(query, searcher, embedding_model)
print(len(relevant_docs), relevant_docs)

10 [{'idx': 5, 'score': 0.5736244320869446, 'content': "Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion, who rules as king of the gods of Mount Olympus. His name is cognate with the first element of his Roman equivalent Jupiter. His mythologies and powers are similar, though not identical, to those of Indo-European deities such as Jupiter, Perkūnas, Perun, Indra and Thor. Zeus is the child of Cronus and Rhea, the youngest of his siblings to be born, though sometimes reckoned the eldest as the others required disgorging from Cronus's stomach. In most traditions, he is married to Hera, by whom he is usually said to have fathered Ares, Hebe, and Hephaestus."}, {'idx': 4, 'score': 0.5297212600708008, 'content': 'He was equated with many foreign weather gods, permitting Pausanias to observe "That Zeus is king in heaven is a saying common to all men". Zeus\' symbols are the thunderbolt, eagle, bull, and oak. In addition to his In

### Think

In [99]:
def think(context):
    think_api = GroqAPI(
        system_message="""
You will reflect about a given information and provide a useful reflection.
Output in json as the following:
{
    "reflection" : "str"
}
        """,
        json_format=True
    )

    reflection = think_api.get_answer(
        f"""
Provide a concise, relevant and precise reflection about the following context:

{context}
        """
    )

    return reflection


### Answer

In [100]:
def answer(context):
    answer_api = GroqAPI(
        system_message="""
You will answer a question based in aditional context.
The context will have (ideally):
    - Search results that will help you fetching information
    - Reflections that have key points of the context
Your answer must not be long and focus only in the keypoints of the question, so you must be precise and relevant when answering.
Try to keep the answer in under 200 characthers.
Output in json as the following:
{
    "answer" : "str"
}
""",
        json_format=True
    )

    answer = answer_api.get_answer(f"""
Based in the following context respond the intial question:

{context}
"""
    )

    return answer

### Pipeline

In [107]:
def ReAct(query, debug=0, top_k=10):
    tool = None
    context = f"Question: {query}"
    step = 0

    if debug > 0:
        print(f"\nStarting ReAct agent step {step}.")
    if debug > 1:
        print(f"\nThe context is the following\n----- Start of context -----\n{context}\n----- End of context -----\n")

    decision_history = []
    while tool != "answer":
        step += 1

        if debug > 0:
            print(f"\nCalling ROUTER! Step {step}")
        if debug > 1:
            print(f"The context is the following\n----- Start of context -----\n{context}\n----- End of context -----\n")

        last_decision = decision_history[-1] if len(decision_history) else "None"
        
        router_answer = json.loads(get_next_tool(context, step, last_decision))
        # decision_history.append("router")
        context += "\nThought: " + router_answer["thought"]
        
        if debug > 0:
            print(f"Router gave the following answer {router_answer}")

        step += 1

        tool = router_answer["tool"].lower()
        if tool == "search":
            if debug > 0:
                print(f"\nCalling SEARCH! Step {step}, Query: {router_answer['query']}")
            if debug > 1:
                print(f"The context is the following\n----- Start of context -----\n{context}\n----- End of context -----\n")

            search_query = router_answer["query"]
            search_answer = get_relevant_content(search_query, top_k=top_k)

            if debug > 0:
                print(f"Search gave the following answer {search_answer}")

            decision_history.append("search")
            context += "\n- Start of search results -\n" + search_answer + "\n- End of search results -\n"

        if tool == "think":
            if debug > 0:
                print(f"\nCalling THINK! Step {step}")
            if debug > 1:
                print(f"The context is the following\n----- Start of context -----\n{context}\n----- End of context -----\n")

            think_answer = json.loads(think(context))

            if debug > 0:
                print(f"Think gave the following answer {think_answer}")

            decision_history.append("think")
            context += "\nThink result: " + think_answer["reflection"]
    
    if debug > 0:
        print(f"\nCalling ANSWER! Step {step}")
    if debug > 1:
        print(f"The context is the following\n----- Start of context -----\n{context}\n----- End of context -----\n")

    decision_history.append("answer")
    final_answer = json.loads(answer(context))["answer"]

    if debug > 0:
        print(f"Answer gave the following answer: {final_answer}")

    return final_answer, decision_history

In [109]:
sample_question = questions_list[0]["question"]
sample_answer = questions_list[0]["answer"]
print(sample_question, sample_answer, sep='\n')
_answer, _history = ReAct(sample_question, debug=1, top_k=5)
print(_answer, _history, sep='\n')

What is Zeus know for in Greek mythology?
sky and thunder god

Starting ReAct agent step 0.

Calling ROUTER! Step 1
Router gave the following answer {'tool': 'search', 'thought': "I'll search for information about Zeus to provide a helpful response", 'query': 'Zeus in Greek mythology'}

Calling SEARCH! Step 2, Query: Zeus in Greek mythology
Search gave the following answer Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion, who rules as king of the gods of Mount Olympus. His name is cognate with the first element of his Roman equivalent Jupiter. His mythologies and powers are similar, though not identical, to those of Indo-European deities such as Jupiter, Perkūnas, Perun, Indra and Thor. Zeus is the child of Cronus and Rhea, the youngest of his siblings to be born, though sometimes reckoned the eldest as the others required disgorging from Cronus's stomach. In most traditions, he is married to Hera, by whom he is usually said