# Aula 9_10 - Exercício: implementação do ReAct para Agentic RAG

- Implementar a ReAct  com LLaMa 3 70B (groq)
- Testar no dataset do IIRC - 50 primeiras perguntas com resposta (test_questions.json em anexo)
- Usar o prompt do LLaMAIndex: https://github.com/run-llama/llama_index/blob/a87b63fce3cc3d24dc71ae170a8d431440025565/llama_index/agent/react/prompts.py
- Salvar as respostas finais das 50 perguntas no JSON para exercício futuro de avaliação
- Instruir o modelo a seguir a sequência Thougth, Action, Input, Observation (a observação não é do próprio modelo, mas resultado da busca)
- É necessário usar o parâmetro stop_sequence="Observation:", para o modelo parar de gerar texto e esperar o retorno da busca. Implementem o código da busca e retornem os top-k documentos pro modelo (sugestão: k=5).
- Instruir o modelo agir passo-a-passo (decomposição da pergunta).
- Podem usar o LangChain, LLaMAindex ou outro framework. Ou implementar na mão.
- Usar a busca como ferramenta
- Usar o BM25 como buscador (repetir indexação do exercício passado)
- Usar a indexação do Visconde: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb


In [25]:
# Imports
import json
from bs4 import BeautifulSoup
from rank_bm25 import BM25Okapi

## Load and process Dataset

### Load Data

In [13]:
# Load files
context_articles_file = open("./context_articles.json", "r")
test_questions_file = open("./test_questions.json", "r")

context_articles = json.load(context_articles_file)
test_questions = json.load(test_questions_file)

### Process data

In [22]:
# Remove html tags from context articles and choose only the ones used in the test_questions.json question_links

# First get all question_links
question_links = set()
for test_question in test_questions:
    for link in test_question["question_links"]:
        question_links.add(link.lower())

# Then we will extract the context articles of the fetched links while cleaning them
cleaned_context_articles = {}
for article, content in context_articles.items():
    if article in question_links:
        parser = BeautifulSoup(content, "html")
        cleaned_context_articles[article] = parser.get_text()

# Testing functionality
assert 'san diego padres' not in cleaned_context_articles.keys() 
assert cleaned_context_articles['zeus']

# Creating a list of dicts with all articles that will be used
articles_list = []
for article, content in cleaned_context_articles.items():
    articles_list.append(
        {
            "article": article,
            "content": content
        }
    )

print(articles_list[0])

{'article': 'arizona', 'content': 'Arizona (; ; ) is a state in the southwestern region of the United States. It is also part of the Western and the Mountain states. It is the 6th largest and the 14th most populous of the 50 states. Its capital and largest city is Phoenix. Arizona shares the Four Corners region with Utah, Colorado, and New Mexico; its other neighboring states are Nevada and California to the west and the Mexican states of Sonora and Baja California to the south and southwest.\n\nArizona is the 48th state and last of the contiguous states to be admitted to the Union, achieving statehood on February 14, 1912, coinciding with Valentine\'s Day. Historically part of the territory of Alta California in New Spain, it became part of independent Mexico in 1821. After being defeated in the Mexican–American War, Mexico ceded much of this territory to the United States in 1848. The southernmost portion of the state was acquired in 1853 through the Gadsden Purchase.\n\nSouthern Ari

## BM25

In [32]:
# Tokenizing all content to add to BM25
tokenized_content = {article: content.split() for article, content in cleaned_context_articles.items()}

# Creating BM25 index
bm25_index = BM25Okapi(tokenized_content.values())

# Create a index map for each article
article_index = {article: idx for idx, article in enumerate(cleaned_context_articles.keys())}

In [41]:
# Testing the Indexer

def get_rankings(query, index=bm25_index, k=None):
    tokenized_query = query.split()

    # Get the BM25 scores for the query
    scores = index.get_scores(tokenized_query)

    ranking = {article: scores[article_index[article]] for article in cleaned_context_articles.keys()}
    sorted_ranking = dict(sorted(ranking.items(), key=lambda item: item[1], reverse=True))
    return dict(list(sorted_ranking.items())[:k]) if k else sorted_ranking


def print_ranking(ranking, sort=False):
    if sort:
        ranking = dict(sorted(ranking.items(), key=lambda item: item[1], reverse=True))
    
    for rank, (article, score) in enumerate(ranking.items(), start=1):
        print(f"Rank {rank}: {article} - BM25 Score: {score}")

ranking = get_rankings("Who was Zeus in greek mythology?", k=5)
print_ranking(ranking)
# query = "Who was Zeus in greek mythology?"

Rank 1: zeus - BM25 Score: 11.982062708491494
Rank 2: greek mythology - BM25 Score: 10.492936571216058
Rank 3: bob bourne - BM25 Score: 10.206749955837116
Rank 4: white hart lane - BM25 Score: 4.165540078786103
Rank 5: 1936 summer olympics - BM25 Score: 4.164142891866513
