# CodeArena (C4) Question Answer bot

### Objective
- This notebook has the PoC work for a Question Answer bot using C4's knowledge bases.
- The objective of the PoC is to prototype an LLM implementation that can accurately answer questions to their expectation and at the very least perform better than their current bot from [Mava](https://www.mava.app/)

### Observations from the usage of Mava
- The platform offers Discord support management with ticketing and AI help bot features
- For the AI help bot, the user is able to specify links to multiple knowledge sources that can be used for answering questions.
- Based on C4's testing of the Mava bot in the private channel, the following stats were observed:-
    - Total questions asked: 29
    - Total questions mis-answered based on emoji reactions: 13
    - Accuracy - ~55%

### Knowledge Bases
Based on conversations with their team, the following knowledge bases were identified to be relevant and are the same ones that Mava is using:-
- [Main Website](https://code4rena.com/)
- [Docs](https://docs.code4rena.com/) 


### High-level Approach
- Crawl and scrape C4’s website and docs using Scrapy lib
- Convert the html content to markdown format so that the model can better understand the context
- Use LangChain lib to do the following:-
    - Split the markdown header-separated sections into semantic chunks
    - Embed and store the semantic chunks in an in-memory vector db
    - Use the retrieval augmented functionality to answer the question

In [59]:
# Install all the third-party packages

!pip install 'langchain[llms]'
!pip install Scrapy
!pip install html2text
!pip install lxml
!pip install python-dotenv
!pip install "unstructured[all-docs]"
!pip install tiktoken
!pip install faiss-cpu 
!pip install GitPython
!pip install notebook
!pip install chromadb
!pip install pandas

Collecting tqdm==4.64.1
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Installing collected packages: tqdm
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.1
    Uninstalling tqdm-4.66.1:
      Successfully uninstalled tqdm-4.66.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.4.8 requires tqdm>=4.65.0, but you have tqdm 4.64.1 which is incompatible.[0m[31m
[0mSuccessfully installed tqdm-4.64.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;3

In [27]:
# General setup - you can specify OPENAI_API_KEY in .env file

import logging
from dotenv import load_dotenv
from IPython.display import display, Markdown, Latex

logging.getLogger().setLevel(logging.INFO)
load_dotenv()

True

In [28]:
import getpass
import os

OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')

assert OPENAI_API_KEY, "Please set OPENAI_API_KEY in your environment variables"

In [29]:
# Paths to the data

C4_WEBSITE_STORAGE_DIR = "knowledge_base/c4/website"
C4_DOCS_STORAGE_DIR = "knowledge_base/c4/docs"
C4_GH_DOCS_STORAGE_DIR = "knowledge_base/c4/gh_docs"

### Crawling and Scraping using Scrapy

In [30]:
import os
import scrapy
import html2text
import lxml.html
import json
from urllib.parse import urlparse

class GenericSpider(scrapy.Spider):
    name = 'generic'

    def __init__(self, domain='', storage_dir='.', *args, **kwargs):
        super(GenericSpider, self).__init__(*args, **kwargs)
        self.allowed_domains = [domain]
        self.start_urls = [f'http://{domain}/']
        self.storage_dir = storage_dir
    
    def parse(self, response):
        # Remove unwanted elements using lxml
        tree = lxml.html.fromstring(response.text)
        
        # Remove non-text related tags
        for unwanted in tree.xpath('//script|//img|//video|//audio|//iframe|//object|//embed|//canvas|//svg|//link|//source|//track|//map|//area'):
            unwanted.drop_tree()

        cleaned_html = lxml.html.tostring(tree).decode('utf-8')

        # Convert HTML to Markdown
        converter = html2text.HTML2Text()
        markdown_text = converter.handle(cleaned_html)

        # Save to a markdown file in the specified directory
        if not os.path.exists(self.storage_dir):
            os.makedirs(self.storage_dir)

        url = response.url
        page_name = response.url.split("/")[-1] if response.url.split("/")[-1] else "index"

        filename = os.path.join(self.storage_dir, f'{page_name}.json')

        with open(filename, 'w') as f:
            # Store the URL and markdown text in JSON format
            json.dump({'url': url, 'md_content': markdown_text}, f)

        # Recursively follow relative links to other pages on the same domain
        for href in response.css('a::attr(href)').getall():
            url = response.urljoin(href)
            if urlparse(url).netloc in self.allowed_domains:
                yield scrapy.Request(url, self.parse)


NOTE: Data has already been scraped and saved locally as JSON files in the 'knowledge_base/c4' directory. To re-run the scraping, uncomment the code in the cell below.

On re-running the crawler, if you get 'ReactorNotRestartable' error, the notebook kernel would need to be restarted.

In [11]:
# from scrapy.crawler import CrawlerRunner
# from scrapy.utils.project import get_project_settings
# from twisted.internet import reactor

# settings = get_project_settings()

# runner = CrawlerRunner(settings)
# runner.crawl(GenericSpider, domain="code4rena.com", storage_dir=C4_WEBSITE_STORAGE_DIR)
# runner.crawl(GenericSpider, domain="docs.code4rena.com", storage_dir=C4_DOCS_STORAGE_DIR)
# d = runner.join()
# d.addBoth(lambda _: reactor.stop())
# reactor.run()

#### Get docs from Github Repo

In [None]:
# from git import Repo

# repo = Repo.clone_from(
#     "https://github.com/code-423n4/docs", to_path=C4_GH_DOCS_STORAGE_DIR
# )

### Retrieval Augmented Generation using LangChain

#### Load locally saved scraped data

In [31]:
import json
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

def load_json_files(dir):
    loader = DirectoryLoader(dir, loader_cls=TextLoader)
    documents = loader.load()
    for d in documents:
        page_content_dict = json.loads(d.page_content)
        d.page_content = page_content_dict['md_content']
        d.metadata['url'] = page_content_dict['url']
    return documents

c4_website_data_list = load_json_files(C4_WEBSITE_STORAGE_DIR)
c4_docs_data_list = load_json_files(C4_DOCS_STORAGE_DIR)

In [32]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

loader = DirectoryLoader(C4_GH_DOCS_STORAGE_DIR, loader_cls=TextLoader)
c4_gh_docs_data_list = loader.load()


#### Split the markdown content into semantic chunks

In [322]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=2000, chunk_overlap=200
)


website_chunks =  md_splitter.split_documents(c4_website_data_list)
docs_chunks =  md_splitter.split_documents(c4_docs_data_list)
gh_docs_chunks = md_splitter.split_documents(c4_gh_docs_data_list)

print(len(website_chunks))
print(len(docs_chunks))
print(len(gh_docs_chunks))

89
97
72


#### Embed the semantic chunks and store in an in-memory vector db

In [271]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# NOTE: At times, OpenAI Embedding service can fail intermittently and return errorneous values such as [NaN], more info: https://github.com/langchain-ai/langchain/pull/7070

embeddings = OpenAIEmbeddings()
vectorstore = Chroma("vectorstore_1", embeddings, collection_metadata={"hnsw:space": "cosine"})

vectorstore.add_documents(website_chunks)
#vectorstore.add_documents(docs_chunks)
vectorstore.add_documents(gh_docs_chunks)


['72e47726-5246-11ee-8d9d-367dda1ae1c5',
 '72e4782a-5246-11ee-8d9d-367dda1ae1c5',
 '72e47866-5246-11ee-8d9d-367dda1ae1c5',
 '72e47898-5246-11ee-8d9d-367dda1ae1c5',
 '72e478c0-5246-11ee-8d9d-367dda1ae1c5',
 '72e478e8-5246-11ee-8d9d-367dda1ae1c5',
 '72e47910-5246-11ee-8d9d-367dda1ae1c5',
 '72e47938-5246-11ee-8d9d-367dda1ae1c5',
 '72e47960-5246-11ee-8d9d-367dda1ae1c5',
 '72e47988-5246-11ee-8d9d-367dda1ae1c5',
 '72e479b0-5246-11ee-8d9d-367dda1ae1c5',
 '72e479d8-5246-11ee-8d9d-367dda1ae1c5',
 '72e47a00-5246-11ee-8d9d-367dda1ae1c5',
 '72e47a28-5246-11ee-8d9d-367dda1ae1c5',
 '72e47a50-5246-11ee-8d9d-367dda1ae1c5',
 '72e47a78-5246-11ee-8d9d-367dda1ae1c5',
 '72e47a96-5246-11ee-8d9d-367dda1ae1c5',
 '72e47abe-5246-11ee-8d9d-367dda1ae1c5',
 '72e47ae6-5246-11ee-8d9d-367dda1ae1c5',
 '72e47b0e-5246-11ee-8d9d-367dda1ae1c5',
 '72e47b36-5246-11ee-8d9d-367dda1ae1c5',
 '72e47b5e-5246-11ee-8d9d-367dda1ae1c5',
 '72e47b86-5246-11ee-8d9d-367dda1ae1c5',
 '72e47ba4-5246-11ee-8d9d-367dda1ae1c5',
 '72e47bcc-5246-

#### Retrieval Augmented Generation
Workflow 
1. Use faster LLM (GPT-3.5) to generate 3 rephrased variants of the original user question to improve question quality which in-turn should improve retrieval
2. Use the rephrased question to generate the final answer using RAG

##### Generate rephrased questions
Use faster LLM (GPT-3.5) to generate 3 rephrased variants of the original user question to improve question quality which in-turn should improve retrieval

In [383]:
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

prompt_template = """You are a teacher who is helping a student ask the right questions about a service so that they can look in the most relevant places to find the answer. 
# INSTRUCTIONS
- You are given student's question below
- Using the original question, generate 3 alternative questions that are rephrased to be not vague or ambiguous so as to clearly convey the same meaning and context as the original question
- Return the final result as a JSON object containing a list of rephrased questions as "new_questions" field

# QUESTION
{question}

# RESULT
"""


def generate_rephrased_questions(question):
    chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
    llm_chain = LLMChain(llm=chat, prompt=PromptTemplate.from_template(prompt_template))

    result = llm_chain(inputs={"question": question}, return_only_outputs=True)
    result_dict = json.loads(result['text'])
    new_questions = result_dict['new_questions']
    return new_questions

generate_rephrased_questions("What are scout awards?")

['What is the meaning of scout awards?',
 'Can you explain what scout awards are?',
 'Could you provide a description of scout awards?']

##### Generate final answer using RAG

In [296]:
def display_result(question, result):
    display(Markdown(f"### Question"))
    display(Markdown("ORIGINAL: " + question))
    display(Markdown("REPHRASED: " + f"{result['rephrased_question'] if result['rephrased_question'] else 'None'}"))

    display(Markdown(f"### Answer"))
    display(Markdown(result["result"]))

    display(Markdown(f"### Sources"))
    sources = [r.metadata['url'] if 'url' in r.metadata else r.metadata['source'] for r in result["source_documents"] ]
    print(", ".join(sources))

In [408]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name="gpt-4", temperature=0), chain_type="stuff", retriever=vectorstore.as_retriever(), return_source_documents=True)


def call_llm(question, use_rephrased_questions=True):
    if not use_rephrased_questions:
        result = qa({"query": question})
        result['rephrased_question'] = None
        return result


    # Get rephrased questions
    rephrased_questions = generate_rephrased_questions(question)

    # Attempt each question until a valid result is found
    for q in rephrased_questions:
        result = qa({"query": q})
        answer = result['result']
        result['rephrased_question'] = None
        
        # If the model is unable to find an answer, it returns 'sorry' in the response, we try again with a different question
        if 'sorry' in answer.lower():
            continue
        else:
            result['rephrased_question'] = q
            break

    return result
 

def ask(question, use_rephrased_questions=True):
    result = call_llm(question, use_rephrased_questions)
    display_result(question, result)


#### AutoEvaluator
Using LangChain's [AutoEvaluator technique](https://autoevaluator.langchain.com/) to evaluate the bot's performance on the dataset of C4 questions correctly answered by Mava as per team feedback


In [228]:
import yaml

# load yaml file
with open('knowledge_base/c4/c4_test_qa.yaml') as file:
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format
    yaml_data = yaml.load(file, Loader=yaml.FullLoader)

mava_questions = [d['question'] for d in yaml_data]


In [229]:
from langchain.prompts import PromptTemplate

template = """ 
    You are a grader trying to determine if a set of retrieved documents will help a student answer a question. \n

    Here is the question: \n
    {query}

    Here are the documents retrieved to answer question: \n
    {result}
    
    Here is the correct answer to the question: \n 
    {answer}
   
    Criteria: 
      relevance: Do all of the documents contain information that will help the student arrive that the correct answer to the question?"

    Your response should be as follows:

    GRADE: (Correct or Incorrect, depending if all of the documents retrieved meet the criterion)
    (line break)
    JUSTIFICATION: (Write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Use three sentences maximum. Keep the answer as concise as possible.)
    """

GRADE_DOCS_PROMPT = PromptTemplate(input_variables=['result', 'answer', 'query'], template=template)

template = """You are a teacher grading a quiz. 
You are given a question, the student's answer, and the true answer, and are asked to score the student answer as either Correct or Incorrect.

Example Format:
QUESTION: question here
STUDENT ANSWER: student's answer here
TRUE ANSWER: true answer here
GRADE: Correct or Incorrect here

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. If the student answers that there is no specific information provided in the context, then the answer is Incorrect. Begin! 

QUESTION: {query}
STUDENT ANSWER: {result}
TRUE ANSWER: {answer}
GRADE:

Your response should be as follows:

GRADE: (Correct or Incorrect)
(line break)
JUSTIFICATION: (Without mentioning the student/teacher framing of this prompt, explain why the STUDENT ANSWER is Correct or Incorrect. Use one or two sentences maximum. Keep the answer as concise as possible.)
"""

GRADE_ANSWER_PROMPT = PromptTemplate(input_variables=["query", "result", "answer"], template=template)

In [230]:
from langchain.evaluation.qa import QAEvalChain

def grade_model_answer(predicted_dataset, predictions):

    # Create an evaluation chain
    eval_chain = QAEvalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        prompt=GRADE_ANSWER_PROMPT
    )

    # Evaluate the predictions and ground truth using the evaluation chain
    graded_outputs = eval_chain.evaluate(
        predicted_dataset,
        predictions,
        question_key="question",
        prediction_key="result"
    )

    return graded_outputs


def grade_model_retrieval(gt_dataset, predictions):
    # Create an evaluation chain
    eval_chain = QAEvalChain.from_llm(
        llm=ChatOpenAI(model_name="gpt-4", temperature=0),
        prompt=GRADE_DOCS_PROMPT
    )

    # Evaluate the predictions and ground truth using the evaluation chain
    graded_outputs = eval_chain.evaluate(
        gt_dataset,
        predictions,
        question_key="question",
        prediction_key="result"
    )
    return graded_outputs

In [409]:
bot_answers = []
source_docs = []
for d in yaml_data:
    result = call_llm(d['question'])
    bot_answers.append(result['result'])
    source_docs.append(result['source_documents'])


In [411]:
predictions = [{'result': a} for a in bot_answers]

answer_grades = grade_model_answer(yaml_data, predictions)

In [415]:
retrieved_docs = []
for i, d in enumerate(yaml_data):
    retrieved_doc_text = ""
    for j, doc in enumerate(source_docs[i]):
        retrieved_doc_text += "Doc %s: " % str(j + 1) + doc.page_content + " "
    retrieved = {"question": d["question"], "answer": d["answer"], "result": retrieved_doc_text}
    retrieved_docs.append(retrieved)

In [416]:
retrieval_grades = grade_model_retrieval(yaml_data, retrieved_docs)

In [417]:
import pandas as pd

df = pd.DataFrame({
    "question": [d['question'] for d in yaml_data],
    "Mava correct answer (True value)": [d['answer'] for d in yaml_data],
    "Bot answers": [p['result'] for p in predictions],
    "Retrieval relevancy score": ['Incorrect' if 'Incorrect' in g['results'] else 'Correct' for g in retrieval_grades],
    "Answer similarity score": ['Incorrect' if 'Incorrect' in g['results'] else 'Correct' for g in answer_grades]
})
df

Unnamed: 0,question,Mava correct answer (True value),Bot answers,Retrieval relevancy score,Answer similarity score
0,"Hi, how can I get backstage access?","To get backstage access, you need to become a ...","To obtain +Backstage access, you need to meet ...",Correct,Incorrect
1,how long does it take until findings are relea...,"Based on the context provided, the findings fr...",The audit report is published and audit issues...,Correct,Correct
2,When can I talk about findings?,You can talk about your findings after the con...,You can discuss the findings after the audit r...,Incorrect,Correct
3,How do I change my wallet address?,"To change your wallet address, follow these st...","To update your wallet address, you need to:\n\...",Correct,Correct
4,What are scouts?,"In the context of Code4rena, Scouts are indivi...",Scouts in the context of Code4rena are individ...,Correct,Correct
5,How long does the contest process usually take?,"Based on the provided context, the contest pro...",Most audits typically run for 3-7 days.,Correct,Incorrect
6,how does certification work?,The certification process at Code4rena works i...,The certification process is as follows:\n\n1....,Correct,Correct
7,Can I use bots to analyze code?,"Yes, you can use bots to analyze code. In fact...","Yes, it is possible to utilize bots for code a...",Correct,Correct
8,What is a lookout?,"In the context provided, a lookout is a role i...",A Lookout in the context of Code4rena's compet...,Incorrect,Correct


### HyDE technique
This technique can help improve information retrieval

https://python.langchain.com/docs/use_cases/question_answering/how_to/hyde

In [None]:
vectorstore_hyde = Chroma("store_hyde_1", embeddings, collection_metadata={"hnsw:space": "cosine"})
vectorstore_hyde.add_documents(website_chunks)
vectorstore_hyde.add_documents(gh_docs_chunks)

In [None]:
from langchain.vectorstores.base import VectorStoreRetriever
from langchain.callbacks.manager import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain.docstore.document import Document
from typing import List

class HydeRetriever(VectorStoreRetriever):

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

        web_search_template = """Please write a passage to answer the question 
        Question: {QUESTION}
        Passage:"""

        web_search = PromptTemplate(template=web_search_template, input_variables=["QUESTION"])

        llm_chain = LLMChain(llm=llm, prompt=web_search)

        result = llm_chain(inputs={"QUESTION": query}, return_only_outputs=True)
        hyquery = result['text']

        return super()._get_relevant_documents(hyquery, run_manager=run_manager)


hyde_retriever = HydeRetriever(vectorstore=vectorstore_hyde)

hyde_retriever.get_relevant_documents("How can I access findings.csv")

In [299]:

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name="gpt-4", temperature=0), chain_type="stuff", retriever=hyde_retriever, return_source_documents=True)


def call_hyde_llm(question):
    result = qa({"query": question})
    result['rephrased_question'] = None
    return result

def ask_hyde(question):
    result = call_hyde_llm(question)
    display_result(question, result)

#### MultiQuery approach

In [None]:
# from langchain.chat_models import ChatOpenAI
# from langchain.retrievers.multi_query import MultiQueryRetriever

# question = "What are scout awards?"
# llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# multiquery_retriever = MultiQueryRetriever.from_llm(
#     retriever=vectorstore.as_retriever(), llm=llm
# )
# import logging

# logging.basicConfig()
# logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

#### Final Implementation

In [325]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# NOTE: At times, OpenAI Embedding service can fail intermittently and return errorneous values such as [NaN], more info: https://github.com/langchain-ai/langchain/pull/7070

embeddings = OpenAIEmbeddings()
vectorstore_with_sources = Chroma("vectorstore_with_sources3", embeddings, collection_metadata={"hnsw:space": "cosine"})

for i, d in enumerate(website_chunks):
    d.metadata['source'] = f"w{i}-pl"
    vectorstore_with_sources.add_documents([d])

for i, d in enumerate(gh_docs_chunks):
    local_path = d.metadata['source']
    d.metadata['source'] = f"g{i}-pl"
    d.metadata['url'] = f"{local_path.replace(C4_GH_DOCS_STORAGE_DIR, 'https://github.com/code-423n4/docs/blob/main/')}"
    vectorstore_with_sources.add_documents([d])

In [422]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI


model = ChatOpenAI(model_name="gpt-4", temperature=0)

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(model, chain_type="stuff", retriever=vectorstore_with_sources.as_retriever(), return_source_documents=True)


def run_qa_with_sources(question, use_rephrased_questions=False):

    rephrased_question = None

    if not use_rephrased_questions:
        result = qa_with_sources({"question": question}, return_only_outputs=True)
    else:
    
        rephrased_questions = generate_rephrased_questions(question)

        # Attempt each question until a valid result is found
        for q in rephrased_questions:
            result = qa_with_sources({"question": q}, return_only_outputs=True)            
            # If the model is unable to find an answer, it returns 'sorry' in the response, we try again with a different question
            if 'sorry' in result['answer'].lower():
                continue
            else:
                rephrased_question = q
                break

    answer = result['answer']
    source_ids = result['sources']
    source_docs = result['source_documents']

    source_urls = set()
    for d in source_docs:
        metadata = d.metadata
        source_id = metadata['source']
        url = metadata['url']
        if source_id in source_ids:
            source_urls.add(url)
    return dict(answer=answer, source_urls=source_urls, rephrased_question=rephrased_question, source_docs=source_docs)

def ask_with_sources(question, use_rephrased_questions=False):
    result = run_qa_with_sources(question, use_rephrased_questions)

    display(Markdown(f"### Question"))
    display(Markdown("ORIGINAL: " + question))
    display(Markdown("REPHRASED: " + f"{result['rephrased_question'] if 'rephrased_question' in result else 'None'}"))

    display(Markdown(f"### Answer"))
    display(Markdown(result["answer"]))

    display(Markdown(f"### Sources"))
    print(", ".join(result['source_urls']))

In [None]:
# Questions that were answered incorrectly by the Mava bot as per emoji reaction in the test channel
MAVA_MISANSWERED_QUES = [
    "what's a scout?",
    "Am I allowed to use AI in an audit?",
    "Can I change my Code4rena username?",
    "How do I book a solo audit?",
    "Do I need to be certified to participate in an audit?",
    "How do bot races work?",
    "Can I change my Code4rena profile name?",
    "What are scout awards?",
    "What are analysis reports?",
    "what is an analysis finding?",
    "My name wasn't in the award announcements. When can I check on my results?",
    "How long does the certification process take?",
    "How can I access findings.csv?",
    "Can I use chatgpt?"
]

for q in MAVA_MISANSWERED_QUES:
    ask_with_sources(q, use_rephrased_questions=False)

In [425]:
def auto_eval():
    bot_answers = []
    source_docs = []
    for d in yaml_data:
        result = run_qa_with_sources(d['question'])
        bot_answers.append(result['answer'])
        source_docs.append(result['source_docs'])
    
    predictions = [{'result': a} for a in bot_answers]

    answer_grades = grade_model_answer(yaml_data, predictions)

    retrieved_docs = []
    for i, d in enumerate(yaml_data):
        retrieved_doc_text = ""
        for j, doc in enumerate(source_docs[i]):
            retrieved_doc_text += "Doc %s: " % str(j + 1) + doc.page_content + " "
        retrieved = {"question": d["question"], "answer": d["answer"], "result": retrieved_doc_text}
        retrieved_docs.append(retrieved)

    retrieval_grades = grade_model_retrieval(yaml_data, retrieved_docs)

    df = pd.DataFrame({
        "question": [d['question'] for d in yaml_data],
        "Mava correct answer (True value)": [d['answer'] for d in yaml_data],
        "Bot answers": [p['result'] for p in predictions],
        "Retrieval relevancy score": ['Incorrect' if 'Incorrect' in g['results'] else 'Correct' for g in retrieval_grades],
        "Answer similarity score": ['Incorrect' if 'Incorrect' in g['results'] else 'Correct' for g in answer_grades]
    })
    print(f"Bot Accuracy: {df['Answer similarity score'].value_counts()['Correct'] / len(df['Answer similarity score'])}")
    
    return df

In [426]:
auto_eval()

Bot Accuracy: 0.7777777777777778


Unnamed: 0,question,Mava correct answer (True value),Bot answers,Retrieval relevancy score,Answer similarity score
0,"Hi, how can I get backstage access?","To get backstage access, you need to become a ...",The documents provided do not contain informat...,Incorrect,Incorrect
1,how long does it take until findings are relea...,"Based on the context provided, the findings fr...",The findings from the audit are typically rele...,Correct,Correct
2,When can I talk about findings?,You can talk about your findings after the con...,You can talk about findings after they have be...,Incorrect,Incorrect
3,How do I change my wallet address?,"To change your wallet address, follow these st...",You can change your payment information at any...,Correct,Correct
4,What are scouts?,"In the context of Code4rena, Scouts are indivi...","In the context of Code4rena, scouts are indivi...",Correct,Correct
5,How long does the contest process usually take?,"Based on the provided context, the contest pro...",The contest process usually takes between 42 t...,Correct,Correct
6,how does certification work?,The certification process at Code4rena works i...,Certification works through a process where an...,Correct,Correct
7,Can I use bots to analyze code?,"Yes, you can use bots to analyze code. In fact...","Yes, you can use bots to analyze code. Code4re...",Correct,Correct
8,What is a lookout?,"In the context provided, a lookout is a role i...","In the context of Code4rena, a lookout is a ro...",Incorrect,Correct
