# CodeArena (C4) Question Answer bot

### Objective
- This notebook has the PoC work for a Question Answer bot using C4's knowledge bases.
- The objective of the PoC is to prototype an LLM implementation that can accurately answer questions to their expectation and at the very least perform better than their current bot from [Mava](https://www.mava.app/)

### Observations from the usage of Mava
- The platform offers Discord support management with ticketing and AI help bot features
- For the AI help bot, the user is able to specify links to multiple knowledge sources that can be used for answering questions.
- Based on C4's testing of the Mava bot in the private channel, the following stats were observed:-
    - Total questions asked: 29
    - Total questions mis-answered based on emoji reactions: 13
    - Accuracy - ~55%

### Knowledge Bases
Based on conversations with their team, the following knowledge bases were identified to be relevant and are the same ones that Mava is using:-
- [Main Website](https://code4rena.com/)
- [Docs](https://docs.code4rena.com/) 


### High-level Approach
- Crawl and scrape C4’s website and docs using Scrapy lib
- Convert the html content to markdown format so that the model can better understand the context
- Use LangChain lib to do the following:-
    - Split the markdown header-separated sections into semantic chunks
    - Embed and store the semantic chunks in an in-memory vector db
    - Use the retrieval augmented functionality to answer the question

In [None]:
# Install all the third-party packages

!pip install 'langchain[llms]'
!pip install Scrapy
!pip install html2text
!pip install lxml
!pip install python-dotenv
!pip install "unstructured[all-docs]"
!pip install tiktoken
!pip install faiss-cpu 
!pip install GitPython
!pip install notebook
!pip install chromadb

In [17]:
# General setup - you can specify OPENAI_API_KEY in .env file

import logging
from dotenv import load_dotenv
from IPython.display import display, Markdown, Latex

logging.getLogger().setLevel(logging.INFO)
load_dotenv()

True

In [None]:
import getpass
import os

OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY') or getpass.getpass('Enter your OpenAI API key: ')

assert OPENAI_API_KEY, "Please set OPENAI_API_KEY in your environment variables"

### Crawling and Scraping using Scrapy

In [None]:
import os
import scrapy
import html2text
import lxml.html
import json
from urllib.parse import urlparse

C4_WEBSITE_STORAGE_DIR = "knowledge_base/c4/website"
C4_DOCS_STORAGE_DIR = "knowledge_base/c4/docs"

class GenericSpider(scrapy.Spider):
    name = 'generic'

    def __init__(self, domain='', storage_dir='.', *args, **kwargs):
        super(GenericSpider, self).__init__(*args, **kwargs)
        self.allowed_domains = [domain]
        self.start_urls = [f'http://{domain}/']
        self.storage_dir = storage_dir
    
    def parse(self, response):
        # Remove unwanted elements using lxml
        tree = lxml.html.fromstring(response.text)
        
        # Remove non-text related tags
        for unwanted in tree.xpath('//script|//img|//video|//audio|//iframe|//object|//embed|//canvas|//svg|//link|//source|//track|//map|//area'):
            unwanted.drop_tree()

        cleaned_html = lxml.html.tostring(tree).decode('utf-8')

        # Convert HTML to Markdown
        converter = html2text.HTML2Text()
        markdown_text = converter.handle(cleaned_html)

        # Save to a markdown file in the specified directory
        if not os.path.exists(self.storage_dir):
            os.makedirs(self.storage_dir)

        url = response.url
        page_name = response.url.split("/")[-1] if response.url.split("/")[-1] else "index"

        filename = os.path.join(self.storage_dir, f'{page_name}.json')

        with open(filename, 'w') as f:
            # Store the URL and markdown text in JSON format
            json.dump({'url': url, 'md_content': markdown_text}, f)

        # Recursively follow relative links to other pages on the same domain
        for href in response.css('a::attr(href)').getall():
            url = response.urljoin(href)
            if urlparse(url).netloc in self.allowed_domains:
                yield scrapy.Request(url, self.parse)


NOTE: Data has already been scraped and saved locally as JSON files in the 'knowledge_base/c4' directory. To re-run the scraping, uncomment the code in the cell below.

On re-running the crawler, if you get 'ReactorNotRestartable' error, the notebook kernel would need to be restarted.

In [None]:
# from scrapy.crawler import CrawlerRunner
# from scrapy.utils.project import get_project_settings
# from twisted.internet import reactor

# settings = get_project_settings()

# runner = CrawlerRunner(settings)
# runner.crawl(GenericSpider, domain="code4rena.com", storage_dir=C4_WEBSITE_STORAGE_DIR)
# runner.crawl(GenericSpider, domain="docs.code4rena.com", storage_dir=C4_DOCS_STORAGE_DIR)
# d = runner.join()
# d.addBoth(lambda _: reactor.stop())
# reactor.run()

### Retrieval Augmented Generation using LangChain

#### Load locally saved scraped data

In [8]:
import json
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

def load_json_files(dir):
    loader = DirectoryLoader(dir, loader_cls=TextLoader)
    documents = loader.load()
    for d in documents:
        page_content_dict = json.loads(d.page_content)
        d.page_content = page_content_dict['md_content']
        d.metadata['url'] = page_content_dict['url']
    return documents

c4_website_data_list = load_json_files(C4_WEBSITE_STORAGE_DIR)
c4_docs_data_list = load_json_files(C4_DOCS_STORAGE_DIR)

#### Split the markdown content into semantic chunks

In [10]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=2000, chunk_overlap=200
)


website_chunks =  md_splitter.split_documents(c4_website_data_list)
docs_chunks =  md_splitter.split_documents(c4_docs_data_list)

print(len(website_chunks))
print(len(docs_chunks))

89
97


#### Embed the semantic chunks and store in an in-memory vector db

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# NOTE: At times, OpenAI Embedding service can fail intermittently and return errorneous values such as [NaN], more info: https://github.com/langchain-ai/langchain/pull/7070

embeddings = OpenAIEmbeddings()
vectorstore = Chroma("langchain_store", embeddings)

vectorstore.add_documents(website_chunks)
vectorstore.add_documents(docs_chunks)


#### Retrieval QA chain

In [39]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model_name="gpt-4", temperature=0), chain_type="stuff", retriever=vectorstore.as_retriever(), return_source_documents=True)

def ask(question):
    result = qa({"query": question})
    display(Markdown(f"### Question"))
    display(Markdown(question))

    display(Markdown(f"### Answer"))
    display(Markdown(result["result"]))

    display(Markdown(f"### Sources"))
    sources = [r.metadata['url'] for r in result["source_documents"]]
    print(", ".join(sources))

In [41]:
# Questions that were answered incorrectly by the Mava bot as per emoji reaction in the test channel
MAVA_MISANSWERED_QUES = [
    "what's a scout?",
    "Am I allowed to use AI in an audit?",
    "Can I change my Code4rena username?",
    "How do I book a solo audit?",
    "Do I need to be certified to participate in an audit?",
    "How do bot races work?",
    "Can I change my Code4rena profile name?",
    "What are scout awards?",
    "What are analysis reports?",
    "what is an analysis finding?",
    "My name wasn't in the award announcements. When can I check on my results?",
    "How long does the certification process take?",
    "How can I access findings.csv?"
]

In [42]:
for q in MAVA_MISANSWERED_QUES:
    ask(q)

### Question

what's a scout?

### Answer

A Scout in the context of Code4rena is a role that focuses on scoping and pre-audit intel. Currently, Scouts are hand-picked by the C4 team as it's a highly sensitive role.

### Sources

https://docs.code4rena.com/roles/certified-contributors/lookouts, https://docs.code4rena.com/structure/frequently-asked-questions, https://code4rena.com/how-it-works, https://code4rena.com/how-it-works


### Question

Am I allowed to use AI in an audit?

### Answer

Yes, you are allowed to use AI in an audit, but there are some restrictions. Code4rena runs a Bot Race at the start of each audit where wardens compete to see whose AI-driven bot can create the highest quality and most thorough audit report. The winning report is shared with all C4 wardens and all findings in the winning Bot Report will be declared publicly known issues, and therefore ineligible for awards. 

However, using the output of AI tools like ChatGPT, GPT-3, or other automated tools for audit submissions is highly discouraged as it often leads to a high ratio of nonsense submissions. Wardens may use automated tools as a first pass, and build on these findings to identify High and Medium severity issues. But, submissions based on automated tools will have a higher burden of proof for demonstrating to sponsors a relevant exploit path in order to be considered satisfactory.

### Sources

https://docs.code4rena.com/roles/wardens/submission-policy, https://docs.code4rena.com/awarding/fairness-and-validity, https://docs.code4rena.com/roles/wardens/submission-policy, https://docs.code4rena.com/awarding/incentive-model-and-awards


### Question

Can I change my Code4rena username?How do I book a solo audit?

### Answer

The text does not provide information on whether you can change your Code4rena username.

To book a solo audit, a project team member needs to click the "Get a quote" button on a warden's profile and share scoping details with the Code4rena team. Code4rena staff will then consult with the warden and project team to firm up scoping, pricing, and dates.

### Sources

https://docs.code4rena.com/structure/frequently-asked-questions, https://docs.code4rena.com/roles/wardens/solo-audits, https://code4rena.com/register, https://code4rena.com/register


### Question

Do I need to be certified to participate in an audit?

### Answer

Yes, to participate in an audit as a Certified Warden, you need to be certified. The certification process involves submitting the Certified Contributor Application form and providing necessary documents such as a local authority document that is less than 3 months old. Once your application is approved, you can participate in audits.

### Sources

https://docs.code4rena.com/roles/certified-contributors, https://docs.code4rena.com/roles/certified-contributors, https://docs.code4rena.com/roles/wardens, https://docs.code4rena.com/roles/wardens/solo-audits


### Question

How do bot races work?

### Answer

I'm sorry, but the provided context does not contain information on how bot races work.

### Sources

https://docs.code4rena.com/awarding/fairness-and-validity, https://docs.code4rena.com/roles/judges, https://docs.code4rena.com/roles/judges/how-to-judge-a-contest, https://docs.code4rena.com/roles/sponsors


### Question

Can I change my Code4rena profile name?

### Answer

The provided context does not include information on whether you can change your Code4rena profile name.

### Sources

https://docs.code4rena.com/roles/wardens/warden-auth, https://code4rena.com/help, https://code4rena.com/help, https://code4rena.com/contests/2023-05-chainlink-cross-chain-services-ccip-and-arm-network


### Question

What are scout awards?

### Answer

I'm sorry, but the provided context does not contain any information about "scout awards."

### Sources

https://docs.code4rena.com/philosophy/security-is-about-people, https://docs.code4rena.com/roles/certified-contributors/lookouts, https://docs.code4rena.com/awarding/incentive-model-and-awards/awarding-process, https://docs.code4rena.com/awarding/judging-criteria


### Question

What are analysis reports?

### Answer

Analysis reports are written submissions that outline the Wardens' analysis of the codebase as a whole, any observations or advice they have about architecture, mechanism, or approach, broader concerns like systemic risks or centralization risks, and the approach taken in reviewing the code. They also include new insights and learnings from the audit. These reports provide wardens with an opportunity to contribute value through high level insights and advice that aren't necessarily covered by specific bugs. Analyses are judged A/B/C, with the top Analysis selected for inclusion in the audit report.

### Sources

https://docs.code4rena.com/awarding/judging-criteria, https://docs.code4rena.com/awarding/incentive-model-and-awards, https://docs.code4rena.com/awarding/fairness-and-validity, https://docs.code4rena.com/awarding/incentive-model-and-awards


### Question

what is an analysis finding?

### Answer

An analysis is a written submission that outlines the Wardens' analysis of the codebase as a whole and any observations or advice they have about architecture, mechanism, or approach. It also includes broader concerns like systemic risks or centralization risks, the approach taken in reviewing the code, and new insights and learnings from the audit. Analyses are judged A/B/C, with the top Analysis selected for inclusion in the audit report. They provide wardens with an opportunity to contribute value through high level insights and advice that aren't necessarily covered by specific bugs.

### Sources

https://docs.code4rena.com/awarding/judging-criteria, https://docs.code4rena.com/awarding/incentive-model-and-awards, https://docs.code4rena.com/awarding/fairness-and-validity, https://docs.code4rena.com/structure/frequently-asked-questions


### Question

My name wasn't in the award announcements. When can I check on my results?

### Answer

Based on the audit timeline provided, the judging QA is completed and awards are announced between Day 25-34 after audit submissions close. If your name wasn't in the award announcements, you may want to wait until this period is over to check on your results. If you still don't see your award after this time, there might be other issues at play and you may need to contact the Code4rena Foundation for further assistance.

### Sources

https://docs.code4rena.com/awarding/incentive-model-and-awards/awarding-process, https://docs.code4rena.com/roles/wardens/warden-auth, https://docs.code4rena.com/awarding/incentive-model-and-awards/qa-gas-report-faq, https://docs.code4rena.com/structure/our-process


### Question

How long does the certification process take?

### Answer

Once you submit the Certified Contributor Application form, Provenance typically emails you within one business day. If you have all the available documents, the process can usually be completed within a day. However, it will take longer if you need to assemble the necessary documents.

### Sources

https://docs.code4rena.com/roles/certified-contributors, https://docs.code4rena.com/structure/our-process, https://docs.code4rena.com/roles/certified-contributors, https://docs.code4rena.com/roles/wardens


### Question

How can I access findings.csv?

### Answer

I'm sorry, but the provided context does not contain information on how to access findings.csv.

### Sources

https://docs.code4rena.com/structure/frequently-asked-questions, https://docs.code4rena.com/roles/wardens/submission-policy, https://docs.code4rena.com/roles/wardens/submission-policy, https://docs.code4rena.com/roles/wardens/warden-auth
