### High-level approach
- Parse messages, translate messages along with any replies into this format "timestamp;author;message;reply_to_author;reply_to_message"
- Prompt LLM to generate questions and answers with timestamp and answer source author
- To handle duplicates and use the latest:
    - Add all the questions to the embeddings index
    - For each quesion look for highly similar questions in the index, with threshold of 0.9
    - For the similar questions, choose the one with the latest timestamp

In [None]:
%pip install beautifulsoup4 lxml
%pip install matplotlib openai plotly pandas scipy scikit-learn python-dotenv langchain tiktoken chromadb

In [None]:
import os
from dotenv import load_dotenv
from getpass import getpass

load_dotenv()

openai_api_key = os.environ.get('OPENAI_API_KEY') or getpass('Enter your OpenAI API key: ')
os.environ['OPENAI_API_KEY'] = openai_api_key

In [None]:
from bs4 import BeautifulSoup

formatted_msg_blocks = []

with open('./Code4rena_-_Main_-_questions.html', 'r', encoding='utf-8') as file:
    html_content = file.read()


soup = BeautifulSoup(html_content, 'html.parser')

### Parse messages

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

message_groups = soup.find_all('div', class_='chatlog__message-group')

parsed_messages = []

for message_group in message_groups:
    author = message_group.find('span', class_='chatlog__author')
    if author:
        author = author.text
    else:
        author = ''
    timestamp = message_group.find('span', class_='chatlog__timestamp')
    if timestamp:
        timestamp = timestamp.text
    else:
        timestamp = ''
    markdown_preserves = message_group.find_all('span', class_='chatlog__markdown-preserve')
    message = '\n'.join([mp.text for mp in markdown_preserves])
    reply_to_author = message_group.find('div', class_='chatlog__reply-author')
    if reply_to_author:
        reply_to_author = reply_to_author.text
    else:
        reply_to_author = ''
    reply_to_message = message_group.find('span', class_='chatlog__reply-link')
    if reply_to_message:
        reply_to_message = reply_to_message.text
    else:
        reply_to_message = ''
    parsed_messages.append({'author': author, 'timestamp': timestamp, 'message': message, 'reply_to_author': reply_to_author, 'reply_to_message': reply_to_message})

len(parsed_messages)

### Choose the last N messages for testing

In [None]:
latest_messages = parsed_messages[-1000:]

### Generate message blocks using sliding window with overlap

In [None]:

message_blocks = []
step = 10
lookout_range = 5
for i in range(0, len(latest_messages), step):    
    before = latest_messages[i-lookout_range:i]
    main_block = latest_messages[i:i+step]
    after = latest_messages[i+1:i+lookout_range]

    message_block = []
    message_block.extend(before)
    message_block.extend(main_block)
    message_block.extend(after)

    message_blocks.append(message_block)
        

### Format messages for prompt

In [None]:
formatted_msg_blocks = []
for mb in message_blocks:
    lines = ""
    
    for m in mb:
        timestamp = m['timestamp']
        author = m['author']
        message = m['message'].replace('"', "").replace("'", "")
        reply_to_author = m['reply_to_author']
        reply_to_message = m['reply_to_message'].replace('"', "").replace("'", "")
        formatted_message = f"{timestamp};{author};{message};{reply_to_author};{reply_to_message}"
        lines += formatted_message + "---"
        
    formatted_msg_blocks.append(lines)
len(formatted_msg_blocks)

In [None]:
SYSTEM_PROMPT = """You are an intelligent analyst capable of looking at chat messages and generating questions and answers from it to create an FAQ.

- For your task, you have been given chat messages from an organization called Code4rena (a.k.a C4) that specializes in crowd sourced smart contract audits.
- You are given chat messages below, each message is formatted as timestamp;author;message;reply_to_author;reply_to_message and separated by "---"
- To generate questions and answers, think step-by-step, first base it on the reply to author and reply to message, if they are not available, then solely based it on the messages before and after.
- **DO NOT** use any follow-on questions as an answer to previous question.
- If a message seems like a casual conversation and unrelated to the general subject, skip it.
- If a question does not have a helpful answer, feel free to skip it.
- Rephrase the questions and answers to be professional, suitable enough to be used in a FAQ.
- Use the message timestamp from the author as the timestamp for the question and answer.
- Do not mention any thing about the particular chat or author in the answer, it should be generic enough to be used in a FAQ.
- Any links mentioned in the messages are very important, please include them in the answer.
- Identify the true source author that contributed to the answer from the messages
- Output the results as a JSON list with fields "timestamp", "question", "answer", "answer_source_author"
- **DO NOT** make up questions and answers, only use the chat messages as the source of truth.

## Eample:
### Chat messages:
06/28/2023 5:39 PM;DadeKuma;thats old, it doesnt work like that anymore;lsaudit; according to that .cvs file, Low issues are ranked by uniquess too ---06/28/2023 5:40 PM;lsaudit;so if all As get the same award, no matter how many Low findings there are - why should auditors bother to put more than one Low findins in QA?\nif one Low finding is enough to be scored as A ?\nOr maybe Ill rephrase my question. Lets assume that there are only three QA reports. 1st reports issues: A, B, C, D. 2nd: B, C, D, E; 3rd: F. Can only one report be choosen for a final report?\nOr the report will merge: A, B, C, D, E, F. So 1st report will get bonus for A uniquness, and 3rd report, would get bonus for reporting F issue?;;---06/28/2023 5:47 PM;🦙 liveactionllama | C4;The info here might be helpful:\nhttps://docs.code4rena.com/awarding/judging-criteria#qa-reports-low-non-critical\nhttps://docs.code4rena.com/awarding/incentive-model-and-awards#qa-and-gas-optimization-reports\n\nJudges look at both quantity and quality when judging QA reports. If a wardens QA submission only had 1 item, it would be pretty unlikely to receive a high grade. Especially if other wardens QA submissions within that audit contained many high quality items in comparison.;lsaudit; so if all As get the same award, no matter how many Low findings there are - why should auditors bother to put more than one Low findins in QA?

### JSON result:
{{
"timestamp": "06/28/2023 5:40 PM",
"question": "Why should auditors bother to put more than one Low findings in QA if all As get the same award, no matter how many Low findings there are?",
"answer": "Judges look at both quantity and quality when judging QA reports. If a warden's QA submission only had 1 item, it would be pretty unlikely to receive a high grade. Especially if other wardens' QA submissions within that audit contained many high-quality items in comparison. More information can be found at https://docs.code4rena.com/awarding/judging-criteria#qa-reports-low-non-critical and https://docs.code4rena.com/awarding/incentive-model-and-awards#qa-and-gas-optimization-reports.",
"answer_source_author": "🦙 liveactionllama | C4",
}}

## Chat messages:
{chat_messages}

## JSON result:"""

from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["chat_messages"],
    template=SYSTEM_PROMPT,
)

In [None]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
)

In [None]:
from langchain.chains import LLMChain
import json
from langchain.callbacks import get_openai_callback

NUM_MESSAGE_BLOCKS = 10


chain = LLMChain(llm=llm, prompt=prompt)

qa_list = []
total_tokens = 0
total_cost = 0
with get_openai_callback() as cb:
    for l in formatted_msg_blocks[:NUM_MESSAGE_BLOCKS]:
        result = chain.run(chat_messages=l)
        print(result)
        qa_list.extend(json.loads(result))
        total_tokens += cb.total_tokens
        total_cost += cb.total_cost

print(f"Total tokens: {total_tokens}")
print(f"Total cost: {total_cost}")

### Generate Langchain Document objects from the resultant questions and answers

In [None]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from datetime import datetime


docs = []
for i, qa in enumerate(qa_list):
    question = qa['question']
    doc = Document(page_content=question, metadata={
        'ques_id': i,
        'timestamp': qa['timestamp'],
        'epoch_time': int(datetime.strptime(qa['timestamp'], '%m/%d/%Y %I:%M %p').timestamp()),
        'answer': qa['answer'],
        'answer_source_author': qa['answer_source_author']
    })
    docs.append(doc)

### Add Documents to the vector index

In [None]:
import chromadb

embeddings = OpenAIEmbeddings()
collection_name = "questions"

chroma = chromadb.Client()
try:
    collection = chroma.get_collection(collection_name)
    if collection:
        chroma.delete_collection(collection_name)
except:
    pass

ques_db = Chroma(collection_name=collection_name, embedding_function=embeddings)
ques_db.add_documents(docs)

### Filter for latest questions

In [None]:
skip_question_ids = []
final_qa_docs = []

for d in docs:
    q = d.page_content
    ques_id = d.metadata['ques_id']

    if ques_id in skip_question_ids:
        continue

    results = ques_db.similarity_search_with_relevance_scores(q, k=4, score_threshold=0.9)
    latest_question = d
    for r in results:
        skip_question_ids.append(r[0].metadata['ques_id'])
        ques_id = r[0].metadata['ques_id']
        epoch_time = r[0].metadata['epoch_time']
        if epoch_time > latest_question.metadata['epoch_time']:
            latest_question = r[0]
    final_qa_docs.append(latest_question)

### Create markdown file with the final results

In [None]:
qa_to_store = []
for d in final_qa_docs:
    qa_to_store.append({
        'question': d.page_content,
        'answer': d.metadata['answer'],
        'timestamp': d.metadata['timestamp'],
        'answer_source_author': d.metadata['answer_source_author']
    })

with open('./output/faq.json', 'w') as f:
    json.dump(qa_to_store, f, indent=4)

with open('./output/faq.md', 'w') as f:
    for i, qa in enumerate(final_qa_docs):
        question = qa.page_content
        answer = qa.metadata['answer']
        author = qa.metadata['answer_source_author']
        timestamp = qa.metadata['timestamp']
        f.write(f"#### {i+1}. {question}\n")
        f.write(f"{answer}\n\n")
        f.write(f"*Source Author: {author}*\n\n")
        f.write(f"*Source Timestamp: {timestamp}*\n\n")