# Interrogation Benchmark
Anecdotally, converting documents from raw text into question and answer pairs seems to yield better results when building [RAG applications](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html).

We created a basic benchmark using langchain and FAISS to determine if these performance improvements are real, and under what conditions.

In [None]:
! pip install langchain
! pip install doctran_openai
! pip install pandas

In [29]:
import pandas as pd
import csv
import json
import itertools
import os
from email import message_from_string
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import FAISS
from doctran_openai import Doctran, ExtractProperty


In [71]:
from dotenv import load_dotenv

load_dotenv(".env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_MODEL = "gpt-4"
OPENAI_TOKEN_LIMIT = 8000

### Load Email Dataset
The email dataset used in this notebook is not included since it's over 1.4GB. However it can be downloaded from https://www.cs.cmu.edu/~./enron/.

In [72]:
filename = "emails.csv"
start_row = 1200
end_row = 1210

with open(filename, "r") as f:
    reader = csv.reader(f)
    rows = list(itertools.islice(reader, start_row, end_row))
    emails = [message_from_string(row[1]) for row in rows]

print(emails[0])

Message-ID: <12536946.1075855667184.JavaMail.evans@thyme>
Date: Tue, 31 Oct 2000 07:00:00 -0800 (PST)
From: phillip.allen@enron.com
To: david.delainey@enron.com
Subject: 
Cc: john.lavorato@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: john.lavorato@enron.com
X-From: Phillip K Allen
X-To: David W Delainey
X-cc: John J Lavorato
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents
X-Origin: Allen-P
X-FileName: pallen.nsf

Dave,

The back office is having a hard time dealing with the $11 million dollars 
that is to be recognized as transport expense by the west desk then recouped 
from the Office of the Chairman.    Is your understanding that the West desk 
will receive origination each month based on the schedule below.

 
 The Office of the Chairman agrees to grant origination to the Denver desk as 
follows:

October 2000  $1,395,000
November 2000 $1,350,000
December 2000 $1,395,000
January 2001  $   669,600
Fe

### Run emails through interrogation

In [73]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
doctran = Doctran(openai_api_key=OPENAI_API_KEY)
docs = []
for email in emails:
    text = email.as_string()
    doc = doctran.parse(content=text)
    docs.append(await doc.interrogate().execute())

In [91]:
print(json.dumps(docs[0].extracted_properties, indent=2))

{
  "questions_and_answers": [
    {
      "question": "Who is the email from?",
      "answer": "The email is from phillip.allen@enron.com."
    },
    {
      "question": "Who is the email to?",
      "answer": "The email is to david.delainey@enron.com."
    },
    {
      "question": "What is the issue the back office is having?",
      "answer": "The back office is having a hard time dealing with the $11 million dollars that is to be recognized as transport expense by the west desk then recouped from the Office of the Chairman."
    },
    {
      "question": "What is the Office of the Chairman agreeing to?",
      "answer": "The Office of the Chairman agrees to grant origination to the Denver desk as per the schedule mentioned in the email."
    },
    {
      "question": "What does the schedule represent?",
      "answer": "The schedule represents a demand charge payable to NBP Energy Pipelines by the Denver desk."
    },
    {
      "question": "Who has agreed to reimburse the w

## Benchmark with raw document as vectors
Vectorize as normal, indexing the emails based on vectors generated from the raw email strings.

In [83]:
regular_docs = []
for doc in docs:
    regular_docs.append(Document(page_content=doc.raw_content))
email_db = FAISS.from_documents(regular_docs, embeddings)
query = "What are common topics phillip allen talks to his colleagues about?"
results = email_db.similarity_search_with_score(query, k=20)

### ‚ùå Result
Vectors computed from the raw emails have high distance from the query vector (0.44 - 0.56)

In [84]:
df = pd.DataFrame(results, columns=["email", "distance"])
df

Unnamed: 0,email,distance
0,page_content='Message-ID: <14486500.1075863687...,0.439669
1,page_content='Message-ID: <5469026.10758556672...,0.468482
2,page_content='Message-ID: <12338129.1075855667...,0.49188
3,page_content='Message-ID: <27394834.1075855703...,0.492825
4,page_content='Message-ID: <31461370.1075855667...,0.502488
5,page_content='Message-ID: <20188486.1075855667...,0.511412
6,page_content='Message-ID: <12536946.1075855667...,0.513907
7,"page_content=""Message-ID: <558883.107585566726...",0.517571
8,page_content='Message-ID: <20513208.1075855667...,0.522639
9,page_content='Message-ID: <18752461.1075855665...,0.565841


## Benchmark with questions and answers as vectors
Vectorize questions and index each email based on question/answer pairs generated from the email, instead of the raw email itself.

In [85]:
qa_docs = []
for doc in docs:
    qa_docs.extend([Document(page_content=json.dumps(qa), metadata={"raw_email": doc.raw_content}) for qa in doc.extracted_properties["questions_and_answers"]])
qa_db = FAISS.from_documents(qa_docs, embeddings)
query = "What are common topics phillip allen talks to his colleagues about?"
results = qa_db.similarity_search_with_score(query, k=20)

### üéØ Result
Vectors computed from question/answer pairs have significantly lower distance from the query vector (0.32 -0.46)

In [86]:
df = pd.DataFrame(results, columns=["qa_pair", "distance"])
df

Unnamed: 0,qa_pair,distance
0,"page_content='{""question"": ""Who did Phillip Al...",0.326656
1,"page_content='{""question"": ""What does Phillip ...",0.341358
2,"page_content='{""question"": ""Who sent the email...",0.363514
3,"page_content='{""question"": ""Who sent the email...",0.363665
4,"page_content='{""question"": ""Who forwarded the ...",0.377645
5,"page_content='{""question"": ""Who forwarded the ...",0.394742
6,"page_content='{""question"": ""What is Phillip as...",0.398435
7,"page_content='{""question"": ""What does Phillip ...",0.411117
8,"page_content='{""question"": ""What does Phillip ...",0.411117
9,"page_content='{""question"": ""What does Phillip ...",0.417999


## Benchmarking with additional questions

In [88]:
questions = [
    "What are common topics phillip allen talks to his colleagues about?",
    "What are the top concerns expressed in these emails?",
    "Who exhibits the most leadership in these emails?",
    "How would you describe the company culture based on these emails?",
    "Summarize the decisions made",
    "List all the people who were involved in the decision making process",
    "List the two most exciting events of 2023.",
    "Name the president of the United States in 1995.",
    "Which city has the highest population?",
    "How many legs does the city of San Francisco have?",
    "How many kilocalories does the subject line of this email thread contain?"
    "What is the melody of this email?"
]
results = []
for question in questions:
    email_results = email_db.similarity_search_with_score(question, k=20)
    qa_results = qa_db.similarity_search_with_score(question, k=20)
    average_distance = lambda x: sum(y[1] for y in x) / len(x)
    results.append((question, average_distance(email_results), average_distance(qa_results)))
df = pd.DataFrame(results, columns=["question", "email_distance", "qa_distance"])
df

Unnamed: 0,question,email_distance,qa_distance
0,What are common topics phillip allen talks to ...,0.502671,0.41333
1,What are the top concerns expressed in these e...,0.486858,0.402318
2,Who exhibits the most leadership in these emails?,0.501326,0.399064
3,How would you describe the company culture bas...,0.502998,0.423434
4,Summarize the decisions made,0.563469,0.509776
5,List all the people who were involved in the d...,0.551474,0.497622
6,List the two most exciting events of 2023.,0.636214,0.574591
7,Name the president of the United States in 1995.,0.570765,0.513492
8,Which city has the highest population?,0.614459,0.573331
9,How many legs does the city of San Francisco h...,0.620565,0.575173


## Conclusion
We are able to consistently achieve higher true positive rate by vectorizing the question/answer pairs generated from this email dataset compared to vectorizing the raw emails, as indicated by `qa_distance` being consistently lower than `email_distance` for questions where the answer was present in the data set.

However, for queries that were either not present in the data set ("What were the two most exciting events of 2023"), or completely nonsensical ("How many legs does the city of San Francisco have?"), converting documents to question/answer pairs before vectorizing tends to yield more false positives.

For use cases where users are expected to **ask questions** rather than provide instructions to a LLM, converting source documents into question and answer pairs is likely to result in more reliable results when performing vector retrieval. For example, this preprocessing technique is likely better suited for customer support or knowledge base search use cases, and less well suited for agent-based workflows.