In [4]:
# URL = 'https://arxiv.org/pdf/2403.18802.pdf'
URL = '~/Downloads/ud923-rosenblum-garfinkel-paper.pdf'

# when extracting the text from the PDF, INCLUDE_AT and EXCLUDE_AT are used to determine where to
# start and stop extracting text. For example, if INCLUDE_AT is 'Abstract' and EXCLUDE_AT is
# 'Acknowledgements', then the text extraction will start at (and include) the first occurrence of
# 'Abstract' and stop at (and Exclude) the first occurrence of 'Acknowledgements'.
INCLUDE_AT = "ABSTRACT"
# EXCLUDE_AT = "Acknowledgements"
EXCLUDE_AT = "9 C ONCLUSION"

# MODEL = 'gpt-3.5-turbo-0125'
# MODEL = 'gpt-4-0125-preview'
MODEL = 'gpt-4o-mini'
SYSTEM_MESSAGE = 'You are an AI assistant that gives detailed and intuitive explanations.'
MAX_TOKENS=None
TEMPERATURE=0.8

---

In [5]:
from IPython.display import Markdown, display
from source.library.pdf import clean_text_from_pdf, extract_text_from_pdf
from llm_workflow.openai import OpenAIChat, num_tokens, MODEL_COST_PER_TOKEN


def create_model():
    return OpenAIChat(MODEL, SYSTEM_MESSAGE, max_tokens=MAX_TOKENS, temperature=TEMPERATURE)

In [6]:
# download and extract text of pdf
text = extract_text_from_pdf(pdf_path=URL)
n_tokens = num_tokens(model_name=MODEL, value=text)
print(f"# of tokens: {n_tokens:,}")
print(f"Cost if input tokens == {n_tokens:,}:  ${MODEL_COST_PER_TOKEN[MODEL]['input'] * n_tokens:.3f}")
print(f"Cost if output tokens == {n_tokens:,}: ${MODEL_COST_PER_TOKEN[MODEL]['output'] * n_tokens:.3f}")

FileNotFoundError: [Errno 2] No such file or directory: '~/Downloads/ud923-rosenblum-garfinkel-paper.pdf'

In [18]:
# removed text before `INCLUDE_AT` and after `EXCLUDE_AT`
chars_before = len(text)
print(f"{chars_before:,} characters before")
text = clean_text_from_pdf(text=text, include_at=INCLUDE_AT, exclude_at=EXCLUDE_AT)
chars_after = len(text)
print(f"{chars_after:,} characters after")
print(f"Removed {abs((chars_after - chars_before) / chars_before):.2%} of text")
print("Preview:\n---\n")
print(text[:500])

191,545 characters before
32,612 characters after
Removed 82.97% of text
Preview:
---

ABSTRACT

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model’s long-form factuality in open domains, we first use GPT-4 to generate

LongFact, a prompt set comprising thousands of questions spanning 38 topics.

We then propose that LLM agents can be used as automated evaluators for long- form factuality through a method which we call Search-Augmented Factuality

Evaluator (SAFE). SAFE


In [19]:
n_tokens = num_tokens(model_name=MODEL, value=text)
print(f"# of tokens: {n_tokens:,}")
print(f"Input cost if input tokens == {n_tokens:,}:   ${MODEL_COST_PER_TOKEN[MODEL]['input'] * n_tokens:.3f}")
print(f"Output cost if output tokens == {n_tokens:,}: ${MODEL_COST_PER_TOKEN[MODEL]['output'] * n_tokens:.3f}")

# of tokens: 7,906
Input cost if input tokens == 7,906:   $0.079
Output cost if output tokens == 7,906: $0.237


In [10]:
model = create_model()
prompt = f"""
Rewrite this entire paper and each individual section using concise, simple, and intuitive language. Remove unnecessary jargon but keep relevant and important terminology. Write at least 5-7 sentences for each section. Keep all important details, insights, and key points, while removing or simplifying immaterial details. The point of view should still be from the authors' and should contain all relevant information, but it should be much easier to understand.

Here is the paper:

{text}
"""

response = model(prompt)
with open('summary.txt', 'w') as f:
    f.write(response)
cost = model.cost
print(f"Cost: ${cost:.3f}")
display(Markdown(response))

Cost: $0.102


**Simplified Paper Overview**

**Abstract Simplified**
We created a new way to test if large language models (like GPT-4) can provide accurate, long answers across a variety of topics. We made a big list of questions (called LongFact) covering 38 topics. Then, we introduced a new method (named SAFE) that uses a language model to check if the answers are true by searching the web. We also came up with a scoring system that considers both the number of correct facts and the desired length of an answer. Our tests showed that this method is better and much cheaper than asking people to check the answers. We also found that bigger language models tend to give more accurate long answers.

**Introduction Simplified**
Even though large language models have gotten really good, they often make mistakes when giving detailed answers. They might get dates wrong or make up things about well-known people. We aim to improve how we test these models for accuracy by introducing a new set of questions (LongFact), a checking method (SAFE), and a scoring system that looks at answer accuracy and length. Our tests show that our method works well and is much cheaper than using people to check answers.

**LongFact Simplified**
There aren't many good tests for seeing if language models can provide long, accurate answers on a wide range of topics. So, we made LongFact by asking GPT-4 to come up with lots of questions that need detailed answers. These questions cover 38 different topics. This is the first big test of its kind for checking detailed answer accuracy across many subjects.

**SAFE Simplified**
Judging the accuracy of long answers is tricky. We suggest focusing on checking each fact in an answer separately and using web searches to see if they're true. Our method, SAFE, breaks down answers into facts, decides if they're relevant, and then checks them online. This approach helps us accurately judge if a detailed answer is true or not.

**Comparing SAFE to Human Checkers Simplified**
When we compared our SAFE method to answers checked by humans, SAFE agreed with the human checkers 72% of the time. In cases where they disagreed, SAFE was right 76% of the time. Also, SAFE was more than 20 times less expensive than using human checkers. This shows that our method not only works well but can also save a lot of money.

**F1@K: A New Scoring System Simplified**
We want detailed answers to be both accurate and comprehensive. Our new scoring system, F1@K, helps us measure both by considering how many facts in an answer are correct and how the length of the answer matches what we're looking for. This helps us better understand how well a model can provide detailed, accurate information.

**Testing Larger Language Models Simplified**
We tested 13 different language models to see how well they could provide detailed, accurate answers using our new methods. Bigger models generally did better, giving more accurate and comprehensive answers. This suggests that as models get larger, they might also become more reliable in giving long, detailed answers.

**Related Work and Limitations Simplified**
Other studies have tried to test models' accuracy with short answers or on specific topics. Our work aims to check accuracy in long, detailed answers across many topics. While our method is a big improvement, it relies on the ability of language models to follow instructions and on the assumption that web searches can always provide the right facts to check an answer. There might also be cases where Google Search doesn’t have all the answers, so there's still room for improvement.

In [11]:
print(f"Total Cost:            ${model.cost:.5f}")
print(f"Total Tokens:          {model.total_tokens:,}")
print(f"Total Prompt Tokens:   {model.input_tokens:,}")
print(f"Total Response Tokens: {model.response_tokens:,}")

Total Cost:            $0.10213
Total Tokens:          8,753
Total Prompt Tokens:   8,023
Total Response Tokens: 730


---

In [None]:
prompt = "Do the authors discuss the specifics or different scenarios of when not to use cosine similarity?"
response = model(prompt)
cost = model.cost
print(f"Cost: ${cost:.3f}")
display(Markdown(response))

---