<a href="https://colab.research.google.com/github/shallotly/news-eval-cookbook/blob/main/Info_Extraction_Scenario_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Journalism Benchmark Cookbook: Information Extraction**
In this notebook, we demonstrate how to evaluate an information extraction task based on a scenario with **textual information extraction** from **unstructured** **pdf files**.

This notebook can be copied and modified to fit other similar scenarios of evaluation. For other specific tasks in information extraction, see here(link under construction).

# Evaluation Overview

*   **What is the use scenario?** In this use case, we aim to test the *information extraction* abilities of generative AI using a scenario in [MuckRock's reporting on governmental agencies' AI usage](https://www.muckrock.com/news/archives/2024/nov/26/foia-annual-reports-crowdsourced-ai/). This story examines the use of AI to assist with fulfilling FOIA requests in government agencies by looking at [Chief FOIA Officer Reports](https://www.documentcloud.org/documents/?q=%2Bproject%3Achief-foia-officer-report-219382%20) from 2023 and 2024, specifically looking for the answer to the question: "Does your agency currently use any technology to automate record processing? For example, does your agency use machine learning, predictive coding, technology assisted review or similar tools to conduct searches or make redactions?"

*   **What is the dataset?** The dataset can be accessed at: https://github.com/MuckRock/crowdsource-foia-ai-reports. MuckRock employeed volunteers to read through these reports and provide annotations on how agencies reported using AI for each report. [Below](#data), we provide a brief explanation of data fields.

*   **How is the test case set up?**
We employ generative AI to extract information such as, year, agency name, and freeform text response regarding AI usage in [PDF reports](https://www.documentcloud.org/documents/?q=%2Bproject%3Achief-foia-officer-report-219382%20) submitted by agencies and obtained by MuckRock. <br> We prompt systems with a link to the report pdf and a text prompt: *Given the FOIA Officer report, please extract information about the year of the report, responding agency, whether the agency used machine learning or AI. If the agency reports using AI or machine learning, include the text in the report describing the agency's use of AI. Please generate output following this example json format: {'year': (string), 'agency': (string), 'ai_use': (boolean), 'original_text': (string)}*
*   **How are we assessing the performance of gen AI?** For information such as year and agency name, we calculate an accuracy source based on exact match to the ground truth (collected by MuckRock through crowdsourcing). For freeform text extraction, we calculate the percent overlap between the model's response and the ground truth. Since the ground truth answers are crowdsourced and provided by users of MuckRock, we also compare a few entries of ground truth that express uncertainty to the model's response for the same documents.


<a name="data"></a>
# 1. Data

The dataset contains the following relevant fields :

  * **datum**: documentcloud link to the pdf file (*e.g. https://www.documentcloud.org/documents/25178374-doc-2024-chief-foia-officer-report2*)
  * **id**: documentcloud file id (e.g. *25178374*)
  * **Which year is this report from?**: volunteer's response to which year the report is from. [*2023/2024/unknown*]
  * **Which government office is this report from?**: volunteer's response for the government agency that produced the report (e.g. *U.S. Department of Commerce*)
  * **Did the government office report that they do use machine learning or AI?**: volunteer's response for whether the government agency report using AI or not [*yes/no/unknown*]
  * **If the government office does use AI or machine learning, copy and paste the text from the report that describes the office's use of the technology.**: volunteer's chosen excerpt from the document describing AI usage by the agency. (e.g. *We began using Relativity for deduplication.*)

In total, there are 120 rows (corresponding to 120 reports produced by government agencies, each reviewed by 1 volunteer) in the dataset. This dataset is open sourced and used by MuckRock for a series of stories. It is worth noting that the volunteers' annotations were not verified beyond face value by the story editor, and it contains ambiguous answers, which are excluded from the evaluation below.

## 1.1 Data Code
The following code loads the data from github. The datum field is used to scrape urls to pdf files for use as input later.

In [None]:
!curl -O https://raw.githubusercontent.com/MuckRock/crowdsource-foia-ai-reports/refs/heads/main/data/manual/results.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 84327  100 84327    0     0   351k      0 --:--:-- --:--:-- --:--:--  353k


In [None]:
#!curl -O https://raw.githubusercontent.com/MuckRock/crowdsource-foia-ai-reports/refs/heads/main/data/manual/hand_annotated_flags.csv

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

# load dataset containing links to pdfs
df = pd.read_csv('results.csv')

tasks = []
for index, row in df.iterrows():
  task = {}

  # add id
  task['id'] = row['id']

  # get pdf url
  url = row['datum']
  html = requests.get(url)
  soup = BeautifulSoup(html.text, 'html.parser')
  for link in soup.find_all('a'):
    if link.get('href').endswith("pdf"):
      task["url"] = link.get('href')

  # create groundtruth
  groundtruth = {
      'year': row['Which year is this report from?'],
      'agency': row['Which government office is this report from?'],
      'ai_use': row['Did the government office report that they do use machine learning or AI?'],
      'original_text': row["If the government office does use AI or machine learning, copy and paste the text from the report that describes the office's use of the technology."]
  }
  task['groudtruth'] = groundtruth

  # append to list
  tasks.append(task)

print(tasks[0])

{'id': 25178374, 'url': 'https://s3.documentcloud.org/documents/25178374/doc-2024-chief-foia-officer-report2.pdf', 'groudtruth': {'year': '2024', 'agency': 'U.S. Department of Commerce', 'ai_use': 'Yes', 'original_text': 'The Departmentâ€™s technological priorities for the FOIA program over Fiscal Year 2024 include exploring possible artificial intelligence applications towards more efficient and timely processing while keeping transparency at the forefront and eDiscovery solutions for the FOIA program.\n\n Also, some components began using the EDR module within FOIAXpress to assist with responsive reviews of record sets found responsive to requests.'}}




---



# 2. Task Details
The scenario above outlines an *information extraction task*. This task is for the AI model to correctly extract specific information or excerpts from unstructured PDF documents and classify the information.

This scenario fills a specific journalistic task context in that it is an *internal task*, which means the output of this task is used by internal newsroom staff, and it contains *unstructured sources* as opposed to structured data such as row-column format in a .CSV. This scenario focuses on data that is textual and that is stored in PDF files.

We summarize the task context here:

| Task Context | Value |
| :----------- | :---- |
| Task Usage   | Internal Usage (not public facing) |
| File Format  | PDF |
| Modality     | Text |
| Data Structure | Unstructured |

## 2.1 Task Code   
The following code blocks first define a textual prompt input for models and then uses the [OpenRouter API](https://openrouter.ai/docs/quickstart#using-the-openrouter-api-directly) to collect model responses.

A prompt is input with each document for 5 different models (`openai/gpt-5-mini`, `openai/gpt-oss-20b:free`, `anthropic/claude-opus-4.1`, `deepseek/deepseek-r1-0528:free`, `meta-llama/llama-3.3-70b-instruct`). The code uses OpenRouter's [file parser](https://openrouter.ai/docs/features/multimodal/pdfs) to input the pdf file in each prompt, and provides a template for [structured output](https://openrouter.ai/docs/features/structured-outputs).

To authenticate the Open Router API with your own API key, refer to [this](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb#scrollTo=yeadDkMiISin) to create a separate name-value pair for Open Router. The total cost of running 120 prompts on 5 models came out to around $20.

The last two code blocks loads model outputs as python `dict` for access in the next portion and exports outputs as a json file.

In [None]:
prompt = "Given the FOIA Officer report, please extract information about the year of the report, responding agency, whether the agency used machine learning or AI. If the agency reports using AI or machine learning, include the text in the report describing the agency's use of AI. Please generate output following this example json format: {'year': (string), 'agency': (string), 'ai_use': (boolean), 'original_text': (string)}"

In [None]:
import json

with open('outputs_8:20.json','r') as f:
  tasks = json.load(f)

In [None]:
models = ["openai/gpt-5","openai/gpt-5-mini", "openai/gpt-oss-20b:free", "anthropic/claude-opus-4.1", "deepseek/deepseek-r1-0528:free", "meta-llama/llama-3.3-70b-instruct"]

In [None]:
from google.colab import userdata
import requests
import json
import tqdm

url="https://openrouter.ai/api/v1/chat/completions"
headers = {
    "Authorization": f"Bearer {userdata.get('openrouter_api')}",
    "Content-Type": "application/json"
}
plugins = [
  {
    "id": "file-parser",
    "pdf": {
        "engine": "pdf-text" # there is an option to natively input file, but not all models we are testing have file processing capabilities
    }
  }
]

response_format ={
    "type": "json_schema",
    "json_schema": {
      "name": "report_annotate",
      "strict": True,
      "schema": {
        "type": "object",
        "properties": {
          "year": {
            "type": "string",
            "description": "year of the report"
          },
          "agency": {
            "type": "string",
            "description": "name of the agency"
          },
          "ai_use": {
            "type": "boolean",
            "description": "whether the report indicate use of AI"
          },
          "original_text": {
            "type": "string",
            "description": "excerpt from the report indicating use of AI"
          }
        },
        "required": ["year", "agency", "ai_use", "original_text"],
        "additionalProperties": False
      }
    }
  }

for model in models:
  print(f"currently prompting {model}")
  for task in tqdm.tqdm(tasks[87:]):
    #print(f"task #{tasks.index(task)}")
    messages=[
        {
          "role": "user",
          "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "file",
                    "file": {
                      "filename": "document.pdf",
                      "file_data": task['url']
                      },
                },
              ],
        }
    ]
    payload = {
      "model": model,
      "messages": messages,
      "response_format": response_format,
      "plugins": plugins
    }
    try:
      response = requests.post(url, headers=headers, data=json.dumps(payload))
      raw_response = response.json()
      task[model+'_output'] = raw_response['choices'][0]['message']['content']
    except KeyError:
      print(payload)
      print(response.json())
    except json.JSONDecodeError:
      print(payload)
      print(response)


currently prompting openai/gpt-5


100%|██████████| 33/33 [14:38<00:00, 26.61s/it]


In [None]:
import re

for model in models:
  for task in tqdm.tqdm(tasks):
    if task[model+'_output'].startswith("{") and task[model+'_output'].endswith("}"):
      task[model+"_json"] = json.loads(task[model+'_output'])
    else:
      try:
        matches = re.findall(r"\{(.*?)\}", task[model+'_output'], re.S)
        task[model+"_json"] = json.loads('{'+matches[0]+'}')
      except IndexError:
        print("index error")
        print(model)
        print(tasks.index(task))
        task[model+"_json"] = {}

100%|██████████| 120/120 [00:00<00:00, 102341.70it/s]
100%|██████████| 120/120 [00:00<00:00, 51077.38it/s]
100%|██████████| 120/120 [00:00<00:00, 95560.37it/s]
100%|██████████| 120/120 [00:00<00:00, 40906.74it/s]
100%|██████████| 120/120 [00:00<00:00, 94466.31it/s]
100%|██████████| 120/120 [00:00<00:00, 76480.24it/s]

index error
meta-llama/llama-3.3-70b-instruct
33
index error
meta-llama/llama-3.3-70b-instruct
43







---



# 3. Evaluation Metrics
We focus on evaluating the *accuracy* of the models against annotations provided by volunteers on the FOIA report documents.

*   For accuracy of "year," "government agency name," and "whether the agency uses AI," we compare the generated answer against volunteer provided groundtruth based on exact match (i.e. whether the system generated answers that are exactly the same as the volunteers'). The value for these items range from 0 to 1 (0 being the system did not generated any answer that matches that of volunteers).
*   For accuracy of the excerpt extracted from the original report that describes the use of AI, we compute the accumulated longest common substring length against answers provided by volunteers. To do this, we divide the length of a longest common substring between each generation and groundtruth by the length of ground truth, and we aggregate the average per system.

#### Other Potential Metrics
Other factors that could be measured includes speed, which would rely on measuring how fast the model produced a response, and uncertainty, which could either stem from directly prompting systems for a confidence score or prompting multiple times for a sample of outputs and computing a distribution. However, neither of these metrics are implemented here.





---



# 4. Performance Measurments

#    
The following code blocks calculate the accuracy score for each models on each of the variables in the generated data (`year`, `ai_use`, `agency`, and `original_text`).

In [None]:
from prettytable import PrettyTable

In [None]:
#some ground truth entries do not include year, or include vague answers for the year
valid_truth = 0
for task in tasks:
  if task['groudtruth']['year'] == "2023" or task['groudtruth']['year'] == "2024":
    valid_truth += 1

t = PrettyTable(['Model', 'Correct #', 'Accuracy'])

# calculating accuracy with extracting the year of the report
for model in models:
  correct = 0
  for task in tasks:
    try:
      correct += 1 if task['groudtruth']['year'] == task[model+'_json']['year'] else 0
    except KeyError:
      # print(tasks.index(task))
      # print(model)
      correct += 0
  t.add_row([model, correct, correct/valid_truth])

print(f"The following table shows each model's accuracy score on {valid_truth} groundtruth answers for 'year'.")
print(t)

The following table shows each model's accuracy score on 102 groundtruth answers for 'year'.
+-----------------------------------+-----------+--------------------+
|               Model               | Correct # |      Accuracy      |
+-----------------------------------+-----------+--------------------+
|            openai/gpt-5           |     92    | 0.9019607843137255 |
|         openai/gpt-5-mini         |     85    | 0.8333333333333334 |
|      openai/gpt-oss-20b:free      |     87    | 0.8529411764705882 |
|     anthropic/claude-opus-4.1     |     94    | 0.9215686274509803 |
|   deepseek/deepseek-r1-0528:free  |     89    | 0.8725490196078431 |
| meta-llama/llama-3.3-70b-instruct |     89    | 0.8725490196078431 |
+-----------------------------------+-----------+--------------------+


In [None]:
# calculating accuracy with identifying mention of ai uses

t = PrettyTable(['Model', 'Correct #', 'Accuracy'])

for model in models:
  pos = 0
  neg = 0
  tp = 0
  tn = 0
  for task in tasks:
    if task['groudtruth']['ai_use'] == 'Yes':
      pos += 1
      try:
        if task[model+'_json']['ai_use'] == True:
          tp += 1
      except KeyError:
        tp += 0
    elif task['groudtruth']['ai_use'] == 'No':
      neg += 1
      try:
        if task[model+'_json']['ai_use'] == False:
          tn += 1
      except KeyError:
        tn += 0

  t.add_row([model, tp+tn, (tp+tn)/(pos+neg)])

print(f"The following table shows each model's accuracy score on {pos+neg} groundtruth answers for 'ai use'.")
print(t)

The following table shows each model's accuracy score on 97 groundtruth answers for 'ai use'.
+-----------------------------------+-----------+--------------------+
|               Model               | Correct # |      Accuracy      |
+-----------------------------------+-----------+--------------------+
|            openai/gpt-5           |     67    | 0.6907216494845361 |
|         openai/gpt-5-mini         |     69    | 0.711340206185567  |
|      openai/gpt-oss-20b:free      |     69    | 0.711340206185567  |
|     anthropic/claude-opus-4.1     |     67    | 0.6907216494845361 |
|   deepseek/deepseek-r1-0528:free  |     72    | 0.7422680412371134 |
| meta-llama/llama-3.3-70b-instruct |     71    | 0.7319587628865979 |
+-----------------------------------+-----------+--------------------+


In [None]:
import math

valid_truth = 0
for task in tasks:
  if type(task['groudtruth']['agency']) == str:
    valid_truth += 1

t = PrettyTable(['Model', 'Correct #', 'Accuracy'])
# calculating accuracy with extracting the agency name of the report
for model in models:
  matches = 0
  for task in tasks:
    if type(task['groudtruth']['agency']) == str:
      try:
        matches += 1 if task['groudtruth']['agency'].casefold() in task[model+'_json']['agency'].casefold() else 0
      except KeyError:
        matches += 0
  t.add_row([model, matches, matches/valid_truth])

print(f"The following table shows each model's accuracy score on {valid_truth} groundtruth answers for 'agency name.'")
print(t)

The following table shows each model's accuracy score on 109 groundtruth answers for 'agency name.'
+-----------------------------------+-----------+--------------------+
|               Model               | Correct # |      Accuracy      |
+-----------------------------------+-----------+--------------------+
|            openai/gpt-5           |     73    | 0.6697247706422018 |
|         openai/gpt-5-mini         |     77    | 0.7064220183486238 |
|      openai/gpt-oss-20b:free      |     72    | 0.6605504587155964 |
|     anthropic/claude-opus-4.1     |     71    | 0.6513761467889908 |
|   deepseek/deepseek-r1-0528:free  |     68    | 0.6238532110091743 |
| meta-llama/llama-3.3-70b-instruct |     73    | 0.6697247706422018 |
+-----------------------------------+-----------+--------------------+


In [None]:
from difflib import SequenceMatcher

#code adapted from google
def find_longest_common_substring(groundtruth, generated):
    matcher = SequenceMatcher(None, groundtruth, generated)
    overlap = matcher.find_longest_match(0, len(groundtruth), 0, len(generated))

    if overlap.size == 0:
        return 0
    else:
        return overlap.size/len(groundtruth)

t = PrettyTable(['Model', 'Score'])
for model in models:
  score = 0
  groundtruth = 0
  for task in tasks:
    try:
      groundtruth_string = task['groudtruth']['original_text']
      generated_string = task[model+'_json']['original_text']
      if type(groundtruth_string) == str:
        groundtruth += 1
        if type(generated_string) == str:
          score += find_longest_common_substring(groundtruth_string, generated_string)
    except KeyError:
      score += 0
  t.add_row([model,score])

print(f"The following table shows each model's accuracy score on {groundtruth} groundtruth answers for 'original text.'")
print(t)

The following table shows each model's accuracy score on 79 groundtruth answers for 'original text.'
+-----------------------------------+--------------------+
|               Model               |       Score        |
+-----------------------------------+--------------------+
|            openai/gpt-5           | 11.307042554095451 |
|         openai/gpt-5-mini         | 17.177281269369416 |
|      openai/gpt-oss-20b:free      | 10.471069697704849 |
|     anthropic/claude-opus-4.1     | 17.79350369623537  |
|   deepseek/deepseek-r1-0528:free  | 21.332857790887104 |
| meta-llama/llama-3.3-70b-instruct | 35.146809715687674 |
+-----------------------------------+--------------------+
