<a href="https://colab.research.google.com/github/shallotly/news-eval-cookbook/blob/main/Info_Extraction_Scenario_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Journalism Benchmark Cookbook: Information Extraction**
In this notebook, we demonstrate how to evaluate an information extraction task based on a scenario with **textual information extraction** from **unstructured** **textual documents**.

This notebook can be copied and modified to fit other similar scenarios of evaluation. For other specific tasks in information extraction, see [here](https://colab.research.google.com/drive/1a0802LBBCeVTOcKdTdWTzwdr6FxAUXgY#scrollTo=Q-5JGF1TqZFB).

# Evaluation Overview

*   **What is the use scenario?** In this use case, we aim to test the *information extraction* abilities of generative AI using a scenario in [the Washington Post's reporting on unwanted sexual approaches on chat apps](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/). This story examines the phenomenon of unwanted sexual approaches on social media apps through iOS app store reviews of six social media apps.

*   **What is the dataset?** The dataset can be accessed at: https://github.com/AlJohri/data-random-chat-apps/raw/master/reviews.csv. WaPo first used a machine learning algorithm to sift through more than 130,000 reviews of six random chat apps and then manually coded 1500 reviews mentioning unwanted sexual behavior on these apps. [Below](#data), we provide a brief explanation of data fields.

*   **How is the test case set up?**
We employ generative AI to extract information regarding whether an app review contains sexually explicit information or mentions unwanted sexual behaviors in a set of human annotated reviews containing the app being reviewed, review title, review rating, and review body text. <br> We prompt systems with a text prompt: *Given the following review entry for {app} on App Store: review rating - {rating}, review title - {title}, review text - {body}, please extract information about whether this review mentions something sexually explicit occurred or is itself sexually explicit. Additionally, extract information about if the review mentions that some unwanted sexual behavior occured or generally occurs. Please adhere to the provided example json format and output 1 if answer is positive, 0 if answer is negative, and ? if you are unsure.* We also provide a structured response format for model outputs.

*   **How are we assessing the performance of gen AI?**
We calculate an accuracy score based on exact match to the ground truth (annotator's notes). we also calculate the number of reviews human annotators were unsure about and use it as a point of reference to determine how often a model provide uncertain answers.


<a name="data"></a>
# 1. Data

The dataset contains the following relevant fields :

  * **reviewId**: Unique identifier for a review(e.g. 5175110487)
  * **app**: Social Media App being reviewed [*Monkey/Yubo/ChatLive/Chat for Strangers/Skout/Holla*]
  * **date**: Date and time of review
  * **rating**: 1-5 rating on App Store
  * **userId**: Apple assigned user id (e.g. 485829199)
  * **userName**: User chosen nickname (e.g. jasminebadbabg)
  * **title**: Title of review (e.g. "Needs Improvement")
  * **body**: Main text of review (e.g. "It’s fun to troll and talk to friends on this bro")
  * **sexually_explicit**: 1 if the review mentions something sexually explicit occurred or the review itself is sexually explicit, 0 if not, ? if the the manual coder was unable to decide, and blank if the review was not manually coded.
  * **unwanted_sexual**: 1 if the review mentions that some unwanted sexual behavior occured or generally occurs, 0 if not, ? if the the manual coder was unable to decide, and blank if the review was not manually coded.
  * **racism**: 1 if the review mentions that some racist behavior occured or the review itself is racist, 0 if not, ? if the the manual coder was unable to decide, and blank if the review was not manually coded.
  * **bullying**: 1 if the review mentions that some form of bullying occured, 0 if not, ? if the the manual coder was unable to decide, and blank if the review was not manually coded.
  * **spam**: 1 if the review talks about spammy behavior from the app, it's users, or other third-parties, 0 if not, ? if the the manual coder was unable to decide, and blank if the review was not manually coded.
  

In total, there are 131749 rows, each corresponding to one review, in the dataset. Among them, 3840 rows are annotated for whether the review contains sexually explicit content or mentions unwanted sexual behavoir. We filter out rows that are not annotated for this case study, and we truncate the dataset to 1500 rows to conserve cost of evaluation.

## 1.1 Data Code
The following code loads the data from github into a python list of dictionaries. It is further filtered and truncated.

In [1]:
!curl -O https://raw.githubusercontent.com/washingtonpost/data-random-chat-apps/refs/heads/master/reviews.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.5M  100 17.5M    0     0  16.9M      0  0:00:01  0:00:01 --:--:-- 17.0M


In [2]:
import csv

data = []
with open('reviews.csv', 'r', encoding='utf-8') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        data.append(row)

In [6]:
cleaned_data = [r for r in data if r['unwanted_sexual'] != ""]

In [10]:
data_lite = cleaned_data[:1500]

In [11]:
print(data_lite[3])

{'reviewId': '5130012958', 'app': 'Monkey', 'date': '2019-11-12T21:38:42-05:00', 'rating': '5', 'userId': '1099248928', 'userName': 'inlike girl', 'title': 'Money', 'body': 'Don’t do lad it because there is a lot of perverts on there', 'sexually_explicit': '1', 'unwanted_sexual': '1', 'racism': '', 'bullying': '', 'spam': ''}


# 2. Task Details
The scenario above outlines an *information extraction task*. This task is for the AI model to correctly extract specific information or excerpts from unstructured texts and classify the information.

This scenario fills a specific journalistic task context in that it is an *internal task*, which means the output of this task is used by internal newsroom staff, and it contains *unstructured sources* in the way that reviews are often freeform. This scenario focuses on data that is textual and that is stored in CSV file.

We summarize the task context here:

| Task Context | Value |
| :----------- | :---- |
| Task Usage   | Internal Usage (not public facing) |
| File Format  | csv |
| Modality     | Text |
| Data Structure | Unstructured |

## 2.1 Task Code   
The following code blocks first define a textual prompt input for models and then uses the [OpenRouter API](https://openrouter.ai/docs/quickstart#using-the-openrouter-api-directly) to collect model responses.

A prompt is input with each document for 5 different models (`openai/gpt-5-mini`, `openai/gpt-oss-20b`, `anthropic/claude-sonnet-4`, `deepseek/deepseek-chat-v3-0324`, `meta-llama/llama-3.3-70b-instruct`). The code provides a template for [structured output](https://openrouter.ai/docs/features/structured-outputs).

To authenticate the Open Router API with your own API key, refer to [this](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb#scrollTo=yeadDkMiISin) to create a separate name-value pair for Open Router.

The last two code blocks loads model outputs as python `dict` for access in the next portion and exports outputs as a json file.

In [None]:
def create_prompt (app, rating, title, body):
  return f"Given the following review entry for {app} on App Store: review rating - {rating}, review title - {title}, review text - {body}, please extract information about whether this review mentions something sexually explicit occurred or is itself sexually explicit. Additionally, extract information about if the review mentions that some unwanted sexual behavior occured or generally occurs. Please adhere to the provided example json format and output 1 if answer is positive, 0 if answer is negative, and ? if you are unsure."

In [6]:
models = ["openai/gpt-5", "openai/gpt-oss-20b","anthropic/claude-sonnet-4", "deepseek/deepseek-chat-v3-0324", "meta-llama/llama-3.3-70b-instruct"] #"deepseek/deepseek-r1-0528:free (could not use this one bc out of free credits)"

In [None]:
from google.colab import userdata
import requests
import json
import tqdm

url="https://openrouter.ai/api/v1/chat/completions"
headers = {
    "Authorization": f"Bearer {userdata.get('openrouter_api')}",
    "Content-Type": "application/json"
}

response_format ={
    "type": "json_schema",
    "json_schema": {
      "name": "sexual_explicit",
      "strict": True,
      "schema": {
        "type": "object",
        "properties": {
          "sexually_explicit": {
            "type": "string",
            "description": "whether the review mentions something sexually explicit occured, values 0, 1 or ?"
          },
          "unwanted_sexual": {
            "type": "string",
            "description": "whether the review mentions that some unwanted sexual behavior occured, values 0, 1 or ?"
          }
        },
        "required": ["sexually_explicit", "unwanted_sexual"],
        "additionalProperties": False
      }
    }
  }

for model in models:
  print(f"currently prompting {model}")
  for task in tqdm.tqdm(data_lite):
    #print(f"task #{tasks.index(task)}")
    messages=[
        {
          "role": "user",
          "content": [
                {
                    "type": "text",
                    "text": create_prompt(task['app'],task['rating'],task['title'], task['body']),
                }],
        }
    ]
    payload = {
      "model": model,
      "messages": messages,
      "response_format": response_format
    }
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    try:
      raw_response = response.json()
      task[model+'_output'] = raw_response['choices'][0]['message']['content']
    except KeyError:
      print(payload)
      print(response.json())
    except requests.JSONDecodeError:
      print(response.content)


# 3. Evaluation Metrics
We focus on evaluating the *accuracy* of the models against annotations provided reporters on the dataset.

*   We calculate the accuracy of models on extracting information about whether a review contains sexually explicit language and mentions unwanted sexual behavior by calculating the true positives (what annotators deemed to be "yes" that models also noted as "yes") and true negatives (what annotators deemed to be "no" that models also noted as "no") in model outputs based on human annotation. The accuracy values range from 0 to 1, depending total number of true positives and true negatives among all of the reviews annotators marked as either "yes" (1) or "no" (0)

#### Other Potential Metrics
We also provide a reference number on reviews that annotators marked as "unsure" (?), and correspondingly number of reviews that a model marked as "unsure" to provide an estimate in terms of model confidence about their outputs.

Other factors that could be measured includes speed, which would rely on measuring how fast the model produced a response, but this metric is not implemented here.


# 4. Performance Measurments

#    
The following code blocks calculate the accuracy score for each models on each of the variables in the generated data (`sexually_explicit` and `unwanted_sexual`). It is worth noting that the anthropic model did not generate response following the provided data schema, which made its score 0 for each of the variable. We test the Anthropic model's score at the end explicitly extracting keys that corresponds to the original dataset.

In [6]:
from prettytable import PrettyTable

In [5]:
data_lite[4]

{'reviewId': '5124867218',
 'app': 'Monkey',
 'date': '2019-11-11T17:30:24-05:00',
 'rating': '5',
 'userId': '482778816',
 'userName': 'Luisprs',
 'title': 'I love it',
 'body': 'I got nudes and thats why i like it',
 'sexually_explicit': '1',
 'unwanted_sexual': '?',
 'racism': '',
 'bullying': '',
 'spam': '',
 'openai/gpt-5_output': '{"sexually_explicit":"1","unwanted_sexual":"0"}',
 'anthropic/claude-sonnet-4_output': '```json\n{\n  "sexually_explicit_content": 1,\n  "unwanted_sexual_behavior": 0\n}\n```\n\nThe review explicitly mentions receiving nude images, which constitutes sexually explicit content. However, the reviewer expresses positive sentiment about this experience ("I love it", "thats why i like it"), indicating this was wanted rather than unwanted sexual behavior from their perspective.',
 'deepseek/deepseek-r1-0528:free_output': '{\n    "sexually_explicit": \t\t\t\t\t"...",\n    "unwanted_sexual": \t\t\t\t\t"..."\n}',
 'meta-llama/llama-3.3-70b-instruct_output': '{"s

In [8]:
#calculate sexually explicit accuracy

t = PrettyTable(['Model', 'Correct #', 'Accuracy', 'human_unsure', 'model_unsure'])
models = ["openai/gpt-5", "openai/gpt-oss-20b","anthropic/claude-sonnet-4", "deepseek/deepseek-chat-v3-0324", "meta-llama/llama-3.3-70b-instruct"] #"deepseek/deepseek-r1-0528:free (could not use this one bc out of free credits)"
for model in models:
  pos = 0
  neg = 0
  human_unsure = 0
  tp = 0
  tn = 0
  model_unsure = 0
  for task in data_lite:
    try:
      if task[model+'_json']['sexually_explicit'] == '?':
        model_unsure += 1
    except KeyError:
      print(model+': key error')
    if task['sexually_explicit'] == '1':
      pos += 1
      try:
        if task[model+'_json']['sexually_explicit'] == '1':
          tp += 1
      except KeyError:
        tp += 0
    elif task['sexually_explicit'] == '0':
      neg +=1
      try:
        if task[model+'_json']['sexually_explicit'] == '0':
          tn += 1
      except KeyError:
        tn += 0
    elif task['sexually_explicit'] == '?':
      human_unsure += 1
  t.add_row([model, tp+tn, (tp+tn)/(pos+neg), human_unsure, model_unsure])

print(f"The following table shows each model's accuracy score on {pos+neg} groundtruth answers for 'sexually explicit'.")
print(t)

openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt

In [9]:
#calculate sexually explicit accuracy

t = PrettyTable(['Model', 'Correct #', 'Accuracy', 'human_unsure', 'model_unsure'])

for model in models:
  pos = 0
  neg = 0
  human_unsure = 0
  tp = 0
  tn = 0
  model_unsure = 0
  for task in data_lite:
    try:
      if task[model+'_json']['unwanted_sexual'] == '?':
        model_unsure += 1
    except KeyError:
      print(model+': key error')
    if task['unwanted_sexual'] == '1':
      pos += 1
      try:
        if task[model+'_json']['unwanted_sexual'] == '1':
          tp += 1
      except KeyError:
        tp += 0
    elif task['unwanted_sexual'] == '0':
      neg +=1
      try:
        if task[model+'_json']['unwanted_sexual'] == '0':
          tn += 1
      except KeyError:
        tn += 0
    elif task['unwanted_sexual'] == '?':
      human_unsure += 1
  t.add_row([model, tp+tn, (tp+tn)/(pos+neg), human_unsure, model_unsure])

print(f"The following table shows each model's accuracy score on {pos+neg} groundtruth answers for 'sexually explicit'.")
print(t)

openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt-oss-20b: key error
openai/gpt

In [19]:
t = PrettyTable(['Model', 'Correct #', 'Accuracy', 'human_unsure', 'model_unsure'])

pos = 0
neg = 0
human_unsure = 0
tp = 0
tn = 0
model_unsure = 0
for task in data_lite:
  try:
    keys = list(task['anthropic/claude-sonnet-4_json'].keys())
  except KeyError:
    print(task)
    continue
  if task['sexually_explicit'] == '?':
    human_unsure += 1

  if len(keys)< 3:
    if task['anthropic/claude-sonnet-4_json'][keys[0]] == '?':
      model_unsure += 1
    else:
      if task['sexually_explicit'] == '1':
        pos += 1
        if int(task['anthropic/claude-sonnet-4_json'][keys[0]]) == 1:
          tp += 1
      if task['sexually_explicit'] == '0':
        neg +=1
        if int(task['anthropic/claude-sonnet-4_json'][keys[0]]) == 0:
          tn += 1
  if len(keys) == 3:
    if task['anthropic/claude-sonnet-4_json'][keys[0]] == '?' or task['anthropic/claude-sonnet-4_json'][keys[1]] == '?':
      model_unsure += 1
    if task['sexually_explicit'] == '1':
      pos += 1
      if task['anthropic/claude-sonnet-4_json'][keys[0]] == 1 or task['anthropic/claude-sonnet-4_json'][keys[1]] == 1:
        tp += 1
    if task['sexually_explicit'] == '0':
      neg +=1
      if task['anthropic/claude-sonnet-4_json'][keys[0]] == 0 and task['anthropic/claude-sonnet-4_json'][keys[1]] == 0:
        tn += 1
  else:
    #print(keys)
    None


t.add_row(['anthropic/claude-sonnet-4', tp+tn, (tp+tn)/(pos+neg), human_unsure, model_unsure])

print(f"The following table shows each model's accuracy score on {pos+neg} groundtruth answers for 'sexually explicit'.")
print(t)


{'reviewId': '1098731126', 'app': 'Skout', 'date': '2014-11-17T17:46:00-05:00', 'rating': '1', 'userId': '377470845', 'userName': 'Angryperson2345335', 'title': 'Lame site', 'body': 'Full of creeps', 'sexually_explicit': '1', 'unwanted_sexual': '1', 'racism': '', 'bullying': '', 'spam': '', 'anthropic/claude-sonnet-4_output': '```json\n{\n  "sexually_explicit_content": 0,\n  "unwanted_sexual_behavior": ?\n}\n```\n\nThe review does not contain sexually explicit language or descriptions. However, the term "creeps" could potentially refer to unwanted sexual behavior, but it\'s ambiguous - it could also refer to generally inappropriate, annoying, or suspicious users without necessarily implying sexual misconduct. Given this ambiguity, I marked the unwanted sexual behavior field as uncertain (?).', 'openai/gpt-5_output': '{"sexually_explicit":"0","unwanted_sexual":"?"}', 'meta-llama/llama-3.3-70b-instruct_output': '{"sexually_explicit": ", " ,"unwanted_sexual": "1"}', 'deepseek/deepseek-cha