# Generative AI for News Media

The purpose of this notebook is to provide an overview and some examples of how generative models such as GPT-3 can be used for productive purposes in the domain of journalism. It's important to get a handle on the capabilities and limitations of these models in order to use them responsibly. 

You can try out OpenAI's interactive "playground" (https://beta.openai.com/playground) and try variations of the tasks I walk through below. Before we get to tasks though, I'll talk about prompts and model limitations. 


## Prompting
The main way you interact with these kinds of models is through *prompting* -- writing commands in natural language. **The prompt is the textual input to the model used to describe the task that the model should perform.** For example, you could prompt the model with "Brainstorm three article ideas on the topic of generative AI in news media" and it should generate a list of three ideas. 

Prompting is your main way of controlling the machine and providing context or data for the task. You need to communicate clearly and precisely what you want, and what you don't. When writing prompts think about the verbs and objects to describe what you want to happen, like "<u>Summarize</u> this scientific <u>abstract</u> using lay terminology: \<abstract text\>". 

Prompting is sometimes referred to as "prompt design" or "prompt engineering" since you often need to carefully craft and iterate or "debug" your prompts so that they accomplish what you want. This is similar to how we explain things to other people: if someone doesn't understand you, you try again and explain it a bit differently, trying to clarify. 

Here are a few tips for prompting: 

-- Use plain language, be concrete with verbs and objects

-- Iterate on different prompts in the playground so you can quickly see feedback on whether it's working as expected. You might need to try different ways of expressing what you want the model to do.

-- Start by being as descriptive as possible and then work backwards to remove extra words and see if performance stays consistent

-- Play with the scope and complexity of task: “Write an article about GPT-3” vs. “List three specific advantages of GPT-3 for news media”

## Model Limitations
Journalists need to be aware of the limitations of the AI systems they use. Generate AI models like GPT-3 generally come with a few caveats including:

-- They don't "reason" well, especially for complex tasks with multiple steps that require planning. 

-- They can "hallucinate" information leading to factual inaccuracies. They generally cannot provide accurate references or links back to where information came from. 

-- They can exhibit societal biases based on the data they're trained on. Based on when the training data ends the model won't "know" anything about the world after that date (for GPT-3.5 this is June, 2021) 

-- They don't generally do well on math problems.

-- There are length limitations (~3000 words for GPT-3) and generally the longer the output text, the less coherent.

-- They may memorize text from their training that could result in copyright issues.

# Capabilities

Next I'll walk through some basic use-cases for generative AI in news media, focusing on text output using the GPT-3 model. I'll demonstrate six basic tasks that should have some journalistic value: (1) Rewriting, (2) Summarization, (3) Brainstorming, (4) Classification, (5) Extraction, and (6) Data-to-text. There are other example tasks you can check out in the OpenAI documentation: https://beta.openai.com/examples as well as in the OpenAI Cookbook: https://github.com/openai/openai-cookbook.  

**To edit and run the code, create a copy of this Google Doc in your drive, then sign up on OpenAI for an API key and paste it in below where it says `openai.api_key`.** Or you can also copy-paste the prompts into the OpenAI playground to see the output there.

If you're keen to build on these tasks or develop your own, see the Generative AI in the Newsroom Challenge: https://medium.com/@ndiakopoulos/the-generative-ai-in-the-newsroom-challenge-9fe2dc5fb2a7

In [1]:
!pip install openai

Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
Installing collected packages: openai
Successfully installed openai-0.27.2


In [None]:
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

In [1]:
import pandas as pd
import openai
import os

# Copy your Open AI Secret Key here between the quotation marks
#openai.api_key = os.environ.get('OpenAI_secKeyAPI')
openai.api_key = "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"

import warnings
warnings.filterwarnings('ignore')

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)


## 1. Rewriting

These models tend to be good at paraphrasing text, which can be useful for removing jargon or simplifying the language of a text. 

**Use Case**: You're a science reporter trying to see if a recently published paper deals with a topic of interest. So you prompt the model to dejargonize a scientific abstract from a pre-print server. 

⚠ **WARNING** ⚠ : Output like this should only be used to get a quick sense for what the research is about and help you decide if you want to read the full study.  




In [2]:
# Here is the prompt command
prompt = "Rewrite the following abstract to avoid heavy scientific jargon and use simpler vocabulary: "

# Here is a jargony scientific abstract taken from bioarxiv: https://www.biorxiv.org/content/10.1101/581280v3 
abstract = \
"Current treatments for depression are limited by suboptimal efficacy, delayed response, \
and frequent side effects. Intermittent theta-burst stimulation (iTBS) is a non-invasive brain stimulation treatment that is \
FDA-approved for treatment-resistant depression (TRD). Recent methodological advancements suggest iTBS could be improved \
through 1) treating with multiple sessions per day at optimally-spaced intervals, 2) applying a higher overall pulse-dose \
of stimulation and 3) precision targeting of the left dorsolateral prefrontal cortex (L-DLPFC) to subgenual anterior cingulate \
cortex (sgACC) circuit. We examined the feasibility, tolerability, and preliminary efficacy of an accelerated, high-dose, \
resting-state functional connectivity MRI (fcMRI)-guided iTBS protocol for TRD termed \
‘Stanford Accelerated Intelligent Neuromodulation Therapy (SAINT)’."

response = openai.Completion.create(
  model="text-davinci-003",
  prompt= prompt + abstract, # Concatenate the prompt and abstract
  temperature=0.0, # Set the temperature to zero to remove variation in the output (should always get the same result)
  max_tokens=300,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

# Output the response
print (response["choices"][0]["text"])

APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/completions (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1129)')))



---



## 2. Summarization

These models are quite proficient at crunching a text down so you can scan it more quickly. 

**Use Case**: You're an AI reporter and want to keep tabs on state legislatures that may be introducing relevant bills. You want to set up an alert so that any bill mentioning "AI" gets summarized and sent to your email. 

⚠ **WARNING** ⚠ : Output like this should only be used to get a quick sense for what the bill is about and help you decide if you want to read the full bill in detail.  



In [8]:
# Import packages needed to get and parse the data
import requests
from bs4 import BeautifulSoup

# Here is the prompt command
prompt = "Summarize what is journalistically newsworthy about the following proposed bill: "

# Download the text for a proposed Massachusetts bill
url = "http://malegislature.gov/Bills/193/SD1827/Senate/Bill/Text"
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
bill_text = soup.get_text()

# Uncomment this if you want to see the original bill text
#print (bill_text)

response = openai.Completion.create(
  model="text-davinci-003",
  prompt= prompt + bill_text, # Concatenate the prompt and bill text
  temperature=0.7, # Setting the temperature to the default which can result in some variability from run to run
  max_tokens=250,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

# Output the response
print (response["choices"][0]["text"])

APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/completions (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1129)')))



---



## 3. Brainstorming

These models can be helpful for generating diverse ideas that can spark your creativity. 

**Use Case**: A press release just landed in your inbox from the NYC Mayor's office. You haven't covered the topic before and don't have much background so you use the model to jumpstart your thinking about possible story angles you could pursue.  

**Note**: The use of the `presence_penality` parameter here to nudge the model to produce distinct ideas.

⚠ **WARNING** ⚠ : The more you know about a topic, the less the model will probably surprise you. Don't expect any truly creative thinking from the model, but it might offer the right spark for your own creativity.



In [None]:
# Import packages needed to get and parse the data
import requests
from bs4 import BeautifulSoup

# Here is the prompt command
prompt = "Brainstorm three specific news article ideas by critically assessing the following press release, explaining each idea: "

# Download the text for the NYC Press Release
url = "https://www.nyc.gov/office-of-the-mayor/news/077-23/mayor-adams-helps-new-yorkers-save-up-150-million-overdue-water-bills"
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
press_release_text = soup.select_one(".col-content").get_text()

# Uncomment this if you want to see the original press release text
#print (bill_text)

response = openai.Completion.create(
  model="text-davinci-003",
  prompt= prompt + press_release_text, # Concatenate the prompt and bill text
  temperature=0.7, # Setting the temperature to the default which can result in some variability from run to run
  max_tokens=500,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.8 # Setting the presence penality to a relatively high value should help the model brainstorm distinct ideas
)

# Output the response
print (response["choices"][0]["text"])


1. "Mayor Adams Offers Relief to Nearly 200,000 New Yorkers With Unpaid Water Bills": This article would explore how Mayor Adams' amnesty program is offering relief to nearly 200,000 customers with overdue water bills by forgiving up to 100 percent of interest when customers pay a portion or all of their outstanding water bills. 

2. "New Yorkers Could Save up to $150 Million With Temporary Water Bill Amnesty Program": This article would focus on the potential savings for New Yorkers of up to $150 million that are available through Mayor Adams' temporary water bill amnesty program. 

3. "NYC Water Infrastructure Threatened By $1.2 Billion Owed in Unpaid Water Bills": This article would explain how the $1.2 billion owed in unpaid water bills poses a threat to NYC's water infrastructure and could lead to additional rate hikes if left unpaid.




---



## 4. Classification

Generative models like GPT3 can also be used for analytic tasks like classifying a text into different categories. 

**Use Case**: You're an engaged journalist doing a follow up story on recent coverage around the need to replace gas cooktops, such as https://www.washingtonpost.com/climate-solutions/2023/02/04/how-to-use-gas-stove-safely/. You want to run a quick survey with people who have switched to induction stoves and ask them why they switched. To find these people you decide to scan the more than 2,700 comments made on the last story. 

**Note**: The prompt here is designed to output structured data, which can be used easy in downstream processing, like filtering. 

⚠ **WARNING** ⚠ : Don't make assumptions about what your classifier might be missing. You might need to devise a small study to assess accuracy. 

In [None]:
import json
import time
from openai.error import RateLimitError

# For purposes of the demo we just scan 6 comments, but this could easily be scaled up.
sample_comments = ["We were considering converting to gas when replacing our stove. We went with induction instead. It’s amazing. Cooks food/boils water faster. Is 90% efficient vs. gas 30% efficacy. Equipped with an air fryer function. We had to purchase new cook wear that is required because of the magnification properties on the burners. You need pots that are made to adhere so to speak to burners while heating. But it’s worth it! Love our new pots and get dinner ready faster. Now if it would only clean -up.",
     "When we first moved into our house (we weren't the first owner) it had a conventional electric cooktop. Conventional electric stoves just plain suck for a variety of reasons. But we were lucky, because the builder of our house included both a gas pipe and a 220 volt outlet in the island where the cooktop sat. We considered replacing the electric cooktop with an induction one but didn't want to spend the time, effort and money to replace the majority of our cookware with pots and pans that were magnetic. So in our case it was easy work to yank the electric cooktop and replace it with a gas one. It took the appliance store guys about 30 minutes to make the switch. A couple of years ago we remodeled our kitchen, and in the big scheme of cost and inconvenience, replacing our cookware was pretty trivial. So we finally got our induction cooktop. It combines the best features of electric (more efficient than gas, no air pollution) with the best features of gas (easy and quick to regulate temperature), and includes advantages neither one has (doesn't heat up the kitchen, cooktop doesn't stay burning hot for long when cooking is done), and it heats things faster than either electric or gas. Without a doubt induction cooking is the way of the future. And I'm sure prices will come down as more home owners who can't or don't want to spend on high-end appliances start to demand them.",
     "We have a gas stove, and like it. Our range hood is vented to the outside. We will be getting an induction range within 2 to 4 years. We are older, and it is a safer option since there will not be an open flame. We took this decision before the \"controversy\" arose. Gas stoves are better than standard electric in some areas (where gas is cheaper), and work very well. It is also true that the products of combustion are a problem. Houseplants help, opening windows helps, range hoods help. Nothing works perfectly.",
     "Unless you have a small home with little ventilation, the risk from gas stove emissions is quite small. After cooking on a induction stove top for a year in a rental we opted for gas when we put a new stove in our own house. It is well ventilated and this old house provides a lot of unplanned ventilation too.",
     "And just so we're all clear: the suitability of a pan for induction cooking is not \"adhesion\". Testing with a magnet is a quick way to determine whether the pan has enough ferrous metal to respond to the magnetic field generate by the element.",
     "The gas stove hullabaloo is absurd. The matter of the plastic straws is infinitely more serious because they stay out there essentially forever and affect other creatures. I have no problem banning them."
    ]

# Here is the prompt command
prompt = 'Does the following comment describe someone\'s experience of switching to an induction cooktop? Respond with JSON structured data with a "label" field that is either "yes" or "no" and and "explanation" field that provides an explanation of why it\'s labeled that way: '

# This array will store our categorized data
output_data = []

# Loop through our list of sample_comments
for ncomment, comment_text in enumerate(sample_comments):
  print ("Testing comment", ncomment)

  # Construct the API request  
  try:
    response = openai.Completion.create(
      model="text-davinci-003",
      prompt= prompt + comment_text, 
      temperature=0.0, # We don't want any variability in the responses 
      max_tokens=256,
      top_p=1.0,
      frequency_penalty=0.0,
      presence_penalty=0.0
    )

    data = {}
    data["response"] = json.loads(response["choices"][0]["text"])
    data["comment"] = comment_text
    output_data.append(data)
    #print (data)

    # You need to sleep for 3 seconds in between API calls to avoid being rate limited (the rate limit is 20 calls / minute for the free tier)
    time.sleep(5)
  except RateLimitError as e:
    print (e)
    time.sleep (10) # Wait some more

# Outcome the final data
print (json.dumps(output_data, indent=2))

Testing comment 0
Testing comment 1
Testing comment 2
Testing comment 3
Testing comment 4
Testing comment 5
[
  {
    "response": {
      "label": "yes",
      "explanation": "The comment describes a positive experience with an induction cooktop, including its efficiency, air fryer function, and the need to purchase new cookware."
    },
    "comment": "We were considering converting to gas when replacing our stove. We went with induction instead. It\u2019s amazing. Cooks food/boils water faster. Is 90% efficient vs. gas 30% efficacy. Equipped with an air fryer function. We had to purchase new cook wear that is required because of the magnification properties on the burners. You need pots that are made to adhere so to speak to burners while heating. But it\u2019s worth it! Love our new pots and get dinner ready faster. Now if it would only clean -up."
  },
  {
    "response": {
      "label": "Yes",
      "explanation": "The comment describes the experience of switching from an electri



---



## 5. Extraction

Models can also take unstructured data like documents and structure it for further analysis. 

**Use Case**: Roberto Rocha posted an interesting use case here https://robertorocha.info/getting-tabular-data-from-unstructured-text-with-gpt-3-an-ongoing-experiment/ The idea is to extract structured data about who is lobbying government officials from the Candian federal lobbyist registry. This is quite a messy data source.  

⚠ **WARNING** ⚠ : It took a lot of prompt iteration for this one to get it to work as intended even just on the small sample of documents in the demo. Further testing would be required to confidently generalize it to a larger dataset.  

In [None]:
import time
from openai.error import RateLimitError

# A sample of 4 documents we might want to parse
sample_documents = ["1996-1997 EXECUTIVE ASSISTANT MINISTER OF TRANSPORT",
"Special Assistant 1991 to 1993 Hon. Robert Kaplan",
"September 1984 to February 1988 Senior Policy Analyst - various assignments related to federal procurement and trade policy Department of Supply and Services",
"January 2002 to May 2002 Chief of Staff Office of the Minister of Public Works and Government Services Canada"
]

# Here is the prompt command
prompt = 'Extract two dates or years from the input data. Also extract a job description from each line of text in the input data. Create a three-column table with the first date, second date, and job description. If there is no date or job description, leave the column blank. \
Use the following format: first date | second date | job description \
input data: '

# Loop through our list of sample_comments
for ndoc, document_text in enumerate(sample_documents):
  # Construct the API request  
  try:
    response = openai.Completion.create(
      model="text-davinci-003",
      prompt= prompt + document_text, 
      temperature=0.7,  
      max_tokens=256,
      top_p=1.0,
      frequency_penalty=0.0,
      presence_penalty=0.0
    )

    print (response["choices"][0]["text"])

    # You need to sleep for 3 seconds in between API calls to avoid being rate limited (the rate limit is 20 calls / minute for the free tier)
    time.sleep(3)
  except RateLimitError as e:
    print (e)
    time.sleep (10) # Wait some more




1996 | 1997 | Executive Assistant Minister of Transport


1991 | 1993 | Special Assistant Hon. Robert Kaplan


September 1984 | February 1988 | Senior Policy Analyst


January 2002 | May 2002 | Chief of Staff, Office of the Minister of Public Works and Government Services Canada




---



## 6. Structured Data to Text

These models can take structured data and output readable text to support various data journalism projects. 

**Use Case**: You want to create a quick blurb for your website that summarizes the CDC weekly Influenze Surveillance Data (https://www.cdc.gov/flu/weekly/index.htm) and personalizes it the location (e.g. state) of the user. 

⚠ **WARNING** ⚠ : When generating text, always be sure to double-check it for factual accuracy before publishing. 

In [None]:
# Here is the prompt command
prompt = "Write a short description of this week's influenza activity based on the following data: "

# And some sample data
data = "State: Virginia \
ILI Activity Level: Moderate \
ILI Rate for visits to emergency departments or urgent care centers: 3.2%"

response = openai.Completion.create(
  model="text-davinci-003",
  prompt= prompt + data, 
  temperature=0.7, 
  max_tokens=256,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

# Output the response
print (response["choices"][0]["text"])



This week, influenza activity in Virginia has been moderate. Visits to emergency departments or urgent care centers have reported an ILI rate of 3.2%.
