# Testing OCR correction of Trove newspaper articles using GPT3

In [241]:
import requests
import os
import html2text
import re
from IPython.display import HTML
from html_diff import diff

HTML(
    """<style>
    ins {background-color: #ccffcc; }
    del {background-color: #ffcccc; } 
    </style>"""
)

In [242]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

In [243]:
# Insert your Trove & GPT3 API keys
GPT3_KEY = "YOUR API KEY"
TROVE_API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("GPT3_KEY"):
    GPT3_KEY = os.getenv("GPT3_KEY")
    
if os.getenv("TROVE_API_KEY"):
    TROVE_API_KEY = os.getenv("TROVE_API_KEY")

## Either get a random newspaper article

In [250]:
# Get a random article
trove_response = requests.get(f"https://trove-proxy.herokuapp.com/random/?key={TROVE_API_KEY}&word=100 - 1000 Words")
if trove_response.ok:
    article = trove_response.json()
    print(f"Article found! -- {article['heading']}: {article['troveUrl']}")
else:
    print("Try again!")

Article found! -- The Factory Infant's Prayer.: https://trove.nla.gov.au/ndp/del/article/138918125?searchTerm=%22weren%27t%22


## Or get a specific newspaper article using its identifier

In [251]:
# Or else get a specific article by setting the id value below
article_id = None

if article_id:
    trove_params = {
        "include": "articletext",
        "encoding": "json",
        "key": TROVE_API_KEY
    }

    trove_url = f"https://api.trove.nla.gov.au/v2/newspaper/{article_id}"

    trove_response = requests.get(trove_url, params=trove_params)
    if trove_response.ok:
        article = trove_response.json()["article"]
        print(f"Article found! -- {article['heading']}: {article['troveUrl']}")
    else:
        print("There was a problem!")

## Clean up the input text

Remove HTML and line breaks.

In [252]:
text = html2text.html2text(article["articleText"])
text = re.sub("\s+", " ", text)

## Query GPT3

At first I tried the `edits` endpoint which uses a different model. That didn't seem to work, so I switched to using `text-davinci-003` and the `completions` endpoint.

In [253]:
params = {
  "model": "text-davinci-003",
  "prompt": f"Correct the OCR errors in this text: {text}",
  "max_tokens": 1500
}

headers = {"Authorization": f"Bearer {GPT3_KEY}"}

gpt_response = requests.post("https://api.openai.com/v1/completions", json=params, headers=headers)

## Reformat the corrected text

Remove the label and line breaks.

In [254]:
corrected = gpt_response.json()["choices"][0]["text"]
corrected = re.sub(r"^\s*.+?:\s*", "", corrected)
# GPT3 inserts line breaks, remove them so we can compare texts
corrected = re.sub("\s+", " ", corrected)

## Display the differences

In [255]:
HTML(diff(text, corrected))

Just the corrected output:

In [256]:
corrected

" The Factory Infant's Prayer. Gentle Jesus, meek and mild, Gaze down on the factory child, Attending loom most all the day I 'spect it's naught but waste o'time to play, I leave my bed for weaver's stool, Then leave my work to go to school, But nothing won't keep in my head 'Cos I'm thinking of the weaving shed. Gentle Jesus, tell me why They work such little girls as I; Would the folks all naked be If 'tweren't for little kids like me. Our master says it is a crime To stay at school more'n half the time, We cannot believe all he say, 'Cos his kin go to school all day. Gentle Jesus, tell me why Your great disciples rave and cry About the poor Armenian's state, Yet silently condone our fate. Of course, dear Jesus, you're aware That their temples bright and fair Are endowed with golden spoil, Wring from our ungodly toil. Gentle Jesus, Lord how long Are us wee mites to suffer wrong? When will shame thrust forth her sting, Awaken men to this mean thing? Gentle Jesus, meek and mild, Pity t

## Cost estimate

Based on $0.0200 / 1,000 tokens.

In [263]:
total_tokens = response.json()["usage"]["total_tokens"]
print(f"Estimated cost: USD${(total_tokens/1000) * .02:.2f}")

Estimated cost: $0.01
