<a href="https://colab.research.google.com/github/stwagner07/atr-historical-research/blob/main/colab-notebooks/colab_textcorrection_ner_nlp_own-output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## OCR correction and Named Entity Recognition (NER) with PySpellChecker and Stanza

This script offers a more traditional NLP alternative to using LLMs for OCR post-processing for Python users. The packages used here as smaples are PySpellChecker and Stanza. But they are by no means the only options.




In [1]:
# Install the packages needed

!pip install pyspellchecker stanza
import requests
import stanza
from spellchecker import SpellChecker

print("Installation successful!")

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.

### Download dictionary for spell checking

The challenge here is to find a dictionary for your language that is comprehensive enough and, in historical research, also captures old word froms. Unfortunately, an extensive German dictionary I used in the past is no longer available. For testing purposes, we are now working with a more limited German word list from Github.

In [2]:
# download German word list provided by Marvin J. Wendt
dwds_url = "https://gist.githubusercontent.com/MarvinJWendt/2f4f4154b8ae218600eb091a5706b5f4/raw/36b70dd6be330aa61cd4d4cdfda6234dcb0b8784/wordlist-german.txt"
response = requests.get(dwds_url)

if response.status_code == 200:
    # Save the word list as a local file
    with open("word_list.txt", "w", encoding="utf-8") as file:
        file.write(response.text)
    print("Word list successfully downloaded.")
else:
    print("Failed to download word list.")

# change SpellChecker language from English to German

spell = SpellChecker(language=None)
spell.word_frequency.load_text_file("word_list.txt")
print("German word list loaded into PySpellChecker.")

Word list successfully downloaded.
German word list loaded into PySpellChecker.


### Correct OCR errors

In [3]:
# function for error correction with PySpellChecker
# note that the original word will be used when word not in word list

def correct_ocr_errors(text):
    words = text.split()
    corrected_text = " ".join([spell.correction(word) if spell.correction(word) else word for word in words])
    return corrected_text

### Read sample text from GitHub

In [4]:
# Here you can load one of the samples provided in the atr-historical-research repository
# or load raw text from your own repository

github_raw_url = "https://raw.githubusercontent.com/stwagner07/atr-historical-research/refs/heads/main/sample_data_txt/ThS.L.V.413u14Text.txt"
response = requests.get(github_raw_url)
if response.status_code == 200:
    input_text = response.text
    corrected_text = correct_ocr_errors(input_text)
    print("Corrected OCR Text:", corrected_text[:700])  # Print first 700 characters
else:
    print("Failed to access sample text.")

# Be patient! Code execution can take several minutes as checking each token against the word list is not the fastest approach.
# Here, a regex-based replacement of false characters or character combinations could be beneficial, but building a suitable regex can take time to build.

Corrected OCR Text: harrende iot mir an po gen purhia, acer ante gera bnd bwl nachen ich an fehlkanten 4. 182kanpefanz am tb Paadenfeber flehe at ſm̅eln, bor uns laach 1 ei Pahwn- Zrestnamal arsgrarbetten. baden dea ab alle allfenrchen Snlaeſachzen darre da Etlashenrerftl. aal blond aalte aa gänze beſchrelan ad zen geballt ei kw arppühehe bin leite irveweifs, ali haar ader fichte da achats wegs hd ub 1. bevftaat eller don gssdhials swr ben necke batch helen foren ach alf ihren ahr an agp zu bebte pen . da agens ec aa ub nettetal der an 170 hier au da anstalt als bohr anpeſtelld, satan in Arzerharl ar re acer auwcherss pofe band puarb⸗Etrt hd aar abfege freund der merk ade ad Pcbmlicheler au mündels bei ahmet Bo


### Named Entity Recognition (NER) with Stanza

In [5]:
stanza.download("de")  # download German model
nlp = stanza.Pipeline(lang="de", processors="tokenize,ner")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: de (German) ...


Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/de/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | combined     |
| mwt       | combined     |
| ner       | germeval2014 |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


### Extract named entities

In [6]:
# function for NER with focus on persons and locations

def extract_named_entities(text):
    doc = nlp(text)
    named_entities = {"Person": [], "Location": []}
    for ent in doc.ents:
        if ent.type == "PER":
            named_entities["Person"].append(ent.text)
        elif ent.type in ["LOC", "GPE"]:
            named_entities["Location"].append(ent.text)
    return named_entities

### Apply NER to corrected text from the previous step

In [7]:
# This generates output in JSON format.

entities = extract_named_entities(corrected_text)
print("Named Entities:", entities)

Named Entities: {'Person': ['Haddolad', 'dean', 'jahns'], 'Location': ['baden dea', 'dornum', 'benin', 'Seüduge', 'bozen']}


**Note**: This is a sample script for testing purposes which still needs updates to be fully functional for research. Especially when you are processing sensitive text, you should consider using a local installation of Python or running code in an institutional environment.