<a href="https://colab.research.google.com/github/soberbichler/Notebooks4Historical_Newspapers/blob/main/Llama3_OCR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Researching German Historical Newspapers with Llama AI Model
## Example: OCR Post-Correction

Created by Sarah Oberbichler [![ORCID](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0002-1031-2759)

This notebook shows how LLMs can be used to support research with historical newspapers. In this example, the Llama 3 model is used to to correct OCR of previously OCR'd historical newspapers pages.

OCR quality has been a long-standing issue in digitization efforts. Historical newspapers are particularly affected due their complexity, historical fonts, or degradation. Additionally, OCR technology faced limitations when dealing with historical scripts.


### 1.   Query the German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal (from Karl Krägerlin), can be found [here](https://https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151).

In [None]:
# @markdown #####  Launch this cell and get access to the API of the Newspaper Portal from the German Digital Library
!pip install ddbapi

In [None]:
# @markdown ####  Import the necessary packages
import pandas as pd
from ddbapi import zp_issues, zp_pages, list_column, filter

In [None]:
# @markdown ### Possible kwargs for the functions are:
# @markdown - language: Use ISO Codes, currently ger, eng, fre, spa
# @markdown - place_of_distribution: Search inside "Verbreitungsort"
# @markdown - use a list for multiple search-words
# @markdown - publication_date: Get newspapers by publication date.
# @markdown - zdb_id: Search by ZDB-ID
# @markdown - provider: Search by Data Provider
# @markdown - paper_title: Search inside the title of the Newspaper
# @markdown - plainpagefulltex: search inside the OCR
# Get the data
# Get the data
df = zp_pages(
    publication_date='[1906-01-01T12:00:00Z TO 1906-12-31T12:00:00Z]',
    plainpagefulltext=["Rückwanderer*"],
    #paper_title='Deutsche allgemeine Zeitung'
    )

df.head()

In [None]:
# @markdown #### We can narrow down the text surrounding the keyword in order to reduce the input tokens for the model.
def extract_context(keywords, text, tokens_before, tokens_after):
    # Finde alle Positionen der Keywords im Text
    keyword_positions = []

    for keyword in keywords:
        keyword_start = text.find(keyword)
        while keyword_start != -1:
            keyword_positions.append(keyword_start)
            keyword_start = text.find(keyword, keyword_start + len(keyword))

    if not keyword_positions:
        return "Keywords not found in text."

    # Bestimme den Start- und Endpunkt des Kontextfensters
    first_occurrence = min(keyword_positions)  # Erstes Vorkommen eines Keywords
    last_occurrence = max(keyword_positions)  # Letztes Vorkommen eines Keywords

    # Berechne den Start des Kontextfensters
    start_index = max(0, first_occurrence - tokens_before)

    # Berechne das Ende des Kontextfensters
    end_index = min(len(text), last_occurrence + tokens_after)

    # Extrahiere das Kontextfenster
    context = text[start_index:end_index]

    return context

# Liste von Keywords
keywords = ["ückwander"]

df['context'] = [extract_context(keywords, row['plainpagefulltext'], 2000, 3000) for _, row in df.iterrows()]



In [None]:
# @markdown #### Save the results as Excel file
df.to_excel('newspaper_rückkehrer.xlsx', index=False)

## Setting up the requirements for the Llama model

Llama 3 is a family of models developed by Meta. Llama 3 instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.

In [None]:
pip install replicate


In [None]:
# @markdown ##### Get an API key at https://replicate.com/, activate the billing, save your key as .env file. To do so, take following steps:
# @markdown - Open a Notepad and write REPLICATE_API_TOKEN = "your key"
# @markdown - Click on Save option and change the file type to 'All files'
# @markdown - Keep the file name as .env.
# @markdown - Hit Save Now the file is an .env file.


!pip install python-dotenv

import os
import dotenv

#Set the REPLICATE_API_TOKEN environment variable
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# @markdown Load the .env file into the drive/MyDrive
dotenv.load_dotenv('/content/drive/MyDrive/.env')

os.getenv('REPLICATE_KEY_TOKEN')

# Run model for OCR-post correction

To run OCR-post correction, it is essential to formulate a precise prompt. For example, it needs to be specified that the whole text should be corrected, while summarizations and any other addition need to be avoided. A guide on how to write effective prompts can be found also [here](https://https://support.google.com/a/users/answer/14200040?hl=en).

Depending on the size of the dataframe, it can take a while to load.

In [None]:
df
df = df[:10]

In [None]:
import json
import replicate

def OCR_correction(newspaper_page):
    # Define the prompt for separating articles

    input = {
    "prompt": f"Korrogiere OCR Fehler des gesamten deutschen Textes und drucke den gesamten korrigierten Text \n\n{newspaper_page}\n\n---\n\ .",
    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an OCR correction expert. Please don't ask for feedback or questions <|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "max_new_tokens": 8000,
    }

    # Initialize an empty string to collect the response
    text = ""

    # Generate the response using the LLaMA model
    for event in replicate.stream(
        "meta/meta-llama-3-70b-instruct",
        input=input
    ):
        if event:
            text += str(event)
        else:
            print("Received empty event data")

    # Return the separated articles
    return text

# Create an empty list to store the separated articles
post_OCR = []

# Loop through each row in the dataframe
for index, row in df.iterrows():
    # Extract the text of the newspaper page from the current row
    newspaper_page = row['context']

    # Separate articles for the current newspaper page only if newspaper_page is not empty
    if newspaper_page.strip():
        text = OCR_correction(newspaper_page)

        # Append the separated articles to the list, even if it’s empty
        post_OCR.append(text)
    else:
        print("Skipping empty newspaper page")

# Add the list of separated articles as a new column 'article' in the dataframe
df['article_corrected'] = post_OCR

# Print the modified dataframe
df


In [None]:
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)

df.to_excel('article_corrected.xlsx', index=False)