<a href="https://colab.research.google.com/github/soberbichler/Notebooks4Historical_Newspapers/blob/main/Nemotron_Article_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Researching German Historical Newspapers using the Nvidia Nemotron Model
## Example: Article Extraction

*Notebook created by Sarah Oberbichler (oberbichler@ieg-mainz.de)*

This notebook shows how LLMs can be used to support research with historical newspapers. In this example, the NVIDIA nemotron model is used to extract articles on earthquakes in OCR'd historical newspapers pages.

Article segmentation for historical newspapers can be based on layout information and graphical elements (image) as well as on textual context (data). While the former is very challenging due to the changing and complex layouts of historical newspapers, the latter seems to be especially promising for topic-specific corpus building. Qualitative research relies on correctly separated articles. An article, in this context, is defined as a coherent text covering a specific topic, no more and no less.



### 1.   Query the German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal (from Karl Krägerlin), can be found [here](https://https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151).

Python > 3.8 is required

In [1]:
# @markdown ####  Launch this cell and get access to the API of the Newspaper Portal from the German Digital Library
!pip install ddbapi

Collecting ddbapi
  Downloading ddbapi-0.1.2.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ddbapi
  Building wheel for ddbapi (setup.py) ... [?25l[?25hdone
  Created wheel for ddbapi: filename=ddbapi-0.1.2-py3-none-any.whl size=5385 sha256=55902a3f06e7a3d452df9676cdc375416999b8ff19f7d76407483b20788cd737
  Stored in directory: /root/.cache/pip/wheels/0a/93/7e/69ec8f7396174c1532d0f9c5b9a343c6df0353071db93e4b2b
Successfully built ddbapi
Installing collected packages: ddbapi
Successfully installed ddbapi-0.1.2


In [2]:
# @markdown ####  Import the necessary packages
import pandas as pd
from ddbapi import zp_issues, zp_pages, list_column, filter

In [3]:

# @markdown ### Possible kwargs for the functions are:
# @markdown - language: Use ISO Codes, currently ger, eng, fre, spa
# @markdown - place_of_distribution: Search inside "Verbreitungsort"
# @markdown - use a list for multiple search-words
# @markdown - publication_date: Get newspapers by publication date.
# @markdown - zdb_id: Search by ZDB-ID
# @markdown - provider: Search by Data Provider
# @markdown - paper_title: Search inside the title of the Newspaper
# @markdown - plainpagefulltex: search inside the OCR
# Get the data
df = zp_pages(
    publication_date='[1909-01-01T12:00:00Z TO 1912-01-01T12:00:00Z]',
    plainpagefulltext=["Erdstoß"],
    paper_title='Kölnische Zeitung'
    )
df.head()

https://api.deutsche-digitale-bibliothek.de/search/index/newspaper-issues/select?rows=1000&sort=id+ASC&q=type%3Apage+AND+publication_date%3A%22%5B1909-01-01T12%3A00%3A00Z%5C+TO%5C+1912-01-01T12%3A00%3A00Z%5D%22+AND+%28plainpagefulltext%3AErdsto%C3%9F%29+AND+paper_title%3A%22K%C3%B6lnische%5C+Zeitung%22&cursorMark=%2A
Got 102 items.


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,3ML37O5BXQD3EYOR5S777GQZKGDOCDIM-ALTO8633337_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-12-23 12:00:00,"[Köln, Kleve (Kreis Kleve), Jülich]",[ger],dac9b430-b364-4fa9-9367-0e9c795c1103,[/data/altos/3M/L3/3ML37O5BXQD3EYOR5S777GQZKGD...,ALTO8633337_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Samstag , 23 . Dezember Offiziere , und über 3..."
1,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384861_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"[Köln, Kleve (Kreis Kleve), Jülich]",[ger],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,[/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQVK...,ALTO8384861_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung s Mittag..."
2,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384865_D...,14,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"[Köln, Kleve (Kreis Kleve), Jülich]",[ger],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,[/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQVK...,ALTO8384865_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung 8 Abend ..."
3,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"[Köln, Kleve (Kreis Kleve), Jülich]",[ger],457082cd-ea69-4273-a359-82e5ec7191d9,[/data/altos/47/7T/477TOWWZGBGVO2T47FBUNYAPVER...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach..."
4,4R2P7K6IV6RKWVC2FKYMDFB3HPL7WMOH-ALTO8596509_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-11-17 12:00:00,"[Köln, Kleve (Kreis Kleve), Jülich]",[ger],ad02f11f-3813-4992-8128-5146c67c4642,[/data/altos/4R/2P/4R2P7K6IV6RKWVC2FKYMDFB3HPL...,ALTO8596509_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Freitag , 17 . November selbständigen Zugangs ..."


In [4]:
# @markdown #### Save the results as Excel file
df.to_excel('dataset.xlsx', index=False)


In [5]:
# @markdown #### We can narrow down the text surrounding the keyword in order to reduce the input tokens for the model.
def extract_context(keywords, text, tokens_before, tokens_after):
    # Finde alle Positionen der Keywords im Text
    keyword_positions = []

    for keyword in keywords:
        keyword_start = text.find(keyword)
        while keyword_start != -1:
            keyword_positions.append(keyword_start)
            keyword_start = text.find(keyword, keyword_start + len(keyword))

    if not keyword_positions:
        return "Keywords not found in text."

    # Bestimme den Start- und Endpunkt des Kontextfensters
    first_occurrence = min(keyword_positions)  # Erstes Vorkommen eines Keywords
    last_occurrence = max(keyword_positions)  # Letztes Vorkommen eines Keywords

    # Berechne den Start des Kontextfensters
    start_index = max(0, first_occurrence - tokens_before)

    # Berechne das Ende des Kontextfensters
    end_index = min(len(text), last_occurrence + tokens_after)

    # Extrahiere das Kontextfenster
    context = text[start_index:end_index]

    return context

# Liste von Keywords
keywords = ["Erdstoß"]

df['context'] = [extract_context(keywords, row['plainpagefulltext'], 2000, 3000) for _, row in df.iterrows()]


In [6]:
# @markdown #### Save the results as Excel file
df.to_excel('dataset_context.xlsx', index=False)


## Setting up the requirements for the NVIDIA Nemotron model

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

In [7]:
import pandas as pd
from typing import List, Dict
from openai import OpenAI
!pip uninstall -y httpx
!pip install httpx==0.27.2
from google.colab import userdata

Found existing installation: httpx 0.28.1
Uninstalling httpx-0.28.1:
  Successfully uninstalled httpx-0.28.1
Collecting httpx==0.27.2
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpx
Successfully installed httpx-0.27.2


In [8]:
# @markdown ##### Get a free API key at https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct, and add the key to the secrets in this colab notebook (right bar). Name the key in secrets NVIDIA_TOKEN and add the key under value.

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key = userdata.get('NVIDIA_TOKEN')
)


# Extract Articles

To extract articles on earthquakes, it is essential to formulate a precise prompt that specifies the articles should be extracted in their original form without translations or corrections. A guide on how to write effective prompts can be found also [here](https://https://support.google.com/a/users/answer/14200040?hl=en).

Depending on the size of the dataframe, it can take a while to load.

In [9]:
df=df[:5]

In [10]:
import pandas as pd
from typing import List, Dict
from openai import OpenAI

# Initialize OpenAI client with NVIDIA API settings
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=userdata.get('NVIDIA_TOKEN')
)

def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    def analyze_text(text: str) -> str:
        system_prompt = """
# System Instructions
You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations.
Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.

Classify as relevant if the text contains:
- Primary earthquake terminology from the 19th and 20th century
- Official earthquake reports
- geology and seismology
- Impact descriptions
- Solution description
- Technical description
- Aid
- Honorations
- Political discussion and opinions on earthquake
- Stories from victims and refugees
- reportings on refugees and victims
- Live of victims
- historical references
- comparisons

Your output should consist of tnothing else but the the xml structure >article></article><verification></verification><human_verification_needed></human_verification_needed>

Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions
"""
        user_prompt = f"""
Bitte befolgen Sie diese Spezifikationen:
1. Definition eines Artikels: Ein Artikel ist eine semantische Einheit im Text, die sich deutlich von vorangehendem und nachfolgendem Inhalt abgrenzt (z.B. durch eine eigene Überschrift).
3. Antwortformat:
- Wenn ein oder mehrere relevante Artikel gefunden werden, strukturieren Sie Ihre Antwort mit XML-Tags wie im folgenden Beispiel, unter Verwendung der Tags article, verification und human_verification_needed (True oder False): <article>vollständiger extrahierter Artikelinhalt</article><verification>Ist die Einheit kohärent? Ist das Thema vorhanden? Ist der Artikel vollständig? Wurden alle Artikel gefunden?</verification><human_verification_needed>False</human_verification_needed>
- Gebe alle relevanten Artikel in ihrer Originalform zurück, ohne Ergänzungen, Auslassungen, Korrekturen oder Kommentare.
- Wenn keine relevanten Artikel gefunden werden, ist keine besondere Strukturierung erforderlich; gebe einfach "Kein relevanter Artikel gefunden." ohne weitere Erklärungen zurück.
4. Hinweise zur Segmentierung:
- Stelle sicher, dass über mehrere Absätze verteilte Artikel als eine Einheit behandelt werden.
5. Menschliche Überprüfung notwendig:
- Kann die Werte "True" oder "False" haben
- False: Wenn Sie glauben, den Artikel korrekt segmentiert und seine Relevanz richtig eingeschätzt zu haben.
- True: Wenn du unsicher bist, ob du den vollständigen Inhalt des Artikels, wie er im Zeitungsdokument enthalten ist, erfasst hast oder ob er relevant ist.

Hier ist das Zeitungsdokument:

{text}
"""
        try:
            messages = [
                {
                    'role': 'system',
                    'content': system_prompt
                },
                {
                    'role': 'user',
                    'content': user_prompt
                }
            ]

            completion = client.chat.completions.create(
                model="nvidia/llama-3.1-nemotron-70b-instruct",
                messages=messages,
                temperature=0.0,
                max_tokens=20000
            )
            return completion.choices[0].message.content
        except Exception as e:
            print(f"Error in API call: {str(e)}")
            return ""

    # Apply the analysis to each row in the DataFrame
    df['separated_articles'] = df[text_column].apply(lambda x: analyze_text(x) if pd.notna(x) else "")

    return df

# Usage example (assuming df is your input DataFrame)
if __name__ == "__main__":
    # Process the DataFrame
    text_column = 'context'  # or your text column name
    result_df = analyze_dataframe(df, text_column)

    # Save the results
    result_df.to_excel('analyzed_results.xlsx', index=False)

    # Display sample results
    print("\nSample of processed articles:")
    print(result_df['separated_articles'].head())


Sample of processed articles:
0    <article>\nIn der Stadt Mexiko wurde heute ein...
1    <article>\nNeapel, 7. Juni. (Telegr.) Ein well...
2    <article>\nFoggia, 7. Juni. (Telegr.) Ein heft...
3    <article>\nV Brancaleone (Kalabrien), 17. Apri...
4    <article>\nErdbeben. \n# # \nMannheim, 16. Nov...
Name: separated_articles, dtype: object


In [11]:
result_df.to_excel('extracted_articles.xlsx', index=False)