# Researching German Historical Newspapers with Llama AI Model
## Example: OCR Post-Correction

*Notebook created by Sarah Oberbichler (oberbichler@ieg-mainz.de)*

This notebook shows how LLMs can be used to support research with historical newspapers. In this example, the Llama 3 model is used to to correct OCR of previously OCR'd historical newspapers pages.

OCR quality has been a long-standing issue in digitization efforts. Historical newspapers are particularly affected due their complexity, historical fonts, or degradation. Additionally, OCR technology faced limitations when dealing with historical scripts.


### 1.   Query the German Historical Newspaper Portal

German historical newspapers from the German Digital Library can be accessed via the DDB-API. This API is open access and allows to query the Historical Newspapers available in the German Newspaper Portal ([Deutsches Zeitungsportal](https://https://www.deutsche-digitale-bibliothek.de/newspaper)). An instruction, provided by the German Newspaper Portal, can be found [here](https://https://deepnote.com/app/karl-kragelin-b83c/Zeitungsportal-API-d9224dda-8e26-4b35-a6d7-40e9507b1151).

In [13]:
!git clone https://github.com/ieg-dhr/NLP-Kurs_DMGK_Digitale-Geisteswissenschaften.git

Cloning into 'NLP-Kurs_DMGK_Digitale-Geisteswissenschaften'...
remote: Enumerating objects: 162, done.[K
remote: Counting objects: 100% (159/159), done.[K
remote: Compressing objects: 100% (104/104), done.[K
remote: Total 162 (delta 69), reused 126 (delta 53), pack-reused 3 (from 1)[K
Receiving objects: 100% (162/162), 65.75 KiB | 21.92 MiB/s, done.
Resolving deltas: 100% (69/69), done.


In [None]:
# prompt: please explain what the sklearn package is

# The scikit-learn (sklearn) package is a powerful and widely used Python library for machine learning.
# It provides a comprehensive set of tools for various machine learning tasks, including:

# * Classification:  Predicting the category or class of a data point (e.g., spam detection, image recognition).
# * Regression: Predicting a continuous value (e.g., predicting house prices, stock prices).
# * Clustering: Grouping similar data points together (e.g., customer segmentation, document clustering).
# * Dimensionality reduction: Reducing the number of features in a dataset while preserving important information (e.g., principal component analysis).
# * Model selection: Choosing the best machine learning model for a given task (e.g., comparing different algorithms, tuning hyperparameters).
# * Preprocessing: Preparing data for machine learning (e.g., handling missing values, feature scaling).


# Key Features:

# * User-Friendly: Scikit-learn is designed to be easy to use, even for beginners.
# * Consistent API:  It has a consistent and well-defined API that makes it easy to switch between different models and algorithms.
# * Extensive Documentation: It offers comprehensive documentation, tutorials, and examples.
# * Open Source: It is an open-source library, meaning it's free to use and distribute.


# Examples of Usage:

# You can use scikit-learn to train models like:

# * Linear Regression
# * Logistic Regression
# * Support Vector Machines (SVM)
# * Decision Trees
# * Random Forests
# * K-Nearest Neighbors (KNN)
# * Naive Bayes
# * And many more...

# In essence, scikit-learn is a valuable resource for building and deploying machine learning models in Python.


In [None]:
# @markdown #####  Launch this cell and get access to the API of the Newspaper Portal from the German Digital Library
!pip install ddbapi

Collecting ddbapi
  Downloading ddbapi-0.1.2.tar.gz (5.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ddbapi
  Building wheel for ddbapi (setup.py) ... [?25l[?25hdone
  Created wheel for ddbapi: filename=ddbapi-0.1.2-py3-none-any.whl size=5384 sha256=398ab430c89597b9b0c3185860aca54a1fa8f5ae6bb972521c8451ac86b64a29
  Stored in directory: /root/.cache/pip/wheels/0a/93/7e/69ec8f7396174c1532d0f9c5b9a343c6df0353071db93e4b2b
Successfully built ddbapi
Installing collected packages: ddbapi
Successfully installed ddbapi-0.1.2


In [None]:
# @markdown ####  Import the necessary packages
import pandas as pd
from ddbapi import zp_issues, zp_pages, list_column, filter

In [None]:
# @markdown ### Possible kwargs for the functions are:
# @markdown - language: Use ISO Codes, currently ger, eng, fre, spa
# @markdown - place_of_distribution: Search inside "Verbreitungsort"
# @markdown - use a list for multiple search-words
# @markdown - publication_date: Get newspapers by publication date.
# @markdown - zdb_id: Search by ZDB-ID
# @markdown - provider: Search by Data Provider
# @markdown - paper_title: Search inside the title of the Newspaper
# @markdown - plainpagefulltex: search inside the OCR
# Get the data
# Get the data
df = zp_pages(
    publication_date='[1909-01-01T12:00:00Z TO 1909-12-31T12:00:00Z]',
    #plainpagefulltext=["Erdbeben"],
    paper_title='Norddeutsche allgemeine Zeitung'
    )

df.head()

https://api.deutsche-digitale-bibliothek.de/search/index/newspaper-issues/select?rows=1000&sort=id+ASC&q=type%3Apage+AND+publication_date%3A%22%5B1909-01-01T12%3A00%3A00Z%5C+TO%5C+1909-12-31T12%3A00%3A00Z%5D%22+AND+paper_title%3A%22Norddeutsche%5C+allgemeine%5C+Zeitung%22&cursorMark=%2A
Getting 1000 of 3389
Getting 2000 of 3389
Getting 3000 of 3389
Getting 3389 of 3389
Got 3389 items.


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,1,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,[/data/altos/24/VE/24VESKH5KBXN5CQM53ERPEEPZJ3...,FID-F_SBB_00007_19090626_048_147_0_001-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,"B 11 Dir „Norddeutsche Allgemeine Zeitung"" ers..."
1,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,2,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,[/data/altos/24/VE/24VESKH5KBXN5CQM53ERPEEPZJ3...,FID-F_SBB_00007_19090626_048_147_0_002-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,wärtn — well dauernd — ein schwerer Schlug für...
2,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,3,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,[/data/altos/24/VE/24VESKH5KBXN5CQM53ERPEEPZJ3...,FID-F_SBB_00007_19090626_048_147_0_003-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,M 147. 26. Im» 1909. Norddeutsche Allgemeine Z...
3,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,4,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,[/data/altos/24/VE/24VESKH5KBXN5CQM53ERPEEPZJ3...,FID-F_SBB_00007_19090626_048_147_0_004-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,immer günstiger. Die Kandidaten de« gegenwärti...
4,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,5,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,[/data/altos/24/VE/24VESKH5KBXN5CQM53ERPEEPZJ3...,FID-F_SBB_00007_19090626_048_147_0_005-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,5ttw^tn:|t JJreb. Jtlcole .10 118* Habt: Pred....


In [None]:
# prompt: export excel

# Assuming 'df' is your DataFrame
df.to_excel('output.xlsx', index=False)


In [None]:
# prompt: create embeddings with sklearn package

from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming 'df' contains a column named 'plainpagefulltext' with the text data
corpus = df['plainpagefulltext'].tolist()

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the corpus to create embeddings
embeddings = vectorizer.fit_transform(corpus)

# Print the shape of the embeddings matrix
print(embeddings.shape)


(3389, 1111557)


In [None]:
pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.1


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sentence_transformers import SentenceTransformer

# List of natural disaster-related terms in German
disaster_terms = [
    'Naturkatastrophe', 'Erdbeben', 'Überschwemmung', 'Hochwasser', 'Hurrikan', 'Orkan',
    'Tornado', 'Tsunami', 'Waldbrand', 'Dürre', 'Erdrutsch', 'Vulkanausbruch',
    'Zyklone', 'Taifun', 'Schneesturm', 'Lawine', 'Sturmflut', 'Hitzewelle',
    'Kältewelle', 'Hungersnot', 'Epidemie', 'Pandemie', 'Sturm', 'Unwetter'
]

def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with single space
    return text.strip()

def find_disaster_terms(text):
    return ', '.join([term for term in disaster_terms if term.lower() in str(text).lower()])

def find_relevant_disaster_pages(df, query, model, max_pages=20):
    df['processed_text'] = df['plainpagefulltext'].apply(preprocess_text)

    # Encode the query and all documents
    query_embedding = model.encode([query])
    document_embeddings = model.encode(df['processed_text'].tolist())

    # Calculate semantic similarities
    similarities = cosine_similarity(query_embedding, document_embeddings)[0]

    df['semantic_relevanz'] = similarities

    # Sort by semantic relevance score in descending order
    df_sorted = df.sort_values('semantic_relevanz', ascending=False)

    # Select top results
    relevant_df = df_sorted.head(max_pages)

    return relevant_df

# Load the pre-trained multilingual model
print("Loading the sentence transformer model...")
model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
print("Model loaded successfully.")

# Assuming df is your DataFrame with a 'plainpagefulltext' column
# If it's not loaded, uncomment and adjust the following line:
# df = pd.read_csv('your_data.csv')

query = "Naturkatastrophen und ihre Auswirkungen auf Menschen und Umwelt"

print("\nFinding relevant pages...")
df_result = find_relevant_disaster_pages(df, query, model)

# Add columns for disaster term analysis
df_result['enthält_katastrophenbegriffe'] = df_result['plainpagefulltext'].apply(
    lambda x: any(term.lower() in str(x).lower() for term in disaster_terms)
)
df_result['gefundene_katastrophenbegriffe'] = df_result['plainpagefulltext'].apply(find_disaster_terms)

# Create a summary column
df_result['zusammenfassung'] = df_result.apply(
    lambda row: f"{'Mit' if row['enthält_katastrophenbegriffe'] else 'Ohne'} explizite Katastrophenbegriffe. "
                f"Gefundene Begriffe: {row['gefundene_katastrophenbegriffe'] if row['enthält_katastrophenbegriffe'] else 'Keine'}. "
                f"Relevanz: {row['semantic_relevanz']:.4f}",
    axis=1
)

# Select and reorder columns for the final output
columns_to_display = ['semantic_relevanz', 'enthält_katastrophenbegriffe', 'gefundene_katastrophenbegriffe', 'zusammenfassung', 'plainpagefulltext']
df_output = df_result[columns_to_display]

# Display the resulting DataFrame
print("\nRelevante Seiten:")
print(df_output)

# Print summary statistics
print(f"\nGesamtanzahl der gefundenen relevanten Seiten: {len(df_output)}")
print(f"Seiten mit expliziten Katastrophenbegriffen: {df_output['enthält_katastrophenbegriffe'].sum()}")
print(f"Seiten ohne explizite Katastrophenbegriffe: {len(df_output) - df_output['enthält_katastrophenbegriffe'].sum()}")

# Optionally, save the result to a CSV file
df_output.to_csv('disaster_relevant_pages.csv', index=False)
print("\nErgebnisse wurden in 'disaster_relevant_pages.csv' gespeichert.")

KeyboardInterrupt: 

In [None]:
# prompt: import excel 'output'

import pandas as pd

# Assuming 'output.xlsx' is the name of your Excel file
df = pd.read_excel('output (1).xlsx')

# Now you can work with the data in the DataFrame 'df_excel'
print(df.head())


                                             page_id  pagenumber  \
0  24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...           1   
1  24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...           2   
2  24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...           3   
3  24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...           4   
4  24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...           5   

                       paper_title                   provider_ddb_id  \
0  Norddeutsche allgemeine Zeitung  6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK   
1  Norddeutsche allgemeine Zeitung  6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK   
2  Norddeutsche allgemeine Zeitung  6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK   
3  Norddeutsche allgemeine Zeitung  6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK   
4  Norddeutsche allgemeine Zeitung  6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK   

                                            provider     zdb_id  \
0  Staatsbibliothek zu Berlin - Preußischer Kultu...  2802868-5   
1  Staatsbibliothek zu B

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sentence_transformers import SentenceTransformer
from collections import Counter

def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zäöüß\s]', '', text)
    return text

def get_unique_words(text):
    words = text.split()
    return list(set(words))

def find_similar_words(df, target_word, model, top_n=40):
    # Preprocess the text
    df['processed_text'] = df['plainpagefulltext'].apply(preprocess_text)

    # Get unique words from all texts
    all_words = []
    for text in df['processed_text']:
        all_words.extend(get_unique_words(text))

    # Get unique words and their frequencies
    word_freq = Counter(all_words)
    unique_words = list(word_freq.keys())

    # Remove very common and very rare words
    min_freq = 2
    max_freq = len(df) * 0.5  # Words appearing in more than 50% of documents
    filtered_words = [word for word in unique_words if min_freq <= word_freq[word] <= max_freq]

    print(f"Number of unique words after filtering: {len(filtered_words)}")

    # Encode the target word and filtered words
    target_embedding = model.encode([target_word])
    word_embeddings = model.encode(filtered_words)

    # Calculate similarities
    similarities = cosine_similarity(target_embedding, word_embeddings)[0]

    # Create a DataFrame with words and their similarities
    word_sim_df = pd.DataFrame({
        'word': filtered_words,
        'similarity': similarities
    })

    # Sort by similarity and get top N results
    top_similar = word_sim_df.sort_values('similarity', ascending=False).head(top_n)

    return top_similar

# Load the pre-trained multilingual model
print("Loading the sentence transformer model...")
model = SentenceTransformer('sentence-transformers/LaBSE')
print("Model loaded successfully.")

# Assuming df is your DataFrame with a 'plainpagefulltext' column
# If it's not loaded, uncomment and adjust the following line:
# df = pd.read_csv('your_data.csv')

target_word = "Umweltkatastrophe"

print(f"\nFinding words similar to '{target_word}'...")
similar_words = find_similar_words(df, target_word, model)

print("\nMost similar words to 'Naturkatastrophe':")
print(similar_words)

# Optionally, save the result to a CSV file
similar_words.to_csv('similar_words_to_naturkatastrophe.csv', index=False)
print("\nResults saved to 'similar_words_to_naturkatastrophe.csv'")

Loading the sentence transformer model...


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.22k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.62M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Model loaded successfully.

Finding words similar to 'Umweltkatastrophe'...
Number of unique words after filtering: 232930

Most similar words to 'Naturkatastrophe':
                          word  similarity
70319      erdbebenkatastrophe    0.727620
198414   hochwasserkatastrophe    0.693724
214670        brandkatastrophe    0.688872
139068       radbodkatastrophe    0.676319
116535    eisenbahnkatastrophe    0.670850
119853     hochbahnkatastrophe    0.646157
88078          erdbebenunglück    0.631625
178541      katasterlandmeffer    0.626045
109313        wirtschaftskrise    0.621209
185299            klimatologie    0.617173
162399        trauerkundgebung    0.614843
36262            ministerkrise    0.613366
183187  ueberschwemmungsgefahr    0.612411
223997         ohnmachtsanfall    0.610751
173141                pestfall    0.605104
44344              katastrophe    0.602942
146716          kabinettskrise    0.602591
217394             katastropbe    0.601382
205978           

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sentence_transformers import SentenceTransformer
from collections import Counter

def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zäöüß\s]', '', text)
    return text

def get_unique_words(text):
    words = text.split()
    return list(set(words))

def find_similar_words(df, target_word, model, top_n=20):
    df['processed_text'] = df['plainpagefulltext'].apply(preprocess_text)

    all_words = []
    for text in df['processed_text']:
        all_words.extend(get_unique_words(text))

    word_freq = Counter(all_words)
    unique_words = list(word_freq.keys())

    min_freq = 2
    max_freq = len(df) * 0.5
    filtered_words = [word for word in unique_words if min_freq <= word_freq[word] <= max_freq]

    print(f"Number of unique words after filtering: {len(filtered_words)}")

    target_embedding = model.encode([target_word])
    word_embeddings = model.encode(filtered_words)

    similarities = cosine_similarity(target_embedding, word_embeddings)[0]

    word_sim_df = pd.DataFrame({
        'word': filtered_words,
        'similarity': similarities
    })

    top_similar = word_sim_df.sort_values('similarity', ascending=False).head(top_n)

    return top_similar

def find_pages_with_similar_words(df, similar_words):
    df['processed_text'] = df['plainpagefulltext'].apply(preprocess_text)

    def contains_similar_words(text):
        return [word for word in similar_words if word in text]

    df['similar_words_found'] = df['processed_text'].apply(contains_similar_words)
    df['contains_similar_words'] = df['similar_words_found'].apply(len) > 0

    return df[df['contains_similar_words']]

# Load the pre-trained multilingual model
print("Loading the sentence transformer model...")
model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
print("Model loaded successfully.")

# Assuming df is your DataFrame with a 'plainpagefulltext' column
# If it's not loaded, uncomment and adjust the following line:
# df = pd.read_csv('your_data.csv')

target_word = "Naturkatastrophe"

print(f"\nFinding words similar to '{target_word}'...")
similar_words_df = find_similar_words(df, target_word, model)

print("\nMost similar words to 'Naturkatastrophe':")
print(similar_words_df)

similar_words_list = similar_words_df['word'].tolist()

print("\nFinding pages containing similar words...")
relevant_pages_df = find_pages_with_similar_words(df, similar_words_list)

print(f"\nNumber of pages containing similar words: {len(relevant_pages_df)}")

# Select and reorder columns for the final output
columns_to_display = ['similar_words_found', 'plainpagefulltext']
df_output = relevant_pages_df[columns_to_display]

# Display a sample of the resulting DataFrame
print("\nSample of relevant pages (first 5 rows):")
print(df_output.head())

# Optionally, save the full result to a CSV file
df_output.to_csv('disaster_related_pages.csv', index=False)
print("\nFull results saved to 'disaster_related_pages.csv'")

# Print some statistics
print("\nStatistics:")
print(f"Total number of pages in the original dataset: {len(df)}")
print(f"Number of pages containing words similar to 'Naturkatastrophe': {len(df_output)}")
print(f"Percentage of relevant pages: {len(df_output) / len(df) * 100:.2f}%")

# Display the distribution of similar words found
word_counts = relevant_pages_df['similar_words_found'].apply(len).value_counts().sort_index()
print("\nDistribution of number of similar words found per page:")
print(word_counts)

ModuleNotFoundError: No module named 'sentence_transformers'

In [None]:
# prompt: df with only contains_similar_words = True

relevant_pages_df = df[df['contains_similar_words'] == True]


In [None]:
# prompt: export df as excel

df.to_excel('output.xlsx', index=False)


In [None]:
# @markdown #### We can narrow down the text surrounding the keyword in order to reduce the input tokens for the model. Choose the size of the context window here:
context_window = 2000 # @param {type:"number"}
def extract_context(keyword, text, window_size=context_window):
    index = text.find(keyword)
    if index == -1:
        return "Keyword not found in text."

    start_index = max(0, index - window_size)
    end_index = min(len(text), index + len(keyword) + window_size)

    context = text[start_index:end_index]

    return context

# Extract context for each row
contexts = []
for index, row in relevant_pages_df.iterrows():
    text = row['plainpagefulltext']
    keyword = "uswander"  # You can modify this
    context = extract_context(keyword, text)
    contexts.append(context)

# Add the context to the dataframe
relevant_pages_df['context'] = contexts

# Print the dataframe with context

relevant_pages_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  relevant_pages_df['context'] = contexts


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,...,pagename,preview_reference,plainpagefulltext,processed_text,katastrophen_relevanz,relevante_begriffe,semantic_relevanz,similar_words_found,contains_similar_words,context
4,24VESKH5KBXN5CQM53ERPEEPZJ35BEDV-FID-F_SBB_000...,5,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-06-26 12:00:00,[Berlin],[ger],465fe093-d249-4a89-9cb6-5bf77033ba7b,...,FID-F_SBB_00007_19090626_048_147_0_005-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,5ttw^tn:|t JJreb. Jtlcole .10 118* Habt: Pred....,ttwtnt jjreb jtlcole habt pred pöronc lo uhr...,0.000791,Sturm,0.008278,[naturtreue],True,Keyword not found in text.
53,2A75T5MUBJD6VXNEPUT4HGXFTLKLW7PN-FID-F_SBB_000...,3,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-05-04 12:00:00,[Berlin],[ger],75a7bae5-9f9b-4f92-8f3a-e23ebd3437aa,...,FID-F_SBB_00007_19090504_048_103_0_003-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,M 103 . 4. Mai 4909. Norddeutsche Allgemeine Z...,m mai norddeutsche allgemeine zeitung beib...,0.005974,"Erdbeben, Hochwasser",0.048659,"[erdbebenkatastrophe, katastrophe]",True,Keyword not found in text.
61,2A75T5MUBJD6VXNEPUT4HGXFTLKLW7PN-FID-F_SBB_000...,11,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-05-04 12:00:00,[Berlin],[ger],75a7bae5-9f9b-4f92-8f3a-e23ebd3437aa,...,FID-F_SBB_00007_19090504_048_103_0_011-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,V \ Parlaments-Beilage Der Nachdruck der Beric...,v parlamentsbeilage der nachdruck der bericht...,0.000272,,-0.059698,[naturwiffen],True,Keyword not found in text.
66,2APQOSQID72CSGBSKIVC2JURZPPBQIVR-FID-F_SBB_000...,4,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-05-06 12:00:00,[Berlin],[ger],c0928d42-22f6-4251-838a-b9231f49bd37,...,FID-F_SBB_00007_19090506_048_105_0_004-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,"Folgen haben, die heute noch niemand auch annä...",folgen haben die heute noch niemand auch annäh...,0.000731,,0.008888,"[katastrophen, katastrophe]",True,Keyword not found in text.
88,2F47H7PEOJ2SIK72EEK6VHPFMBLKGVJS-FID-F_SBB_000...,6,Norddeutsche allgemeine Zeitung,6GFV3I4ELFEEFQIN2WECOXMTI5FUWHCK,Staatsbibliothek zu Berlin - Preußischer Kultu...,2802868-5,1909-07-09 12:00:00,[Berlin],[ger],fe5772e5-c431-44d9-8142-127b8a392159,...,FID-F_SBB_00007_19090709_048_158_0_006-ALTO_DD...,https://api.deutsche-digitale-bibliothek.de/bi...,"S t, und da für die Aiaretten, die unsere Dame...",s t und da für die aiaretten die unsere damen ...,0.011707,Zyklone,0.00484,[katastrophe],True,Keyword not found in text.


In [None]:
# @markdown #### Save the results as Excel file
df.to_excel('newspaper_rückkehrer.xlsx', index=False)

## Setting up the requirements for the Llama model

Llama 3 is a family of models developed by Meta. Llama 3 instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.

In [None]:
pip install groq

Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Collecting httpx<1,>=0.23.0 (from groq)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->groq)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->groq)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading groq-0.11.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.6-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.0/78.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-any.whl (58 kB

In [None]:
import os
os.environ["GROQ_API_KEY"] = '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'

In [None]:
import os

from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of fast language models",
        }
    ],
    model="llama3-8b-8192",
)

print(chat_completion.choices[0].message.content)

Fast language models, also known as efficient language models or lightweight language models, are variants of traditional language models that are designed to be faster and more efficient in their computations while maintaining similar or even improved performance. The importance of fast language models can be seen from the following perspectives:

1. **Scalability**: Fast language models enable the processing of large amounts of data and the generation of responses at a quicker pace, making them suitable for real-time applications such as chatbots, virtual assistants, and language translation tools.
2. **Resource Efficiency**: Fast language models require less computational resources, such as memory and processing power, which is essential for deployment on low-resource devices, mobile applications, or edge computing environments.
3. **Improved Latency**: Faster response times and lower latency are essential for applications that require instant feedback, such as text-based conversati

ConnectionError: HTTPSConnectionPool(host='api.croq.app', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7c207dc0ed10>: Failed to resolve 'api.croq.app' ([Errno -2] Name or service not known)"))

In [None]:
pip install replicate


Collecting replicate
  Downloading replicate-0.34.1-py3-none-any.whl.metadata (25 kB)
Downloading replicate-0.34.1-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m702.1 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: replicate
Successfully installed replicate-0.34.1


In [None]:
# @markdown ##### Get an API key at https://replicate.com/, activate the billing, save your key as .env file. To do so, take following steps:
# @markdown - Open a Notepad and write REPLICATE_API_TOKEN = "your key"
# @markdown - Click on Save option and change the file type to 'All files'
# @markdown - Keep the file name as .env.
# @markdown - Hit Save Now the file is an .env file.


!pip install python-dotenv

import os
import dotenv

#Set the REPLICATE_API_TOKEN environment variable
from google.colab import drive
drive.mount('/content/drive')


Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Mounted at /content/drive


In [None]:
# @markdown Load the .env file into the drive/MyDrive
dotenv.load_dotenv('/content/drive/MyDrive/.env')

os.getenv('GROQ_API_KEY')

In [None]:
import os

from groq import Groq

client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of fast language models",
        }
    ],
    model="llama3-8b-8192",
)

print(chat_completion.choices[0].message.content)

Fast language models, also known as large-scale pre-trained language models, have revolutionized the field of natural language processing (NLP) and have numerous important applications. The significance of fast language models can be seen in their:

1. **Improved Performance**: They have surpassed human-level performance in various NLP tasks, such as language translation, question answering, and text classification.
2. **Scalability**: Fast language models can be fine-tuned for specific downstream tasks, allowing them to adapt to new domains and tasks with ease, making them a powerful tool for many applications.
3. **Advancements in AI Research**: They have enabled researchers to explore new areas of AI, such as multimodal processing, sentiment analysis, and dialogue systems, among others.
4. **Breakthroughs in Conversational AI**: Fast language models have enabled the development of more sophisticated conversational AI systems, such as chatbots, virtual assistants, and personal assist

In [None]:
df = df[:10]

In [None]:
import os
from groq import Groq
import pandas as pd
from tqdm import tqdm

def ocr_correction(client, newspaper_page):
    prompt = f"""Bitte gebe nichts als den extrahierten Text wieder. Du bist ein Experte in Artikelseparierung für Zeitungen. Bitte separiere nur Zeitungsartikel in ihrer deutschen ORIGINALFORM (keine Zusammenfassungen und Zusätze) zum Thema Naturkatastrophen und korrigiere OCR fehler in

{newspaper_page}

---"""

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You are expert in text extraction and OCR. Please don't comment, ask for feedback or questions."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        model="llama-3.1-70b-versatile",
        max_tokens=8000,
    )

    return chat_completion.choices[0].message.content

def process_dataframe(relevant_pages_df, api_key, context_column='context', result_column='article_corrected'):
    client = Groq(api_key=api_key)

    def process_row(row):
        newspaper_page = row[context_column]
        if newspaper_page.strip():
            return ocr_correction(client, newspaper_page)
        else:
            print("Skipping empty newspaper page")
            return ""

    tqdm.pandas(desc="Processing rows")
    relevant_pages_df[result_column] = relevant_pages_df.progress_apply(process_row, axis=1)
    return relevant_pages_df

# Usage example
if __name__ == "__main__":
    # Set up your Groq API key
    api_key = os.environ.get("GROQ_API_KEY")
    if not api_key:
        raise ValueError("GROQ_API_KEY environment variable not set")

    # Assuming you have a DataFrame 'df' with a 'context' column
    # df = pd.read_csv('your_data.csv')  # Uncomment and modify this line to load your data

    # Process the DataFrame
    relevant_pages_df = process_dataframe(relevant_pages_df, api_key)

    # Print the modified DataFrame
    relevant_pages_df

Processing rows: 100%|██████████| 326/326 [11:22<00:00,  2.09s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  relevant_pages_df[result_column] = relevant_pages_df.progress_apply(process_row, axis=1)


In [None]:
# prompt: export the df as excel

# Export the DataFrame to an Excel file
relevant_pages_df.to_excel('ocr_corrected_newspaper.xlsx', index=False)


In [None]:
# prompt: I would like to chat with the column 'plainpagefulltext'

def chat_with_column(df, column_name, api_key):
  """Chats with the content of a specific column in a DataFrame using a language model.

  Args:
    df: The pandas DataFrame containing the data.
    column_name: The name of the column to chat with.
    api_key: The API key for the language model service.

  Returns:
    None
  """

  client = Groq(api_key=api_key)

  def chat_with_row(row):
    """Chats with the content of a single row in a DataFrame."""

    text = row[column_name]
    if not isinstance(text, str) or not text.strip():
      print("Skipping empty or non-string value.")
      return

    prompt = f""" Finde Zeitungssseiten die zum Thema Migration, Flucht, oder Rückkehrmigration berichten:

      {text}

      ---
    """

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful and informative assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        model="llama-3.1-70b-versatile",
        max_tokens=8000,
    )

    print(chat_completion.choices[0].message.content)


  for index, row in df.iterrows():
    chat_with_row(row)


# Usage example
if __name__ == "__main__":
  # Set up your Groq API key
  api_key = os.environ.get("GROQ_API_KEY")
  if not api_key:
    raise ValueError("GROQ_API_KEY environment variable not set")

  # Chat with the 'plainpagefulltext' column of the DataFrame
  chat_with_column(df, 'plainpagefulltext', api_key)


RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-70b-versatile` in organization `org_01j7111dyneqa889hytfqhked7` on : Limit 500000, Used 497056, Requested 9839. Please try again in 19m51.3278s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': '', 'code': 'rate_limit_exceeded'}}

# Run model for OCR-post correction

To run OCR-post correction, it is essential to formulate a precise prompt. For example, it needs to be specified that the whole text should be corrected, while summarizations and any other addition need to be avoided. A guide on how to write effective prompts can be found also [here](https://https://support.google.com/a/users/answer/14200040?hl=en).

Depending on the size of the dataframe, it can take a while to load.

In [None]:
import os
from groq import Groq

def test_groq_connection():
    # Try to get the API key from the environment variable
    api_key = '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'

    if not api_key:
        print("Error: GROQ_API_KEY environment variable is not set.")
        return

    print(f"API Key found: {api_key[:5]}...{api_key[-5:]}")  # Print first and last 5 characters of the API key

    try:
        # Initialize the Groq client
        client = Groq(api_key=api_key)
        print("Groq client created successfully")

        # Make a simple API call
        response = client.chat.completions.create(
            messages=[{"role": "user", "content": "Hello, world!"}],
            model="llama3-8b-8192",
            max_tokens=10
        )

        print("API call successful")
        print(f"Response: {response.choices[0].message.content}")

    except Exception as e:
        print(f"Error occurred: {str(e)}")
        print(f"Error type: {type(e).__name__}")

        if hasattr(e, 'response'):
            print(f"Response status code: {e.response.status_code}")
            print(f"Response content: {e.response.content}")

if __name__ == "__main__":
    test_groq_connection()

API Key found: gsk_F...wMx8Q
Groq client created successfully
API call successful
Response: Hello! It's nice to meet you, world


In [None]:
import json
import replicate

def OCR_correction(newspaper_page):
    # Define the prompt for separating articles

    input = {
    "prompt": f"Bitte separiere Artikel als Originaltext, die über Migration/Rückwanderung/Flucht/Auswanderung sprechen in \n\n{newspaper_page}\n\n---\n\ , ohne Zusammenfassung oder Ergänzungen aber mit OCR Korrektur",
    "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an text extraction expert. Please don't ask for feedback or questions <|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "max_new_tokens": 8000,
    }

    # Initialize an empty string to collect the response
    text = ""

    # Generate the response using the LLaMA model
    for event in replicate.stream(
        "meta/meta-llama-3-70b-instruct",
        input=input
    ):
        if event:
            text += str(event)
        else:
            print("Received empty event data")

    # Return the separated articles
    return text

# Assuming `df` is your dataframe
# Create an empty list to store the separated articles
post_OCR = []

# Loop through each row in the dataframe
for index, row in df.iterrows():
    # Extract the text of the newspaper page from the current row
    newspaper_page = row['context']

    # Separate articles for the current newspaper page only if newspaper_page is not empty
    if newspaper_page.strip():
        text = OCR_correction(newspaper_page)

        # Append the separated articles to the list, even if it’s empty
        post_OCR.append(text)
    else:
        print("Skipping empty newspaper page")

# Add the list of separated articles as a new column 'article' in the dataframe
df['article_corrected'] = post_OCR

# Print the modified dataframe
df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['article_corrected'] = post_OCR


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,context,article_corrected
0,2452IP73L263T7EUP326MLDLWGMWA6PL-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-11 12:00:00,[Hamburg],[ger],d13db2eb-59a1-494f-b6d1-4ec4f9ac647a,[/data/altos/24/52/2452IP73L263T7EUP326MLDLWGM...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Aweite Beilage z«« Hamöuvgev Fvemden-Blatt Ne....,m 2 Uhr 27 Minuten nachmittags. Die Gesamtstär...,Here is the extracted article related to migra...
1,26ZENHX64GQFWANFKPXWMMDCLX7D5AV6-FILE_0009_DDB...,9,Schwäbischer Merkur : mit Schwäbischer Kronik ...,VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J,Württembergische Landesbibliothek,2751625-8,1906-01-26 12:00:00,[Stuttgart],[ger],c7c3e5a5-1265-4d45-a7c3-241409568f66,[/data/altos/26/ZE/26ZENHX64GQFWANFKPXWMMDCLX7...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,«r. 4S. SVMVE MMW. ANMllM. Würdigung dteJtm.ql...,nd dr'irsie» diesem für die Autorität de- Sult...,Here are the articles separated into original ...
2,2ETGNYUAPVHLLE4SVVD66TK345MM4ACR-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-11-13 12:00:00,[Hamburg],[ger],0f18037d-34a3-4660-a20d-13baea8d26f9,[/data/altos/2E/TG/2ETGNYUAPVHLLE4SVVD66TK345M...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Frem-en-Blatt Nr....,"m Hause Paralleelstraße 12 zum Ausvruch. Da, u...",Here is the extracted article related to migra...
3,2KPSZQ2ZEXLY5EKLT36ZNXUNVXJPGUCD-ALTO6659890_D...,1,Aachener Anzeiger : politisches Tageblatt : be...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2975858-0,1906-01-04 12:00:00,"[Aachen, Regierungsbezirk Aachen]",[ger],51e77301-8216-489b-a90c-1a1972c57efe,[/data/altos/2K/PS/2KPSZQ2ZEXLY5EKLT36ZNXUNVXJ...,ALTO6659890_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,90 nd und Afrika berschrift und Herz erk “ . W...,"lig mittellosen Arbeitern und Handwerkern , di...",Here is the extracted article about migration/...
4,2MWORD7UEUFHSGZOQZ6TQYPEIPKQGG73-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-04 12:00:00,[Hamburg],[ger],b789cbd1-4d59-4575-aac6-803023299945,[/data/altos/2M/WO/2MWORD7UEUFHSGZOQZ6TQYPEIPK...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Aremden-GLatt Nr....,"ranken hauses, Herr Pros. Dr. Lenhartz, hat am...",Here are the articles related to migration/ret...
5,2NPFSEZPOTZEGAQS25MUACOX5LQCFZGU-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-09-04 12:00:00,[Hamburg],[ger],f078d8fd-4c88-4ef6-b520-a3611087b995,[/data/altos/2N/PF/2NPFSEZPOTZEGAQS25MUACOX5LQ...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Reichen"" ihr Vermögen höchst unpraktisch festg...","die berühmte „Ouvertüre 1812"" von Tschai kowsk...",Here are the articles related to migration/ret...
6,2R4XMGQPTR3CPNSIQAO3DWCJBS7PSPPH-ALTO953185_DD...,3,Dortmunder Zeitung. 1874-1939,4EV676FQPACNVNHFEJHGKUY55BXC3QMB,Westfälische Wilhelms-Universität Münster Univ...,2941861-6,1906-04-28 12:00:00,[Dortmund],[ger],7d09f4f2-c3a9-4140-a859-770f92b93473,[/data/altos/2R/4X/2R4XMGQPTR3CPNSIQAO3DWCJBS7...,ALTO953185_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,* * * * * S a . * DEES * S9 # SS m Dage Spss —...,thers mit 335 000 Dollars . Die höchst versich...,Here is the extracted article related to migra...
7,2VAVMTKZ4S3DAPC5J254V5RQHY3JZMOV-uuid-2ca07a46...,2,Wochenblatt für Zschopau und Umgegend : Zschop...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,2947898-4,1906-04-14 12:00:00,[Zschopau],[ger],db8adbe6-5c3d-4a87-bb12-cca5d35735a1,[/data/altos/2V/AV/2VAVMTKZ4S3DAPC5J254V5RQHY3...,uuid-2ca07a46-b68d-464f-99db-cd244fe40736_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,— Am ersten Osterleierlog nachmittag» 4 Uhr fi...,uch dort ein Streik des unteren Personal» vorb...,Here are the articles related to migration/ret...
8,2YYFQNDWZ7HM4I7MFB54EECZ5STTDX2G-ALTO7267289_D...,8,Badische Schulzeitung : Vereinsbl. d. Badische...,INLVDM4I3AMZLTG6AE6C5GZRJKGOF75K,Badische Landesbibliothek,3108888-0,1906-09-01 12:00:00,,[ger],34278dcb-93b2-4d75-9c23-2a183a33861c,[/data/altos/2Y/YF/2YYFQNDWZ7HM4I7MFB54EECZ5ST...,ALTO7267289_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,604 Die Frage der Volksbildung in ihren versch...,"kreises , Bereicherung seines Geisteslebens , ...",Here is the extracted article related to migra...
9,32ZFXGXIXPZ4WFNTDKNAEPTAKMYFWGUE-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-05-22 12:00:00,[Hamburg],[ger],35765f3f-5d34-4004-bd9d-9620f50df70c,[/data/altos/32/ZF/32ZFXGXIXPZ4WFNTDKNAEPTAKMY...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Religionsgemeinschaften auf die Gestaltung des...,me ber Pflanze dient. Der Ausstellungsausschuß...,Here are the articles related to migration/ret...


In [None]:
# prompt: export file

# Export the DataFrame to an Excel file
df.to_excel('ocr_corrected_newspaper_final.xlsx', index=False)

# Download the Excel file
from google.colab import files
files.download('ocr_corrected_newspaper_final.xlsx')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)
df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)

df.to_excel('article_corrected.xlsx', index=False)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['article_corrected'] = df['article_corrected'].apply(lambda x: x.split('\n\n', 1)[1].lstrip() if isinstance(x, str) and '\n\n' in x else x)


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,context,article_corrected
0,2452IP73L263T7EUP326MLDLWGMWA6PL-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-11 12:00:00,[Hamburg],[ger],d13db2eb-59a1-494f-b6d1-4ec4f9ac647a,[/data/altos/24/52/2452IP73L263T7EUP326MLDLWGM...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Aweite Beilage z«« Hamöuvgev Fvemden-Blatt Ne....,m 2 Uhr 27 Minuten nachmittags. Die Gesamtstär...,* 30 mittellose Rückwanderer trafen mit dem Da...
1,26ZENHX64GQFWANFKPXWMMDCLX7D5AV6-FILE_0009_DDB...,9,Schwäbischer Merkur : mit Schwäbischer Kronik ...,VNHXUCEEKHOUSYH4NVOUBHJGSRMOGK7J,Württembergische Landesbibliothek,2751625-8,1906-01-26 12:00:00,[Stuttgart],[ger],c7c3e5a5-1265-4d45-a7c3-241409568f66,[/data/altos/26/ZE/26ZENHX64GQFWANFKPXWMMDCLX7...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,«r. 4S. SVMVE MMW. ANMllM. Würdigung dteJtm.ql...,nd dr'irsie» diesem für die Autorität de- Sult...,* Die Umwälzung in Rußland. \nIn einer Einsend...
2,2ETGNYUAPVHLLE4SVVD66TK345MM4ACR-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-11-13 12:00:00,[Hamburg],[ger],0f18037d-34a3-4660-a20d-13baea8d26f9,[/data/altos/2E/TG/2ETGNYUAPVHLLE4SVVD66TK345M...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Frem-en-Blatt Nr....,"m Hause Paralleelstraße 12 zum Ausvruch. Da, u...",* 48 mittellose Rückwanderer kehrten mit dem D...
3,2KPSZQ2ZEXLY5EKLT36ZNXUNVXJPGUCD-ALTO6659890_D...,1,Aachener Anzeiger : politisches Tageblatt : be...,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2975858-0,1906-01-04 12:00:00,"[Aachen, Regierungsbezirk Aachen]",[ger],51e77301-8216-489b-a90c-1a1972c57efe,[/data/altos/2K/PS/2KPSZQ2ZEXLY5EKLT36ZNXUNVXJ...,ALTO6659890_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,90 nd und Afrika berschrift und Herz erk “ . W...,"lig mittellosen Arbeitern und Handwerkern , di...",**Article 2**\nTitle: Flüchtlinge auf dem Schw...
4,2MWORD7UEUFHSGZOQZ6TQYPEIPKQGG73-FILE_0009_DDB...,9,Hamburger Fremdenblatt,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3024925-9,1906-09-04 12:00:00,[Hamburg],[ger],b789cbd1-4d59-4575-aac6-803023299945,[/data/altos/2M/WO/2MWORD7UEUFHSGZOQZ6TQYPEIPK...,FILE_0009_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Zweite Beilage zum Hamburger Aremden-GLatt Nr....,"ranken hauses, Herr Pros. Dr. Lenhartz, hat am...",Immer noch treffen mit den von England kommend...
5,2NPFSEZPOTZEGAQS25MUACOX5LQCFZGU-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-09-04 12:00:00,[Hamburg],[ger],f078d8fd-4c88-4ef6-b520-a3611087b995,[/data/altos/2N/PF/2NPFSEZPOTZEGAQS25MUACOX5LQ...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Reichen"" ihr Vermögen höchst unpraktisch festg...","die berühmte „Ouvertüre 1812"" von Tschai kowsk...","* ""Mit dem englischen Dampfer „N 0 t t i n g h..."
6,2R4XMGQPTR3CPNSIQAO3DWCJBS7PSPPH-ALTO953185_DD...,3,Dortmunder Zeitung. 1874-1939,4EV676FQPACNVNHFEJHGKUY55BXC3QMB,Westfälische Wilhelms-Universität Münster Univ...,2941861-6,1906-04-28 12:00:00,[Dortmund],[ger],7d09f4f2-c3a9-4140-a859-770f92b93473,[/data/altos/2R/4X/2R4XMGQPTR3CPNSIQAO3DWCJBS7...,ALTO953185_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,* * * * * S a . * DEES * S9 # SS m Dage Spss —...,thers mit 335 000 Dollars . Die höchst versich...,Mehr noch als Königsberg i. Pr. ist die Stadt ...
7,2VAVMTKZ4S3DAPC5J254V5RQHY3JZMOV-uuid-2ca07a46...,2,Wochenblatt für Zschopau und Umgegend : Zschop...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,2947898-4,1906-04-14 12:00:00,[Zschopau],[ger],db8adbe6-5c3d-4a87-bb12-cca5d35735a1,[/data/altos/2V/AV/2VAVMTKZ4S3DAPC5J254V5RQHY3...,uuid-2ca07a46-b68d-464f-99db-cd244fe40736_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,— Am ersten Osterleierlog nachmittag» 4 Uhr fi...,uch dort ein Streik des unteren Personal» vorb...,* Deutsch-Afrika. — Der Plan de» Psarrer» Rose...
8,2YYFQNDWZ7HM4I7MFB54EECZ5STTDX2G-ALTO7267289_D...,8,Badische Schulzeitung : Vereinsbl. d. Badische...,INLVDM4I3AMZLTG6AE6C5GZRJKGOF75K,Badische Landesbibliothek,3108888-0,1906-09-01 12:00:00,,[ger],34278dcb-93b2-4d75-9c23-2a183a33861c,[/data/altos/2Y/YF/2YYFQNDWZ7HM4I7MFB54EECZ5ST...,ALTO7267289_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,604 Die Frage der Volksbildung in ihren versch...,"kreises , Bereicherung seines Geisteslebens , ...",(No further text available)\n\nThis is the onl...
9,32ZFXGXIXPZ4WFNTDKNAEPTAKMYFWGUE-FILE_0002_DDB...,2,Hamburger Echo ; [...] ; Abend-Ausgabe,BZVTR553HLJBDMQD5NCJ6YKP3HMBQRF4,Staats- und Universitätsbibliothek Hamburg Car...,3060377-8,1906-05-22 12:00:00,[Hamburg],[ger],35765f3f-5d34-4004-bd9d-9620f50df70c,[/data/altos/32/ZF/32ZFXGXIXPZ4WFNTDKNAEPTAKMY...,FILE_0002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Religionsgemeinschaften auf die Gestaltung des...,me ber Pflanze dient. Der Ausstellungsausschuß...,This article reports on 52 destitute returnees...


In [None]:
# prompt: export file

from google.colab import files

# Assuming 'df' is your DataFrame and you want to export it as an Excel file named 'my_data.xlsx'
df.to_excel('my_data.xlsx', index=False)

# Download the file
files.download('my_data.xlsx')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>