Deutsche Digitale Bibliothek REST API: https://labs.deutsche-digitale-bibliothek.de/app/ddbapi/#/search/getSolrSearch

Fragen & Antworten zum Deutschen Zeitungsportal: https://www.deutsche-digitale-bibliothek.de/content/newspaper/fragen-antworten

Wrapper for the DDB API: https://pypi.org/project/ddbapi/

# Functions for extracting and analyzing data from Das Deutsche Zeitungsportal: 

In [84]:
import pandas as pd
from ddbapi import zp_pages
import folium
from geopy import geocoders 

In [85]:
def article_extractor(search_dict):
    """
    This function extracts newspaper articles in a given language between two given dates. It takes a dictionary with 
    three keys as its only argument, like the following example:

    search_dict= {
        'language': 'ger',
        'date_begin': f'{year}-01-01',
        'date_end': f'{year}-12-31'
        }
    """
    
    df= zp_pages(language=search_dict['language'], 
                  publication_date= f"[{search_dict['date_begin']}T12:00:00Z TO {search_dict['date_end']}T12:00:00Z]")
    return df

Downloading newspaper data for each year in three chunks: 

In [86]:
year_start = 1933
year_end = 1933
path= "./data_deutsches_zeitungsportal_misc"

In [83]:
search_dict= {
     'language': 'ger',
     'date_begin': f'{year_start}-01-01',
     'date_end': f'{year_end}-05-31'
     }
    
df_challenge= article_extractor(search_dict)
df_challenge
df_challenge.to_pickle(f"{path}/newspapers_{search_dict['language']}_{year_start}_part_1.pkl")
#df_challenge.to_parquet(
#    f"{path}/newspapers_{search_dict['language']}_{year_start}-{year_end}_part_1.parquet",
#    index=False,
#    compression='snappy'  # oder 'gzip' für stärkere Kompression
#)

KeyboardInterrupt: 

In [None]:
# search_dict= {
#     'language': 'ger',
#     'date_begin': f'{year}-05-01',
#     'date_end': f'{year}-08-31'
#     }
    
# df_challenge= article_extractor(search_dict)
# df_challenge.to_pickle(f"{path}/newspapers_{search_dict['language']}_{year}_part_2")

In [None]:
# search_dict= {
#     'language': 'ger',
#     'date_begin': f'{year}-09-01',
#     'date_end': f'{year}-12-31'
#     }
    
# df_challenge_ger= article_extractor(search_dict)
# df_challenge_ger.to_pickle(f"{path}/newspapers_{search_dict['language']}_{year}_part_3")

In [None]:
# # testing if the pickled dataframes are loadable: 

# test_year= f"{year}_part_3"
# columns= ['paper_title', 'publication_date', 'place_of_distribution']
# try:
#     print (len(pd.read_pickle(f"{path}/newspapers_{search_dict['language']}_{test_year}")[columns]))
                  
# except EOFError:
#     print(f"Error: EOFError occurred while loading data for year {test_year}.")

Downloading newspaper from 2023 for comparison in text quality

In [None]:
year_start = 1980
year_end = 1994
path= "./data_deutsches_zeitungsportal_misc"

In [None]:
search_dict= {
     'language': 'ger',
     'date_begin': f'{year_start}-01-01',
     'date_end': f'{year_end}-05-31'
     }
    
df_challenge= article_extractor(search_dict)
df_challenge
df_challenge.to_pickle(f"{path}/newspapers_{search_dict['language']}_{year_start}_part_1.pkl")
#df_challenge.to_parquet(
#    f"{path}/newspapers_{search_dict['language']}_{year_start}-{year_end}_part_1.parquet",
#    index=False,
#    compression='snappy'  # oder 'gzip' für stärkere Kompression
#)
df_challenge

https://api.deutsche-digitale-bibliothek.de/search/index/newspaper-issues/select?rows=1000&sort=id+ASC&q=type%3Apage+AND+language%3A%22ger%22+AND+publication_date%3A%22%5B1980-01-01T12%3A00%3A00Z%5C+TO%5C+1994-05-31T12%3A00%3A00Z%5D%22&cursorMark=%2A
Getting 1000 of 2807
Getting 2000 of 2807
Getting 2807 of 2807
Got 2807 items.


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_001_DDB_...,1,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_001_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,= Organ der Parteileitung der SED im VEB Diese...
1,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_002_DDB_...,2,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Am Arbeitsplatz mehr Vorsicht! 25 betriebliche...
2,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_003_DDB_...,3,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_003_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Einer kennt ihn als Vertrauensmann, den Jugend..."
3,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_004_DDB_...,4,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_004_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,der KDT und URANIA im DMR „MOTOR“ Leserausspra...
4,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_005_DDB_...,5,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_005_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,At ontage) heißt es so zügiger kann der a scho...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2802,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_004_DDB_...,4,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_004_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"|Ehrentafel Anlaßlich des 32, Jahrestages DDR ..."
2803,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_005_DDB_...,5,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_005_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,icheren Händen ısatz und Disziplin ausgezeichn...
2804,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_006_DDB_...,6,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_006_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"SE NETTE ET TEE | | GELOBNIS % Wir, junge Bürg..."
2805,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_007_DDB_...,7,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_007_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,port Am 28.September standen sich auf dem Lok-...


# Perplexity
In this part of the code, we attempt to derive perplexity using the 'distilgpt2' model. We calculate the perplexity for the column 'plainpagefulltext' for articles from 1933 and from 1980–1994, based on the assumption that the quality of the text improved with the transition from Old German to the modern alphabet. Because calculating perplexity for each row is computationally expensive, we generate a random sample consisting of 1.5% of the articles from 1933 and 20% of the articles from 1980–1994.
## 1933

In [87]:
df_1933 = pd.read_pickle("./data_deutsches_zeitungsportal_misc/newspapers_ger_1933_part_1.pkl")  
#read_in_df = pd.read_parquet("./data_deutsches_zeitungsportal_misc/newspapers_ger_1933_part_1.pkl")  

In [88]:
df_1933

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,225ZVEUFAMVCRQLBYKG2WTI3WYTSORYZ-ALTO1541386_D...,1,Morgen-Zeitung. 1925-1949,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2928805-8,1933-03-21 12:00:00,"[Velbert, Neviges, Velbert-Langenberg, Wülfrat...",[ger],26384eff-247a-4088-a290-d128d35fe611,[/data/altos/22/5Z/225ZVEUFAMVCRQLBYKG2WTI3WYT...,ALTO1541386_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Einzelnummer 15 Pig . Mit Ill . = Blatt 30 Pfg...
1,225ZVEUFAMVCRQLBYKG2WTI3WYTSORYZ-ALTO1541387_D...,2,Morgen-Zeitung. 1925-1949,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2928805-8,1933-03-21 12:00:00,"[Velbert, Neviges, Velbert-Langenberg, Wülfrat...",[ger],26384eff-247a-4088-a290-d128d35fe611,[/data/altos/22/5Z/225ZVEUFAMVCRQLBYKG2WTI3WYT...,ALTO1541387_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , den 21 . März Morgen - Zeiltunh 193..."
2,225ZVEUFAMVCRQLBYKG2WTI3WYTSORYZ-ALTO1541388_D...,3,Morgen-Zeitung. 1925-1949,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2928805-8,1933-03-21 12:00:00,"[Velbert, Neviges, Velbert-Langenberg, Wülfrat...",[ger],26384eff-247a-4088-a290-d128d35fe611,[/data/altos/22/5Z/225ZVEUFAMVCRQLBYKG2WTI3WYT...,ALTO1541388_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,ine im wer arla damit lüssen eine ichs gexer 1...
3,225ZVEUFAMVCRQLBYKG2WTI3WYTSORYZ-ALTO1541389_D...,4,Morgen-Zeitung. 1925-1949,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2928805-8,1933-03-21 12:00:00,"[Velbert, Neviges, Velbert-Langenberg, Wülfrat...",[ger],26384eff-247a-4088-a290-d128d35fe611,[/data/altos/22/5Z/225ZVEUFAMVCRQLBYKG2WTI3WYT...,ALTO1541389_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienskag , de R O MAN von WERNER SCHEFF Inhalt..."
4,225ZVEUFAMVCRQLBYKG2WTI3WYTSORYZ-ALTO1541390_D...,5,Morgen-Zeitung. 1925-1949,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2928805-8,1933-03-21 12:00:00,"[Velbert, Neviges, Velbert-Langenberg, Wülfrat...",[ger],26384eff-247a-4088-a290-d128d35fe611,[/data/altos/22/5Z/225ZVEUFAMVCRQLBYKG2WTI3WYT...,ALTO1541390_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienskag , den 21 . 1933 — Nummer 79 , Seite 9..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133379,ZZZZPIGJEYV3UWA7CW7RHXVDVJPRMXV5-uuid-a31b4e74...,2,Weißeritz-Zeitung : Tageszeitung und Anzeiger ...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,3074708-9,1933-01-31 12:00:00,"[Dippoldiswalde, Frauenstein, Schmiedeberg (La...",[ger],fd1a6503-73b1-41c5-bacb-8442437bbd1e,[/data/altos/ZZ/ZZ/ZZZZPIGJEYV3UWA7CW7RHXVDVJP...,uuid-a31b4e74-67e9-4769-b4c4-48634ae29ddb_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,"Platz hinweg nach der Wilhelmstraße eilten, wo..."
133380,ZZZZPIGJEYV3UWA7CW7RHXVDVJPRMXV5-uuid-b889ea81...,6,Weißeritz-Zeitung : Tageszeitung und Anzeiger ...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,3074708-9,1933-01-31 12:00:00,"[Dippoldiswalde, Frauenstein, Schmiedeberg (La...",[ger],fd1a6503-73b1-41c5-bacb-8442437bbd1e,[/data/altos/ZZ/ZZ/ZZZZPIGJEYV3UWA7CW7RHXVDVJP...,uuid-b889ea81-49ef-4bbd-949a-c2df0134b796_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,Leutnant in Vas Insauterieregimen« Nr 73 (Hann...
133381,ZZZZPIGJEYV3UWA7CW7RHXVDVJPRMXV5-uuid-c11dedff...,3,Weißeritz-Zeitung : Tageszeitung und Anzeiger ...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,3074708-9,1933-01-31 12:00:00,"[Dippoldiswalde, Frauenstein, Schmiedeberg (La...",[ger],fd1a6503-73b1-41c5-bacb-8442437bbd1e,[/data/altos/ZZ/ZZ/ZZZZPIGJEYV3UWA7CW7RHXVDVJP...,uuid-c11dedff-0487-48ef-aeec-7a40d95ab9c5_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,> Im übrigen bringe dieser Artikel eine Verdre...
133382,ZZZZPIGJEYV3UWA7CW7RHXVDVJPRMXV5-uuid-c5492995...,7,Weißeritz-Zeitung : Tageszeitung und Anzeiger ...,265BI7NE7QBS4NQMZCCGIVLFR73OCOSL,Sächsische Landesbibliothek - Staats- und Univ...,3074708-9,1933-01-31 12:00:00,"[Dippoldiswalde, Frauenstein, Schmiedeberg (La...",[ger],fd1a6503-73b1-41c5-bacb-8442437bbd1e,[/data/altos/ZZ/ZZ/ZZZZPIGJEYV3UWA7CW7RHXVDVJP...,uuid-c5492995-33ed-4c6f-8c84-cb12044e62f5_DDB_...,https://api.deutsche-digitale-bibliothek.de/bi...,"Aaveveui. Demonstrationsverb o t. Der SMj>, Ut..."


In [89]:
unique_titles = df_1933["paper_title"].unique()

titles_df = pd.DataFrame(unique_titles, columns=["paper_title"])

titles_df.to_excel("unique_paper_titles.xlsx", index=False)

df_1933["paper_title"].unique()

array(['Morgen-Zeitung. 1925-1949', 'Deutsche Reichs-Zeitung. 1871-1934',
       'Der Erft-Bote. 1890-1950',
       'Dresdner Nachrichten, 02-Abendausgabe',
       'Westfälische neueste Nachrichten mit Bielefelder General-Anzeiger und Handelsblatt',
       'Neue Mannheimer Zeitung : NMZ : Mannheimer Neues Tageblatt, Abendblatt',
       'Laupheimer Verkündiger : verbunden mit dem Laupheimer Volksblatt',
       'Kölnische Zeitung. 1803-1945',
       'Westfälische Zeitung : Bielefelder Tageblatt',
       'Mittelbadischer Courier : Ettlinger Tagblatt ; mit den neuesten Handels-Nachrichten für Stadt und Bezirk Ettlingen',
       'Erzgebirgischer Volksfreund : mit Schwarzenberger Tageblatt',
       'Neckar-Bote : Heimatzeitung für Seckenheim und Umgebung',
       'Iserlohner Kreisanzeiger und Zeitung. 1898-1949',
       'Sächsische Dorfzeitung und Elbgaupresse : mit Loschwitzer Anzeiger ; Tageszeitung für das östliche Dresden u. seine Vororte',
       'Hörder Volksblatt. 1884-1934',
       '

In [90]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)
model.to(device) 
model.eval()

def get_perplexity(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    input_ids = inputs["input_ids"].to(device) 
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
    return torch.exp(loss).item()

Perplexity = pd.DataFrame(columns=["Text", "Perplexity"])
start_time = time.time() 
length = int(len(df_1933["plainpagefulltext"])*0.015) 
random_samples = df_1933["plainpagefulltext"].sample(n=length, random_state=42)
empty = 0
for i, t in enumerate(random_samples, 1): 
    #skip empty t
    if not t.strip():
        empty+=1
        continue
    perplexity_value = get_perplexity(t)
    #print(f"Text: {t}\nPerplexity: {perplexity_value:.2f}\n")
    
    new_row = {"Text": t, "Perplexity": perplexity_value}
    Perplexity = pd.concat([Perplexity, pd.DataFrame([new_row])], ignore_index=True)
    
    if i % 100 == 0:
        elapsed = time.time() - start_time
        remaining = len(random_samples) - i
        estimated_time = remaining * (elapsed/100) /60
        print(f"Processed {i} texts — Last 100 took {elapsed:.2f} seconds")
        print(f"{remaining} left, which will take approximately {estimated_time:.2f} min\n")
        start_time = time.time() 

print("Empty plainpagefulltext columns " + str(empty))
Perplexity.to_csv("Perplexity_1933.csv")
Perplexity["Perplexity"].mean()

  Perplexity = pd.concat([Perplexity, pd.DataFrame([new_row])], ignore_index=True)


Processed 100 texts — Last 100 took 41.41 seconds
1900 left, which will take approximately 13.11 min

Processed 200 texts — Last 100 took 38.02 seconds
1800 left, which will take approximately 11.41 min

Processed 300 texts — Last 100 took 38.11 seconds
1700 left, which will take approximately 10.80 min

Processed 400 texts — Last 100 took 37.22 seconds
1600 left, which will take approximately 9.92 min

Processed 500 texts — Last 100 took 37.89 seconds
1500 left, which will take approximately 9.47 min

Processed 600 texts — Last 100 took 37.94 seconds
1400 left, which will take approximately 8.85 min

Processed 700 texts — Last 100 took 38.51 seconds
1300 left, which will take approximately 8.34 min

Processed 800 texts — Last 100 took 38.50 seconds
1200 left, which will take approximately 7.70 min

Processed 900 texts — Last 100 took 37.65 seconds
1100 left, which will take approximately 6.90 min

Processed 1000 texts — Last 100 took 37.99 seconds
1000 left, which will take approximat

np.float64(192.22110851542124)

## 1980-1994

In [91]:
df_1980 = pd.read_pickle("./data_deutsches_zeitungsportal_misc/newspapers_ger_1980_part_1.pkl")  
#read_in_df = pd.read_parquet("./data_deutsches_zeitungsportal_misc/newspapers_ger_1933_part_1.pkl")  
df_1980

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext
0,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_001_DDB_...,1,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_001_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,= Organ der Parteileitung der SED im VEB Diese...
1,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_002_DDB_...,2,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_002_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,Am Arbeitsplatz mehr Vorsicht! 25 betriebliche...
2,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_003_DDB_...,3,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_003_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Einer kennt ihn als Vertrauensmann, den Jugend..."
3,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_004_DDB_...,4,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_004_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,der KDT und URANIA im DMR „MOTOR“ Leserausspra...
4,22BDAOGGUQML4XYWD424L5CQGBEPUOPC-alto_005_DDB_...,5,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1980-03-11 12:00:00,[Rostock],[ger],3c17273a-1495-4345-8bf5-df324f159dce,[/data/altos/22/BD/22BDAOGGUQML4XYWD424L5CQGBE...,alto_005_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,At ontage) heißt es so zügiger kann der a scho...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2802,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_004_DDB_...,4,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_004_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"|Ehrentafel Anlaßlich des 32, Jahrestages DDR ..."
2803,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_005_DDB_...,5,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_005_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,icheren Händen ısatz und Disziplin ausgezeichn...
2804,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_006_DDB_...,6,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_006_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"SE NETTE ET TEE | | GELOBNIS % Wir, junge Bürg..."
2805,ZZUSBBUNDOSMBSOV6HZT4GBZKLMCUYOZ-alto_007_DDB_...,7,Der Motor : Zeitung der Dieselmotorenwerk Rost...,UPHR66ECKLOQBHTW23IVD2SE4UBEF2XY,Schifffahrtsmuseum Rostock,3139118-7,1981-10-13 12:00:00,[Rostock],[ger],f82d7cb1-5606-43c2-a43b-4db2c85ddf15,[/data/altos/ZZ/US/ZZUSBBUNDOSMBSOV6HZT4GBZKLM...,alto_007_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,port Am 28.September standen sich auf dem Lok-...


In [92]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)
model.to(device) 
model.eval()

def get_perplexity(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    input_ids = inputs["input_ids"].to(device) 
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
    return torch.exp(loss).item()

Perplexity = pd.DataFrame(columns=["Text", "Perplexity"])
start_time = time.time() 
length = int(len(df_1980["plainpagefulltext"])*0.2) 
random_samples = df_1980["plainpagefulltext"].sample(n=length, random_state=42)
empty = 0
for i, t in enumerate(random_samples, 1): 
    #skip empty t
    if not t.strip():
        empty+=1
        continue
    perplexity_value = get_perplexity(t)
    #print(f"Text: {t}\nPerplexity: {perplexity_value:.2f}\n")
    
    new_row = {"Text": t, "Perplexity": perplexity_value}
    Perplexity = pd.concat([Perplexity, pd.DataFrame([new_row])], ignore_index=True)
    
    if i % 100 == 0:
        elapsed = time.time() - start_time
        remaining = len(random_samples) - i
        estimated_time = remaining * (elapsed/100) /60
        print(f"Processed {i} texts — Last 100 took {elapsed:.2f} seconds")
        print(f"{remaining} left, which will take approximately {estimated_time:.2f} min\n")
        start_time = time.time() 

print("Empty plainpagefulltext columns " + str(empty))
Perplexity.to_csv("Perplexity_1980.csv")
Perplexity["Perplexity"].mean()

  Perplexity = pd.concat([Perplexity, pd.DataFrame([new_row])], ignore_index=True)


Processed 100 texts — Last 100 took 36.80 seconds
461 left, which will take approximately 2.83 min

Processed 200 texts — Last 100 took 36.88 seconds
361 left, which will take approximately 2.22 min

Processed 300 texts — Last 100 took 37.06 seconds
261 left, which will take approximately 1.61 min

Processed 400 texts — Last 100 took 36.86 seconds
161 left, which will take approximately 0.99 min

Processed 500 texts — Last 100 took 36.98 seconds
61 left, which will take approximately 0.38 min

Empty plainpagefulltext columns 0


np.float64(132.82616769991245)

# Comparing Perplexity

In [93]:
Perplexity_1933 = pd.read_csv('Perplexity_1933.csv')
Perplexity_1980 = pd.read_csv('Perplexity_1980.csv')


In [94]:
print("Mean 1933: " + str(Perplexity_1933["Perplexity"].mean()))
print("Mean 1980: " + str(Perplexity_1980["Perplexity"].mean()))
print(f"_____________________________________________________")
print("Median 1933: " + str(Perplexity_1933["Perplexity"].median()))
print("Median 1980: " + str(Perplexity_1980["Perplexity"].median()))

Mean 1933: 192.22110851542124
Mean 1980: 132.82616769991245
_____________________________________________________
Median 1933: 156.34481811523438
Median 1980: 126.85960388183594
