# Script 1 - Extract Articles

_Script by Tim Hebestreit, thebestr@smail.uni-koeln.de_

In this notebook, the articles are extracted from the raw exported docx and pdf files. To replicate the export of the csv file, simply run all cells, but note that this will take 30-60 minutes depending on your machine. Also, note that this is optional as the output csv file (*articles_raw.csv*) is already present in the data/csv folder.

First, we install the needed packages for this script:

- *pdfplumber* helps with reading in and parsing pdf files
- *tqdm* makes the import process more clear with a smart progress bar
- *pandas* is used to create a dataframe of the parsed articles
- *python-docx* (or just docx) helps with the import of docx files 

In [1]:
# --- INSTALLATION ---
!pip install pdfplumber tqdm pandas python-docx



Now we import the libraries that are used to parse the articles.

In [2]:
# --- IMPORTS ---
import os
import re
import pandas as pd
from docx import Document
import pdfplumber
from tqdm import tqdm

Here, we define the input folders, as well as the output file name.

In [3]:
# --- CONFIG ---
DOCX_FOLDER = "../data/docx/"          # Folder for Docx Files from Nexis
PDF_FOLDER = "../data/pdf/"            # Folder for Pdf Files from Wiso
OUTPUT_FILE = "../data/csv/articles_raw.csv" # Save extracted article data in a single file

We declare two lists were each the docx and pdf articles are saved.

In [4]:
# --- GLOBAL VARIABLES ---
articles_docx = []
articles_pdf = []

This first helper function parses a single docx file.

In [5]:
# --- HELPER FUNCTION 1 ---
def docx_parse_file(path):
    """Parses a single Nexis DOCX file using 'Ende des Dokuments' as a splitter."""
    doc = Document(path)
    local_articles = []
    current_lines = []
    splitter = "Ende des Dokuments"

    # Regulae expression to find dates in German format (e.g., 15. Januar 2025)
    date_pattern = re.compile(r'\d{1,2}\.?\s+(?:Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\s+\d{4}', re.IGNORECASE)

    for para in doc.paragraphs:
        txt = para.text.strip()
        if not txt: continue 

        # Check if the end of an article block is reached
        if splitter in txt:
            if current_lines:
                data = docx_extract_block(current_lines, date_pattern)
                if data: local_articles.append(data)
            current_lines = []
        else:
            current_lines.append(txt)

    # Process the last block
    if current_lines:
        data = docx_extract_block(current_lines, date_pattern)
        if data: local_articles.append(data)

    return local_articles

In [6]:
# --- HELPER FUNCTION 2 ---
def docx_extract_block(lines, date_regex):
    """Extracts the metadata and text from a Nexis article block."""

    # Find the Copyright line as an anchor
    copyright_index = -1
    for i, line in enumerate(lines):
        if line.lower().startswith("copyright"):
            copyright_index = i
            break     
    if copyright_index == -1: return None

    # Search backwards from the copyright anchor to find the date
    date_index = -1
    date_str = ""
    start_search = max(0, copyright_index - 10)
    for i in range(copyright_index - 1, start_search, -1):
        if date_regex.search(lines[i]):
            date_index = i
            date_str = lines[i]
            break
            
    if date_index == -1:
        date_index = copyright_index - 1
        date_str = lines[date_index]

    # Identify the source which is usually the line above the date
    source = lines[date_index - 1] if date_index > 0 else "Unbekannt"

    # Identify the title which is everything before the source
    if date_index > 1:
        raw_title_lines = lines[:date_index - 1]
        title = " ".join(raw_title_lines)
    else:
        title = "Unknown"

    # Extract the body text (starts after 'Body' keyword or after Copyright)
    body_index = -1
    for i in range(copyright_index, len(lines)):
        if lines[i] == "Body":
            body_index = i
            break
            
    text_start = body_index + 1 if body_index != -1 else copyright_index + 1

    # Clean the text by removing the metadata lines
    clean_text = []
    for line in lines[text_start:]:
        if line.startswith(("Load-Date:", "Link to PDF", "Graphic")): continue
        if any(x in line for x in ["Section:", "Length:", "Byline:", "Highlight:"]): continue
        clean_text.append(line)

    # We then return the extracted date, source, title, and the cleaned text
    return {
        "Date_Raw": date_str,
        "Source": source,
        "Title": title,
        "Text": " ".join(clean_text)
    }

In [7]:
# --- HELPER FUNCTION 3 ---
def pdf_parse_file(path):
    """Parses PDFs from Wiso"""
    full_text_lines = []

    # Extract the text from the pdf using the pdfplumber library
    with pdfplumber.open(path) as pdf:
        
        # The first page is the table of contents so it can be skipped
        start_page = 1 if len(pdf.pages) > 1 else 0
        
        for i, page in enumerate(pdf.pages):
            if i < start_page: continue 
            text = page.extract_text()
            if text:
                lines = text.split('\n')
                # Filter out headers and footers
                clean_lines = [l for l in lines if not l.startswith("Dokumente") and not re.search(r'Seite \d+ von \d+', l)]
                full_text_lines.extend(clean_lines)

    local_articles = []
    current_buffer = []
    # Regex for dates (with format xx.xx.xx or xx.xx.xxxx)
    date_regex = re.compile(r'(\d{2}\.\d{2}\.\d{2,4})')

    # Iterate through lines to build articles
    for line in full_text_lines:
        stripped_line = line.strip()

        # Check if a line indicates the end of an article, which is when it ends with Quelle (source)
        if stripped_line.startswith("Quelle:") or stripped_line.startswith("Quelle :"):
            if current_buffer:
                # Extract metadata from the footer line
                raw_meta = stripped_line
                date_match = date_regex.search(raw_meta)
                date_str = date_match.group(1) if date_match else None

                # Clean up source name
                source_clean = raw_meta.replace("Quelle:", "").replace("Quelle :", "").strip()
                if date_match:
                    source_clean = source_clean.split(date_str)[0].strip()
                    source_clean = re.split(r',|Nr\.', source_clean)[0].strip()

                # We add a WISO tag so we can identify the extracted Wiso articles later
                if "(WISO)" not in source_clean:
                    source_clean = f"{source_clean} (WISO)"

                # Now the title and text will be extracted
                title = "Unbekannt"
                body_text = ""

                # Remove empty lines at the start
                while current_buffer and not current_buffer[0].strip():
                    current_buffer.pop(0)

                # Set title as the first line and text as the rest
                if current_buffer:
                    title = current_buffer[0]
                    body_text = " ".join(current_buffer[1:])

                # The extracted date, source, title, and the cleaned text are returned
                local_articles.append({
                    "Date_Raw": date_str,
                    "Source": source_clean,
                    "Title": title,
                    "Text": body_text
                })

                # Reset the buffer for the next article
                current_buffer = []

        # Skip the metadata block lines that follow the source
        elif any(marker in stripped_line for marker in ["Ressort:", "Dokumentnummer:", "Dauerhafte Adresse", "Alle Rechte vorbehalten", "GENIOS"]):
            continue
        else:
            # Collect text line 
            current_buffer.append(stripped_line)

    return local_articles

This cell reads all .docx files from the docx input folder. Running this cell takes a couple of minutes.

In [8]:
# --- PARSE DOCX FILES (TAKES 5-10 MINUTES) ---

if os.path.exists(DOCX_FOLDER):
    docx_files = [f for f in os.listdir(DOCX_FOLDER) if f.endswith(".docx") or f.endswith(".DOCX")]
    print(f"Found {len(docx_files)} DOCX Files.")
    
    # Start with empty list
    articles_docx = []

    # Use tqdm for pretty progress bars
    for filename in tqdm(docx_files, desc="Parsing DOCX"):
        path = os.path.join(DOCX_FOLDER, filename)
        try:
            extracted = docx_parse_file(path)
            articles_docx.extend(extracted)
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            
    print(f"Read in of DOCX complete. {len(articles_docx)} articles stored.")
else:
    print(f"Folder {DOCX_FOLDER} not found.")

Found 588 DOCX Files.


Parsing DOCX: 100%|██████████| 588/588 [06:42<00:00,  1.46it/s]

Read in of DOCX complete. 70669 articles stored.





Here, all pdf files are being read from the pdf input folder. Running this cell can take a long time, depending on the machine it might be 30-45 minutes.

In [9]:
# --- PARSE PDF FILES (TAKES AROUND 30-45 MINUTES) ---

if os.path.exists(PDF_FOLDER):
    pdf_files = [f for f in os.listdir(PDF_FOLDER) if f.lower().endswith(".pdf")]
    print(f"Found {len(pdf_files)} PDF Files.")
    
    # Start with empty list
    articles_pdf = []

    # Use tqdm for pretty progress bars
    for filename in tqdm(pdf_files, desc="Parsing PDF"):
        path = os.path.join(PDF_FOLDER, filename)
        try:
            extracted = pdf_parse_file(path)
            articles_pdf.extend(extracted)
        except Exception as e:
            print(f"Error reading {filename}: {e}")
            
    print(f"Read in of PDF complete. {len(articles_pdf)} articles stored.")
else:
    print(f"Folder {PDF_FOLDER} not found.")

Found 117 PDF Files.


Parsing PDF: 100%|██████████| 117/117 [40:43<00:00, 20.88s/it]

Read in of PDF complete. 6424 articles stored.





All that is left is to combine the extracted data, filter junk, and create a pandas DataFrame.

In [10]:
# --- JOIN, CLEAN AND SAVE DATA ---

# Merge the two data lists
all_data = articles_docx + articles_pdf
print(f"Total number of raw extracted articles: {len(all_data)}")

# Convert to a pandas DataFrame
if all_data:
    df = pd.DataFrame(all_data)
    
    # Remove the cover pages from Nexis, as they are not articles and we do not want to save them
    print("Filtering Nexis cover pages...")
    mask_junk = df['Title'].str.contains("Job Number|Search Terms|Request ID", case=False, na=False)
    df_clean = df[~mask_junk]
    
    print(f"Articles after cleaning: {len(df_clean)}")
    
    # Save the DataFrame to CSV
    os.makedirs(os.path.dirname(OUTPUT_FILE), exist_ok=True)
    df_clean.to_csv(OUTPUT_FILE, index=False)
    print(f"Data successfully saved to: {OUTPUT_FILE}")
    
    # Show a preview of the data
    print("\nData Preview:")
    print(df_clean[['Source', 'Title']].head())
    print("\nTop 5 Sources:")
    print(df_clean['Source'].value_counts().head())

else:
    print("No data found. Please make sure to run all cells beforehand.")

Total number of raw extracted articles: 77093
Filtering Nexis cover pages...
Articles after cleaning: 76954
Data successfully saved to: ../data/csv/articles_raw.csv

Data Preview:
               Source                                              Title
0       Urner Zeitung           Das Lexikon für den gepflegten Smalltalk
1   Groß-Gerauer Echo                                Maschine als Mensch
2         Focus-Money  BUCHHALTUNGSPROGRAMME IM TEST; Software mit Gü...
3  Allgemeine Zeitung                                Maschine als Mensch
4    Berliner Zeitung  Wie in Hollywood; Das Computerspiel "Total War...

Top 5 Sources:
Source
dpa-AFX ProFeed                                                   4385
Rheinische Post                                                   2136
Neue Zürcher Zeitung (Internationale Ausgabe) & NZZ am Sonntag    1751
SDA - Basisdienst Deutsch                                         1173
Die Presse                                                        1112
Name