***This is a Jupyter Notebook for the project ```Historical Topic Modelling```***

Let's first create a new clean environment! If prompted to select the environment, click yes.

In [None]:
!python -m venv HistTopMod

Next, let's first install ipykernel, so that the virtual environment can be used as a kernel:
Depending on your internet connection, this might take a while.

In [None]:
!HistTopMod\Scripts\python -m pip install notebook ipykernel

Now let's register the Virtual Environment with Jupyter

In [None]:
!HistTopMod\Scripts\python -m ipykernel install --user --name=HistTopMod --display-name "Python (HistTopMod)"


**This step is crucial, and involves you changing the kernel manually inside your IDE.**

1. Select the current kernel (usually base)
2. Choose "Select Another Kernel"
3. Choose "Python Environments"
4. Choose the newly created HistTopMod kernel

Now we are all set with our clean virtual environment!

***Get all required packages***

A requirements.txt file is provided here:

``https://github.com/raoul-zeno/Historical-Topic-Modeling/blob/main/requirements.txt``

You can either copy and paste the requirements.txt file into this directory, or alternatively run the code below to do the fetch automatically.

In [None]:
import requests
import os

gitlab_req_url = "https://raw.githubusercontent.com/raoul-zeno/Historical-Topic-Modeling/refs/heads/main/requirements.txt"

try:
    response = requests.get(gitlab_req_url)
    response.raise_for_status()

    with open ("requirements.txt", "w") as file:
        file.write(response.text)
    
    print("requirements.txt was fetched and saved to the directory!")

except requests.exceptions.RequestException as e:
    print(f"Failed to fetch the file: {e}")

To install all the necessary libraries from requirements.txt run the following code:

As quite a few libraries are needed due to the complexity of the models, this might take a while, depending on your internet connection. Don't worry, as long as you keep working in this directory, this step only needs to be done once.

In [None]:
import subprocess

try:
    subprocess.check_call(["pip", "install", "-r", "requirements.txt"])
    print("All packages installed successfully!")

except subprocess.CalledProcessError as e:
    print(f"An error occured: {e}")


Great! So we are all set for some topic modelling!

Next, let's create a directory, to store our source XML-Files in!


In [None]:
!mkdir source_files

Please copy all the source files (in XML-Format) into this directory.

This next step does three things:
1. An SQL-database is created inside this directory
2. Some key information and the texts of each XML-File is extracted
3. This data is added to the SQL-Database

Depending on the size of the corpus this might take a while, as it loops over every individual XML-File and extracts the data.

In [None]:
import pandas as pd
import pprint as pp
import os
import xml.etree.ElementTree as ET
import sqlite3

xml_directory = r"./source_files/"

data = []

con = sqlite3.connect("Database.db")
cur = con.cursor()

def extract_text(text_element):
    text_content = []

    def recursive_extract(text_elem):
        if text_elem.text:
            text_content.append(text_elem.text)
        for child in text_elem:
            recursive_extract(child)
            if child.tail:
                text_content.append(child.tail)

    recursive_extract(text_element)
    
    return " ".join(text_content)


for file_name in os.listdir(xml_directory):
    if file_name.endswith(".xml"):
        file_path = os.path.join(xml_directory, file_name)


        tree = ET.parse(file_path)
        root = tree.getroot()
        ns = {"ns0":"http://www.tei-c.org/ns/1.0"}

        main_title = root.findtext('.//ns0:title[@type="main"]', namespaces=ns)
        sub_title = root.findtext('.//ns0:title[@type="sub"]', namespaces=ns)
        volume_title = root.findtext('.//ns0:title[@type="volume"]', namespaces=ns)
        class_main = root.findtext(".//ns0:classCode[@scheme='https://www.deutschestextarchiv.de/doku/klassifikation#dwds1main']", namespaces=ns)
        class_sub = root.findtext(".//ns0:classCode[@scheme='https://www.deutschestextarchiv.de/doku/klassifikation#dwds1sub']", namespaces=ns)
        author_surname = root.findtext(".//ns0:surname", namespaces=ns)
        author_forename = root.findtext(".//ns0:forename", namespaces=ns)
        author = f"{author_surname}, {author_forename}"
        publication_date_str = root.findtext(".//ns0:sourceDesc/ns0:biblFull/ns0:publicationStmt/ns0:date[@type='publication']", namespaces=ns)
        language = root.findtext(".//ns0:language", namespaces=ns)
        text_element = root.find(".//ns0:text", namespaces=ns)
        plain_text = extract_text(text_element)

        data.append({
            "main_title": main_title,
            "sub_title": sub_title,
            "volume_title": volume_title,
            "author": author,
            "publication_date": publication_date_str,
            "class_main": class_main,
            "class_sub": class_sub,
            "language": language,
            "text": plain_text
        })

df = pd.DataFrame(data)

df.to_sql("my_data", con, index=True, if_exists="replace")

There is an issue, where in the database the index is called index (an SQL-command). So let's change that.

In [None]:
import sqlite3

con = sqlite3.connect("Database.db")
cur = con.cursor()

cur.execute("ALTER TABLE my_data RENAME COLUMN 'index' TO text_index;")

print("Renamed index to text_index")

The groundwork for the modelling is now done. Let's restart the kernel to clear memory. Everything done before (creation of database, installing libraries from requirements.txt) will stay.

**Text Preprocessing**
In this section, the plain texts in the database will be preprocessed for Machine Learning purposes. For this, we use regex. The preprocessed texts will be added automatically to the database. 

Let's first create a new column called preprocessed_text

In [None]:
import sqlite3

con = sqlite3.connect("Database.db")
cur = con.cursor()

cur.execute("ALTER TABLE my_data ADD COLUMN preprocessed_text text")

Then next, we do the preprocessing and add it to the newly created column.

In [2]:
import pandas as pd
import unicodedata
import re
import sqlite3
import time

con = sqlite3.connect("Database.db")
cur = con.cursor()

def preprocess_text(text):
    #lowercase the text
    result_text = text.lower()

    #normalize text
    result_text = unicodedata.normalize("NFD", result_text)

    #take out two-letter and single-letter words (experimental)
    result_text = re.sub(r"\b\w{1, 2}\b", "", result_text)
    result_text = re.sub(r"\s+", " ", result_text).strip()

    #remove diaresis
    result_text = re.sub(r"(?<=\w)\u0364", "e", result_text)
    result_text = re.sub(r"o\u0308", "oe", result_text)
    result_text = re.sub(r"a\u0308", "ae", result_text)
    result_text = re.sub(r"u\u0308", "ue", result_text)

    #removing long s
    result_text = re.sub(r"\u017F", "s", result_text)

    #removing round r
    result_text = re.sub(r"\uA75B", "r", result_text)

    #removing ligatures
    result_text = re.sub(r"\u00E6", "ae", result_text)
    result_text = re.sub(r"\u0153", "oe", result_text)

    #removing abbreviations
    result_text = re.sub(r"m\u0303", "m", result_text)
    result_text = re.sub(r"n\u0303", "d", result_text)

    #remove unneeded special characters
    pattern = r"[/;:,()\[\]\"\"*]"
    result_text = re.sub(pattern, "", result_text)

    #remove \n and -\n
    result_text = re.sub(r'- \n', "", result_text)
    result_text = re.sub(r'\n', '', result_text)

    #remove multiple spaces
    result_text = re.sub(" +", " ", result_text)

    #remove special characters
    result_text = re.sub(r"[^a-zA-Z\s]", "", result_text)

    #cleanup double spaces due to removal of single letters and digits
    result_text = re.sub(r"\s{2,}", " ", result_text).strip()

    return result_text
    
def add_pp_texts_to_database():
    number_of_entries = cur.execute("SELECT COUNT(*) FROM my_data").fetchone()
    for i in range(number_of_entries[0]):
        start = time.time()
        example_textObj = cur.execute("SELECT text FROM my_data WHERE text_index=?", (i,))
        example_text = example_textObj.fetchone()
        result_text = preprocess_text(example_text[0])
        cur.execute("UPDATE my_data SET preprocessed_text=? WHERE text_index=?", (result_text, i,))
        end = time.time()
        print(f"ID {i} has taken {end - start} seconds")
    con.commit()

add_pp_texts_to_database()

ID 0 has taken 0.02062082290649414 seconds
ID 1 has taken 0.030745744705200195 seconds
ID 2 has taken 0.03801369667053223 seconds
ID 3 has taken 0.01793193817138672 seconds
ID 4 has taken 0.01477503776550293 seconds
ID 5 has taken 0.08770871162414551 seconds
ID 6 has taken 0.014225959777832031 seconds
ID 7 has taken 0.03520989418029785 seconds
ID 8 has taken 0.054923057556152344 seconds
ID 9 has taken 0.020576000213623047 seconds
ID 10 has taken 0.0 seconds
ID 11 has taken 0.012901067733764648 seconds
ID 12 has taken 0.01747608184814453 seconds
ID 13 has taken 0.007758140563964844 seconds
ID 14 has taken 0.017307043075561523 seconds
ID 15 has taken 0.016184568405151367 seconds
ID 16 has taken 0.00859379768371582 seconds
ID 17 has taken 0.0 seconds
ID 18 has taken 0.007204294204711914 seconds
ID 19 has taken 0.05232715606689453 seconds
ID 20 has taken 0.03126859664916992 seconds
ID 21 has taken 0.007688760757446289 seconds
ID 22 has taken 0.007014751434326172 seconds
ID 23 has taken 0.0

The texts now have been preprocessed, so let's implement the model. To get started let's first test out a famous dataset called fetch_20newsgroups.

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)

topic_model.visualize_topics().show()

The real deal:

Download german stop words

In [3]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

nltk.download("stopwords")

german_stop_words = stopwords.words("german")
specific_stop_words = ["vnd", "fuer", "vns", "ueber", "koennte", "vnsere", "vnserem", "vnser", "vnserer", "vnter", "waehrend", "vnnd", "bey", "sey", "auff", "allda", "allwo", "alsdann", "also", "demnach", "demselben", "desgleichen", "hinwiederum", "ohn", "obgleich", "obschon", "solches", "nachdem", "dergestalt", "mithin", "weswegen", "weshalb", "indem", "sodann", "wanns", "darob", "daher", "hierin", "daraus", "drum", "obgleich", "allzumal"]
german_stop_words += specific_stop_words
vect = CountVectorizer(stop_words=german_stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\raoul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Make the documents for the model!

In [4]:
import sqlite3

con = sqlite3.connect("Database.db")
cur = con.cursor()

docus = []
range_index = cur.execute("SELECT COUNT(*) FROM my_data").fetchone()
for i in range(range_index[0]):
    text = cur.execute("SELECT preprocessed_text FROM my_data WHERE text_index=?", (i,)).fetchone()
    docus.append(text[0])

Now let's train the model!

In [7]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_transformer = SentenceTransformer("distiluse-base-multilingual-cased-v2")

topic_model = BERTopic(language="multilingual", verbose=True, low_memory=True, vectorizer_model=vect, embedding_model=sentence_transformer)
topics, probs = topic_model.fit_transform(docus)

2024-12-14 10:53:19,859 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 139/139 [12:52<00:00,  5.56s/it]
2024-12-14 11:07:06,756 - BERTopic - Embedding - Completed ✓
2024-12-14 11:07:06,758 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-14 11:07:30,077 - BERTopic - Dimensionality - Completed ✓
2024-12-14 11:07:30,078 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-14 11:07:30,340 - BERTopic - Cluster - Completed ✓
2024-12-14 11:07:30,356 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-14 11:11:10,058 - BERTopic - Representation - Completed ✓


***Now, there are a lot of ways in which you can visualize the outcomes***

Intertopic connectedness:

In [8]:
topic_model.visualize_topics().show()

Barchart:

In [9]:
topic_model.visualize_barchart()

In [11]:
embeddings = sentence_transformer.encode(docus, show_progress_bar=True)

Batches: 100%|██████████| 139/139 [14:43<00:00,  6.35s/it]


In [1]:
topic_model.visualize_heatmap()

NameError: name 'topic_model' is not defined