
# Code for Master's Thesis: Topic Modeling

## Research Questions

1. Welche Themen können mithilfe von Topic Modeling aus den DHd-Abstracts
der Tagungen zwischen 2014 und 2023 gefunden werden?

*Which topics can be found in the abstracts from DHd-conferences between 2014 and 2023 with Topic Modeling?*

2. Welche Themen kommen häufig gemeinsam in einem Dokument vor und weisen
daher eine hohe Themenähnlichkeit (topic similarity) auf?

*Which topics appear frequently in one abstract and therefore have a high topic similarity?* **Hierarchical Clustering**

3. Wie haben sich die Themenschwerpunkte im Verlauf der Jahre verändert -
welche Trends sind zu erkennen?

*How have the topics been changing throughout the years - which trends are perceptible?* **Mann-Kendall-Test**

4. Welche Entwicklungen sind in Bezug auf die Verwendung verschiedener Forschungsmethoden festzustellen?

*With regard to the use of different scientific methods, which developments are perceptible?*

5. Welche Personen sind besonders häufig mit Abstracts vertreten, in welchen
Autor:innenteams treten sie auf und wie verändern sich diese im Zeitverlauf?

*Which researchers contribute to the conference particularly frequently with abstracts, in which teams do they contribute and how have the teams been changing?*

6. Welche Personencluster sind in Bezug auf die Themenschwerpunkte zu erkennen und wie verändern sich diese?

*Which clusters of researchers can be found with regard to topics and how have the clusters been changing?* **Network Analysis**

### Imports

In [1]:
#Reading in necessary pdf- and xml-files
import zipfile
from bs4 import BeautifulSoup
'''Vermerken: PyPDF2 hat die Zeichen nicht gut erkannt und daher sind einige Wörter herausgefallen'''
import PyPDF2
import fitz
from io import BytesIO

#(pre)processing the files
import re
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import spacy
from langdetect import detect
from gensim.models import TfidfModel
import pickle
''' In-Script Funktion hat nicht funktioniert '''
from ocrfixr import spellcheck

#LDA
import gensim
import gensim.corpora as corpora
import os

### General functions: opening lists, saving and reopening objects, function to get conference names from file names

In [2]:
def open_list(doc_name):
    file = open(doc_name, "r", encoding='utf-8')
    data = file.read()

    data = data.split(", ")
    return data
    file.close()

In [3]:
def save_object(dirname, filename, varname):
    filename = dirname + filename
    g = open(filename, 'wb')
    pickle.dump(varname, g)
    g.close()

In [4]:
def open_variable(dirname, filename):
    path = str(dirname) + str(filename)
    f = open(path, 'rb')
    filename = pickle.load(f)
    f.close()
    
    return filename 

In [5]:
def get_conference_names(list):
    files = []
    for element in list:
        new_name = element.split('.zip')[0]
        new_name = new_name.split('Corpus/')[1]
        new_name = re.sub('_', ' ', new_name)
        files.append(new_name)
    
    return files

### Preprocessing function: determining text language

In [6]:
def detect_language(text):
    
    #gets text as input
    lang = detect(text)

    #returns the language tag of detected language
    return lang        

### Preprocessing function: cleaning the texts

In [112]:
def clean_text(text):
        
    # filtering paragraphs from text
    clean = re.sub(r'\n', "", str(text))

    
    # filtering weblinks, digits and markup from XML
    abbreviations = [r'http(.*?) ', r'\d', r'<(.*?)>', r'https(.*?) ']
    for word in abbreviations:
        clean = re.sub(word, '', clean)
    
    # filtering punctuation
    punctuation = '''!“()´`¨[]{}\\;:”",<>/.?@#$%^&*_~''' 
    for word in clean:
        if word in punctuation:
            clean = clean.replace(word, "")
    
    # convert a document into a list of lowercase tokens, ignoring tokens that are too short (min_len=2) or too long (max_len=15), no deaccentation (by default)
    clean = gensim.utils.simple_preprocess(clean, min_len=3, max_len=25)

    # returns cleaned-up texts
    return clean

### Preprocessing function: removing stopwords and very short/long words

In [8]:
def remove_stopwords(text, language, additional_stops):
    
    # import German stopword list 
    stops_de = set(stopwords.words('german'))
    stops_de.update(additional_stops)
    
    stops_en = set(stopwords.words('english'))
    stops_en.update(additional_stops)
    
    
    # filter stopwords
    words_filtered = []
    for w in text:
        if language == 'de':
            if w not in stops_de:
                words_filtered.append(w)

        elif language == 'en':
            if w not in stops_en:
                words_filtered.append(w)
    
    # return list of words that are NOT stopwords
    return words_filtered


### Preprocessing function: (morpho-syntactic) lemmatization
- Lemmatizing the words in the texts to their dictionary form according to the detected language
- Hint: 'de_core_news_md' and 'en_core_web_sm' models have to be downloaded via pip beforehand

In [9]:
def lemmatization(texts, language):
    
    # only words tagged as nouns, verbs, adjectives and adverbs should be considered
    allowed_tags = ['NOUN', 'VERB', 'ADJ']

    # disabling parser and ner-tool to accelerate computing 
    nlp_de = spacy.load('de_core_news_md', disable=['parser', 'ner'])
    nlp_en = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    texts_out = []
    for text in texts:
        if language == 'de':
            doc = nlp_de(text)
            new_text = []
            for token in doc:
                if token.pos_ in allowed_tags:
                    new_text.append(token.lemma_.lower())
        elif language == 'en':
            doc = nlp_en(text)
            new_text = []
            for token in doc:
                if token.pos_ in allowed_tags:
                    new_text.append(token.lemma_.lower())
            
        # delete all empty sets where the pos-tag was not in allowed list
        if new_text != []:        
            final = " ".join(new_text)
            texts_out.append(final)
    
    # return list of lemmatized words
    return (texts_out)

### Function: Extracting Keywords from XML-File
- extracts tags \<keywords n="topics" scheme="ConfTool"> and \<keywords n="keywords" scheme="ConfTool"> to get keywords of the texts
- checks validity of keywords

In [10]:
def clean_keywords(keywords):
    keywords = re.sub("<(.*?)>", "", keywords)
    keywords = keywords.split("\n")
    for item in keywords:
        if len(item) <= 2:
            keywords.remove(item)
    return keywords

In [11]:
def extract_keywords(xmldata, conf_tool_methods):
    
    # finds all tags <keywords n="keywords"> and <keywords n="topics">, removes all tags within
    keywords_free= str(xmldata.find_all('keywords', n='keywords'))
    keywords_conf = str(xmldata.find_all('keywords', n='topics'))
    
    keywords_free = clean_keywords(keywords_free)
    keywords_conf = clean_keywords(keywords_conf)
    for item in keywords_conf:
        if item not in conf_tool_methods:
            keywords_conf.remove(item)
            
    # returns list
    return keywords_free, keywords_conf

### Function: Counting number of extracted keywords/authors/etc.
- function creates dictionary from the input list
- counts how often each method/author is used or appears
- returns the dictionary

In [12]:
def count_appearances(input_list):
    
    methods_dict = {}
    # for each item in keyword list, check if it is alredy in dictionary
    # if not, add and set count to 1, if yes add +1 to count
    for item in input_list:
        if item not in methods_dict.keys():
            methods_dict[item] = 1
        else:
            methods_dict[item] += 1
    # sort dictionary according to highest count in the values
    sorted_dict = sorted(methods_dict.items(), key=lambda x: x[1], reverse=True)

    # return the sorted dictionary (becomes list through sorting though)
    return sorted_dict

### Function: Extracting the author names
Extracts the names of the authors and returns a list of lists containing the names of the single texts' authors

In [13]:
def extract_authors(title_stmt):
    # returns a list of authors for each of the texts
    
    # navigating to the title statement and finding all tags <author>
    authors = title_stmt.find_all("author")
    fore_and_surnames = []
    
    # extracting the <surname> and <forename> tags and cleaning the outcome from the tags and the brackets
    for element in authors:
        names = element.find_all(['surname', 'forename'])
        names =  re.sub("<(.*?)>", "", str(names))
        names = re.sub("</(.*?)>", "", str(names))
        names = re.sub(r'\]', "", names)
        names = re.sub(r'\[', "", names)
        fore_and_surnames.append(names)
    
    return fore_and_surnames

### Function: Extracting text from XML-files

In [14]:
def extract_xml_text(soup):
    
    # extract <p> tags from body of xml-document to find the actual text 
    document_body = soup.body
    p_tags = document_body.find_all("p")
    
    # return the text from p-tags
    return p_tags

### Function: Extract List Items from Textfile After Manual OCR-Postprocessing

In [15]:
def post_processing_ocr(textfile):

    pprocessed_pdf = []
    pdf_ocr_corr = textfile.split(r']')
    for item in pdf_ocr_corr:
        item = re.sub(r', \[', '', item)
        item = re.sub(r'\ufeff', '', item)
        item = re.sub(r'\[', '', item)
        item = re.sub(r'\'', '', item)
        item = re.sub(r'\’', '', item)
        item = re.sub(r'\‘', '', item)
        clean = item.split(', ')
        pprocessed_pdf.append(clean)
    # cutting off the last item as it is not a text item
    pprocessed_pdf = pprocessed_pdf[:-1]
    
    return pprocessed_pdf

### Functions: Making bigrams and trigrams

In [16]:
def make_bigrams(texts, bigram):
    return([bigram[doc] for doc in texts])

def make_trigrams(texts, trigram,bigram):
    return ([trigram[bigram[doc]] for doc in texts])

In [110]:
def create_bigrams_trigrams(texts):
   
    bigram_phrases = gensim.models.Phrases(texts, min_count=8, threshold=100)
    trigram_phrases = gensim.models.Phrases(bigram_phrases[texts], threshold=100)

    bigram = gensim.models.phrases.Phraser(bigram_phrases)
    trigram = gensim.models.phrases.Phraser(trigram_phrases)

    data_bigrams = make_bigrams(texts, bigram)
    data_bigrams_trigrams = make_trigrams(data_bigrams, trigram, bigram)

    return data_bigrams_trigrams

### Function: Creating bag of words
delete

In [18]:
# def create_bow(data_words): 
    
#     # mapping the documents' words to a dictionary   
#     id2word = corpora.Dictionary(data_words)

#     # creating a bag of words by using index of dictionary
#     bag_of_words_corpus = []
#     for text in data_words:
#         new = id2word.doc2bow(text)
#         bag_of_words_corpus.append(new)

#     # returning id2word-reference as well as bag of word itself, both needed for LDA    
#     return id2word, bag_of_words_corpus

### Function: TF-IDF weighting

In [19]:
def tf_idf(id2word, texts):
    # simple bag of words for each document, containing tuples with (index, number of appearances of the word in the document)
    corpus = [id2word.doc2bow(text) for text in texts]

    # calculates term frequency (TF) weighted by the inverse document frequency (IDF) for every word/index in the bag of words
    tfidf = TfidfModel(corpus, id2word=id2word)

    # low_value as threshold
    low_value = 0.03
    words  = []
    words_missing_in_tfidf = []

    # for every single bag of words
    for i in range(0, len(corpus)):
        # consider each bow for each document
        bow = corpus[i]
        
        # for each tuple (index, tfidf-value) in the tf-idf-weighted bag of words, extract index (tfidf_ids)
        tfidf_ids = [id for id, value in tfidf[bow]]
        
        # for each tuple (index, bow-value without tfidf), extract index
        bow_ids = [id for id, value in bow]
        
        # if the value in the (index, tfidf-value) tuple is lower than 0.03, put id into list low_value_words
        low_value_words = [id for id, value in tfidf[bow] if value < low_value]
        
        drops = low_value_words+words_missing_in_tfidf
        
        # which words will be deleted from the bow?
        for item in drops:
            words.append(id2word[item])
    
        words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf score 0 will be missing
        
        # add words which indexes are not in low_value_words and not in words_missing_in_tfidf to the new bag of words 
        new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
        
        # new bow is missing certain indexes
        corpus[i] = new_bow
    
    return corpus

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Main Code:

Creating repositories in which variables, models and figures can be saved later

In [20]:
if not os.path.isdir("Variables/"):
    os.mkdir('Variables/')
    print('Created new directory: Variables')

rqs = ['RQ1', 'RQ2', 'RQ3', 'RQ4', 'RQ5', 'RQ6', ]    
for question in rqs:
    if not os.path.isdir('Figures/'+question):
        os.mkdir('Figures/'+question)
        print('Created new directory: Figures/', question )
    
if not os.path.isdir('Models/'):
    os.mkdir('Models/')
    print('Created new directory: Models')

Reading in zip-files of DHd-conferences where only PDF-files are accessible

In [22]:
filenames_pdf = ['Corpus/DHd_2014.zip', 'Corpus/DHd_2015.zip']
doc_statistics = []

# extracting text from pdf-files
all_pdf_texts = []
doc_names_pdf = []
en_count = [0] * 9
k = 0
for conference_file in filenames_pdf:
    archive = zipfile.ZipFile(conference_file, 'r')
    doc_names_year = []
    doc_statistics.append(len(archive.namelist()))
    for name in archive.namelist():
        if name[-4:] == '.pdf':
            doc_names_year.append(name)
            pdf_data = BytesIO(archive.read(name))
            # reading each pdf-file in the zip-archive
            with fitz.open(stream=pdf_data, filetype='pdf') as doc:
                text = ''
                for page in doc:
                    text += page.get_text()

                all_pdf_texts.append(text)
                lang=detect_language(text)
                if lang == 'en':
                    en_count[k] += 1


    doc_names_pdf.append(doc_names_year)
    k += 1
    

filenames_pdf = get_conference_names(filenames_pdf)

Reading in the zip-files of the DHd-Conferences where XML-files were published

In [23]:
filenames_xml = ['Corpus/DHd_2016.zip', 'Corpus/DHd_2017.zip', 'Corpus/DHd_2018.zip', 'Corpus/DHd_2019.zip', 'Corpus/DHd_2020.zip',
             'Corpus/DHd_2022.zip', 'Corpus/DHd_2023.zip']



all_xml_files = []
doc_names_xml = []
# read in all zip-folders
for conference_file in filenames_xml:
    archive = zipfile.ZipFile(conference_file, 'r')
    doc_names_year = []
    xml_per_year = []
    # read in all files in the zip-file and check that they are xml-files
    doc_statistics.append(len(archive.namelist()))
    for name in archive.namelist():
        if name[-4:] == '.xml' and not name[-9:] == 'final.xml':
            xml_per_year.append(archive.read(name))
            doc_names_year.append(name)
    all_xml_files.append(xml_per_year)
    # creating a list of all documents' names
    doc_names_xml.append(doc_names_year)
   

docnames = doc_names_pdf + doc_names_xml
filenames_xml = get_conference_names(filenames_xml)
filenames = filenames_pdf + filenames_xml

XML-Files: 

The XML-files are not only used for text extraction, but since they contain a lot of information due to the extensive markup, some other information will be extracted from the files in the following steps:
- Text 
- Authors of the documents
- Keywords given in the metadata of the abstracts in order to find the scientific methods used

In [24]:
all_xml_texts = []

# importing the list provided, which contains all selectable options for <keywords n='keywords'>
conf_tool_methods = open_list('Misc/conf_tool_methods.txt')

# contains a list per year, this list contains a list of keywords extracted per text
all_free_keywords = []
used_keywords_free = []
used_keywords_conf = []
authors = []
authors_full_list = []

for year in all_xml_files:
    keywords_free_year = []
    keywords_conf_year = []
    authors_year = []
    for doc in year:
        
        soup = BeautifulSoup(doc, 'xml')
        
        # Code for extracting the actual text from xml-files
        xml_text = extract_xml_text(soup)
        all_xml_texts.append(xml_text)
        
        lang = detect_language(str(xml_text))
        if lang == 'en':
            en_count[k] += 1
        
        
        # Code for extracting the author names       
        title_stmt = soup.titleStmt
        authors_in_doc = extract_authors(title_stmt)
        authors_year.append(authors_in_doc) 

        
        # Code for extracting the keywords used in xml-files  (per year)
        keywords_free, keywords_conf = extract_keywords(soup, conf_tool_methods)  
        keywords_free_year = keywords_free_year + keywords_free
        keywords_conf_year = keywords_conf_year + keywords_conf
        
        
    # saving all keywords that were given in the <keyword n=keyword> tags in the XML-files
    all_free_keywords = list(dict.fromkeys(all_free_keywords + keywords_free_year))
    used_keywords_conf.append(keywords_conf_year)    
    used_keywords_free.append(keywords_free_year)     
    
    # saves each text's authors in a list, sorted by year of the text
    authors.append(authors_year)
    
    
    for element in authors_year:
        authors_full_list = authors_full_list + element
    authors_full_list = list(dict.fromkeys(authors_full_list))
    k += 1

Merging the extracted PDF and XML texts for further processing of the textual content:

- Cleaning up
- Removing stopwords depending on the detected language (English or German)
- Lemmatizing the texts depending on the detected language (English or German) --> time-consuming step

In [109]:
''' Do not exert if you have the variables stored!! '''
whole_texts = []
whole_texts = all_pdf_texts + all_xml_texts

additional_stops = open_list('Misc/additional_stopwords.txt')

list_all_texts = []
for text in whole_texts:
    # detecting language in order to remove the stopwords and lemmatize according to language
    lang = detect_language(str(text))   
    text_item = clean_text(text)
    text_item = lemmatization(text_item, lang)
    text_item = remove_stopwords(text_item, lang, additional_stops)
    list_all_texts.append(text_item)


''' Do not exert if you have the variables stored!! '''

['die', 'begutachtung', 'von', 'forschungsbeiträgen', 'ist', 'ein', 'zentraler', 'pfeiler', 'wissenschaftlicher', 'qualitätssicherung', 'sei', 'für', 'zeitschriften', 'konferenzen', 'oder', 'drittmittelfinanzierte', 'forschungsprojekte', 'dafür', 'wie', 'diese', 'begutachtung', 'konkret', 'abläuft', 'gibt', 'unterschiedliche', 'modelle', 'gepflogenheiten', 'erfahrungen', 'und', 'erwartungen', 'das', 'bei', 'dhd', 'konferenzen', 'bis', 'inklusive', 'verwendete', 'modell', 'sah', 'eine', 'teilanonymisierung', 'vor', 'autorinnen', 'waren', 'den', 'gutachterinnen', 'namentlich', 'bekannt', 'jedoch', 'nicht', 'umgekehrt', 'sog', 'single', 'blind', 'modell', 'die', 'gutachten', 'selbst', 'text', 'und', 'zahlenmäßige', 'bewertung', 'wurden', 'nur', 'den', 'autorinnen', 'allen', 'gutachterinnen', 'des', 'beitrags', 'und', 'dem', 'programmkomitee', 'mitgeteilt', 'dieses', 'modell', 'war', 'gegenstand', 'von', 'diskussionen', 'auf', 'den', 'dhd', 'mitgliederversammlungen', 'und', 'bis', 'beschlo

' Do not exert if you have the variables stored!! '

In [29]:
# how many texts are in each year's corpus taken into account?
number_pdf_docs = [len(sublist) for sublist in doc_names_pdf]
number_xml_docs = [len(sublist) for sublist in doc_names_xml]
number_docs = number_pdf_docs + number_xml_docs

Saving the variable *list_all_texts* as txt-file in order to conduct manual post-processing of OCR-created text from pdf-files

In [30]:
save_object('Variables/', 'list_all_texts.pckl', list_all_texts)
# list_all_texts = open_variable('Variables/', 'list_all_texts.pckl')

with open('Misc/list_all_texts.txt', 'w', encoding='utf-8') as f:
    f.write(str(list_all_texts[:231]))

Opening the post-processed file and eventually bringing all texts together again in variable *corr_list_of_texts*

In [31]:
# opening the post-processed file
with open('Misc/ocr_correction.txt', 'r', encoding='utf-8') as p:
    pdf_ocr_corr = p.read()
corr_list_of_texts = post_processing_ocr(pdf_ocr_corr)
corr_list_of_texts = corr_list_of_texts[:-1]

corr_list_of_texts = corr_list_of_texts + list_all_texts[231:]

Creating bigrams and trigrams, id2word and bag of words, which are necessary for the actual topic modeling algorithm (LDA)

In [32]:
# creating bigrams and trigrams from lemmatized words
data_bigrams_trigrams = create_bigrams_trigrams(corr_list_of_texts)

# id2word as dictionary where every word/bi-/trigram is referenced with id
id2word = corpora.Dictionary(data_bigrams_trigrams)


# corpus as dictionary that contains a list of tuples for each document, tuples contain (word id, no. of appearances of the word)
# some index numbers are missing due to the tf-idf weighting 
corpus = tf_idf(id2word, data_bigrams_trigrams)

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 7), (7, 4), (8, 1), (9, 1), (10, 1), (11, 1), (13, 3), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 2), (29, 1), (30, 1), (31, 1), (32, 4), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 4), (54, 3), (55, 5), (56, 1), (57, 1), (58, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 2), (65, 2), (66, 1), (67, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 2), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 3), (91, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 2), (102, 2), (103, 2), (104, 2), (105, 2), (106, 1), (107, 4), (108, 2), (109, 1), (110, 2), (111, 1), (112, 1), (113, 1), (114

Saving the variables for further use

In [78]:
# writing the files to save the variables
save_object('Variables/', 'corpus.pckl', corpus)
save_object('Variables/', 'id2word.pckl', id2word)
save_object('Variables/', 'data_bigrams_trigrams.pckl', data_bigrams_trigrams)
save_object('Variables/', 'corr_list_of_texts.pckl', corr_list_of_texts)

For information/transparency purposes: Saving information on the corpus (e.g. for mentioning in the Thesis text)

In [38]:
# corpus statistics
statistics = pd.DataFrame([doc_statistics, number_docs, en_count], index=["Total No. of Documents", "No. of Documents Ffter Filtering", "Documents in English"], 
                   columns=['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2022', '2023'])
statistics.to_csv('Figures/Statistics_Corpus.csv')

In [41]:
%store number_pdf_docs
%store number_xml_docs
%store number_docs
%store docnames
%store filenames_xml
%store all_free_keywords
%store used_keywords_free
%store used_keywords_conf
%store authors
%store authors_full_list

Stored 'number_pdf_docs' (list)
Stored 'number_xml_docs' (list)
Stored 'number_docs' (list)
Stored 'docnames' (list)
Stored 'filenames_xml' (list)
Stored 'all_free_keywords' (list)
Stored 'used_keywords_free' (list)
Stored 'used_keywords_conf' (list)
Stored 'authors' (list)
Stored 'authors_full_list' (list)
