
# Code for Master's Thesis: Topic Modeling

## Research Questions

1. Which topics can be found in the abstracts from DHd conferences between 2014 and 2023 with Topic Modeling?

2. How have the topics been changing throughout the years - which trends are perceptible?

3. Which topics appear in the same abstracts frequently and therefore have a high topic similarity?

4. With regard to the use of different scientific methods, which developments are perceptible?

5. Which researchers contribute to the conference particularly frequently with abstracts, in which teams do they contribute and how have the teams been changing?

6. Which clusters of researchers can be found with regard to topics and how have the clusters been changing?

### Prerequisites

<span style='color:violet'><font size='2'>Cell 1</span></font> Imports

In [3]:
#general imports
import re
import numpy as np
import logging
import pandas as pd

#Visualisations
import pygal 
from pygal.style import Style
import matplotlib.pyplot as plt

#importing functions from MA_Preprocessing (Both imports necessary!)
import import_ipynb
from MA_Preprocessing import open_list, save_object, open_variable, check_directory

#LDA
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models

#RQ2: Mann-Kendall Test
import pymannkendall as mk

#RQ3: Topic Similarity
from scipy.cluster.hierarchy import dendrogram, linkage, set_link_color_palette

#RQ5: Network Analysis
from pyvis.network import Network
import networkx as nx

#RQ6: Authors-Topic-Analysis
import itertools
from operator import itemgetter

<span style='color:violet'><font size='2'>Cell 2</span></font> Importing all variables needed from the preprocessing notebook to this one and importing the saved variables from the folder.

In [4]:
%store -r number_pdf_docs
%store -r number_xml_docs
%store -r number_docs
%store -r docnames
%store -r filenames
%store -r filenames_xml
%store -r filenames_pdf
%store -r all_freely_selectable_keywords
%store -r used_keywords_freely_selectable
%store -r used_keywords_predetermined
%store -r authors

<span style='color:violet'><font size='2'>Cell 3</span></font> Opening stored variables and checking if the necessary folders for variables, models and figures exist (necessary if the preprocessing notebook was not executed)

In [5]:
#Variables that are stored in the Thesis folders
corpus = open_variable('Variables/', 'corpus.pckl')
id2word = open_variable('Variables/', 'id2word.pckl')
corrected_list_of_texts = open_variable('Variables/', 'corrected_list_of_texts.pckl')
data_bigrams_trigrams = open_variable('Variables/', 'data_bigrams_trigrams.pckl')

check_directory('Models/')

rqs = ['RQ1', 'RQ2', 'RQ3', 'RQ4', 'RQ5', 'RQ6', ]    
for section in rqs:
    check_directory('Figures/' + section)

<span style='color:violet'><font size='2'>Cell 4</span></font> Implemeting a list of indexes as well as a list containing all document names.

In [6]:
indexes = [0]
textnames = []
for sublist in docnames:
    indexes.append(len(sublist) + indexes[-1])
    textnames = textnames + sublist

<span style='color:violet'><font size='2'>Cell 5</span></font> Implementing several lists for authors and author teams which are needed for later tasks.

In [7]:
# contains all authors' names, with duplicates if author contributed several times
all_authors = []
# contains all teams that contributed to the conference, not splitted by years
all_author_teams = []

# authors contains all teams that contributed to the conference, splitted  by years
for year in authors:
    all_author_teams = all_author_teams + year
    for author_team in year:
        all_authors = all_authors + author_team

#contains all authors' names, no duplicates
authors_no_duplicates = list(dict.fromkeys(all_authors))

<span style='color:violet'><font size='2'>Cell 6</span></font> Creating a visualization style in order to be consistent with the visualizations later in the notebook.

In [8]:
custom_style = Style(
legend_font_size = 15,
legend_box_size = 18,
font_family = 'sans-serif',
label_font_size = 14,
major_label_font_size = 14,
background='white',
plot_background='white',
foreground='black',
foreground_strong='#53A0E8',
foreground_subtle='#630C0D',
opacity='0.8',
opacity_hover='0.5',
transition='400ms ease-in',
colors=('#C70039', '#FFC300', '#70C700', 
        '#001EC7', '#7F3ACD', '#3ACDB4', '#EEFC11', 
        '#F2336A', '#E0BD4A', '#66A712',
        '#1232E9', '#A883D1', '#097764', '#717800',
        '#FE0033', '#FE6200', '#124B01'))

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Topic Modeling with Latent Dirichlet Allocation (LDA): Tuning parameters

By measuring the quality measures of *perplexity* and *topic coherence*, the ideal parameters for the topic model should be found. 


**Quality measure perplexity**

- Perplexity can be used to measure how good the LDA model generalizes on the text corpus (Blei et al., 2003, p. 1008)
- the lower the perplexity, the better the model

**Quality measure topic coherence** 

- Topic Coherence can be determined by various measurements - e.g. UMass, C_V, UCI, NPMI - which all use different measurements to calculate the coherence of topics ([Röder et al., 2015, p. 2](https://doi.org/10.1145/2684822.2685324)).
- In this workflow, C_V is used: measurement gives values between 0 and 1, with 1 being the best coherence to be reached

**Parameter topic number k**

- According to [Kumar (2018)](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), "[c]hoosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics". However, "[i]f you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large" (ibid).

**Parameter update_every**

- "Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning" ([Rehurek, 2022b](https://radimrehurek.com/gensim/models/ldamodel.html)).
- Here, the parameter is set to *update_every=1* as this is the default

**Parameter alpha**

- "A-priori belief on document-topic distribution" ([Rehurek, 2022a](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Alpha is the parameter, which has the smoothing effect on the topic-document distribution and ensures that the probability of each topic in each document is not 0 throughout the entire inference procedure" ([Du, 2022, p. 1](https://doi.org/10.5281/zenodo.6327965)). Du's study results indicate that coherence results of models deteriorate with increasing Alpha-parameter, and Du concludes that Alpha of each topic should not be higher than 1 ([Du, 2022, p. 2](https://doi.org/10.5281/zenodo.6327965)).

**Parameter eta/beta**

- "[D]istributional profile of topics in each document" ([Schöch, 2017, para. 20](http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html)),
- "A-priori belief on topic-word distribution" ([Rehurek, 2022b](https://radimrehurek.com/gensim/models/ldamodel.html))

**Parameter iterations**
- "Maximum number of iterations through the corpus when inferring the topic distribution of a corpus" ([Rehurek, 2022b](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Iterations (...) essentially it controls how often we repeat a particular loop over each document. It is important to set the number of 'passes' and 'iterations' high enough" ([Rehurek, 2022a](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)) to train the best LDA Topic Model


**Parameter passes**
- "Number of passes through the corpus during training" ([Rehurek, 2022b](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Passes controls how often we train the model on the entire corpus. Another word for passes might be 'epochs'" ([Rehurek, 2022a](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)).

### Topic Modeling: Functions

- Function *compute_quality_measures*:

This function takes as input the created dictionary, corpus and texts on which the later Topic Models will be based, stemming from the other Jupyter Notebook. Further, it is given several other parameters which can be tuned for creating the optimal Topic Model. Those are the coherence measure to be used (*coherence*) and other parameters to be tuned such as the number of iterations(*iterations*), number of topics (*topic_optim*), values for alpha (*alpha_optim*) and beta (*eta_topim*), the number of passes (*passes*) and the frequency of evaluations (*eval_every*).

<span style='color:violet'><font size='2'>Cell 7</span></font> modeling_and_quality

In [124]:
def modeling_and_quality(dictionary, corpus, texts, coherence, iterations, topic_optim, alpha_optim, eta_optim, passes, eval_every):

    coherence_values = []
    model_list = []
    perplexity_values = []
    topic_coherence_values=[]
    model_names = []
    
    for topics_num in topic_optim:
        for alpha_value in alpha_optim:
            for eta_value in eta_optim:
                model = LdaModel(corpus=corpus,
                                 id2word=dictionary, 
                                 iterations=iterations, 
                                 num_topics=topics_num,
                                 alpha=alpha_value, 
                                 eta=eta_value, 
                                 passes=passes, 
                                 eval_every=eval_every, 
                                 minimum_probability=1e-8)
                model_list.append(model)
                perplexity_values.append(model.log_perplexity(corpus))
                coherencemodel = CoherenceModel(model=model, 
                                                texts=texts, 
                                                dictionary=dictionary, 
                                                corpus=corpus, 
                                                coherence=coherence, 
                                                topn=60, 
                                                window_size=150)
                coherence_values.append(coherencemodel.get_coherence())
                topic_coherence_values.append(coherencemodel.get_coherence_per_topic())
                name = (str(topics_num), str(alpha_value), str(eta_value))
                model_names.append(name)
                name = ''
                
                # For controlling the progress of this very time-consuming step in the workflow, log every parameter that has been checked
                logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
                logging.info('Topic number k: %s', topics_num)
                logging.info('Alpha: %s', alpha_value)
                logging.info('Beta: %s', eta_value) 
                logging.info('Coherence: %s', coherencemodel.get_coherence())
        

    return model_list, perplexity_values, coherence_values, topic_coherence_values, model_names

### Finding the best settings for *passes* and *iterations*:

<span style='color:violet'><font size='2'>Cell 8</span></font> Training of the Topic Model:

 In order to train the model, Rehurek ([2022a](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)) proposes to select a random number of topics *k* and try several options for the parameters *passes* and *iterations*, while keeping all other parameters equal. Default values are *passes=1* and *iterations=50*. 
 
 The tutorial proposes to use the logging function on the DEBUG level to see how many documents were converged during the training. One should opt for settings of the two parameters where most - or ideally all - documents are converged ([Rehurek, 2022a](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)).

Here, convergence of 1202/1203 documents occurred with *k=20*, *iterations=190* and *passes=4*.

In [126]:
# initialize logging in DEBUG-mode to see the convergence of documents
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

# initializing random seed to be able to reproduce the results
np.random.seed(3)

# running the function with default values except for the topic number
k_model_list_cv, k_perplexity_values_cv, k_coherence_values_cv, k_topic_coherence_values_cv, k_model_names_cv = modeling_and_quality(dictionary = id2word, 
                                                                                        corpus = corpus, 
                                                                                        texts = data_bigrams_trigrams, 
                                                                                        coherence = 'c_v', 
                                                                                        iterations = 190, 
                                                                                        topic_optim = [20],
                                                                                        alpha_optim = ['auto'],
                                                                                        eta_optim = ['auto'],
                                                                                        passes = 4,
                                                                                        eval_every = 1)

### Finding the best topic number *k*:

<span style='color:violet'><font size='2'>Cell 9</span></font> After finding good settings for *passes* and *iterations*, the next step is finding the ideal topic number *k*. While trying out different values for *k* between 10 and 35 and keeping the other parameters constant, one should look for the highest coherence value and lowest perplexity value ([Kumar, 2018](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/)).

In previous experiments, topic numbers up to 70 were tried as well. However, that many topics proved to be too small and too uninterpretable for a meaningful Topic Model and therefore it was decided on trying 35 topics at the maximum.

In [None]:
topic_optim = list(np.arange(10, 36, 1))

# initializing random seed to be able to reproduce the results
k_seed = 1
np.random.seed(k_seed)

# running the function with default values except for the topic number
k_model_list_cv, k_perplexity_values_cv, k_coherence_values_cv, k_topic_coherence_values_cv, k_model_names_cv = modeling_and_quality(dictionary = id2word, 
                                                                                        corpus = corpus, 
                                                                                        texts = data_bigrams_trigrams, 
                                                                                        coherence = 'c_v', 
                                                                                        iterations = 190, 
                                                                                        topic_optim = topic_optim,
                                                                                        alpha_optim = ['auto'],
                                                                                        eta_optim = ['auto'],
                                                                                        passes = 4,
                                                                                        eval_every = None)

<span style='color:violet'><font size='2'>Cell 10</span></font> Saving the variables for further use

In [None]:
check_directory('Models/find_k_seed_' + str(k_seed) + '/')

save_object('Models/find_k_seed_' + str(k_seed) + '/', 'k_model_list_cv.pckl', k_model_list_cv)
save_object('Models/find_k_seed_' + str(k_seed) + '/', 'k_perplexity_values_cv.pckl', k_perplexity_values_cv)
save_object('Models/find_k_seed_' + str(k_seed) + '/', 'k_coherence_values_cv.pckl', k_coherence_values_cv)
save_object('Models/find_k_seed_' + str(k_seed) + '/', 'k_topic_coherence_values_cv.pckl', k_topic_coherence_values_cv)
save_object('Models/find_k_seed_' + str(k_seed) + '/', 'k_model_names_cv.pckl', k_model_names_cv)

<span style='color:violet'><font size='2'>Cell 11</span></font> Plotting the results of optimizing *k* in a line graph:

In [38]:
# Plotting Coherence Scores
line_chart = pygal.Line(style=custom_style, width=1400, x_title='Number of Topics k', y_title='Coherence Score', show_legend=False)
line_chart.title = 'Coherence Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(10, 36))
line_chart.add('Coherence c_v', k_coherence_values_cv)
line_chart.render_to_file('Models/find_k_seed_' + str(k_seed) + '/k_Coherence.svg')

# Plotting Perplexity Scores
line_chart = pygal.Line(style=custom_style, width=1400, x_title='Number of Topics k', y_title='Perplexity Score', show_legend=False)
line_chart.title = 'Perplexity Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(10, 36))
line_chart.add('Perplexity', k_perplexity_values_cv)
line_chart.render_to_file('Models/find_k_seed_'+ str(k_seed)+'/k_Perplexity.svg')

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

### Alpha and Beta Parameter Optimization with Optimal *k*:

<span style='color:violet'><font size='2'>Cell 12</span></font> Conducting another Topic Modeling Process with the optimal number for *k* from the training above. Now, the focus lies on the parameters *alpha* and *eta*, which can take on float values as well as string values. In the following, a mixture over several floats as well as strings is created, which is then given to the function *modeling_and_quality* in order to iterate over those varying values. The best *alpha* and *beta* values are again indicated by a high coherence and low perplexity.

In [None]:
a_ints = list(np.arange(0.01, 1, 0.3))
a_strings = ['symmetric', 'asymmetric', 'auto']
alpha = a_ints + a_strings

e_ints = list(np.arange(0.01, 1, 0.3))
e_strings = ['symmetric', 'auto']
eta = e_ints + e_strings

# initializing random seed to be able to reproduce the results
seed = 3
np.random.seed(seed)
topic_number_k = 32

# running the function to find the optimal combination of parameters
model_list_cv, perplexity_values_cv, coherence_values_cv, topic_coherence_values_cv, model_names_cv = modeling_and_quality(dictionary=id2word, 
                                                                                                                         corpus=corpus, 
                                                                                        texts=data_bigrams_trigrams, 
                                                                                        coherence='c_v', 
                                                                                        iterations = 190,
                                                                                        topic_optim = [topic_number_k],                                                                                         
                                                                                        alpha_optim = alpha,
                                                                                        eta_optim = eta,
                                                                                        passes = 4,
                                                                                        eval_every=None)

<span style='color:violet'><font size='2'>Cell 13</span></font> Saving the variables created while optimizing the parameters *alpha* and *beta*

In [10]:
check_directory('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/')

save_object('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/', 'model_list.pckl', model_list_cv)
save_object('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/', 'perplexity_values.pckl', perplexity_values_cv)
save_object('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/', 'coherence_values.pckl', coherence_values_cv)
save_object('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/', 'topic_coherence_values.pckl', topic_coherence_values_cv)
save_object('Models/' + 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/', 'model_names.pckl', model_names_cv)

<span style='color:violet'><font size='2'>Cell 14</span></font> Plotting the results of the runs finding the optimal *alpha* and *beta* parameters

In [37]:
topic_number_k = 32
seed = 3
# Plotting Coherence Scores
line_chart = pygal.Line(style=custom_style, width=1600, x_title='Index of Trained Models', y_title='Coherence Score', show_legend=False)
line_chart.title = 'Coherence Scores for Different Models with Optimized Alpha and Beta Values'
line_chart.x_labels = map(str, range(0, 42))
line_chart.add('Coherence c_v', coherence_values_cv)
line_chart.render_to_file('Models/'+ 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/' + 'Coherence.svg')

# Plotting Perplexity Scores
line_chart = pygal.Line(style=custom_style, width=1600, x_title='Index of Trained Models', y_title='Perplexity Score', show_legend=False)
line_chart.title = 'Perplexity Scores for Different Models with Optimized Alpha and Beta Values'
line_chart.x_labels = map(str, range(0, 42))
line_chart.add('Perplexity', perplexity_values_cv)
line_chart.render_to_file('Models/'+ 'k_' + str(topic_number_k) + '_seed_' + str(seed)+ '/' + 'Perplexity.svg')

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

### Determining the optimal model for further use:

- Function *get_topic_coherences*:

This function takes as input the list of topic coherence values and the index of the model one is interested in, so that the coherences of the single topics in one certain model can be printed out, e.g. for analysis.

- Function *get_model_info*:

The function takes an index as input as well as the lists of model names, models, coherence values, perplexity values and topic coherence values. In combination with the used index, those lists are used to log the most important information about the models. 

<span style='color:violet'><font size='2'>Cell 15</span></font> get_topic_coherences

In [12]:
def get_topic_coherences(topic_coherence_values_cv, index_best_model):
    
    coherences = []
    i = 1
    for value in topic_coherence_values_cv[index_best_model]:
        coherences.append((i, round(value, 2)))
        i +=1 
        
    return coherences

<span style='color:violet'><font size='2'>Cell 16</span></font> get_model_info

In [13]:
def get_model_info(selected_index, model_names, model_list, coherence_values, perplexity_values, topic_coherence_values):
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    logging.info('Index of the selected model: %s', selected_index)
    logging.info('Model name (k, alpha, beta parameters): %s', model_names[selected_index])
    logging.info('Number of topics in the selected model: %s', model_list[selected_index].num_topics)
    logging.info('Coherence of the selected Topic Model: %s', coherence_values[selected_index])
    logging.info('Perplexity value: %s', perplexity_values[selected_index])
    logging.info('Topic coherences: %s \n', get_topic_coherences(topic_coherence_values, selected_index)) 

<span style='color:violet'><font size='2'>Cell 17</span></font> Reopening the models with *k = 32* and *seed = 3* if needed

In [9]:
model_list_cv = open_variable('Models/k_32_seed_3/', 'model_list.pckl')
coherence_values_cv = open_variable('Models/k_32_seed_3/', 'coherence_values.pckl')
topic_coherence_values_cv = open_variable('Models/k_32_seed_3/', 'topic_coherence_values.pckl')
model_names_cv = open_variable('Models/k_32_seed_3/', 'model_names.pckl')
perplexity_values_cv = open_variable('Models/k_32_seed_3/', 'perplexity_values.pckl')

<span style='color:violet'><font size='2'>Cell 18</span></font> Choosing the optimal Topic Model:

Inspecting the models with the highest coherences and selecting one of the models for further use. The highest coherence in this list of models is approximately 0.5, as shown in the created visualizations. Thus, to somehow structure the process of finding a meaningful model, a threshold of coherence = 0.475 was chosen to filter the models firstly according to their coherence.

In [1]:
possible_models = []
for value in coherence_values_cv:
    if value > 0.475:
        selected_index = coherence_values_cv.index(value)
        possible_models.append(selected_index)
        get_model_info(selected_index, 
                       model_names_cv, 
                       model_list_cv, 
                       coherence_values_cv, 
                       perplexity_values_cv, 
                       topic_coherence_values_cv)

<span style='color:violet'><font size='2'>Cell 19</span></font> Visualizing and saving the generated models with pyLDAvis, so that the optimal model can be chosen.

In [16]:
for model in possible_models:
    pyLDAvis.enable_notebook()
    vis = pyLDAvis.gensim_models.prepare(model_list_cv[model], corpus, id2word)
    name_visualization = 'Visualization_model_index_' + str(model) + '.html'
    pyLDAvis.save_html(vis, 'Models/' + name_visualization)

<span style='color:violet'><font size='2'>Cell 20</span></font> Optimal model:

After close inspection, the model with index = 5 seems to fit the purposes best. Thus, it is determined as the optimal model:

In [11]:
optimal_model = model_list_cv[5]
final_num_topics = optimal_model.num_topics
optimal_model.save('Models/optimal_model_k32.model', 'w')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 1: Topics of the DHd

### *Which topics can be found in the abstracts from DHd conferences between 2014 and 2023 with Topic Modeling?*

### RQ1: Functions


- Function *find_document_topics*:

Considering each document from the corpus, this function initially determines which topic has the highest probability in the specific text. In addition, it also extracts the whole topic probability distribution for each of the documents. The function returns a dictionary which contains the text ids as keys and the number of the most prominent topic as values and a list containing all topic probabilities. 

<span style='color:violet'><font size='2'>Cell 21</span></font> find_document_topics

In [14]:
def find_document_topics(corpus, model):
    
    topics_per_document = []
    salient_topics = {}
    i = 0

    for document in corpus:
        doc_topics = model.get_document_topics(document)
        
        #finding the most salient topic in each text
        most_salient_topics = max(doc_topics, key=itemgetter(1))
        salient_topic_number, salient_topic_probability = most_salient_topics
        salient_topics[i] = salient_topic_number +1
        
        #saving the complete topic distribution of each text in a list
        probabilities = []
        for topic in doc_topics:
            topic_num, probability = topic
            probabilities.append(probability)
        topics_per_document.append(probabilities)
        i += 1    
    
    return salient_topics, topics_per_document

### RQ1: Main

<span style='color:violet'><font size='2'>Cell 22</span></font> If necessary, the optimal model can be loaded from the folder with the following line

In [None]:
optimal_model = LdaModel.load('Models/optimal_model_k32.model')

<span style='color:violet'><font size='2'>Cell 23</span></font> Using the function *find_document_topics* to find out which texts have a certain topic as prevalent topic

In [15]:
main_topics, topics_per_document = find_document_topics(corpus, optimal_model)

for item in main_topics.items():
    key, value = item
    if value == 1:
        logging.info('%s %s', key, textnames[key])

<span style='color:violet'><font size='2'>Cell 24</span></font> Dictionary of all topic numbers and interpreted themes

In [12]:
topics = {1: 'General Research Process and Challenges',
2: 'Research Objects in DH',
3: 'Machine Translation',
4: 'Automatic Text Annotation',
5: 'Text Genres',
6: 'Multimodality and Multimedia',
7: 'Structure and Schedule of Projects',
8: 'Geography',
9: 'Literary Studies',
10: 'Linguistics and Discourse Analysis',
11: 'Interdisciplinarity and Science',
12: 'Data Models and Data Management',
13: 'Mass Digitization Projects',
14: 'Compilation of Corpora and Data Sustainability',
15: 'Data Bases and Research Infrastructures',
16: 'XML',
17: 'Automatic Detection of Persons, Visual Objects, Qualities',
18: 'Authorship Attribution',
19: 'Network Analysis',
20: 'Classification',
21: 'Large Textual Resources',
22: 'Drama and Comparative Drama Analysis',
23: 'Art History',
24: 'Music and Computized Music Annotation',
25: 'OCR',
26: 'Transcription',
27: 'Morphology, Lexicography, Archaeology',
28: 'Research Softwares and Tools',
29: 'Collocations and Metaphors',
30: '3D Reconstruction',
31: 'Topic Modeling',
32: 'Font Recognition, Figures in Plays'}

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 2: Topic Trends

### *How have the topics been changing throughout the years - which trends are perceptible?*


Mann-Kendall (MK) Test:

- MK Test analyzes whether there is a trend in a topic's appearance over time: increasing, decreasing, no trend (Chen et al., 2020; [Mann, 1945](https://www.jstor.org/stable/1907187))
- results are statistically significant if *p <= 0.05*
- results are highly statistically significant if *p <= 0.01*
- [Further Information on Mann-Kendall Test:](https://www.geeksforgeeks.org/how-to-perform-a-mann-kendall-trend-test-in-python/) (geetansh044, n.d.)


### RQ2: Functions

- Function *calculate_topic_average*:

This function takes the list of indexes for each year's corpus as well as the lists of topics per text as input. The latter one contains a list of probabilities for every topic for every document in the corpus. On the basis of those probabilites and separated by the DHd year the documents belong to, the average probability of each topic within each conference year is calculated. Those averages are then stored in another list, which is returned.

<span style='color:violet'><font size='2'>Cell 25</span></font> calculate_topic_averages

In [16]:
def calculate_topic_average(indexes, topics_per_document):
    
    averages_per_topic_per_year = []
    i = 0 
    j = 1
    
    while i < int(len(indexes)-1):
        all_probabilities = []
        for text in topics_per_document[indexes[i]:indexes[j]]:
            all_probabilities.append(text)
        # dividing the sum of all first, second, third... values by the number of values to get the average of all n topics for one year
        averages_per_topic = ((np.sum(all_probabilities, axis=0)/len(all_probabilities))*100)
        averages_per_topic_per_year.append(averages_per_topic)
        i +=1
        j += 1
        
    return averages_per_topic_per_year

### RQ2: Main

<span style='color:violet'><font size='2'>Cell 26</span></font>

In [18]:
# using the list of names of documents for each year to count how many documents belong to one year
topic_averages = calculate_topic_average(indexes, topics_per_document)

# creating a dataframe which shows the average probability of a topic in each year
df_author_analysis = pd.DataFrame(topic_averages, index = filenames).T
df_author_analysis['Mean'] = df_author_analysis.mean(axis=1)

# conducting the Mann-Kendall Test on the values of each of the topics, excluding the 'mean' column
mk_test = []
for i in range(0, final_num_topics):
    mk_test.append(mk.original_test(df_author_analysis.values[i][:-1]))
mk_results = pd.DataFrame(mk_test)

# joining the averages-DF and the Mann-Kendall Test-DF in order to have complete csv-file
mk_df = df_author_analysis.join(mk_results)
# rename columns so that topics are 1-22 and not 0-21 (similar in all other uses)
mk_df.index = list(topics.values())
mk_df.round(2).to_csv('Figures/RQ2/RQ2__Mann-Kendall.csv')

<span style='color:violet'><font size='2'>Cell 27</span></font> Creating several line plots

In [19]:
# dividing up the number of topics in order to plot understandable graphs 
parts = [0, round(final_num_topics*0.25), round(final_num_topics*0.5), round(final_num_topics*0.75), final_num_topics]

for k in range(len(parts)-1):
    line_chart = pygal.Line(style=custom_style, value_font_size=40, x_title = 'Years Observed', 
                            y_title = 'Average Probability in %', truncate_legend = 25, 
                            legend_at_bottom = True, width = 1000)
    line_chart.title = 'Average Probabilities of Topics over the Years'
    line_chart.x_labels = filenames
    line_chart.font_size = 40
    for i in range(parts[k], parts[k+1]):
        line_chart.add((list(topics.values())[i]), df_author_analysis.round(2).values[i][:-1])
    
    name = 'Figures/RQ2/RQ2__TopicProbabilitiesPerYear_' + str(parts[k]+1) + '-' + str((parts[k+1])) + '.svg'
    line_chart.render_to_file(name) 

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 3: Topic Similarity

### *Which topics appear together frequently in one abstract and therefore have a high topic similarity?*

### RQ3: Functions

- Function *create_dendrogram*:

The function has the purpose to create a dendrogram for hierarchical clustering. It is given the topic vectors for each year and for all years as input, as well as titles for x-axis and the whole visualization, labels for the leaves and the resulting file name. For the clustering process itself, the library scipy is used with its objects *cluster.hierarchy.dendrogram* and *.linkage*. *.linkage* does the actual clustering by first calculating the euclidean distances between the vectors and then using the ward method to do the clustering itself. Then, *.dendrogram* is used for visualizing the dendrogram.

<span style='color:violet'><font size='2'>Cell 28</span></font> create_dendrogram

In [15]:
def create_dendrogram(vector, xlabel, plot_title, color_threshold, leaf_labels, filename):
    
    plt.ioff()
    plt.figure(figsize=(16, 16))
    plt.xlabel(xlabel, fontsize = 26)  
    plt.title(plot_title, fontsize = 26) 
    set_link_color_palette(['#C70039', '#FFC300', '#70C700', '#001EC7', '#7F3ACD', '#3ACDB4', '#EEFC11'])
    dend = dendrogram(linkage(vector, 
                              metric='euclidean', 
                              method='ward'), 
                      color_threshold = color_threshold, 
                      labels = leaf_labels, 
                      leaf_font_size = 26, 
                      orientation = 'right')
    plt.tick_params(axis='x', which='major', labelsize=18)
    plt.savefig(filename, format='svg', bbox_inches = 'tight')

### RQ3: Main

<span style='color:violet'><font size='2'>Cell 29</span></font> Creating a dendrogram with the topic distances for every year observed



In [18]:
for i in range(0, len(indexes)-1):
    # separating the data according to the conference year the texts belong to
    df_topics_per_document = topics_per_document[indexes[i]:indexes[i+1]]
    
    # creating and transposing DataFrame to easily extract each first, second, third value in the lists
    df_topics_per_year = pd.DataFrame(data=df_topics_per_document).T
    
    # topic_vectors contains a list for each year
    # each list contains the probability of the respective topic for each text of the year, therefore each list is as long as the indexes indicate
    # therefore, topic_vectors_years[0] contains topic 0's probabilities from all texts from the analyzed year, topic_vectors[1] contains topic 1's probabilities from the year
    topic_vectors_years = []
    # for j in range(0, final_num_topics):
    for j in range(0, final_num_topics):
        topic_vectors_years.append(df_topics_per_year.iloc[j].values)
    
    create_dendrogram(topic_vectors_years, 
                      '', 
                      filenames[i]+': Hierarchical Clustering of All Topics Based on Euclidean Distance', 
                      2,
                       list(topics.values()),
                       'Figures/RQ3/RQ3__Clustering'+ filenames[i]+'.svg')    

<span style='color:violet'><font size='2'>Cell 30 </span></font> Creating a dendrogram with the topic distances for all years combined

In [19]:
# creating DataFrame to easily extract each first, second, third value in the lists
df_topics_all_years = pd.DataFrame(data = topics_per_document).T

topic_vectors_total = []
# basically same operations as above, but without splitting by years
for i in range(0, final_num_topics):
    topic_vectors_total.append(df_topics_all_years.iloc[i].values)

create_dendrogram(topic_vectors_total, 
                  '', 
                  '2014-2023: Hierarchical Clustering of Topics Based on Euclidean Distance', 
                  2,
                  list(topics.values()),
                  'Figures/RQ3/RQ3__Clustering_AllYears.svg')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 4: Keywords and Methods
### *With regard to the use of different scientific methods and keywords, which developments are perceptible?*

There are two types of keywords used in the xml files of the DHd conferences: \<keywords scheme="ConfTool" n="keywords"> and \<keywords scheme="ConfTool" n="topics">. The former may be used by authors to freely annotate their xml file with words which they want to be in the metadata, while the latter one only allows for a selection of keywords.
From 2016 on, authors to the conference could select keywords to be used in the xml file to their contribution. The list of keywords is 74 items long, out of which six can currently be selected for the xml file. In 2016 and 2017, this restriction did not exist, so that authors could select more than six keywords for the metadata. The full list of usable keywords is called 'conf_tool_methods' and will be used in the following.


### RQ4: Functions

- Function *count_keywords*:

This function is the first to be called for RQ4 and takes two lists as input arguments. The first one contains all freely-assigned keywords which were used in all xml documents from 2016 until 2023. This list is then used to create a dictionary for counting the used methods. First, all values to the keys, which are the methods' names, are set to 0. Then the second list provided will be iterated over and the values of the methods are added to. This happens for each DHd year separately, so that the sorted list returned by *count_keywords* provides an overview of which methods were used how often in which year.

- Function *calculate_relative_count*:

The function takes as input the keywords extracted in the previous function, the sorted list of counted keywords from *count_keywords* as well as a list stating how many xml files are in each conference year's corpus. The function's purpose is to calculate how often each extracted keyword is used in each year in relation to the corpus size. Taking the corpus size into account is important as the number of documents in the different corpora varies strongly, so that the numbers would be distorted if corpus size was not taken into account.

- Function *information_on_keywords*:

This function calls on the other two functions sequentially and returns the relative count of use of keywords.

- Function *chart_and_csv*:

*chart_and_cv* takes as input data calculated beforehand as well as titles for the resulting output. It turns the given data into a line chart containing only a small amount of selected data as well as a csv file that contains all data. 

<span style='color:violet'><font size='2'>Cell 31</span></font> count_keywords

In [44]:
def count_keywords(complete_list, year_list):
    # for each year, a large list of all possible, freely annotated keywords is set up and the count for each of the keywords is set to 0
    # if the method appears in the specific year's list, +1 will be added to the count

    counted_keywords_per_year = []
    for year in year_list:
        dict = {}
        for method in complete_list:
            dict[method] = 0
        for method in year:
            dict[method] += 1
        # sorting the dict according to the count, so that the item with the highest count comes first
        sorted_dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
        counted_keywords_per_year.append(sorted_dict)
    
    return counted_keywords_per_year

<span style='color:violet'><font size='2'>Cell 32</span></font> calculate_relative_count

In [45]:
def calculate_relative_count(complete_list, absolute_count_keywords, number_xml_docs):
    
    # for the used keywords, it is looked at how often they appear in each year's corpus
    # then, that count will be divided by the total number of texts in that corpus, which results in a proportional frequency of the keyword to the corpus size 
    
    relative_dict = {}
    for item in complete_list:
        item_count = []
        i = 0
        for year in absolute_count_keywords:
            
            for tuple in year:
                name, count = tuple
                if name == item:
                    item_count.append(round((count/number_xml_docs[i])*100, 2))
            i += 1
        relative_dict[item] = item_count

    return relative_dict

<span style='color:violet'><font size='2'>Cell 33</span></font> retrieve_information_on_keywords

In [46]:
def retrieve_information_on_keywords(total_list, used_keywords_list, number_xml_docs):
    
    absolute_count_keywords = count_keywords(total_list, used_keywords_list)
    relative_count_keywords = calculate_relative_count(total_list, absolute_count_keywords, number_xml_docs)
    
    return absolute_count_keywords, relative_count_keywords

<span style='color:violet'><font size='2'>Cell 34</span></font> chart_and_csv

In [62]:
def chart_and_csv(chart_title, filenames_xml, chart_filename, all_relative_counts, csv_filename):
    
    # creating a plot for visualisation
    line_chart = pygal.Line(truncate_legend=-1, 
                            x_title='Years Observed', 
                            y_title= 'Percentage of Term Usages' + '\n' + '(Relative to Corpus Size)', 
                            width = 1000, 
                            style = custom_style,
                            legend_at_bottom = True)

    
    # for complete information, a csv-file is provided with all used keywords and their relative counts
    df_keywords = pd.DataFrame(all_relative_counts.values(), index=all_relative_counts.keys(),columns=filenames_xml)
    
    df_keywords['Mean'] = df_keywords.mean(axis=1).round(2)
    df_keywords = df_keywords.sort_values(by=['Mean'], ascending=False)
    
    if '' in df_keywords.index:
        df_keywords = df_keywords.drop('')

    df_keywords.to_csv(csv_filename)
    
    line_chart.title = chart_title
    line_chart.x_labels = filenames_xml
    for i in range(0, 10):
        line_chart.add(df_keywords.index[i], df_keywords.iloc[i, :-1])
    line_chart.render_to_file(chart_filename)   

### RQ4: Main

<span style='color:violet'><font size='2'>Cell 35</span></font> Collecting all necessary information on the keywords

In [63]:
# importing the list provided, which contains all selectable options for <keywords n='keywords'>
all_predetermined_keywords = open_list('Misc/predetermined_keywords.txt')

absolute_count_free_keywords, relative_count_free_keywords = retrieve_information_on_keywords(all_freely_selectable_keywords, 
                                                                                              used_keywords_freely_selectable, 
                                                                                              number_xml_docs)

absolute_count_predetermined_keywords, relative_count_predetermined_keywords = retrieve_information_on_keywords(all_predetermined_keywords, 
                                                                                                                used_keywords_predetermined, 
                                                                                                                number_xml_docs)

<span style='color:violet'><font size='2'>Cell 36 </span></font> Creating a line chart and csv file

In [64]:
chart_and_csv('Used Terms in <keywords n=\'keywords\'>', filenames_xml, 
             'Figures/RQ4/RQ4__Free_ResearchMethodsPerYear.svg', relative_count_free_keywords, 'Figures/RQ4/RQ4__Free_Keywords_All.csv')

chart_and_csv('Used Terms in <keywords n=\'topics\'>', filenames_xml, 
             'Figures/RQ4/RQ4__Predetermined_ResearchMethodsPerYear.svg', relative_count_predetermined_keywords,'Figures/RQ4/RQ4__Predetermined_Keywords_All.csv')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 5: Authors and Networks


### *Which researchers contribute to the conference particularly frequently with abstracts, in which teams do they contribute and how have the teams been changing?*

### RQ5: Functions - Part 1: Data Processing

- Function *count_appearances_descending*:

*count_appearances_descending* takes a list as input and counts how often a list item appears within the list. The function returns a sorted dictionary in which the keys with the highest values come first.

- Function *rank_authors*:

*rank_authors* takes as input the nested list of authors per text and per year, and turns it one large list containing every author who contributed the conference in one year. This list contains authors twice/three times if they contributed twice/three times within one conference, which is how eventually a count of every author and their contributions to the conferene is set up. The function returns a list of lists, in which the authors' names and their respective count of contributions for one year is stored. This function is necessary to set up the network nodes in a last step.

- Function *find_coauthors*:

The function *find_coauthors* takes the same list, *authors*, as input. Differently to *rank_authors*, however, this function uses the input list to determine the coauthors of one year. This means that it extracts the authors who worked together in one single contribution in pairs and counts how often these pairs appear within one conference year. This function prepares the data for adding edges and their weight to the network.

- Function *find_new_authors*:

*find_new_authors* takes the output list from *rank_authors* as input and uses it to determine which authors have contributed to one conference but not the previous one. Further, it identifies authors who have never contributed to any DHd conference. This is done by comparing the authors lists of one year with the one from the previous year(s). Those new authors are stored in a list, which is used in the setting up of the network in order to mark those new authors.

- Function *convert_tuples_to_dict*:

Converts a list of tuples into a dictionary where the first tuple-item is the key and the second tuple-item is the value.

- Function *check_for_significant_teams*:

Takes the list of the 25 most significant, i.e. most contributing authors as input and the list of coauthors per year. Iterating over these lists, the function then checks whether two of the significant authors collaborated on a conference document and saves this team of authors. 

- Function *count_team_appearances*:

Gets the list of significant teams returned by *check_for_significant_teams* as input and counts how often each team collaborated togehter.

<span style='color:violet'><font size='2'>Cell 37</span></font> count_appearances_descending

In [65]:
def count_appearances_descending(input_list):
    
    count_dict = {}
    # for each item in keyword list, check if it is alredy in dictionary
    # if not, add and set count to 1, if yes add +1 to count
    for item in input_list:
        if item not in count_dict.keys():
            count_dict[item] = 1
        else:
            count_dict[item] += 1
    # sort dictionary according to highest count in the values
    sorted_dict = sorted(count_dict.items(), key=lambda x: x[1], reverse=True)

    # return the sorted dictionary (becomes list through sorting though)
    return sorted_dict

<span style='color:violet'><font size='2'>Cell 38</span></font> rank_authors

In [66]:
def rank_authors(authors):

    all_counted_authors = []
    for year in authors:
        authors_list_per_year = [] 
        for team in year:
            # create one long list of all authors of one year (not separated by author team)
            authors_list_per_year = authors_list_per_year + team
        all_counted_authors.append(count_appearances_descending(authors_list_per_year))
        
    return all_counted_authors

<span style='color:violet'><font size='2'>Cell 39</span></font> find_coauthors

In [67]:
def find_coauthors(authors):
    
    coauthors_per_year = []
    for conference_year in authors:
        coauthors_list = []
        for document_authors in conference_year:
            # check if text has at least two authors
            if len(document_authors) >= 2:
                for i in range(len(document_authors)-1):
                    for j in range(i+1, len(document_authors), 1):
                        coauthors_list.append((document_authors[i], document_authors[j]))              
        dict_coauthors_count = count_appearances_descending(coauthors_list)      
        coauthors_per_year.append(dict_coauthors_count) 
        
    return coauthors_per_year

<span style='color:violet'><font size='2'>Cell 40</span></font> find_new_authors

In [68]:
def find_new_authors(all_counted_authors):
    
     # returns list of people who did not participate previous DHd conferences
    new_authors = [[]]
    all_prev_authors = []
    for i in range(1, len(all_counted_authors)):
        current_year = all_counted_authors[i]
        prev_year = authors[i-1]
        
        prev_authors = []
        completely_new_authors = []
        
        for item in prev_year:
            prev_authors = prev_authors + item
        all_prev_authors = all_prev_authors + prev_authors        
        for element in current_year:
            name, count = element
            if name not in all_prev_authors:
                completely_new_authors.append(name)
        new_authors.append(completely_new_authors)
    
    return new_authors

<span style='color:violet'><font size='2'>Cell 41</span></font> convert_tuples_to_dict

In [69]:
def convert_tuples_to_dict(tuple_list):
    
    new_dict = {}
    for tuple in tuple_list:
        item1, item2 = tuple
        new_dict[item1] = item2

    return new_dict   

<span style='color:violet'><font size='2'>Cell 42</span></font> check_for_significant_authors

In [70]:
def check_for_significant_teams(list_significant_DHumanists, coauthors):
    
    teams = []
    for i in range(len(list_significant_DHumanists)-1):
        for j in range(i+1, len(list_significant_DHumanists), 1):
            k = 0
            for year in coauthors:
                for team in year:
                    #check if team of authors consists of only the most significant authors (i.e. the 25 most contributing ones)
                    if list_significant_DHumanists[i][0] in team[0][0] and list_significant_DHumanists[j][0] in team[0][1] or list_significant_DHumanists[i][0] in team[0][1] and list_significant_DHumanists[j][0] in team[0][0]:
                            teams.append(team)
                k += 1
                
    return teams

<span style='color:violet'><font size='2'>Cell 43</span></font> count_team_appearances

In [71]:
def count_team_appearances(teams):

    teams_dict = {}
    for element in teams:
        author_team, team_count = element        
        if author_team not in teams_dict:
            teams_dict[author_team] = team_count
        else: 
            teams_dict[author_team] += team_count

    return teams_dict

### RQ5: Functions - Part 2: Network 

- Function *determine_node_shape*:

Depending on how often the author contributed to the DHd conference in the respective year, i.e. whether they are a very frequent author or not, this function returns a string which then determines the author's node shape in the final network. 

- Function *add_nodes*:

Creates a node for every author that contributed to the DHd in that specific year. Depending on how often that author contributed to the conference in the form of writing an abstract, the shape of the node is determined. 

- Function *change_color_new_authors*:

For every author appearing in the new_authors list, the node color is changed from pink to orange, so that they can be clearly distinguished in the network.

- Function *determine_edge_color*:

The function *determine_edge_color* is applied while setting up the network. Based on the variable *coauthor-count*, which is determined by the function *find_coauthors*, this function determines the color of the network's edges depending on how often the respective authors have worked together in on conference year. 

- Function *add_edges*:

For every team of coauthors, the edges within the nodes are added by referring to the authors' nodes by their names. The edge colors are determined by how often the respective authors collaborated in that DHd year, which precisely happens in the function *determine_edge_color*.

- Function *generate_html_file*:

Generates an html file containing all the nodes and edges created previously. Additionally, a legend is added to the html, so that the colors and shapes within the network can be understood. 

<span style='color:violet'><font size='2'>Cell 44</span></font> determine_node_shape

In [72]:
def determine_node_shape(total_count, topcount):
    
    if total_count == topcount:
        return 'star'
    else:
        return 'dot'

<span style='color:violet'><font size='2'>Cell 45</span></font> add_nodes

In [73]:
def add_nodes(all_counted_authors, i):
    
    top_name, top_count = all_counted_authors[i][0]
    for author in all_counted_authors[i]:
        name, total_count = author
        title = (name, total_count)
        g.add_node(name, 
                title=title, 
                label=name, 
                size=(total_count*10), 
                # pink
                color='#C70039',
                borderWidth=1, 
                borderWidthSelected=3,
                # determine the nodes' shape according to whether author is top author with most contributions in that year (==star) or not (==dot)
                shape = determine_node_shape(total_count, top_count))
        
    return

<span style='color:violet'><font size='2'>Cell 46</span></font> change_color_new_authors

In [74]:
def change_color_new_authors(new_authors, i):
       
    if len(new_authors[i]) > 0:
        for author in new_authors[i]:
            g.get_node(author)['color'] = 'orange'
        
    return len(new_authors[i])    

<span style='color:violet'><font size='2'>Cell 47</span></font> determine_edge_color

In [75]:
def determine_edge_color(coauthor_count):
      
      # function to determine the color of the network's edges, depending on how often authors worked together in that year
      # dark blue
      if coauthor_count >= 3:
            return '#360BFA'
      # light blue
      elif coauthor_count == 2:
            return '#6CADFB'
      # turquoise
      else: 
        return '#1FA3B5'

<span style='color:violet'><font size='2'>Cell 48</span></font> add_edges

In [76]:
def add_edges(coauthors_per_year, i):
    
    for item in coauthors_per_year[i]:
        name1 = item[0][0]
        name2 = item[0][1]
        coauthor_count = item[1]
    
        # edge color depending on how often authors worked togehter
        g.add_edge(name1, name2, 
                    width=(coauthor_count),
                    title=(coauthor_count),
                    color = determine_edge_color(coauthor_count))
    
    return

<span style='color:violet'><font size='2'>Cell 49</span></font> generate_html_file

In [80]:
def generate_html_file(filenames_xml, i):
    
    # opening the provided HTML code which has to be added to the network html file
    with open('Misc/LegendHTML.txt', 'r', encoding='utf-8') as legend:
        html_addition = legend.read()
    
    # writing an html-file
    html = g.generate_html()
    name = 'Figures/RQ5/RQ5__Authors_Networks_' + str(filenames_xml[i][-4:]) + '.html'
    
    with open(str(name), mode='w', encoding='utf-8') as fp:        
        # finding the proper place in the html document and inserting the additional markup for legend
        find = re.search(r'<div id="mynetwork" class="card-body"></div>', html)
        end = find.end()+1

        html = html[:end] + html_addition + html[end:]   
        fp.write(html)

### RQ5: Main

<span style='color:violet'><font size='2'>Cell 50</span></font>

In [81]:
#needed for network
all_counted_authors = rank_authors(authors)
coauthors_per_year = find_coauthors(authors)
new_authors = find_new_authors(all_counted_authors)

#needed for cooccurrence matrix
most_significant_DHumanists_list = count_appearances_descending(all_authors)[:25]
most_significant_DHumanists_dict = convert_tuples_to_dict(most_significant_DHumanists_list)
significant_DH_teams = check_for_significant_teams(most_significant_DHumanists_list, coauthors_per_year)
significant_teams_count = count_team_appearances(significant_DH_teams)

<span style='color:violet'><font size='2'>Cell 51</span></font> Creating the networks:

The nodes have to be added as well as the edges and their weights. In the end, the general algorithm of the network as well as some other parameters are defined. 

In [82]:
# implementing the statistics for each year
statistics_completely_new_authors = []

i = 0
for year in filenames_xml:
      
  # implementing the network itself for each year
  g = Network(height='600px', width='100%', cdn_resources='remote', select_menu=True, font_color='black', filter_menu=True, neighborhood_highlight=True)
  nxg = nx.complete_graph(0)
  g.from_nx(nxg)
  
  # adding nodes and edges, determining colors and shapes, writing html file
  add_nodes(all_counted_authors, i)
  number_new_authors = change_color_new_authors(new_authors, i)
  statistics_completely_new_authors.append(number_new_authors)
  add_edges(coauthors_per_year, i)
  generate_html_file(filenames_xml, i)
  
  i += 1

<span style='color:violet'><font size='2'>Cell 52</span></font> Creating DataFrames:

DataFrames for the collaborations of significant authors and for a statistic of (new) authors

In [86]:
#creating a dataFrame that contains the names of significant authors and the times they collaborated
cooccurrence_matrix = pd.DataFrame(index = most_significant_DHumanists_dict.keys(), columns = most_significant_DHumanists_dict.keys())
cooccurrence_matrix['Total contributions to conference'] = most_significant_DHumanists_dict.values()

#determining the right cell to enter the count of the team
for duo in significant_teams_count:
    cooccurrence_matrix.loc[duo[0], duo[1]] = significant_teams_count[duo]
cooccurrence_matrix.fillna(0).to_csv('Figures/RQ5/RQ5__Cooccurrence_Matrix_Significant_Authors.csv')

<span style='color:violet'><font size='2'>Cell 53</span></font> DataFrame and csv file for analyzing new authors

In [87]:
total_number_authors = []
for year in all_counted_authors:
     total_number_authors.append(len(year))
     
d = {'Total Number of Contributing Authors': total_number_authors,
     'New Authors': statistics_completely_new_authors}

df_author_analysis = pd.DataFrame(data = d, index = filenames_xml)
df_author_analysis['% of New Authors'] = round(df_author_analysis['New Authors'].div(df_author_analysis['Total Number of Contributing Authors'])*100 , 2)
df_author_analysis['Average Number of Authors Per Text'] = round(df_author_analysis['Total Number of Contributing Authors'].div(number_xml_docs) , 2)
df_author_analysis.loc['Mean'] = df_author_analysis.mean(axis=0).round(2)

df_author_analysis.T.to_csv('Figures/RQ5/RQ5__AuthorsStatistics.csv')

<span style='color:violet'><font size='2'>Cell 54</span></font> Bar chart for new authors

In [88]:
bar_chart = pygal.Bar(style=custom_style, 
                      x_title='Years Observed', 
                      y_title='Total Numbers', 
                      truncate_legend = -1, 
                      legend_at_bottom=True, 
                      print_values=True, 
                      print_values_position='bottom')
bar_chart.title = 'DHd Authors'
bar_chart.x_labels = filenames_xml
bar_chart.add('All Authors', df_author_analysis.T.iloc[0][:-1])
bar_chart.add('New Authors', df_author_analysis.T.iloc[1][:-1])
bar_chart.render_to_file('Figures/RQ5/RQ5__ContributorsAnalysis.svg')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 6: Author-Topic Clustering

### *Which clusters of researchers can be found with regard to topics and how have the clusters been changing?* 

### RQ6: Functions

- Function *get_authors_and_topics*:

This function takes two lists and one dictionary as input, as well as the first xml document's index. One list - *authors_no_duplicates* - contains all author names extracted from the xml files and is then sorted alphabetically. The second list *all_author_teams* contains the authors of each text so that it can be determined which text was written by whom. The input dictionary *main_topics* is the one returned by the function *find_document_topics*, which is already applied in RQ1, containing the most prominent topic of each document. These three input variables are used to infer which author has written on which (most salient) topic. This information is stored in the dictionary *authors_and_topics*, where each author name is a key and the values are lists containing the topics an author has written about. 
In order to make later visualizations more readible and understandable, the data is reduced to authors which have contributed in DHd conferences at least nine times. The reduced dictionary of authors and topics is returned by the function.

In this function, only the documents from index 231 on are taken into account. Those documents are the ones springing from xml files where the documents' authors are noted down in the xml markup.

- Function *create_vectors*:

*create_vectors* takes the dictionary from *get_authors_and_topics* as input and transforms the information on the authors' contributions to topics into a sparse vector representation. Through this, a vector with the length of the final topic number is created for each author (i.e. key of the dictionary), meaning that the vector contains a 0 where the author did not contribute to a topic. For topics the author contributed to, the vector contains an integer how often the author contributed to that topic. 
The function returns the dictionary with authors (keys) and vectors (values), as well as a list of only the vectors and a list of only the authors' names. 

<span style='color:violet'><font size='2'>Cell 55</span></font> get_authors_and_topics

In [89]:
def get_authors_and_topics(authors_no_duplicates, all_author_teams, main_topics, index_first_xml_document):
    
    authors_and_topics = {}
    for name in sorted(authors_no_duplicates):
        topics_per_author=[]
        # setting text_id to 231 (in index_first_xml_document), because only from document 231 on the authors are noted down in markup of xml files
        document_id = index_first_xml_document
        # iterating over all authors in all texts, trying to find the 'key' currently looked at
        for document in all_author_teams:
            for author in document:
                # if the key matches the author of the text, then note the text id and through that find the salient topic of the text
                if name == author:
                    document_topic = main_topics[document_id]
                    topics_per_author.append(document_topic)
                authors_and_topics[name] = topics_per_author 
            document_id +=1

    return authors_and_topics  

<span style='color:violet'><font size='2'>Cell 56</span></font> create_vectors

In [90]:
def create_vectors(authors_and_topics, minimun_number_of_contributions):

    # transforming the data from reduced_authors_and_topics into a vector representation
    # advantage: easier to plot and counts how often each topic was written on by the authors
    vector = {}
    for key in authors_and_topics:
        if len(authors_and_topics[key]) > minimun_number_of_contributions:
        # creating a vector with a length corresponding to the number of topics
            vector[key] = [0]*final_num_topics
            for digit in authors_and_topics[key]:
                vector[key][digit-1] += 1
    # retrieving the author names to use them as labels and the vectors to determine distances between the vectors
    only_authors = [key for key in vector]
    only_vectors = [vector[key] for key in vector]

    return vector, only_vectors, only_authors

### RQ6: Main

<span style='color:violet'><font size='2'>Cell 57</span></font>

In [91]:
index_first_xml_document = 231

# slicing the dict main_topics because for this RQ only the documents froom 2016-2023 are needed beginning with index 231
main_topics_xml = dict(itertools.islice(main_topics.items(), index_first_xml_document, len(textnames)))  
authors_and_topics = get_authors_and_topics(authors_no_duplicates, all_author_teams, main_topics_xml, index_first_xml_document)
vector, only_vectors, only_authors = create_vectors(authors_and_topics, 8)

<span style='color:violet'><font size='2'>Cell 58</span></font> Creating a dot chart visualization

In [97]:
i = 1
for digit in np.arange(0, 1, 0.5):
    # dividing the dictionary into several lists, to make the output plot more readible
    part = dict(list(vector.items())[round(len(vector)*digit) : round(len(vector)*(digit+0.5))])

    dot_chart = pygal.Dot(human_readable = True, 
                            width = 800, height= 800, 
                            truncate_legend = 20,
                            truncate_label = -1, 
                            style = custom_style, 
                            legend_box_size = 6, 
                            x_label_rotation= 90)
    dot_chart.title = 'Authors with >8 contributions and the topics they wrote about'
    dot_chart.x_labels = list(topics.values())

    for key in part:
        dot_chart.add(key, part[key])

    name = 'Figures/RQ6/RQ6__Authors_and_Topics_' + str(i) + '.svg'
    dot_chart.render_to_file(name)
    i += 1

<span style='color:violet'><font size='2'>Cell 59</span></font> Creating a dendrogram

In [99]:
create_dendrogram(only_vectors, 
                  ' ', 
                  'Closeness of Authors Calculated by Topic Vectors', 
                  None, 
                  only_authors,
                  'Figures/RQ6/RQ6__Dendrogram.svg')

# References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. *Journal of
Machine Learning Research, 3*, 993–1022.

Chen, X., Zou, D., Xie, H. (2020). Fifty years of British Journal of Education Technology: A topic modeling based bibliometric perspective. *British Journal of Education Technology, 51*(3), 692-708.

Du, K. (2022, March 7). Evaluating Hyperparameter Alpha of LDA Topic Modeling. Zenodo. Digital Humanities im deutschsprachigen Raum 2022. https://doi.org/10.5281/zenodo.6327965

geetansh044. (n.d.). *How to Perform a Mann-Kendall Trend Test in Python*. https://www.geeksforgeeks.org/how-to-perform-a-mann-kendall-trend-test-in-python/ (29.09.2023).

Kumar, K. (2018). *Evaluation of Topic Modeling: Topic Coherence*. https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ (29.09.2023).

Mann, H. B. (1945). Nonparametric Tests Against Trend. *Econometrica, 13*(3), 245-259. (https://www.jstor.org/stable/1907187).

Řehůřek, R. (2022). *LDA model*. https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html (29.09.2023)

Řehůřek, R. (2022). *models.ldamodel – Latent Dirichlet Allocation*. https://radimrehurek.com/gensim/models/ldamodel.html (29.09.2023)

Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence
Measures. In X. Cheng (Ed.), *ACM Digital Library, Proceedings of the Eighth ACM International
Conference on Web Search and Data Mining* (pp. 399–408). ACM. https://doi.org/10.1145/2684822.2685324.

Schöch, C. (2017). Topic Modeling Genre: An Exploration of French Classical and Enlightenment
Drama. *Digital Humanities Quarterly, 11*(2). http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html