
# Code for Master's Thesis: Topic Modeling

## Research Questions

1. Which topics can be found in the abstracts from DHd-conferences between 2014 and 2023 with Topic Modeling?

2. Which topics appear frequently in one abstract and therefore have a high topic similarity?

3. How have the topics been changing throughout the years - which trends are perceptible?

4. With regard to the use of different scientific methods, which developments are perceptible?

5. Which researchers contribute to the conference particularly frequently with abstracts, in which teams do they contribute and how have the teams been changing?

6. Which clusters of researchers can be found with regard to topics and how have the clusters been changing?

### Imports

In [6]:
#general imports
import re
import pickle
import numpy as np
import matplotlib.pyplot as plt
import logging

#Visualisations
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim_models

#importing functions from MA_Preprocessing
import MA_Preprocessing

#LDA
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel

#RQ2: Topic Similarity
import math
import scipy.cluster.hierarchy as shc

#RQ3: Mann-Kendall-Test
import pymannkendall as mk

#RQ5: Network Analysis
from pyvis.network import Network
import networkx as nx

#RQ6: Authors-Topic-Analysis
import itertools
from operator import itemgetter
import pygal 
from pygal.style import Style
from scipy.spatial import distance_matrix
from scipy.cluster import hierarchy


ModuleNotFoundError: No module named 'MA_Preprocessing'

Importing all variables needed from the preprocessing notebook to this one

In [4]:
%store -r number_pdf_docs
%store -r number_xml_docs
%store -r number_docs
%store -r docnames
%store -r filenames
%store -r filenames_xml
%store -r filenames_pdf
%store -r all_freely_selectable_keywords
%store -r used_keywords_freely_selectable
%store -r used_keywords_predetermined
%store -r authors
%store -r authors_full_list

#Variables that are stored in the Thesis folders
corpus = open_variable('Variables/', 'corpus.pckl')
id2word = open_variable('Variables/', 'id2word.pckl')
corrected_list_of_texts = open_variable('Variables/', 'corrected_list_of_texts.pckl')
data_bigrams_trigrams = open_variable('Variables/', 'data_bigrams_trigrams.pckl')

Implemeting a list of indexes as well as a list containing all document names.

In [None]:
indexes = [0]
textnames = []
for sublist in docnames:
    indexes.append(len(sublist) + indexes[-1])
    textnames = textnames + sublist

Implementing a list containing all authors containing double names, and a list containing all author teams through all years. 

In [5]:
all_authors = []
all_author_teams = []
for year in authors:
    all_author_teams = all_author_teams + year
    for author_team in year:
        all_authors = all_authors + author_team

['Kepper, Johannes', 'Hoppe, Stephan', 'Pfarr-Harfst, Mieke', 'Münster, Sander', 'Kuroczyński, Piotr', 'Blümel, Ina', 'Hauck, Oliver', 'Lutteroth, Jan', 'Pfeil, Patrick', 'Aehnlich, Barbara', 'Sahle, Patrick', 'Trippel, Thorsten', 'Neumann, Gerald', 'Engelhardt, Claudia', 'Kurzawe, Daniel', 'Schäfer, Felix', 'Wörner, Kai', 'Wiedemann, Gregor', 'Gloning, Thomas', 'Blätte, Andreas', 'Keller, Maret', 'Haaf, Susanne', 'Würzner, Kay-Michael', 'Andorfer, Peter', 'Durco, Matej', 'Stäcker, Thomas', 'Thomas, Christian', 'Hildenbrandt, Vera', 'Stigler, Hubert', 'Söring, Sibylle', 'Rosenthaler, Lukas', 'Keim, Daniel A.', 'Zweig, Katharina Anna', 'Aehnlich, Barbara', 'Kösser, Sylwia', 'Bürgermeister, Martina', 'Makowski, Stephan', 'Strecker, Bernhard', 'Jeller, Daniel', 'Schneider, Gerlinde', 'Bigalke, Jan', 'Büttner, Stephan', 'Heger, Martin', 'Heinrich, Marcus', 'Keller, Carolin', 'Lehmann, Anna', 'Meyer, Michaela', 'Barzen, Johanna', 'Falkenthal, Michael', 'Hentschel, Frank', 'Leymann, Frank', 

Creating a visualization style in order to be consistent with the visualizations later in the notebook.

In [37]:
# Visualization Style
custom_style = Style(
legend_font_size = 12,
legend_box_size = 12,
background='white',
plot_background='white',
foreground='black',
foreground_strong='#53A0E8',
foreground_subtle='#630C0D',
opacity='.6',
opacity_hover='0.3',
transition='400ms ease-in',
colors=('#C70039', '#FFC300', '#70C700', '#001EC7', '#7F3ACD', '#3ACDB4', '#EEFC11'))

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

### Topic Modeling (LDA): Tuning (hyper-)parameters

By measuring the quality measures of *perplexity* and *topic coherence*, the ideal parameters for the topic model should be found. 


**Quality Measure: Perplexity**

- Perplexity can be used to measure how good the LDA model generalizes on the text corpus (Blei et al., 2003, p. 1008)
- the lower the perplexity, the better the model

**Quality Measure: Topic Coherence** 

- Topic Coherence can be determined by various measurements - e.g. UMass, C_V, UCI, NPMI - which all use different measurements to calculate the coherence of topics (Röder et al., 2015, p. 2)
- In this workflow, C_V is used: measurement gives values between 0 and 1, with 1 being the best coherence to be reached

**Parameter topic number**

- According to [Kumar (2018)](https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/), "[c]hoosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics". However, "[i]f you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large". (ibid.)

**Parameter update_every**

- "Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning." ([Gensim Documentation, Rehurek, 2022](https://radimrehurek.com/gensim/models/ldamodel.html))
- Here set to update_every=1 as this is the default

**Parameter Alpha**

- "A-priori belief on document-topic distribution" ([Gensim Documentation](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Alpha is the parameter, which has the smoothing effect on the topic-document distribution and ensures that the probability of each topic in each document is not 0 throughout the entire inference procedure" (Du, 2022, p. 1). Du's study results indicate that coherence results of models deteriorate with increasing Alpha-parameter, and Du concludes that Alpha of each topic should not be higher than 1 (Du, 2022, p. 2)

**Parameter Eta**

- "[D]istributional profile of topics in each document” (Abs. 20) (Schöch, 2017),
- "A-priori belief on topic-word distribution" ([Gensim Documentation](https://radimrehurek.com/gensim/models/ldamodel.html))

**Parameter Iterations**
- "Maximum number of iterations through the corpus when inferring the topic distribution of a corpus" ([Gensim Documentation](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Iterations (...) essentially it controls how often we repeat a particular loop over each document. It is important to set the number of 'passes' and 'iterations' high enough" ([Tutorial: LDA Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)) to train the best LDA Topic Model


**Parameter Passes**
- "Number of passes through the corpus during training" ([Gensim Documentation](https://radimrehurek.com/gensim/models/ldamodel.html))
- "Passes controls how often we train the model on the entire corpus. Another word for passes might be 'epochs'." ([Tutorial: LDA Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html))

- Function *compute_quality_measures*:

This function takes as input the created dictionary, corpus and texts on which the later Topic Models will be based. Further, it is given several other parameters which can be tuned for creating the optimal Topic Model. Those are the coherence measure to be used (*coherence*) and other parameters to be tuned such as the number of iterations(*iterations*), number of topics (*topic_optim*), values for alpha (*alpha_optim*) and beta (*eta_topim*), the number of passes (*passes*) and the frequency of evaluations (*eval_every*).

In [10]:
def modeling_and_quality(dictionary, corpus, texts, coherence, iterations, topic_optim, alpha_optim, eta_optim, passes, eval_every):

    coherence_values = []
    model_list = []
    perplexity_values = []
    topic_coherence_values=[]
    model_names = []
    
    for topics_num in topic_optim:
        for alpha_value in alpha_optim:
            for eta_value in eta_optim:
                model = LdaModel(corpus=corpus, id2word=dictionary, iterations=iterations, num_topics=topics_num, 
                                                      alpha=alpha_value, eta=eta_value, passes=passes, eval_every=eval_every, minimum_probability=1e-8)
                model_list.append(model)
                perplexity_values.append(model.log_perplexity(corpus))
                coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, corpus=corpus, coherence=coherence, topn=60, window_size=150)
                coherence_values.append(coherencemodel.get_coherence())
                topic_coherence_values.append(coherencemodel.get_coherence_per_topic())
                name = (str(topics_num), str(alpha_value), str(eta_value))
                model_names.append(name)
                name = ''
                
                ''' For controlling the progress of this very time-consuming step in the workflow, print every parameter that has been checked '''
                
                print('   topics_num: ', topics_num, '   alpha_value: ', alpha_value, '   eta_value: ', eta_value, 
                        '   coherence', coherencemodel.get_coherence())
        

    return model_list, perplexity_values, coherence_values, topic_coherence_values, model_names

### Finding the best settings for *passes* and *iterations*:
First, a training of the topic model is needed. In order to do so, the LDA Tutorial proposes to select a random number of topics *k* and try several options for the parameters *passes* and *iterations*, while keeping all other parameters equal. The tutorial proposes to use the logging function on the DEBUG level to see how many documents were converged during the training. One should opt for settings of the two parameters where most - or ideally all - documents are converged ([Tutorial: LDA Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)).

Here, convergence of 1202/1203 documents occurred with *k=20*, *iterations=190* and *passes=4*.

In [10]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

In [1]:
# initializing random seed to be able to reproduce the results
np.random.seed(3)

# running the function with default values except for the topic number
k_model_list_cv, k_perplexity_values_cv, k_coherence_values_cv, k_topic_coherence_values_cv, k_model_names_cv = modeling_and_quality(dictionary = id2word, 
                                                                                        corpus = corpus, 
                                                                                        texts = data_bigrams_trigrams, 
                                                                                        coherence = "c_v", 
                                                                                        # default: iterations = 50
                                                                                        iterations = 190, 
                                                                                        topic_optim = [20],
                                                                                        alpha_optim = ['auto'],
                                                                                        eta_optim = ['auto'],
                                                                                        # default: passes = 1
                                                                                        passes = 4,
                                                                                        eval_every = 1)

### Finding the best topic number *k*:
After finding good settings for *passes* and *iterations*, the next step is finding the ideal topic number *k*. While trying out different values for *k* between 10 and 35 and keeping the other parameters constant, one should look for the highest coherence value and lowest perplexity value.

In previous experiments, topic numbers up to 70 were tried as well. However, that many topics proved to be too small and too uninterpretable for a meaningful Topic Model and therefore it was decided on 35 topics at the maximum.

In [40]:
topic_optim = list(np.arange(10, 36, 1))

# initializing random seed to be able to reproduce the results
np.random.seed(1)

# running the function with default values except for the topic number
k_model_list_cv, k_perplexity_values_cv, k_coherence_values_cv, k_topic_coherence_values_cv, k_model_names_cv = modeling_and_quality(dictionary = id2word, 
                                                                                        corpus = corpus, 
                                                                                        texts = data_bigrams_trigrams, 
                                                                                        coherence = "c_v", 
                                                                                        iterations = 190, 
                                                                                        topic_optim = topic_optim,
                                                                                        alpha_optim = ['auto'],
                                                                                        eta_optim = ['auto'],
                                                                                        passes = 4,
                                                                                        eval_every = None)

<span style="color: red;">rausnehmen? </span>

In [13]:
save_object('D:/Models/HanTa/', 'seed1_k_model_list_cv.pckl', k_model_list_cv)
save_object('D:/Models/HanTa/', 'seed1_k_perplexity_values_cv.pckl', k_perplexity_values_cv)
save_object('D:/Models/HanTa/', 'seed1_k_coherence_values_cv.pckl', k_coherence_values_cv)
save_object('D:/Models/HanTa/', 'seed1_k_topic_coherence_values_cv.pckl', k_topic_coherence_values_cv)
save_object('D:/Models/HanTa/', 'seed1_k_model_names_cv.pckl', k_model_names_cv)

### Plotting the Results for Optimizing *k*:

In [14]:
# Plotting Coherence Scores
line_chart = pygal.Line(style=custom_style, width=1400, x_title='Number of Topics k', y_title='Coherence Score')
line_chart.title = 'Coherence Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(10, 36))
line_chart.add('Coherence c_v', k_coherence_values_cv)
line_chart.render_to_file('Models/find_k/Coherence.svg')

# Plotting Perplexity Scores
line_chart = pygal.Line(style=custom_style, width=1400, x_title='Number of Topics k', y_title='Perplexity Score')
line_chart.title = 'Perplexity Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(10, 36))
line_chart.add('Perplexity', k_perplexity_values_cv)
line_chart.render_to_file('D:/Models/find_k/Perplexity.svg')

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

### Alpha and Beta Parameter Optimization with Optimal *k*:

Conducting another Topic Modeling Process with the optimal number for *k* from the training above. Now, the focus lies on the parameters *alpha* and *eta*, which can take on float values as well as string values. In the following, a mixture over several floats as well as strings is created, which is then given to the function *modeling_and_quality* in order to iterate over those varying values. 

In [39]:
a_ints = list(np.arange(0.01, 1, 0.3))
a_strings = ['symmetric', 'asymmetric', 'auto']
alpha = a_ints + a_strings

e_ints = list(np.arange(0.01, 1, 0.3))
e_strings = ['symmetric', 'auto']
eta = e_ints + e_strings

# initializing random seed to be able to reproduce the results
np.random.seed(3)

# running the function to find the optimal combination of parameters
model_list_cv, perplexity_values_cv, coherence_values_cv, topic_coherence_values_cv, model_names_cv = modeling_and_quality(dictionary=id2word, 
                                                                                                                         corpus=corpus, 
                                                                                        texts=data_bigrams_trigrams, 
                                                                                        coherence="c_v", 
                                                                                        iterations = 190,
                                                                                        topic_optim = [32],                                                                                         
                                                                                        alpha_optim = alpha,
                                                                                        eta_optim = eta,
                                                                                        passes = 4,
                                                                                        eval_every=None)

<span style="color: red;">Modelle noch in richtigem Ordner ablegen! </span> 

In [316]:
save_object('Models/k32_seed3/', '32_model_list_cv.pckl', model_list_cv)
save_object('Models/k32_seed3/', '32_perplexity_values.pckl', perplexity_values_cv)
save_object('Models/k32_seed3/', '32_coherence_values_cv.pckl', coherence_values_cv)
save_object('Models/k32_seed3/', '32_topic_coherence_values.pckl', topic_coherence_values_cv)
save_object('Models/k32_seed3/', '32_model_names.pckl', model_names_cv)

### Plotting the Results of the Runs with Optimized *k* Values:

In [317]:
# Plotting Coherence Scores
line_chart = pygal.Line(style=custom_style, width=1600, x_title='Index of Trained Models', y_title='Coherence Score')
line_chart.title = 'Coherence Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(0, 42))
line_chart.add('Coherence c_v', coherence_values_cv)
line_chart.render_to_file('Models/k32_seed3/32_Coherence.svg')

# Plotting Perplexity Scores
line_chart = pygal.Line(style=custom_style, width=1600, x_title='Index of Trained Models', y_title='Perplexity Score')
line_chart.title = 'Perplexity Scores for Different Number of Topics k'
line_chart.x_labels = map(str, range(0, 42))
line_chart.add('Perplexity', perplexity_values_cv)
line_chart.render_to_file('Models/k32_seed3/32_Perplexity.svg')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 1: Topic Modeling

### *Which topics can be found in the abstracts from DHd-conferences between 2014 and 2023 with Topic Modeling?*

### RQ1: Functions

- Function *get_topic_coherences*:

This function takes as input the list of topic coherence values and an index, so that the coherences of the single topics in one certain model can be printed out, e.g. for analysis.

- Function *get_model_info*:

The function takes an index as input as well as the lists of model names, models, coherence values, perplexity values and topic coherence values. Those in combination with the used index, those lists are used to display the most important information about the models.

- Function *find_topics*:

Considering each text from corpus this function initially determines which topic has the highest probability in the specific text. In addition, it also extracts the whole topic probability distribution for each of the documents. The function returns a dictionary which contains the text ids as keys and the number of the most prominent topic as values and a list containing all topic probabilities. 

In [43]:
def get_topic_coherences(topic_coherence_values_cv, index_best_model):
    
    coherences = []
    i = 1
    for value in topic_coherence_values_cv[index_best_model]:
        coherences.append((i, round(value, 3)))
        i +=1 
        
    return coherences

In [42]:
def get_model_info(selected_index, model_names, model_list, coherence_values, perplexity_values, topic_coherence_values):
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    logging.info('Index of the selected model: %s', selected_index)
    logging.info('Model name (k, alpha, beta parameters): %s', model_names[selected_index])
    logging.info('Number of topics in the selected model: %s', model_list[selected_index].num_topics)
    logging.info('Coherence of the selected Topic Model: %s', coherence_values[selected_index])
    logging.info('Perplexity value: %s', perplexity_values[selected_index])
    logging.info('Topic coherences: %s \n', get_topic_coherences(topic_coherence_values, selected_index)) 

In [41]:
def get_main_topics(all_topics):
    
    main_topics = {}
    i = 0
    for doc_topics in all_topics:
        # retrieve the item in doc_topics with highest second value in tuple
        most_salient_topics = max(doc_topics, key=itemgetter(1))
        topic_no, probability = most_salient_topics
        # filling the dictionary with the text IDs as keys and the most salient topics as values
        main_topics[i] = topic_no +1
        i += 1
        
    # returns a dictionary with text indexes as keys and the number of the salient topic as values
    return main_topics

In [40]:
def find_document_topics(corpus, model):
    
    topics_per_document = []
    salient_topics = {}
    i = 0

    for document in corpus:
        doc_topics = model.get_document_topics(document)
        
        #finding the most salient topic in each text
        most_salient_topics = max(doc_topics, key=itemgetter(1))
        salient_topic_number, salient_topic_probability = most_salient_topics
        salient_topics[i] = salient_topic_number +1
        
        #saving the complete topic distribution of each text in a list
        probabilities = []
        for topic in doc_topics:
            topic_num, probability = topic
            probabilities.append(probability)
        topics_per_document.append(probabilities)
        i += 1    
    
    return salient_topics, topics_per_document

### RQ1: Main

<font color=red> Modelle noch in richtigen Ordner legen </font>

In [38]:
# reopening the models with k=32 again
model_list_cv = open_variable('Models/k32_seed3/', '32_model_list_cv.pckl')
coherence_values_cv = open_variable('Models/k32_seed3/', '32_coherence_values_cv.pckl')
topic_coherence_values_cv = open_variable('Models/k32_seed3/', '32_topic_coherence_values.pckl')
model_names_cv = open_variable('Models/k32_seed3/', '32_model_names.pckl')
perplexity_values_cv = open_variable('Models/k32_seed3/', '32_perplexity_values.pckl')

Choosing the optimal Topic Model:
Inspecting the models with the highest coherences and selecting one of the models for further use. The highest coherence in this modeling pass is ca. 0.5 (see the visualization created before). Thus, to somehow structure the process of finding a meaningful model, a threshold of coherence = 0.475 or higher was chosen to filter the models according to their coherence.

In [44]:
possible_models = []
for value in coherence_values_cv:
    if value > 0.475:
        selected_index = coherence_values_cv.index(value)
        get_model_info(selected_index, model_names_cv, model_list_cv, coherence_values_cv, perplexity_values_cv, topic_coherence_values_cv)

2023-08-18 16:26:17,722 : INFO : Index of the selected model: 5
2023-08-18 16:26:17,770 : INFO : Model name (k, alpha, beta parameters): ('32', '0.01', 'auto')
2023-08-18 16:26:17,774 : INFO : Number of topics in the selected model: 32
2023-08-18 16:26:17,778 : INFO : Coherence of the selected Topic Model: 0.47518148079689115
2023-08-18 16:26:17,782 : INFO : Perplexity value: -9.960594973875374
2023-08-18 16:26:17,788 : INFO : Topic coherences: [(1, 0.535), (2, 0.477), (3, 0.534), (4, 0.533), (5, 0.432), (6, 0.502), (7, 0.415), (8, 0.462), (9, 0.372), (10, 0.549), (11, 0.455), (12, 0.516), (13, 0.546), (14, 0.448), (15, 0.386), (16, 0.432), (17, 0.378), (18, 0.421), (19, 0.45), (20, 0.44), (21, 0.576), (22, 0.509), (23, 0.511), (24, 0.569), (25, 0.532), (26, 0.28), (27, 0.52), (28, 0.402), (29, 0.511), (30, 0.592), (31, 0.495), (32, 0.425)] 

2023-08-18 16:26:17,791 : INFO : Index of the selected model: 15
2023-08-18 16:26:17,793 : INFO : Model name (k, alpha, beta parameters): ('32', 

Visualizing the generated models with pyLDAvis, so that the optimal model can be chosen.

In [16]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(model_list_cv[5], corpus, id2word)
pyLDAvis.show(vis, local=False)

After close inspection, the model with index=42 seems to fit the purposes best (This is the model initially created while finding the optimal k). Thus, it is determined as the optimal model:

In [17]:
optimal_model = model_list_cv[5]
final_num_topics = optimal_model.num_topics
optimal_model.save("Models/optimal_model_k32.model", "w")

2023-08-16 08:58:30,043 : INFO : LdaState lifecycle event {'fname_or_handle': 'Models/optimal_model_k32.model.state', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-08-16T08:58:30.043089', 'gensim': '4.2.0', 'python': '3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'saving'}
2023-08-16 08:58:30,061 : INFO : saved Models/optimal_model_k32.model.state
2023-08-16 08:58:30,126 : INFO : LdaModel lifecycle event {'fname_or_handle': 'Models/optimal_model_k32.model', 'separately': "['expElogbeta', 'sstats']", 'sep_limit': 10485760, 'ignore': ['id2word', 'state', 'w', 'dispatcher'], 'datetime': '2023-08-16T08:58:30.126188', 'gensim': '4.2.0', 'python': '3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'saving'}
2023-08-16 08:58:30,127 : INFO : storing np array 'expElogbeta' to Models/optim

If needed, the model can be loaded from the folder with the following line

In [19]:
optimal_model = LdaModel.load("Models/optimal_model_k32.model")

Which topic is the most salient topic in which text?

In [34]:
main_topics, topics_per_document = find_document_topics(corpus, optimal_model)

for item in main_topics.items():
    key, value = item
    if value == 24:
        print(key, textnames[key])

27 DHd-Verband-DHd-Abstracts-2014-667e2de/PDF-files/FORKEL_Robert_CLLD_Datenpublikationen_in_der_Linguistik.pdf
57 DHd-Verband-DHd-Abstracts-2014-667e2de/PDF-files/LÜCKE_Stephan_Das_Projekt_VerbaAlpina.pdf
163 DHd-Verband-DHd-Abstracts-2015-615abf8/PDF-files/150227_EmpirischeAnsätze_3_Elwert-Beziehung_und_Bedeutung_Soziale_und_semantische_Netzwerkanalyse_religionshistorischer_Korpora-1311163.pdf
197 DHd-Verband-DHd-Abstracts-2015-615abf8/PDF-files/HOMBURG_Timo_Learning_cuneiform_the_modern_way.pdf
230 DHd-Verband-DHd-Abstracts-2015-615abf8/PDF-files/CAPELLE_im_19._Jahrhunderts.pdf
234 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/panels-004.xml
239 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/posters-001.xml
272 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/posters-034.xml
277 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/posters-039.xml
288 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/posters-051.xml
300 DHd-Verband-DHd-Abstracts-2016-2fa852e/XML-files/posters-063.xml

<span style="color: red;"> Löschen wenn nicht gebraucht </span>

For all: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb

Getting the most frequent words per topic and the probability of a term per topic

In [186]:
# optimal_model.show_topics(num_topics=final_num_topics)
# # optimal_model.get_term_topics('kybernetik')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 2: Hierarchical Clustering and Topic Similarity

### *Which topics appear together frequently in one abstract and therefore have a high topic similarity?*

- Function *calculate_euclidean_distances*:

This function takes as input the vectors of each topic in each document. This means that with k=32 and for example 100 documents from 2014, 32 vectors of the length 100 are created. Through this, the distance between the vectors can be calculated by the Euclidean distance. This finally enables the hierarchical clustering to see which topics occurr together frequently. 

- Function *create_dendrogramm*:

The function has the purpose to create a dendrogramm for hierarchical clustering with the input data. It is given a vector as input, as well as titles for x-axis and the whole visualization, labels for the leaves, the degree of label rotation and the resulting file name. For the clustering process itself, the library scipy is used with its objects cluster.hierarchy.dendrogramm and .linkage (shc.dendrogramm/shc.linkage).

In [180]:
def calculate_euclidean_distances(topic_vectors_years):
    
    # calculating the distance of each topic's topic distribution to any other topics' distribution with Hellinger Distance 
    euclidean_distances_years = []
    for top in topic_vectors_years:
        vec_distances = []
        # comparing the distance between the topic distribution of one text to the distribution of all other texts
        for comparison in topic_vectors_years:
            vec_distances.append(math.dist(top, comparison))
        # returns list of lists, in which the distance of each document to each other document is saved    
        euclidean_distances_years.append(vec_distances)
    
    return euclidean_distances_years

In [298]:
def create_dendrogram(vector, xlabel, plot_title, leaf_labels, rotation, filename):
    
    plt.figure(figsize=(20, 18))
    plt.xlabel(xlabel)  
    plt.title(plot_title, fontsize = 18) 
    dend = shc.dendrogram(shc.linkage(vector, method='ward'), labels = leaf_labels, leaf_font_size=14, leaf_rotation = rotation)
    plt.savefig(filename, format='svg')

### RQ2: Main

Creating a dendrogramm with the topic distances for every year observed

In [303]:
for i in range(0, len(indexes)-1):
    # separating the data according to the conference year the texts belong to
    data = topics_per_document[indexes[i]:indexes[i+1]]
    
    # creating and transposing DataFrame to easily extract each first, second, third value in the lists
    df_topic_averages = pd.DataFrame(data=data).T
    
    # topic_vectors contains a list for each year
    # each list contains the probability of the respective topic for each text of the year, therefore each list is as long as the indexes indicate
    # therefore, topic_vectors_years[0] contains topic 0's probabilities from all texts from the analyzed year, topic_vectors[1] contains topic 1's probabilities from the year
    topic_vectors_years = []
    for j in range(0, final_num_topics):
        topic_vectors_years.append(df_topic_averages.iloc[j].values)

    # calculating the distance of each topic's topic distribution to any other topics' distribution with Euclidean Distance 
    euclidean_distances_years = calculate_euclidean_distances(topic_vectors_years)
    
    create_dendrogram(euclidean_distances_years, 'Topics', filenames[i]+': Hierarchical Clustering of All Topics Based on Euclidean Distance', 
                       range(1, final_num_topics+1), 45,'Figures/RQ2/RQ2_Clustering'+ filenames[i]+'.svg')

Creating a dendrogramm with the topic distances for all years combined

In [302]:
# creating DataFrame to easily extract each first, second, third value in the lists
df_averages = pd.DataFrame(data=topics_per_text).T
topic_vectors_total = []

# basically same operations as above, but without splitting by years
for i in range(0, final_num_topics):
    topic_vectors_total.append(df_averages.iloc[i].values)

# calculating the distance of each topic's topic distribution to any other topics' distribution with Euclidean Distance 
euclidean_distances_years = calculate_euclidean_distances(topic_vectors_total)

create_dendrogramm(euclidean_distances_years, 'Topics', '2016-2023: Hierarchical Clustering of Topics Based on Euclidean Distance', range(1, final_num_topics+1), 45 ,'Figures/RQ2/RQ2__Clustering_AllYears.svg')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 3: Mann-Kendall-Test

### *How have the topics been changing throughout the years - which trends are perceptible?*


Mann-Kendall-Test:

- MK-Test analyzes whether there is a trend in a topic's appearance over time: increasing, decreasing, no trend (Chen et al., 2020; Mann, 1945)
- results are statistically significant if p <= 0.05
- results are highly statistically significant if p <= 0.01
- [Further Information on Mann-Kendall-Test](https://www.geeksforgeeks.org/how-to-perform-a-mann-kendall-trend-test-in-python/)


### RQ3: Functions

- Function *calculate_topic_average*:

This function takes the list of indexes for each year's corpus as well as the lists of topics per text as input. The latter one contains a list of probabilities for every topic for every text in the corpus. On the basis of those probabilites and separated by the DHd-year the texts belong to, the average probability of each topic within each conference year is calculated. Those averages are then stored in another list, which is returned.

In [19]:
def calculate_topic_average(indexes, topics_per_text):
    
    averages_per_topic_per_year = []
    i = 0 
    j = 1
    while i < int(len(indexes)-1):
        all_probabilities = []
        for text in topics_per_text[indexes[i]:indexes[j]]:
            all_probabilities.append(text)
        # dividing the sum of all first, second, third... values by the number of values to get the average of all n topics for one year
        averages_per_topic = (np.sum(all_probabilities, axis=0)/len(all_probabilities))
        averages_per_topic_per_year.append(averages_per_topic)
        i +=1
        j += 1
        
    return averages_per_topic_per_year

### RQ3: Main

In [20]:
# using the list of names of documents for each year to count how many documents belong to one year
topic_averages = calculate_topic_average(indexes, topics_per_text)

# creating a dataframe which shows the average probability of a topic in each year
df_averages = pd.DataFrame(topic_averages, index=filenames).T

# conducting the Mann-Kendall-Test for each of the topics
df_mk = []
for i in range(0, final_num_topics):
    df_mk.append(mk.original_test(df_averages.values[i]))
mk_results = pd.DataFrame(df_mk)

# joining the averages-DF and the Mann-Kendall-Test-DF in order to have complete csv-file
mk_df = df_averages.join(mk_results)
# rename columns so that topics are 1-22 and not 0-21 (similar in all other uses)
mk_df.index = list(np.arange(1, final_num_topics+1, 1))
mk_df.to_csv('Figures/RQ3/RQ3__Mann-Kendall.csv')

### RQ3: Line Plot Visualization

In [22]:
# dividing up the number of topics in order to plot understandable graphs 
parts = [0, round(final_num_topics*0.25), round(final_num_topics*0.5), round(final_num_topics*0.75), final_num_topics]

for k in range(len(parts)-1):
    line_chart = pygal.Line(x_title='Years Observed', y_title='Average Probability of the Topic')
    line_chart.title = 'Average Probabilities of Topics over the Years'
    line_chart.x_labels = filenames
    for i in range(parts[k], parts[k+1]):
        line_chart.add(str('Topic ' + str(i+1)), df_averages.values[i])
    
    name = 'Figures/RQ3/RQ3__TopicProbabilitiesPerYear_' + str(parts[k]+1) + '-' + str((parts[k+1])) + '.svg'
    line_chart.render_to_file(name) 

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 4: Use and Development of Research Methods
### *With regard to the use of different scientific methods, which developments are perceptible?*

There are two types of keywords used in the xml-files of the DHd-Conferences: \<keywords scheme="ConfTool" n="keywords"> and \<keywords scheme="ConfTool" n="topics">. The former may be used by authors to freely annotate their xml-file with words which they want to be in the metadata, while the latter one only allows for a selection of keywords.
From 2016 on, authors to the conference could select keywords to be used in the xml-file to their contribution. The list of keywords is 75 items long, out of which six can currently be selected for the xml-file. Yet, in 2016 and 2017, this restriction has not been made, so that authors could select more than six keywords for the metadata. The full list of usable keywords is called 'conf_tool_methods' and will be used in the following.


### RQ4: Functions

- Function *count_keywords*:

This function is the first to be called for RQ4 and takes two lists as input arguments. The first one contains all freely-assigned keywords which were used in all xml-texts from 2016 until 2023. This list is then used to create a dictionary for counting the used methods. At first, all values to the keys, which are the methods' names, are set to 0, but then the second list provided will be iterated over and the values of the methods are added to. This happens for each DHd-year separately, so that the sorted list returned by *count_keywords* provides an overview of which methods were used how often in which year.

- Function *find_popular_methods*:

*find_popular_methods* takes the list provided by *count_keywords* as input and extracts the first five entries for each year. Those entries are the five most-often used keywords for the respective year and they are added to a list which stores the most popular methods. The list is returned by the function.

- Function *calculate_relative_count*:

The function takes as input the popular keywords extracted in the previous function, the sorted list of counted keywords from *count_keywords* as well as a list stating how many xml-files are in each conference year's corpus. The function's purpose, in the first step, is to find out how often each of the popular keywords is used over the years, i.e. over all corpora. As a second step, this count of the method is then divided by the number of documents in the corpus in order to compute relative instead of absolute numbers. This is necessary because there are larger and smaller corpora in the collection, which could distort the interpretation of numbers. 

- Function *information_on_keywords*:

This function calls on the other three functions sequentially and returns the relative count of use of keywords.

- Function *chart_and_csv*:

*chart_and_cv* takes as input data calculated beforehand as well as titles for the resulting output. It turns the given data into a line chart containing only a small amount of selected data as well as a csv file that contains all data. 

In [23]:
def count_keywords(complete_list, year_list):
    # for each year, a large list of all possible, freely annotated keywords is set up and the count for each of the keywords is set to 0
    # if the method appears in the specific year's list, +1 will be added to the count
     
    counted_keywords_per_year = []
    for year in year_list:
        dict = {}
        for method in complete_list:
            dict[method] = 0
        for method in year:
            dict[method] += 1
        # sorting the dict according to the count, so that the item with the highest count comes first
        sorted_dict = sorted(dict.items(), key=lambda x: x[1], reverse=True)
        counted_keywords_per_year.append(sorted_dict)
        
    return counted_keywords_per_year

In [24]:
def find_most_popular_methods(methods):
    # takes sorted list as input in which the methods used and their count (how often were they used/stated as keyword in each year?) is saved
    most_popular_methods = []
    for year in methods:
        # extracts the first five methods (the ones with the highest count) and appends them to another list, in which only the method itself is saved
        best_five = year[:5]
        for item in best_five: 
            research_method, count = item
            if research_method not in most_popular_methods:
                most_popular_methods.append(research_method)
                
    return most_popular_methods

In [25]:
def calculate_relative_count(most_popular_methods, count_keywords, number_xml_docs):
    
    # for the popular methods, it is looked at how often they appear in each year's corpus
    # then, that count will be divided by the total number of texts in that corpus, which results in a proportional frequency of the keyword to the corpus size 
    relative_dict = {}
    for item in most_popular_methods:
        item_count = []
        for year in count_keywords:
            i = 0
            for tuple in year:
                name, count = tuple
                if name == item:
                    item_count.append(round(count/number_xml_docs[i], 3))
            i += 1
        relative_dict[item] = item_count
        
    return relative_dict

In [26]:
def retrieve_information_on_keywords(total_list, used_keywords_list, number_xml_docs):
    
    absolute_count_keywords = count_keywords(total_list, used_keywords_list)
    most_popular_methods = find_most_popular_methods(absolute_count_keywords)
    relative_count_keywords = calculate_relative_count(most_popular_methods, absolute_count_keywords, number_xml_docs)
    
    return absolute_count_keywords, relative_count_keywords

In [27]:
def chart_and_csv(chart_title, filenames_xml, relative_count, chart_filename, all_relative_counts, csv_filename):
    # creating a plot for visualisation

    line_chart = pygal.Line(truncate_legend=-1, x_title='Years Observed', y_title= 'Number of Usages Per Method' + '\n' + '(Relative to Corpus Size)')

    line_chart.title = chart_title
    line_chart.x_labels = filenames_xml
    for key in relative_count:
        line_chart.add(key, relative_count[key])

    line_chart.render_to_file(chart_filename)
    
    # for complete information, a csv-file is provided with all used keywords and their relative counts
    df_averages = pd.DataFrame(all_relative_counts.values(), index=all_relative_counts.keys(),columns=filenames_xml)
    df_averages.to_csv(csv_filename)

### RQ4: Main

In [28]:
# importing the list provided, which contains all selectable options for <keywords n='keywords'>
all_predetermined_keywords = open_list('Misc/predetermined_keywords.txt')

absolute_count_free_keywords, relative_count_free_keywords = retrieve_information_on_keywords(all_freely_selectable_keywords, used_keywords_freely_selectable, number_xml_docs)
absolute_count_predetermined_keywords, relative_count_predetermined_keywords = retrieve_information_on_keywords(all_predetermined_keywords, used_keywords_predetermined, number_xml_docs)

In [29]:
# for complete information, a csv-file is provided with all used keywords and their relative counts, in line chart only the most popular ones were taken
free_relative_all = calculate_relative_count(all_freely_selectable_keywords, absolute_count_free_keywords, number_xml_docs)
predetermined_relative_all = calculate_relative_count(all_predetermined_keywords, absolute_count_predetermined_keywords, number_xml_docs)


chart_and_csv('Use of methods in <keywords n=\'keywords\'>', filenames_xml, relative_count_free_keywords, 
             'Figures/RQ4/RQ4__Free_ResearchMethodsPerYear.svg', free_relative_all, 'Figures/RQ4/RQ4__Free_Keywords_All.csv')

chart_and_csv('Use of methods in <keywords n=\'topics\'>', filenames_xml, relative_count_predetermined_keywords,
             'Figures/RQ4/RQ4__Predetermined_ResearchMethodsPerYear.svg', predetermined_relative_all, 'Figures/RQ4/RQ4__Predetermined_Keywords_All.csv')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 5: Analysis of Authors and Teams of Authors


*Which researchers contribute to the conference particularly frequently with abstracts, in which teams do they contribute and how have the teams been changing?*

### RQ5: Functions - Part 1: Data Processing

- Function *count_appearances_descending*:

*count_appearances_descending* takes a list as input and counts how often a list item appears within the list. The function returns a sorted dictionary in which the keys with the highest values come first.

- Function *rank_authors*:

*rank_authors* takes as input the nested list of authors per text and per year, and turns it one large list containing every author who contributed the conference in one year. This list contains authors twice/three times if they contributed twice/three times within one conference, which is how eventually a count of every author and their contributions to the conferene is set up. The function returns a list of lists, in which the authors' names and their respective count of contributions for one year is stored. This function is necessary to set up the network nodes in a last step.

- Function *find_coauthors*:

The function *find_coauthors* takes the same list, *authors* as input. Differently to *rank_authors*, however, this function uses the input list to determine the coauthors of one year. This means that it extracts the authors who worked together in one single contribution in pairs and counts how often these pairs appear within one conference year. This function prepares the data for adding edges and their weight to the network.

- Function *find_new_authors*:

*find_new_authors* takes the output list from *rank_authors* as input and uses it to determine which authors have contributed to one conference but not the previous one. Further, it identifies authors who have never contributed to any DHd conference. This is done by comparing the authors lists of one year with the one from the previous year(s). Those new authors are stored in a list, which is used in the setting up of the network in order to mark those new authors.

- Function *convert_tuples_to_dict*:

Converts a list of tuples into a dictionary where the first tuple-item is the key and the second tuple-item is the value.

- Function *check_for_significant_teams*:

Takes the list of the 25 most significant, i.e. most contributing authors as input and the list of coauthors per year. Iterating over these lists, the function then checks whether two of the significant authors collaborated on a conference document and saves this team of authors. 

- Function *count_team_appearances*:

Gets the list of significant teams returned by *check_for_significant_teams* as input and counts how often each team collaborated togehter.

In [190]:
def count_appearances_descending(input_list):
    
    count_dict = {}
    # for each item in keyword list, check if it is alredy in dictionary
    # if not, add and set count to 1, if yes add +1 to count
    for item in input_list:
        if item not in count_dict.keys():
            count_dict[item] = 1
        else:
            count_dict[item] += 1
    # sort dictionary according to highest count in the values
    sorted_dict = sorted(count_dict.items(), key=lambda x: x[1], reverse=True)

    # return the sorted dictionary (becomes list through sorting though)
    return sorted_dict

In [199]:
def rank_authors(authors):

    all_counted_authors = []
    for year in authors:
        authors_list_per_year = [] 
        for team in year:
            # create one long list of all authors of one year (not separated by author team)
            authors_list_per_year = authors_list_per_year + team
        all_counted_authors.append(count_appearances_descending(authors_list_per_year))
        
    return all_counted_authors

In [204]:
def find_coauthors(authors):
    
    coauthors_per_year = []
    for conference_year in authors:
        coauthors_list = []
        for document_authors in conference_year:
            # check if text has at least two authors
            if len(document_authors) >= 2:
                for i in range(len(document_authors)-1):
                    for j in range(i+1, len(document_authors), 1):
                        coauthors_list.append((document_authors[i], document_authors[j]))              
        dict_coauthors_count = count_appearances_descending(coauthors_list)      
        coauthors_per_year.append(dict_coauthors_count) 
        
        
    return coauthors_per_year

In [12]:
def find_new_authors(all_counted_authors):
    
     # returns list of people who did not participate previous DHd conferences
    new_authors = [[]]
    all_prev_authors = []
    for i in range(1, len(all_counted_authors)):
        current_year = all_counted_authors[i]
        prev_year = authors[i-1]
        
        prev_authors = []
        completely_new_authors = []
        
        for item in prev_year:
            prev_authors = prev_authors + item
        all_prev_authors = all_prev_authors + prev_authors        
        for element in current_year:
            name, count = element
            if name not in all_prev_authors:
                completely_new_authors.append(name)
        new_authors.append(completely_new_authors)
    
    return new_authors

In [101]:
def convert_tuples_to_dict(tuple_list):
    
    new_dict = {}
    for tuple in tuple_list:
        item1, item2 = tuple
        new_dict[item1] = item2

    return new_dict   

In [105]:
def check_for_significant_teams(list_significant_DHumanists, coauthors):
    
    teams = []
    for i in range(len(list_significant_DHumanists)-1):
        for j in range(i+1, len(list_significant_DHumanists), 1):
            k = 0
            for year in coauthors:
                for team in year:
                    #check if team of authors consists of only the most significant authors (i.e. the 25 most contributing ones)
                    if list_significant_DHumanists[i][0] in team[0][0] and list_significant_DHumanists[j][0] in team[0][1]:
                            teams.append(team)
                k += 1
                
    return teams

In [108]:
def count_team_appearances(teams):

    teams_dict = {}
    for element in teams:
        author_team, team_count = element
        if author_team not in teams_dict.keys():
            teams_dict[author_team] = team_count
        else:
            teams_dict[author_team] += team_count

    return teams_dict

### RQ5: Functions - Part 2: Network 

- Function *determine_node_shape*:

Depending on how often the author contributed to the DHd conference in the respective year, i.e. whether they are a very frequent author or not, this function returns a string which then determines the author's node's shape in the final network. 

- Function *add_nodes*:

Creates a node for every author that contributed to the DHd in that specific year. Depending on how often that author contributed to the conference in the form of writing an abstract, the shape of the node is determined. 

- Function *change_color_new_authors*:

For every author appearing in the new_authors list, the node color is changed from pink to orange, so that they can be clearly seen in the network.

- Function *determine_edge_color*:

The function *determine_edge_color* is applied while setting up the network. Based on the so-called coauthor-count, which is determined by the function *find_coauthors*, this function determines the color of the network's edges depending on how often the two authors have worked together in on conference year. 

- Function *add_edges*:

For every team of coauthors, the edges within the nodes are added by referring to the authors' nodes by their names. The edges' colors are determined by how often the two authors collaborated in that DHd year, which precisely happens in the function *determine_edge_color*.

- Function *generate_html_file*:

Generates an html file containing all the nodes and edges created previously. Additionally, a legend is added to the html, so that the colors and shapes within the network can be understood. 



In [14]:
def determine_node_shape(total_count, topcount):
    
    if total_count == topcount:
        return 'star'
    else:
        return 'dot'

In [207]:
def add_nodes(all_counted_authors, i):
    
    top_name, top_count = all_counted_authors[i][0]
    for author in all_counted_authors[i]:
        name, total_count = author
        title = (name, total_count)
        g.add_node(name, 
                title=title, 
                label=name, 
                size=(total_count*10), 
                # pink
                color='#C70039',
                borderWidth=1, 
                borderWidthSelected=3,
                # determine the nodes' shape according to whether author is top author with most contributions in that year (==star) or not (==dot)
                shape = determine_node_shape(total_count, top_count))
        
    return

In [215]:
def change_color_new_authors(new_authors, i):
       
    if len(new_authors[i]) > 0:
        for author in new_authors[i]:
            g.get_node(author)['color'] = 'orange'
        
    return len(new_authors[i])    

In [205]:
def determine_edge_color(coauthor_count):
      
      # function to determine the color of the network's edges, depending on how often authors worked together in that year
      # dark blue
      if coauthor_count >= 3:
            return '#360BFA'
      # light blue
      elif coauthor_count == 2:
            return '#6CADFB'
      # turquoise
      else: 
        return '#1FA3B5'

In [227]:
def add_edges(coauthors_per_yer, i):
    
    for item in coauthors_per_year[i]:
        name1 = item[0][0]
        name2 = item[0][1]
        coauthor_count = item[1]
    
        # edge color depending on how often authors worked togehter
        g.add_edge(name1, name2, 
                    width=(coauthor_count),
                    title=(coauthor_count),
                    color = determine_edge_color(coauthor_count))
    
    return

In [225]:
def generate_html_file(filenames_xml, i):
    
    # opening the provided HTML code which has to be added to the network html file
    with open('Misc/LegendHTML.txt', "r", encoding='utf-8') as legend:
        html_addition = legend.read()
    
    # writing an html-file
    html = g.generate_html()
    name = 'Figures/RQ5/RQ5__Authors_Networks_' + str(filenames_xml[i][-4:]) + '.html'
    
    with open(str(name), mode='w', encoding='utf-8') as fp:        
        # finding the proper place in the html document and inserting the additional markup for legend
        find = re.search(r'<div id="mynetwork" class="card-body"></div>', html)
        end = find.end()+1

        html = html[:end] + html_addition + html[end:]   
        fp.write(html)

### RQ5: Main

In [203]:
# needed for network
all_counted_authors = rank_authors(authors)
coauthors_per_year = find_coauthors(authors)
new_authors = find_new_authors(all_counted_authors)

#needed for cooccurrence matrix
most_significant_DHumanists_list = count_appearances_descending(all_authors)[:25]
most_significant_DHumanists_dict = convert_tuples_to_dict(most_significant_DHumanists_list)
significant_DH_teams = check_for_significant_teams(most_significant_DHumanists_list, coauthors_per_year)
significant_teams_count = count_team_appearances(significant_DH_teams)

[[(('Jannidis, Fotis', 'Reger, Isabella'), 3), (('Engelhardt, Claudia', 'Kurzawe, Daniel'), 2), (('Kleineberg, Michael', 'Kaden, Ben'), 2), (('Fischer, Frank', 'Göbel, Mathias'), 2), (('Fischer, Frank', 'Kampkaspar, Dario'), 2), (('Göbel, Mathias', 'Kampkaspar, Dario'), 2), (('Trilcke, Peer', 'Kampkaspar, Dario'), 2), (('Jannidis, Fotis', 'Pielström, Steffen'), 2), (('Pielström, Steffen', 'Reger, Isabella'), 2), (('Stiller, Juliane', 'Thoden, Klaus'), 2), (('Meißner, Cordula', 'Wallner, Franziska'), 2), (('Gradl, Tobias', 'Henrich, Andreas'), 2), (('Hoppe, Stephan', 'Pfarr-Harfst, Mieke'), 1), (('Hoppe, Stephan', 'Münster, Sander'), 1), (('Hoppe, Stephan', 'Kuroczyński, Piotr'), 1), (('Hoppe, Stephan', 'Blümel, Ina'), 1), (('Hoppe, Stephan', 'Hauck, Oliver'), 1), (('Hoppe, Stephan', 'Lutteroth, Jan'), 1), (('Pfarr-Harfst, Mieke', 'Münster, Sander'), 1), (('Pfarr-Harfst, Mieke', 'Kuroczyński, Piotr'), 1), (('Pfarr-Harfst, Mieke', 'Blümel, Ina'), 1), (('Pfarr-Harfst, Mieke', 'Hauck, Oliv

### RQ5: Network Visualizations
In the next few lines of code, the network itself is being created. The nodes have to be added as well as the edges and their weights. In the end, the general algorithm of the network as well as some other parameters are defined. 

In [229]:
# implementing the statistics for each year
statistics_completely_new_authors = []

i = 0
for year in filenames_xml:
      
  # implementing the network itself for each year
  g = Network(height='600px', width='100%', cdn_resources='remote', select_menu=True, font_color='black', filter_menu=True, neighborhood_highlight=True)
  nxg = nx.complete_graph(0)
  g.from_nx(nxg)
  
  # adding nodes and edges, determining colors and shapes, writing html file
  add_nodes(all_counted_authors, i)
  number_new_authors = change_color_new_authors(new_authors, i)
  statistics_completely_new_authors.append(number_new_authors)
  add_edges(coauthors_per_year, i)
  generate_html_file(filenames_xml, i)
  
  i += 1

### RQ5: DataFrames for the collaborations of significant authors and for a statistic of (new) authors

In [118]:
#creating a dataFrame that contains the names of significant authors and the times they collaborated
cooccurrence_matrix = pd.DataFrame(index = most_significant_DHumanists_dict.keys(), columns = most_significant_DHumanists_dict.keys())
cooccurrence_matrix['Total contributions to conference'] = most_significant_DHumanists_dict.values()

#determining the right cell to enter the count of the team
for duo in significant_teams_count:
    cooccurrence_matrix.loc[duo[0], duo[1]] = significant_teams_count[duo]
cooccurrence_matrix.fillna(0).to_csv('Figures/RQ5/RQ5__Cooccurrence_Matrix_Significant_Authors.csv')

In [222]:
total_number_authors = []
for year in all_counted_authors:
     total_number_authors.append(len(year))
     
d = {'Total Number of Contributing Authors': total_number_authors,
     'New Authors': statistics_completely_new_authors}

df_averages = pd.DataFrame(data = d, index = filenames_xml)
df_averages['% of Completely New Authors'] = df_averages['New Authors'].div(df_averages['Total Number of Contributing Authors'])

df_averages.T.to_csv('Figures/RQ5/RQ5__NewAuthors.csv')

### RQ5: Bar Chart Visualization of New Authors

In [223]:
bar_chart = pygal.Bar(style=custom_style, x_title='Years Observed', y_title='Total Numbers', title='Detailed Analysis of Authors and Contributions', truncate_legend = -1)
bar_chart.title = 'Contributors to DHd Conferences'
bar_chart.x_labels = filenames_xml
bar_chart.add('All Contributors', df_averages.T.iloc[0])
bar_chart.add('No Previous Contribution', df_averages.T.iloc[1])
bar_chart.render_to_file('Figures/RQ5/RQ5__ContributorsAnalysis.svg')

==========================================================================================================================

==========================================================================================================================

==========================================================================================================================

## Research Question 6: Clustering of (Teams of) Authors and Certain Research Topics

### *Which clusters of researchers can be found with regard to topics and how have the clusters been changing?* 

### RQ6: Functions

- Function *get_authors_and_topics*:

This function takes two lists and one dictionary as input. One list - authors_full_list - contains all author names extracted from the xml files, sorted alphabetically. The second list contains the authors of each text so that it can be determined which text was written by whom. The input dictionary is the one returned by the function *find_document_topics*, which is already applied in RQ1, containing the most prominent topic of each document. These three input variables are used to infer which author has written on which (most salient) topic. This information is stored in the dictionary authors_and_topics, where each author name is a key and the values are lists containing the topics an author has written about. 
In order to make later visualizations more readible and understandable, the data is reduced to authors which have contributed in DHd-conferences at least eight times. The reduced dictionary of authors and topics is returned by the function.

In this function, only the documents from index 231 on are taken into account. Those documents are the ones springing from xml files where the documents' authors are noted down in the xml markup.

- Function *create_vectors*:

*create_vectors* takes the dictionary from *get_authors_and_topics* as input and transforms the information on the authors' contributions to topics into a sparse vector representation. Through this, a vector with the length of the final topic number is created for each author (i.e. key of the dictionary), meaning that the vector contains a 0 where the author did not contribute to a topic. For topics the author contributed to, the vector contains an integer how often the author contributed to that topic. 
The function returns the dictionary with authors (keys) and vectors (values), as well as a list of only the vectors and a list of only the authors' names. 

In [236]:
def get_authors_and_topics(authors_full_list, all_author_teams, main_topics, index_first_xml_document):
    
    authors_and_topics = {}
    for name in sorted(authors_full_list):
        topics_per_author=[]
        # setting text_id to 231 (in index_first_xml_document), because only from document 231 on the authors are noted down in markup of xml files
        document_id = index_first_xml_document
        # iterating over all authors in all texts, trying to find the 'key' currently looked at
        for document in all_author_teams:
            for author in document:
                # if the key matches the author of the text, then note the text id and through that find the salient topic of the text
                if name == author:
                    document_topic = main_topics[document_id]
                    topics_per_author.append(document_topic)
                authors_and_topics[name] = topics_per_author 
            document_id +=1

    return authors_and_topics  

In [260]:
def create_vectors(authors_and_topics, minimun_number_of_contributions):

    # transforming the data from reduced_authors_and_topics into a vector representation
    # advantage: easier to plot and counts how often each topic was written on by the authors
    vector = {}
    for key in authors_and_topics:
        if len(authors_and_topics[key]) > minimun_number_of_contributions:
        # creating a vector with a length corresponding to the number of topics
            vector[key] = [0]*final_num_topics
            for digit in authors_and_topics[key]:
                vector[key][digit-1] += 1
    # retrieving the author names to use them as labels and the vectors to determine distances between the vectors
    only_authors = [key for key in vector]
    only_vectors = [vector[key] for key in vector]

    return vector, only_vectors, only_authors

### RQ6: Main

In [261]:
index_first_xml_document = 231
# slicing the dict main_topics because for this RQ only the documents froom 2016-2023 are needed beginning with index 231
main_topics_xml = dict(itertools.islice(main_topics.items(), index_first_xml_document, len(textnames)))  
authors_and_topics = get_authors_and_topics(authors_full_list, all_author_teams, main_topics_xml, index_first_xml_document)
vector, only_vectors, only_authors = create_vectors(authors_and_topics, 8)

### RQ6: Dot Chart Visualization

In [249]:
i = 1
for digit in np.arange(0, 1, 0.33):
    # dividing the dictionary into several lists, to make the output plot more readible
    part = dict(list(vector.items())[round(len(vector)*digit) : round(len(vector)*(digit+0.33))])
    
    dot_chart = pygal.Dot(human_readable=True, width = 800, height= 700, truncate_legend=20, style=custom_style, legend_box_size=6)
    dot_chart.title = 'Authors with >8 contributions and the topics they wrote about'
    dot_chart.x_labels = range(1, final_num_topics+1)
    
    for key in part:
        dot_chart.add(key, part[key])

    # dot_chart.render_in_browser(human_readable=True)
    name = 'Figures/RQ6/RQ6__Authors_and_Topics_' + str(i) + '.svg'
    dot_chart.render_to_file(name)
    i += 1

### RQ6: Dendrogramm Visualization

In [301]:
create_dendrogramm(only_vectors, 'Authors', 'Closeness of Authors Calculated by Topic Vectors', only_authors, 70,'Figures/RQ6/RQ6__Dendrogram.svg')