<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/domain_topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyLDAvis
!pip install gensim==4.3.0
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.0-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Collecting funcy
  Downloading funcy-1.18-py2.py3-none-any.whl (33 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: funcy, joblib, pyLDAvis
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.1
    Uninstalling joblib-1.1.1:
      Successfully uninstalled joblib-1.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-profiling 3.2.0 requires joblib~=1.1.0, but

## Imports

In [2]:
from gensim import corpora
from gensim.models import CoherenceModel, LdaModel
from gensim.models import EnsembleLda
from gensim.corpora import Dictionary
from collections import Counter
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import os
import codecs
import pandas as pd

## Build Corpus

In [3]:
if 'macula-greek.tsv' not in [path for path in os.listdir()]:
    !wget -q 'https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/Nestle1904/TSV/macula-greek.tsv'
if 'marble-domain-label-mapping.json' not in [path for path in os.listdir()]:
    !wget -q 'https://raw.githubusercontent.com/Clear-Bible/macula-greek/main/sources/MARBLE/SDBG/marble-domain-label-mapping.json'

In [4]:
os.listdir()

['.config',
 'macula-greek.tsv',
 'marble-domain-label-mapping.json',
 'sample_data']

In [18]:
# Import Macula Greek data
mg = pd.read_csv('macula-greek.tsv', sep='\t', header=0, dtype='str')
mg['domains'] = mg['domains'].astype(str).fillna('missing')

# Extract book, chapter, and verse into separate columns
mg[['book', 'chapter', 'verse']] = mg['ref'].str.extract(r'(\d?[A-Z]+)\s(\d+):(\d+)')

# Add columns for book + chapter, and book + chapter + verse for easier grouping
mg['book_chapter'] = mg['book'] + ' ' + mg['chapter'].astype(str)
mg['book_chapter_verse'] = mg['book_chapter'] + ':' + mg['verse'].astype(str)

# Display the updated data frame
mg.head()

Unnamed: 0,xml:id,ref,role,class,type,gloss,text,after,lemma,normalized,...,domain,ln,frame,subjref,referent,book,chapter,verse,book_chapter,book_chapter_verse
0,n40001001001,MAT 1:1!1,,noun,common,[The] book,Βίβλος,,βίβλος,Βίβλος,...,033005,33.38,,,,MAT,1,1,MAT 1,MAT 1:1
1,n40001001002,MAT 1:1!2,,noun,common,of [the] genealogy,γενέσεως,,γένεσις,γενέσεως,...,010002 033003,10.24 33.19,,,,MAT,1,1,MAT 1,MAT 1:1
2,n40001001003,MAT 1:1!3,,noun,proper,of Jesus,Ἰησοῦ,,Ἰησοῦς,Ἰησοῦ,...,093001,93.169a,,,,MAT,1,1,MAT 1,MAT 1:1
3,n40001001004,MAT 1:1!4,,noun,proper,Christ,Χριστοῦ,,Χριστός,Χριστοῦ,...,093001,93.387,,,,MAT,1,1,MAT 1,MAT 1:1
4,n40001001005,MAT 1:1!5,,noun,common,son,υἱοῦ,,υἱός,υἱοῦ,...,010002,10.30,,,,MAT,1,1,MAT 1,MAT 1:1


In [64]:
# Import domain-label mapping
import json

# Open the JSON file
with open('marble-domain-label-mapping.json', 'r') as f:

    # Load the contents of the file as a dictionary
    domain_labels = json.load(f)

domain_labels['missing'] = 'no domain'

# Display the resulting dictionary
count = 0
for d, l in domain_labels.items():
    print(d, l)
    if count > 5:
        break
    count += 1

001 Geographical Objects and Features
001001 Universe, Creation
001002 Regions Above the Earth
001003 Regions Below the Surface of the Earth
001004 Heavenly Bodies
001005 Atmospheric Objects
001006 The Earth's Surface


Let's filter out stopwords using a list from [Perseus](https://wiki.digitalclassicist.org/Stopwords_for_Greek_and_Latin).

In [65]:
perseus_stopwords = "μή, ἑαυτοῦ, ἄν, ἀλλ', ἀλλά, ἄλλος, ἀπό, ἄρα, αὐτός, δ', δέ, δή, διά, δαί, δαίς, ἔτι, ἐγώ, ἐκ, ἐμός, ἐν, ἐπί, εἰ, εἰμί, εἴμι, εἰς, γάρ, γε, γα, ἡ, ἤ, καί, κατά, μέν, μετά, μή, ὁ, ὅδε, ὅς, ὅστις, ὅτι, οὕτως, οὗτος, οὔτε, οὖν, οὐδείς, οἱ, οὐ, οὐδέ, οὐκ, περί, πρός, σύ, σύν, τά, τε, τήν, τῆς, τῇ, τι, τί, τις, τίς, τό, τοί, τοιοῦτος, τόν, τούς, τοῦ, τῶν, τῷ, ὑμός, ὑπέρ, ὑπό, ὡς, ὦ, ὥστε, ἐάν, παρά, σός".split(', ')
perseus_stopwords += "συ".split(' ')
# filter the DataFrame to exclude rows with class values in the exclude_list
filtered_mg = mg[~mg['normalized'].isin(perseus_stopwords)] # Check normalized forms
filtered_mg = filtered_mg[~filtered_mg['lemma'].isin(perseus_stopwords)] # Check lemmas
difference = len(mg) - len(filtered_mg)
print(f'{difference} rows removed using stopwords')
# Also exclude certain parts of speech
pos_exclude_list = ['det', 'prep', 'pron', 'conj', 'ptcl']
cur_len = len(filtered_mg)
filtered_mg = filtered_mg[~filtered_mg['class'].isin(pos_exclude_list)]
print(f'{cur_len - len(filtered_mg)} more rows removed using parts of speech')

69098 rows removed using stopwords
3462 more rows removed using parts of speech


In [66]:
assert filtered_mg['lemma'].isin(perseus_stopwords).unique() == [False]

Create an ensemble topic model for each book of the New Testament, using lemmas and ignoring stopwords.

In [103]:
book_domains = {}  # Initialize a dictionary to hold the corpora for each book

def get_label(domainstr):
    if type(domainstr) != float: 
        first_domain = ''.join(domainstr.split()[0])
        return domain_labels[first_domain]
    else:
        return 'missing'

def get_domains(group):
    return group['domain'].apply(get_label).values.tolist()

all_domains = []

for book in filtered_mg['book'].unique():  # Loop over unique books in the dataframe
    print(f'Processing {book}...')
    book_df = filtered_mg[filtered_mg['book'] == book]  # Filter the dataframe for the current book
    corpus = []  # Initialize an empty list to hold the tokens for the current book
    grouped = book_df.groupby('book_chapter').apply(get_domains)
    # Loop over rows in the filtered dataframe
    for chapter in grouped:  
        corpus += [[str(value) for value in chapter]]  # Add the tokens to the corpus for the current book
        all_domains += [[str(value) for value in chapter]]
    book_domains[book] = corpus  # Add the corpus for the current book to the dictionary

Processing MAT...
Processing MRK...
Processing LUK...
Processing JHN...
Processing ACT...
Processing ROM...
Processing 1CO...
Processing 2CO...
Processing GAL...
Processing EPH...
Processing PHP...
Processing COL...
Processing 1TH...
Processing 2TH...
Processing 1TI...
Processing 2TI...
Processing TIT...
Processing PHM...
Processing HEB...
Processing JAS...
Processing 1PE...
Processing 2PE...
Processing 1JN...
Processing 2JN...
Processing 3JN...
Processing JUD...
Processing REV...


In [104]:
# Create a dictionary for all words in each corpus
dictionary = corpora.Dictionary(all_domains)
dictionary


<gensim.corpora.dictionary.Dictionary at 0x7f4005089580>

In [105]:
models = dict()

for book, domains in book_domains.items():
    for d in domains:
        for e in d:
            if type(e) != str:
                print(d)
    models[book] = {
        'corpus': [dictionary.doc2bow(chapter) for chapter in domains], 
        'model': None 
    }

## Train LDA Model

In [106]:
ensemble_workers = 4
num_models = 3
distance_workers = 4 # Note from Gensim: After training all the models, some distance computations are required which can take quite some time as well. You can speed this up by using workers for that as well.
num_topics = 100
passes = 20

synoptics = ['MAT', 'MRK', 'LUK']
johannine = ['JHN', '1JN', '2JN', '3JN', 'REV']

subcorpora = {
    'synoptics': ['MAT', 'MRK', 'LUK'], 
    'johannine': ['JHN', '1JN', '2JN', '3JN', 'REV']
}

for subcorp in subcorpora.keys():
    models[subcorp] = {
        'corpus': [dictionary.doc2bow(chapter) for chapter in domains for book in subcorpora[subcorp]],
        'model': None
    }

In [87]:
# domain_labels[dictionary[token_id]]
for i in dictionary.items():
    print(i)
    break

(0, '009003')


In [107]:
# Select book or subcorpus
selected_corpus = 'synoptics'

print(dictionary[0])
# Map domain string tokens to semantic labels

# dictionary.id2token = {k: domain_labels[''.join(v.split()[0])] for k, v in dictionary.items() if v != 'nan'}

print(domain_labels[dictionary[0]])

A Point of Time with Reference to Other Points of Time: Before, Long Ago, Now, At the Same Time, When, About, After


KeyError: ignored

In [108]:
for book, data in models.items():
    if book == selected_corpus:
        print(f'Generating ensemble models for {book}...')
        ensemble = EnsembleLda(
            corpus=data['corpus'],
            id2word=dictionary,
            num_topics=num_topics,
            passes=passes,
            num_models=num_models,
            topic_model_class=LdaModel,
            ensemble_workers=ensemble_workers,
            distance_workers=distance_workers
        )
        data['ensemble'] = ensemble
        data['model'] = ensemble.generate_gensim_representation()

Generating ensemble models for synoptics...




In [109]:
vis_model = models[selected_corpus]['model']
vis_corpus = models[selected_corpus]['corpus']

Since we trained an ensemble model, we need to use the `generate_gensim_representation()` method to return an `LdaModel` class as the visualization library is expecting.

## Evaluate Model

In [110]:
# Compute the coherence score
coherence_model_lda = CoherenceModel(
    model=vis_model,
    corpus=vis_corpus,
    dictionary=dictionary,
    coherence='u_mass'
)
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: {coherence_lda:.3f}')

Coherence Score: -0.657


## Visualize Topics

In [111]:
# Visualize the topics using pyLDAvis
vis = gensimvis.prepare(vis_model, vis_corpus, dictionary)
pyLDAvis.display(vis)

  default_term_info = default_term_info.sort_values(


## Tuning the model ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html#tuning))

Different from `LdaModel`, the number of resulting topics varies greatly depending on the clustering parameters.

You can provide those in the `recluster()` function or the `EnsembleLda` constructor.

Play around until you get as many topics as you desire, which however may reduce their quality. If your ensemble doesn't have enough topics to begin with, you should make sure to make it large enough.

Having an epsilon that is smaller than the smallest distance doesn't make sense. Make sure to chose one that is within the range of values in `asymmetric_distance_matrix`.

In [None]:
import numpy as np
shape = ensemble.asymmetric_distance_matrix.shape
without_diagonal = ensemble.asymmetric_distance_matrix[~np.eye(shape[0], dtype=bool)].reshape(shape[0], -1)
print(without_diagonal.min(), without_diagonal.mean(), without_diagonal.max())

ensemble.recluster(eps=0.09, min_samples=2, min_cores=2)

print(len(ensemble.get_topics()))

0.0 0.3163309121943778 1.0




17
