<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/topic_modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# !pip install pyLDAvis
# !pip install gensim==4.3.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [None]:
import gensim
from gensim import corpora
from gensim.models import CoherenceModel, LdaModel
from gensim.models import EnsembleLda
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import os
import codecs

## Build Corpus

In [None]:
import os
os.listdir('ancient-greek-word2vec')

['.git',
 'query_vector_space_models.ipynb',
 'data',
 'build_greek_fasttext_model.ipynb',
 '.gitignore',
 'build_greek_w2v_model.ipynb',
 'README.md',
 'LICENSE',
 'models']

In [None]:
!git clone https://github.com/ryderwishart/ancient-greek-word2vec.git

Cloning into 'ancient-greek-word2vec'...
remote: Enumerating objects: 32795, done.[K
remote: Counting objects: 100% (914/914), done.[K
remote: Compressing objects: 100% (885/885), done.[K
remote: Total 32795 (delta 27), reused 910 (delta 26), pack-reused 31881[K
Receiving objects: 100% (32795/32795), 308.04 MiB | 22.80 MiB/s, done.
Resolving deltas: 100% (107/107), done.
Updating files: 100% (35227/35227), done.


In [None]:
corpus_directories = [path for path in os.listdir('ancient-greek-word2vec/data') if not(path.startswith('.'))]
print('Directories found in data folder:', corpus_directories)

Directories found in data folder: ['papyri', 'corpus']


In [None]:
force_lowercase = True
use_lemma_disambiguation = False # Some lemmas are indicated with a numeric suffix (e.g., 'ὅτι2')

def tokenize(string):
    output = string
    if use_lemma_disambiguation:
        pass
    else:
        # Filter numeric digits from token
        output = ''.join(filter(lambda x: not x.isdigit(), string))
    if force_lowercase:
        return [token.lower() for token in output.split()]
    else:
        return output.split()
    
class MySentences(object):
    def __iter__(self):
        for corpus_dir in corpus_directories: # the directories where the text files are.
            for file in os.listdir(f'ancient-greek-word2vec/data/{corpus_dir}'): 
                if file.endswith(".txt"):
                    for line in codecs.open(f'ancient-greek-word2vec/data/{corpus_dir}/{file}', 'r+'):
                        tokens = tokenize(line)
                        if len(tokens) > 0:
                            yield tokens

In [None]:
# Instantiate corpus reader
corpus = MySentences()

# Build dictionary
dictionary = corpora.Dictionary(corpus)
# dictionary.filter_extremes(no_below=20, no_above=0.5)

## Train LDA Model

In [None]:
ensemble_workers = 4
num_models = 8
distance_workers = 4 # Note from Gensim: After training all the models, some distance computations are required which can take quite some time as well. You can speed this up by using workers for that as well.
num_topics = 10
passes = 20

ensemble = EnsembleLda(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=passes,
    num_models=num_models,
    topic_model_class=LdaModel,
    ensemble_workers=ensemble_workers,
    distance_workers=distance_workers
)

print(len(ensemble.ttda))
print(len(ensemble.get_topics()))

Process Process-21:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/dist-packages/gensim/models/ensemblelda.py", line 437, in _generate_topic_models_worker
    _generate_topic_models(ensemble=ensemble, num_models=num_models, random_states=random_states)
  File "/usr/local/lib/python3.9/dist-packages/gensim/models/ensemblelda.py", line 412, in _generate_topic_models
    tm = ensemble.get_topic_model_class()(**kwargs)
  File "/usr/local/lib/python3.9/dist-packages/gensim/models/ldamodel.py", line 521, in __init__
    self.update(corpus, chunks_as_numpy=use_numpy)
  File "/usr/local/lib/python3.9/dist-packages/gensim/models/ldamodel.py", line 1006, in update
    gammat = self.do_estep(chunk, other)
  File "/usr/local/lib/python3.9/dist-packages/gensim/models/

Since we trained an ensemble model, we need to use the `generate_gensim_representation()` method to return an `LdaModel` class as the visualization library is expecting.

In [None]:
vis_model = ensemble.generate_gensim_representation()

## Evaluate Model

In [None]:
# Compute the coherence score
coherence_model_lda = CoherenceModel(
    model=vis_model,
    corpus=corpus,
    dictionary=dictionary,
    coherence='u_mass'
)
coherence_lda = coherence_model_lda.get_coherence()
print(f'Coherence Score: {coherence_lda:.3f}')

## Visualize Topics

In [None]:
# Visualize the topics using pyLDAvis
vis = gensimvis.prepare(vis_model, corpus, dictionary)
pyLDAvis.display(vis)

## Tuning the model ([source](https://radimrehurek.com/gensim/auto_examples/tutorials/run_ensemblelda.html#tuning))

Different from `LdaModel`, the number of resulting topics varies greatly depending on the clustering parameters.

You can provide those in the `recluster()` function or the `EnsembleLda` constructor.

Play around until you get as many topics as you desire, which however may reduce their quality. If your ensemble doesn't have enough topics to begin with, you should make sure to make it large enough.

Having an epsilon that is smaller than the smallest distance doesn't make sense. Make sure to chose one that is within the range of values in `asymmetric_distance_matrix`.

In [None]:
import numpy as np
shape = ensemble.asymmetric_distance_matrix.shape
without_diagonal = ensemble.asymmetric_distance_matrix[~np.eye(shape[0], dtype=bool)].reshape(shape[0], -1)
print(without_diagonal.min(), without_diagonal.mean(), without_diagonal.max())

ensemble.recluster(eps=0.09, min_samples=2, min_cores=2)

print(len(ensemble.get_topics()))