# Using odyCy, spacy and displacy over *1 Clement* first words

[OdyCy](https://centre-for-humanities-computing.github.io/odyCy/) is a NLP library for Ancient Greek, capable of part-of-speech tagging, morphological analysis, dependency parsing, lemmatization and more.  It is based on the popular [spaCy](https://spacy.io/) framework, which makes odyCy easy to use, scalable, reliable and modular.

A popular course on how to use spaCy is available here: https://course.spacy.io/en/

Install odyCy from Huggingface Hub using `pip`

```bash
pip install https://huggingface.co/chcaa/grc_odycy_joint_trf/blob/main/grc_odycy_joint_trf-0.7.0-py3-none-any.whl
```


## Starting
We'll start by importing both `spacy`and `displacy`, which will help us visualize dependencies and entities in this notebook.

In [1]:
import spacy
from spacy import displacy

The file we use, located in the `data` folder, contains a short file with just the first two sentences of Clement of Rome's First Epistle to the Corinthians (*1 Clement*). The full Greek text is available in another file, but for this notebook, as we're only exploring Natural Processing Language techniques, the short file will speed up the processing. 
We'll create different documents of different sizes using a short extract or the full text of *1 Clement*: 

In [2]:
clement_source = 'data/clement-i-greek.txt'
clement_file = open(clement_source, 'r')
clement_text = clement_file.read()

clement_source_short = 'data/clement-i-greek.short.txt'
clement_file_short = open(clement_source_short, 'r')
clement_text_short = clement_file_short.read()
# print(text)

## Load and call Spacy pipelines

From there, we create a `clement_doc` variable containing all the lemmatized lists of tokens from *1 Clement*, for full-text process, and a shorter version that we'll use for simpler tests a well. The time required to process the short source is fast: the NLP step takes a 3-4 seconds for that short extract.

In [3]:
nlp = spacy.load("grc_odycy_joint_trf")
clement_doc_short = nlp(clement_text_short) # Short extract (first sentence), takes 3 seconds to get processed

After tokenization, spaCy parse a given Doc to get each Part-of-speech (PoS), and tag them.
Linguistic annotations are available as token attributes ([full list here](https://spacy.io/api/token#attributes)).
The found token attributes vary according to each PoS tagging.

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form.

Here are the found morphologies, just for the 3 first words, using the short extract (which is faster to process)

In [4]:
def parse(token):
    print("\nINDEX", index, "\t",
        "TEXT:",token.text,"\t",
         "ORTH:",token.orth_,"\t",
         "LEMMA:",token.lemma_,"\n",
         
         "TAG:",token.tag_,"\t",
         "DEP:",token.dep_,"\t",
         "SHAPE:",token.shape_,"\t",
         "IS_ALPHA:",token.is_alpha,"\t",
         "IS_STOP:",token.is_stop,"\n",
         
         "POS:",token.pos_,"\t",
         "MORPH (full):",token.morph,"\n",
         "Case:",token.morph.get('Case'),"\t",
         "Gender:",token.morph.get('Gender'),"\t",
         "Number:",token.morph.get('Number'),"\t",
         "HEAD:",token.head)

index = 0
index_max = 2
print("Processing the first 3 words:")
for token in clement_doc_short:
   parse(token)
   index+=1
   if (index>index_max):
    break


Processing the first 3 words:

INDEX 0 	 TEXT: Ἡ 	 ORTH: Ἡ 	 LEMMA: ὁ 
 TAG: l-s---fn- 	 DEP: det 	 SHAPE: X 	 IS_ALPHA: True 	 IS_STOP: True 
 POS: DET 	 MORPH (full): Case=Nom|Gender=Fem|Number=Sing 
 Case: ['Nom'] 	 Gender: ['Fem'] 	 Number: ['Sing'] 	 HEAD: ἐκκλησία

INDEX 1 	 TEXT: ἐκκλησία 	 ORTH: ἐκκλησία 	 LEMMA: ἐκκλησία 
 TAG: n-s---fn- 	 DEP: ROOT 	 SHAPE: xxxx 	 IS_ALPHA: True 	 IS_STOP: False 
 POS: NOUN 	 MORPH (full): Case=Nom|Gender=Fem|Number=Sing 
 Case: ['Nom'] 	 Gender: ['Fem'] 	 Number: ['Sing'] 	 HEAD: ἐκκλησία

INDEX 2 	 TEXT: τοῦ 	 ORTH: τοῦ 	 LEMMA: ὁ 
 TAG: l-s---mg- 	 DEP: det 	 SHAPE: xxx 	 IS_ALPHA: True 	 IS_STOP: True 
 POS: DET 	 MORPH (full): Case=Gen|Definite=Def|Gender=Masc,Neut|Number=Sing|PronType=Dem 
 Case: ['Gen'] 	 Gender: ['Masc', 'Neut'] 	 Number: ['Sing'] 	 HEAD: θεοῦ


Look at the two first PoS above, " Ἡ ἐκκλησία ":
* the determiner (`DEP: det`) `Ἡ` (greek root or `LEMMA`: `ὁ`) is related (`HEAD`) to a noun, `ἐκκλησία`, both of them have a `FEM`-inine `Gender`.
* that determiner is identified as a *stopword* (`IS_STOP: True`) ; stop words are commonly used in Text Mining and Natural Language Processing (NLP) to **eliminate** words that are so widely used that they carry very little useful information.
* as expected, the ἐκκλησία noun isn't a *stopword* (`IS_STOP: False`)
* as expected that determiner has the same case, gender and number as its related head, the ἐκκλησία noun.

Verbs tagging provides even more information, such as the verb base or root, its tense, form, voice and the subject of that verb, called `HEAD`:

In [5]:
parse(clement_doc_short[5]) # the first verb of that sentence: παροικοῦσα


INDEX 3 	 TEXT: παροικοῦσα 	 ORTH: παροικοῦσα 	 LEMMA: παροικέω 
 TAG: v-sppafn- 	 DEP: nmod 	 SHAPE: xxxx 	 IS_ALPHA: True 	 IS_STOP: False 
 POS: VERB 	 MORPH (full): Case=Nom|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act 
 Case: ['Nom'] 	 Gender: ['Fem'] 	 Number: ['Sing'] 	 HEAD: ἐκκλησία


Spacy also provides a list of the found sentences, that we'll display, separated by commas:

In [6]:
sentence_spans = list(clement_doc_short.sents)
print(sentence_spans)

[Ἡ ἐκκλησία τοῦ θεοῦ ἡ παροικοῦσα ῾Ρώμην τῇ ἐκκλησίᾳ τοῦ θεοῦ τῇ παροικούσῃ Κόρινθον, κλητοῖς ἡγιασμένοις ἐν θελήματι θεοῦ διὰ τοῦ κυρίου ἡμῶν Ἰησοῦ Χριστοῦ., χάρις ὑμῖν καὶ εἰρήνη ἀπὸ παντοκράτορος θεοῦ διὰ Ἰησοῦ Χριστοῦ πληθυνθείη., 

I

1. Διὰ τὰς αἰφνιδίους καὶ ἐπαλλήλους γενομένας ἡμῖν συμφορὰς καὶ περιπτώσεις, βράδιον νομίζομεν ἐπιστροφὴν πεποιῆσθαι περὶ τῶν ἐπιζητουμένων παρ’ ὑμῖν πραγμάτων, ἀγαπητοί, τῆς τε ἀλλοτρίας καὶ ξένης τοῖς ἐκλεκτοῖς τοῦ θεοῦ, μιαρᾶς καὶ ἀνοσίου στάσεως ἣν ὀλίγα πρόσωπα προπετῆ καὶ αὐθάδη ὑπάρχοντα εἰς τοσοῦτον ἀπονοίας ἐξέκαυσαν, ὥστε τὸ σεμνὸν καὶ περιβόητον καὶ πᾶσιν ἀνθρώποις ἀξιαγάπητον ὄνομα ὑμῶν μεγάλως βλασφημηθῆναι., 2., τίς γὰρ παρεπιδημήσας πρὸς ὑμᾶς τὴν πανάρετον καὶ βεβαίαν ὑμῶν πίστιν οὐκ ἐδοκίμασεν; τήν τε σώφρονα καὶ ἐπιεικῆ ἐν Χριστῷ εὐσέβειαν οὐκ ἐθαύμασεν; καὶ τὸ μεγαλοπρεπὲς τῆς φιλοξενίας ὑμῶν ἦθος οὐκ ἐκήρυξεν;, καὶ τὴν τελείαν καὶ ἀσφαλῆ γνῶσιν οὐκ ἐμακάρισεν; 3. ἀπροσωπολήμπτως γὰρ πάντα ἐποιεῖτε καὶ ἐν τοῖς νομίμοις τοῦ θεοῦ ἐπο

# Lemmatization used for counting the most frequent occurrences

## Using the short extract first 
Lemmatization converts each word to its meaningful base form. 

In the following example, we'll use the short extract of *1 Clement*.

As we list the found lemmas, we'll remove the ones that has no interest: the stopwords and the punctuation.
We'll then count the number of occurrences for each lemma found, when this lemma is found more than 10 times in the text.

In [7]:
from collections import Counter, defaultdict

lemmatizer = nlp.get_pipe("frequency_lemmatizer")

def display_occurrences(doc, limit):
    filtered_lemmas = [token for token in doc if (token.pos_ != "PUNCT" and token.is_stop == False)]
    lemmas = [token.lemma_ for token in filtered_lemmas]

    # Count occurrences and group words by frequency
    lemma_counts = Counter(lemmas)
    count_to_words = defaultdict(list)

    # Only consider words occurring more than [limit]
    for word, count in lemma_counts.items():
        if count > limit:  
            count_to_words[count].append(word)

    # Sort by frequency (descending) and alphabetically within the same counts
    sorted_counts = sorted(count_to_words.items(), key=lambda x: (-x[0], x[1]))
    
    # Display results
    print(len(doc), "tokens in total, but only", len(filtered_lemmas), "are useful:")
    for count, words in sorted_counts:
        sorted_words = sorted(words)
        print(f"\n{count} occurrence{'s' if count > 1 else ''}:")
        print(", ".join(sorted_words))

display_occurrences(clement_doc_short, 1)

208 tokens in total, but only 113 are useful:

6 occurrences:
θεός

4 occurrences:
σεμνός

3 occurrences:
χριστός

2 occurrences:
παρά, ποιέω, πάντα, ἐκκλησία, ἰησοῦς


# Lemmatization over the full text of *1 Clement*

Now let's do the same, counting the most frequent occurrences on the full document of *1 Clement*, considering only lemmas that appear more than 20 times.

Mind the time required to process the full source: this NLP step takes around 30 seconds, when run locally on a MacBook (check the grey dot on the right side of the header of this Jupyter notebook - turns back to white when finished).

In [8]:
clement_doc = nlp(clement_text) # Full version, takes 30 seconds to get processed
display_occurrences(clement_doc, 20)

11408 tokens in total, but only 5318 are useful:

91 occurrences:
θεός

40 occurrences:
χριστός

33 occurrences:
πᾶς, τίς

32 occurrences:
κύριος

28 occurrences:
γῆ, πάντα

26 occurrences:
ἰησοῦς

25 occurrences:
διά

24 occurrences:
ἀγάπη

23 occurrences:
ποιέω, πολύς

22 occurrences:
δόξα

21 occurrences:
λέγω


# Explicit citations of the Gospels in *1 Clement*

To find explicit citations of the Gospels in the text of Clement I using your Greek texts, we'll combine computational text analysis with philological methods. Here’s how:

- We'll use the Greek text of both *1 Clement* and the four Gospels
- For consistency, we'll tokenize and lemmatize all texts using the same pipeline, `spaCy`, with the same `grc_odycy_joint_trf` Greek model and lemmatizer we've used so far.

## Identify Candidate Gospel Citations

We're looking for:
- Direct Quotations: Look for passages in 1 Clement that closely match phrases or sentences in the Gospels. Some well-known examples include 1 Clem. 13:2 and 46:8, which are often discussed as possible citations or allusions to the Synoptics.
- Composite or Paraphrased Citations: Note that Clement often paraphrases or combines Gospel sayings, so look for partial matches or rewordings.

## Confirming previous studies

The findings should at least confirm the findings from previous studies or opinions expressed by various authors such as 
- Donald Alfred Hagner in his [chapter 8 of *The Use of the Old and New Testaments in Clement of Rome*, pages 272–312](https://brill.com/display/book/9789004266162/B9789004266162-s010.xml)
- Jacob J. Prahlow [in this blogpost on Scripture citations in *1 Clement*](https://pursuingveritas.com/2016/09/26/scripture-in-1-clement-composite-citation-of-the-gospels-part-i/)
- Glenn Miller [in this blogpost on *1 Clement* gospels citations](https://www.christian-thinktank.com/dumbdad2.html)

## Automated Search Strategies

- N-gram Matching: Generate n-grams (e.g., 5-10 words) from both 1 Clement and each Gospel. Use fuzzy matching (e.g., Levenshtein distance or cosine similarity on lemmatized n-grams) to find overlapping sequences.
- Lemmatized Comparison: Compare lemmatized versions to account for inflectional differences in Greek.
- Thresholds: Set similarity thresholds to reduce false positives but still catch paraphrases.

## Manual Review and Contextual Analysis

From there, we could review all high-similarity matches manually, since Clement often adapts or combines sayings, and context is crucial for determining if a passage is a true citation or an allusion. Additionally, there's value in checking for *introductory formulas* (e.g., “the Lord said,” “He himself said,” “as it is written”) which may indicate a citation in *1 Clement*.

## Document and Categorize Results

For each match, we could record

- The passage in Clement and the Gospel(s)
- The Greek text of both
- The degree of similarity (verbatim, paraphrase, composite)
- Whether an introductory formula is present

## A first approach

We have lemmatized_clement and lemmatized_gospels as lists of tokens. Adjust `n` and `threshold` in the code below to choose sensitivity


In [9]:
from difflib import SequenceMatcher

def find_matches(clement, gospel, n=7, threshold=0.8):
    matches = []
    '''
    if ratio > threshold:
        matches.append({
            "gospel": gospel_name,
            "gospel_ngram": gospel_ngram,
            "clement_ngram": clement_ngram,
            "similarity": ratio
        })
    '''
    return matches

# Noun chunks

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [10]:
def nounize(token):
    print(token.text, token.dep_, token.head.text, token.head.pos_, [child for child in token.children])
nounize(clement_doc_short[2])

τοῦ det θεοῦ NOUN []


τοῦ is a determiner, and gets related to a 'head' word, θεοῦ, which is identified as a noun.

# Visualization

Using spaCy’s built-in displaCy [visualizer](https://spacy.io/usage/visualizers),
here’s what our first sentence and its dependencies look like:

In [11]:
displacy.serve(sentence_spans[0], style="dep", host="localhost", auto_select_port=True)


Using the 'dep' visualizer
Serving on http://localhost:5001 ...

Shutting down server on port 5001.


# Topic Modelling
Let's now explore the topics included in *1 Clement* and present this as a graph:

In [13]:
# from bertopic import BERTopic

In [14]:
'''
from flair.models import SequenceTagger
tagger = SequenceTagger.load('SuperPeitho-FLAIR-v2/final-model.pt')

with open("data/clement-i-greek.txt", "r") as testfile:
    test_list = test.readlines()

outfile = open("morph_analysis_outputs.txt", "w")
for testitem in test_list:
    sentence = Sentence(testitem)
    tagger.predict(sentence)
    outputs = sentence.get_spans('pos')
    for output in outputs:
        outfile.write(output + "\n")
    outfile.write("\n")
'''

'\nfrom flair.models import SequenceTagger\ntagger = SequenceTagger.load(\'SuperPeitho-FLAIR-v2/final-model.pt\')\n\nwith open("data/clement-i-greek.txt", "r") as testfile:\n    test_list = test.readlines()\n\noutfile = open("morph_analysis_outputs.txt", "w")\nfor testitem in test_list:\n    sentence = Sentence(testitem)\n    tagger.predict(sentence)\n    outputs = sentence.get_spans(\'pos\')\n    for output in outputs:\n        outfile.write(output + "\n")\n    outfile.write("\n")\n'