# Finding All Ancient-Greek Verbs from The First Epistle of Clement, using NLP

Please read [1-Clement-introduction](1-Clement-introduction.ipynb) to understand the tools in use.

**Spoiler**: 1 Clement counts 150 unique Ancient-Greek verbs.

## spaCy and odyCy
We'll start by importing `spaCy` and will use it to load the `odyCy` pipelines for Ancient-Greek.
Read more about odyCy [here](https://centre-for-humanities-computing.github.io/odyCy/).

In [1]:
import spacy
nlp = spacy.load("grc_odycy_joint_trf")

We'll then use the full Clement of Rome's First Epistle to the Corinthians (*1 Clement*) Greek text available in the `data` folder.
Then we create a `clement_doc` variable containing all the lemmatized lists of tokens from *1 Clement*.

This usually takes ~30 seconds to get processed.

In [2]:
clement_source = 'data/clement-i-greek.txt'
clement_file = open(clement_source, 'r')
clement_text = clement_file.read()
clement_doc = nlp(clement_text)

After tokenization, spaCy parses the given Doc to get each Part-of-speech (PoS), and tags them.
Verbs tagging provides even more information, such as the verb base or root, its tense, form, voice and the subject of that verb, called HEAD.

We'll use the spaCy token attributes ([full list here](https://spacy.io/api/token#attributes)) to filter on such `VERB`

In [3]:
def parse(token):
    print("\nINDEX", index, "\t",
        "TEXT:",token.text,"\t",
         "ORTH:",token.orth_,"\t",
         "LEMMA:",token.lemma_,"\n",
         
         "TAG:",token.tag_,"\t",
         "DEP:",token.dep_,"\t",
         "SHAPE:",token.shape_,"\t",
         "IS_ALPHA:",token.is_alpha,"\t",
         "IS_STOP:",token.is_stop,"\n",
         
         "POS:",token.pos_,"\t",
         "MORPH (full):",token.morph,"\n",
         "Case:",token.morph.get('Case'),"\t",
         "Gender:",token.morph.get('Gender'),"\t",
         "Number:",token.morph.get('Number'),"\t",
         "HEAD:",token.head)

Here is the morphology of the first verb in the sentence of the document: *παροικοῦσα*, in position 5 in the document:

In [4]:
index = 5
parse(clement_doc[index]) # the first verb of that sentence: παροικοῦσα


INDEX 5 	 TEXT: παροικοῦσα 	 ORTH: παροικοῦσα 	 LEMMA: παροικέω 
 TAG: v-sppafn- 	 DEP: nmod 	 SHAPE: xxxx 	 IS_ALPHA: True 	 IS_STOP: False 
 POS: VERB 	 MORPH (full): Case=Nom|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act 
 Case: ['Nom'] 	 Gender: ['Fem'] 	 Number: ['Sing'] 	 HEAD: ἐκκλησία


# All Ancient Greek Verbs from Clement of Rome's First Letter

The function below performs the following actions:
1. filters tokens on an optional 'Parts of Speech' argument – `VERB` in our case.
2. Count the occurrences for each verb
3. The function is going to be called with a `limit` = 0, as we want all the verbs.
4. Sort the found results by frequency
5. Display the results 

In [5]:
from collections import Counter, defaultdict

def display_occurrences(doc, limit, pos=None):
    """
    Displays word occurrences in a spaCy Doc object, filtered by part-of-speech (POS)
    and a minimum occurrence limit.

    Args:
        doc (spacy.tokens.Doc): The spaCy Doc object to analyze.
        limit (int): The minimum number of occurrences for a word to be displayed.
        pos (str or None): An optional part-of-speech tag to filter by (e.g., "NOUN", "VERB").
                           If None, all POS are considered.
    """

    filtered_tokens = []
    if pos:
        filtered_tokens = [token for token in doc if (token.pos_ == pos and not token.is_punct and not token.is_stop)]
    else:
        filtered_tokens = [token for token in doc if (token.pos_ != "PUNCT" and not token.is_stop)]

    lemmas = [token.lemma_ for token in filtered_tokens]

    # Count occurrences and frequency
    lemma_counts = Counter(lemmas)
    count_to_words = defaultdict(list)

    # Only consider words occurring more than [limit]
    for word, count in lemma_counts.items():
        if count > limit:  
            count_to_words[count].append(word)

    # Sort by frequency (descending) and alphabetically within the same counts
    sorted_counts = sorted(count_to_words.items(), key=lambda x: (-x[0], x[1]))
    
    # Display results
    print(f"{len(doc)} tokens in total, but only {len(filtered_tokens)} are useful (filtered by POS: {pos if pos else 'All'}):")
    verbs_count = 0;
    for count, words in sorted_counts:
        sorted_words = sorted(words)
        verbs_count += count;
        print(f"\n{count} occurrence{'s' if count > 1 else ''}:")
        print(", ".join(sorted_words))
    print(f"The search found {verbs_count} verbs.")

The `display_occurrences` function is ready for our test.
We'll look only for *verbs*. Since we want to retrieve them all, we'll give `0` to the `limit` argument.

In [6]:
display_occurrences(clement_doc, 0, 'VERB')

11408 tokens in total, but only 1608 are useful (filtered by POS: VERB):

23 occurrences:
ποιέω

21 occurrences:
λέγω

18 occurrences:
ὁράω

16 occurrences:
εἶπον

15 occurrences:
ἐθέλω, ἔχω

12 occurrences:
δίδωμι

9 occurrences:
εὑρίσκω, λαμβάνω

8 occurrences:
λέγει·

7 occurrences:
γενέσθαι, γιγνώσκω, ζῶ, σώζω, ἐπιτελέω, ἐποίησεν, ὑπάρχω

6 occurrences:
γέγραπται, ἀτενίζω

5 occurrences:
γέγνομαι, λαμβάνω, μέλλω, στασιάζω, φημί, ἀκούω, ἐδόθη, ἐξέρχομαι, ἥκω, ὑποτάσσω

4 occurrences:
βούλομαι, γενόμενος, γίγνομαι, εἰσέρχομαι, εἶδον, πιστεύω, ποιήσαντος, ποιήσωμεν, ταπεινοφρονέω, ἄπειμι, ἐκέλευσεν, ἔρχομαι, ἡγέω

3 occurrences:
βουλομένοις, γέγραπται·, δεδομένην, διδάσκω, δοκέω, δύναμαι, δώσω, εὑρέθη, κολληθέω, λέγω, λέληθεν, νομίζω, νοήσωμεν, παρέρχομαι, ποιήσας, πολιτευόμενοι, πορεύομαι, προσέχω, φοβέω, φέρω, ψεύω, ἀγαπάω, ἀείρω, ἀναγγέλω, ἀναιρέω, ἀφίημι, ἐκήρυξεν, ἐλάλησεν, ἐξάλειψον, ἡγούμενοι, ἱκετεύω, ὀφείλω, ὑπακούω

2 occurrences:
Παιδεύω, βάλλω, γεννάω, γενόμενα, γιγνώσκω, 

Finding: **1 Clement counts 150 unique Ancient-Greek verbs.**

## The most common verbs
Now, to get only the most frequent verbs, i.e., the ones that are found *at least 10 times, and more*, we'd give `10` as a limit.

In [7]:
display_occurrences(clement_doc, 10, 'VERB')


11408 tokens in total, but only 1608 are useful (filtered by POS: VERB):

23 occurrences:
ποιέω

21 occurrences:
λέγω

18 occurrences:
ὁράω

16 occurrences:
εἶπον

15 occurrences:
ἐθέλω, ἔχω

12 occurrences:
δίδωμι
The search found 105 verbs.
