## Test syllabification mods to Anna Conser's Greek-Poetry code

This notebook tests changes to Anna Conser's syllabification logic.  It runs our "ground truth" sample through her routine, then runs the same sample through a new "use_moore_rules" syllabification routine.  Tim Moore's  rules, unlike, Anna Conser's, basically ignore punctuation, and will thus allow for syllables which span word boundaries.

The "ground truth" sample contains 87 lines of verse.  The new code should process 85 of them correctly; the other two cases contain "it depends" exceptions to the rule against dividing a dipthong.

### Load the ground truth dataset

In [1]:
import csv

header = []
data = []

f = open('SMP_ground_truth_corrected.csv', 'r', encoding='utf-8')

reader = csv.reader(f)

n = 0

for row in reader:
    if n == 0:
        header = row
    else:
        data.append(row)
    n += 1    
    
print(header)
print(len(data))
print(data[0])

['play', 'line n', 'text', 'Ground truth', 'same syllabification?', 'Correct', 'difference', 'Notes']
87
['Hekabe', '1', 'ἥκω νεκρῶν κευθμῶνα καὶ σκότου πύλας', 'ἥ κω νε κρῶν κευ θμῶ να καὶσ κό του πύ λας', 'False', 'AC', 'καὶσ κό', 'division need between σ (ς) and ensuing letter']


## Run Anna Conser's rules

The cell outputs only the "ground truth" samples (20 of 87) which were not syllabified according to our expectations.

In [2]:
import re
from collections import defaultdict, Counter
from Greek_Prosody.syllables import get_syllables

true_false_counts = defaultdict(int)
problem_line_n = []

for row in data:
    
    gt_syllables = row[3].split(' ')
    
    ac_syllables = get_syllables(re.sub('\s+', '', row[2]))
    
    same_result = (gt_syllables == ac_syllables)
    
    true_false_counts[same_result] += 1
    
    if same_result == False:
        
        problem_line_n.append(row[1])
                   
        print()
        print('    ', row[1])
        print('LINE', row[2])
        print()
        print('gt  ', ' '.join(gt_syllables))
        print('ac  ', ' '.join(ac_syllables))
        print()
                   
print()
for k, v in true_false_counts.items():
    print(k, v)
    
print()
print(problem_line_n)


     2
LINE λιπών, ἵνʼ ἅιδης χωρὶς ᾤκισται θεῶν,

gt   λι πώ ν,ἵ νʼἅι δης χω ρὶ ςᾤ κισ ται θε ῶν,
ac   λι πών, ἵ νʼἅ ι δης χω ρὶ ςᾤ κισ ται θε ῶν,


     3
LINE Πολύδωρος, ἑκάβης παῖς γεγὼς τῆς κισσέως

gt   Πο λύ δω ρο ς,ἑ κά βης παῖς γε γὼς τῆς κισ σέ ως
ac   Πο λύ δω ρος, ἑ κά βης παῖς γε γὼς τῆς κισ σέ ως


     4
LINE πριάμου τε πατρός, ὅς μʼ, ἐπεὶ φρυγῶν πόλιν

gt   πρι ά μου τε πα τρό ς,ὅς μʼ,ἐ πεὶ φρυ γῶν πό λιν
ac   πρι ά μου τε πα τρός, ὅςμʼ, ἐ πεὶ φρυ γῶν πό λιν


     6
LINE δείσας ὑπεξέπεμψε τρωικῆς χθονὸς

gt   δεί σα ςὑ πε ξέ πεμ ψε τρω ι κῆςχ θο νὸς
ac   δεί σα ςὑ πε ξέ πεμ ψε τρωι κῆςχ θο νὸς


     11
LINE πατήρ, ἵνʼ, εἴ ποτʼ ἰλίου τείχη πέσοι,

gt   πα τή ρ,ἵ νʼ,εἴ πο τʼἰ λί ου τεί χη πέ σοι,
ac   πα τήρ, ἵνʼ, εἴ πο τʼἰ λί ου τεί χη πέ σοι,


     13
LINE νεώτατος δʼ ἦ Πριαμιδῶν, ὃ καί με γῆς

gt   νε ώ τα τος δʼἦ Πρι α μι δῶ ν,ὃ καί με γῆς
ac   νε ώ τα τος δʼἦ Πρι α μι δῶν, ὃ καί με γῆς


     14
LINE ὑπεξέπεμψεν· οὔτε γὰρ φέρειν ὅπλα

gt   ὑ πε ξέ πεμ ψε ν·οὔ τε γ

## Run Tim Moore's rules

The cell outputs only the "ground truth" samples (2 of 87) which were not syllabified according to our expectations, although we expect these 2 to fail because they contain exceptions to the usual rule against splitting dipthongs.

In [3]:
import re
from collections import defaultdict, Counter
from Greek_Prosody.syllables import get_syllables

true_false_counts = defaultdict(int)
problem_line_n = []

for row in data:
    
    gt_syllables = row[3].split(' ')
    
    ac_syllables = get_syllables(re.sub('\s+', '', row[2]), use_moore_rules=True)
    
    same_result = (gt_syllables == ac_syllables)
    
    true_false_counts[same_result] += 1
    
    if same_result == False:
        
        problem_line_n.append(row[1])
                   
        print()
        print('    ', row[1])
        print('LINE', row[2])
        print()
        print('gt  ', ' '.join(gt_syllables))
        print('ac  ', ' '.join(ac_syllables))
        print()
                   
print()
for k, v in true_false_counts.items():
    print(k, v)
    
print()
print(problem_line_n)


     24
LINE σφαγεὶς ἀχιλλέως παιδὸς ἐκ μιαιφόνου,

gt   σφα γεὶ ςἀ χιλ λέως παι δὸ ςἐ κμι αι φό νου,
ac   σφα γεὶ ςἀ χιλ λέ ως παι δὸ ςἐ κμι αι φό νου,


     31
LINE ἑκάβης ἀίσσω, σῶμʼ ἐρημώσας ἐμόν,

gt   ἑ κά βη ςἀ ίσ σω, σῶ μʼἐ ρη μώ σα ςἐ μόν,
ac   ἑ κά βη ςἀίσ σω, σῶ μʼἐ ρη μώ σα ςἐ μόν,


True 85
False 2

['24', '31']
