# Example Whole-File Analysis

The following imports the text of the Rāmāyaṇa, which has been prepared ahead of time to have one verse per line (with verses consisting of three and four halves broken into two), for a total of 19,354 verses.

In [1]:
from tqdm import tqdm # for progress bar, pip install if desired
from datetime import datetime, date

In [2]:
input_fn = 'R_cleaned.txt'
output_fn = input_fn[:input_fn.find('.')] + '_results' + input_fn[input_fn.find('.'):]

In [3]:
with open('%s' % input_fn, 'r') as input_f:
    input_data = input_f.read()
verses = input_data.split('\n')
print(len(verses))

19354


Here **skrutable**'s MeterIdentifier object is imported and instantiated.

In [4]:
from skrutable.meter_identification import MeterIdentifier
MI = MeterIdentifier()

And here the verses are fed one at a time to the MeterIdentifier. To acheive maximum speed while maintaining accuracy, we can take advantage of preexisting expert annotation. Those verses (generally samavṛtta) which are already marked with all pāda breaks (ab ';', bc '/',  cd ';') do not need to be resplit, so resplit_option='none' (i.e., a single identification step) suffices for them. On the other hand, those verses (generally anuṣṭubh) for which only the half-way point is marked (bc '/') must be resplit to find the exact location of the breaks (in case of e.g., jāti verses, or because some verses may be hypo- or hypermetric), but the correct resplit is generally not very far away, so resplit_option='resplit_lite', aided by the further config variable resplit_lite_keep_midpoint, is the right balance. (If no breaks at all had been marked, resplit_option='resplit_max' would give basically the same results, although it is much less computationally efficient.)

In [5]:
with open('%s' % output_fn, 'w') as output_f:

    starting_time = datetime.now().time()
    for v in tqdm(verses):
        v_content, v_label = v[:v.find('// ')+3], v[v.find('// ')+3:] # verse label is e.g. "1.001.001"
        if v_content.count(";") == 2:
            resplit_option = 'none'
        else:
            resplit_option = 'resplit_max'
        result = MI.identify_meter( v_content, resplit_option=resplit_option, from_scheme='IAST')
        # result = MI.identify_meter( v_label, resplit_option=resplit_option, from_scheme='IAST') # or this
        # result = MI.identify_meter( v, resplit_option=resplit_option, from_scheme='IAST') # or this
        output_f.write( v + '\t' + result.meter_label + '\n')
        # output_f.write( v + '\n\n' + result.summarize() + '\n') # or this

    ending_time = datetime.now().time()
    delta = datetime.combine(date.today(), ending_time) - datetime.combine(date.today(), starting_time)
    duration_secs = delta.seconds + delta.microseconds / 1000000
    output_f.write( "samāptam: %d padyāni, %f kṣaṇāḥ" % ( len(verses), duration_secs ) )

100%|██████████| 19354/19354 [00:02<00:00, 7387.01it/s]


## discussion

On a MacBook Pro 2020 (2 GHz Quad-Core Intel Core i5 processor), the total time to ascertain the meter of all 19,354 verses depends on how many resplits are performed. The speed baseline, corresponding to performing no resplits and attempting (almost always unsuccessfully, because of the lack of explicit pāda breaks) to identify each verse one time as given (flat resplit_option='none') is under 3 seconds. Given the good quality of the input data, either of the two other resplit options ("max" or "lite") correct equally well for the missing pāda breaks. The "max" option tries all possible splits and clocks in at 93 seconds, whereas the "lite" tries only relevant splits and thereby finishes much faster, in only 6 seconds, with the same accuracy. Conditionally resplitting only when necessary (when v_content.count(";") == 2) provides another tiny improvement, bringing the best performance time to just about 5 seconds.