# Language article code

## Generating parse data using slingshot and prosodic

Calls on functions in the slingshot sling, [metrical_parsing.ipynb](https://github.com/quadrismegistus/slingshot/blob/master/slings/metrical_parsing.ipynb).

### LSA Corpus ("Six Authors")

#### Ngram

In [116]:
#!slingshot -code metrical_parsing -func parse_by_ngram -path small_corpus/txt -ext txt -savedir small_corpus/data_slingshot/new_parse_by_ngram -overwrite -parallel 8



#### Nsyll

In [117]:
#!slingshot -code metrical_parsing -func parse_by_nsyll -path txt -ext txt -savedir small_corpus/data_slingshot/new_parse_by_nsyll -overwrite -parallel 8



### AMP Corpus (Five Authors, Prose vs. Verse)

#### Ngram

In [118]:
#!slingshot -code metrical_parsing -func parse_by_ngram -path original_corpus/txt -ext txt -savedir original_corpus/data_slingshot/new_parse_by_ngram -overwrite -parallel 8



#### Nsyll

In [119]:
#!slingshot -code metrical_parsing -func parse_by_nsyll -path original_corpus/txt -ext txt -savedir original_corpus/data_slingshot/new_parse_by_nsyll -overwrite -parallel 8



### Metadata

In [120]:
folder2data = {}

# v1
# folder2data['small_corpus/data_slingshot/parse_by_ngram']={'corpus':'LSA', 'method':'ngram'}
# folder2data['small_corpus/data_slingshot/parse_by_nsyll']={'corpus':'LSA','method':'nsyll'}
# folder2data['original_corpus/data_slingshot/parse_by_ngram']={'corpus':'AMP','method':'ngram'}
# folder2data['original_corpus/data_slingshot/parse_by_nsyll']={'corpus':'AMP','method':'nsyll'}

# restraining windows to be within punctuation marks 
#folder2data['small_corpus/data_slingshot/parse_by_nsyll_withinphrase_2']={'corpus':'LSA','method':'nsyll','within_phrase':True}
#folder2data['original_corpus/data_slingshot/parse_by_nsyll_withinphrase']={'corpus':'AMP','method':'nsyll','within_phrase':True}
#folder2data['small_corpus/data_slingshot/parse_by_ngram_withinphrase']={'corpus':'LSA','method':'ngram','within_phrase':True}
#folder2data['original_corpus/data_slingshot/parse_by_ngram_withinphrase']={'corpus':'AMP','method':'ngram','within_phrase':True}

folder2data['small_corpus/data_slingshot/new_parse_by_nsyll']={'corpus':'LSA','method':'nsyll','within_phrase':True}
folder2data['original_corpus/data_slingshot/new_parse_by_nsyll']={'corpus':'AMP','method':'nsyll','within_phrase':True}
folder2data['small_corpus/data_slingshot/new_parse_by_ngram']={'corpus':'LSA','method':'ngram','within_phrase':True}
folder2data['original_corpus/data_slingshot/new_parse_by_ngram']={'corpus':'AMP','method':'ngram','within_phrase':True}


## Postprocessing

In [121]:
import pandas as pd,os
import mpi_slingshot as sl

In [122]:
def path2meta_LSA(path):
    fn=os.path.splitext(os.path.basename(path))[0]
    author,text_type=fn.split('-')
    genre='prose' if not 'shakespeare' in fn else 'verse'
    return {'author':author, 'text_type':text_type, 'id':fn, 'genre':genre,'lang':'en','title':''}

In [123]:
def path2meta_AMP(path):
    #print(path.split('.'))
    lang,genre,author,title,_ = path.split('.')
    fn=os.path.splitext(os.path.basename(path))[0]
    text_type='O'
    genre = genre if genre!='poetry' else 'verse'
    return {'author':author, 'text_type':text_type, 'id':fn, 'lang':lang, 'genre':genre, 'title':title}

In [124]:
def writegen_folder(ifolder,ifolder_data):
    if not '/cache' in ifolder: ifolder=os.path.join(ifolder,'cache')
        
    path2meta = globals()["path2meta_"+ifolder_data['corpus']]
        
    for path,path_ld in sl.stream_results(ifolder):
        if not path.endswith('.txt'): continue
        if 'ipynb_checkpoints' in path: continue
        path_meta = path2meta(path)

        for path_dx in path_ld:
            row_dx=dict( list(ifolder_data.items()) + list(path_meta.items()) + list(path_dx.items()))
            yield row_dx

In [125]:
def writegen():
    import os
    for fldr,fldr_data in folder2data.items():
        for dx in writegen_folder(fldr,fldr_data):
            dx['method']=os.path.basename(fldr).replace('_2','').replace('parse_by_','')
            yield dx

In [126]:
#sl.writegen('data.parse_by_ngram.txt', lambda: writegen_folder(folder_output1))
#sl.writegen('data.parse_by_nsyll.txt', lambda: writegen_folder(folder_output2))

In [127]:
# # ??
# os.chdir('/Users/ryan/Dropbox/PHD/Prose-Verse/experiments/language_article')
# !pwd

In [128]:
sl.writegen('data.parse_multi_methods.txt', writegen)

100%|██████████| 3/3 [00:00<00:00,  5.65it/s]
100%|██████████| 3/3 [00:00<00:00,  8.50it/s]
100%|██████████| 2/2 [00:00<00:00, 10.02it/s]
100%|██████████| 2/2 [00:00<00:00,  6.46it/s]
100%|██████████| 2/2 [00:00<00:00,  4.25it/s]
100%|██████████| 2/2 [00:00<00:00,  5.69it/s]
100%|██████████| 2/2 [00:00<00:00,  6.67it/s]
100%|██████████| 2/2 [00:00<00:00,  7.70it/s]
100%|██████████| 2/2 [00:00<00:00,  6.07it/s]
100%|██████████| 2/2 [00:00<00:00,  5.07it/s]
100%|██████████| 2/2 [00:00<00:00,  7.44it/s]
100%|██████████| 2/2 [00:00<00:00,  8.93it/s]
100%|██████████| 2/2 [00:00<00:00,  7.40it/s]
100%|██████████| 2/2 [00:00<00:00,  7.50it/s]
100%|██████████| 2/2 [00:00<00:00,  5.29it/s]
100%|██████████| 1/1 [00:00<00:00,  5.27it/s]
100%|██████████| 4/4 [00:00<00:00,  5.38it/s]
100%|██████████| 2/2 [00:00<00:00,  5.38it/s]
100%|██████████| 2/2 [00:00<00:00,  5.24it/s]
100%|██████████| 2/2 [00:00<00:00,  5.15it/s]
100%|██████████| 2/2 [00:00<00:00,  5.39it/s]
100%|██████████| 2/2 [00:00<00:00,

>> saved: data.parse_multi_methods.txt


In [130]:
!wc -l data.parse_multi_methods.txt

  271640 data.parse_multi_methods.txt
