In [1]:
# Code
import sys
sys.path.insert(0,'..')
from generative_formalism import * 

printm('# Detecting memorized poems')

printm('## From open-source training data')
printm('### Antoniak et al (Dolma)')

documentation(get_antoniak_et_al_memorization_data)
documentation(preprocess_antoniak_et_al_memorization_data)

df_antoniak_et_al_memorization_data = get_antoniak_et_al_memorization_data()
display(df_antoniak_et_al_memorization_data)

printm('### From Dolma + Chadwyck-Healey')
documentation(get_memorized_poems_in_dolma)
df_mem_chadwyck_open = get_memorized_poems_in_dolma()
display(df_mem_chadwyck_open)

printm('## From memorization detection from closed-source models')
documentation(get_memorized_poems_in_completions)
df_mem_chadwyck_closed = get_memorized_poems_in_completions()
display(df_mem_chadwyck_closed)

printm('## All together')
documentation(get_all_memorization_data)
df_mem = get_all_memorization_data(force=True)
display(df_mem.groupby(['found_corpus', 'found_source','found']).size())
display(df_mem)

# Detecting memorized poems

## From open-source training data

### Antoniak et al (Dolma)

##### `get_antoniak_et_al_memorization_data`: Load Antoniak et al. memorization data with caching support.

##### `preprocess_antoniak_et_al_memorization_data`: Preprocess Antoniak et al. memorization data from raw files.

* Loading from `{REPO}/data/raw/memorization/data.antoniak_et_al_memorization_results.csv.gz`

Unnamed: 0_level_0,found,found_source,found_corpus,author,birth_death_dates,title,txt,form,form_group,tags,...,author_link,pub_year,extracted_birth_year,extracted_death_year,form_tags,theme_tags,occasion_tags,collected_from,author_dob_str,author_dob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
poem/alas-tis-true-i-have-gone-here-and-there-sonnet-110,True,open,antoniak-et-al,William Shakespeare,1564 – 1616,"Alas, 'tis true I have gone here and there (So...","Alas, ’tis true I have gone here and there\nAn...",sonnet,verse forms,"['Sonnet', 'Public Domain']",...,https://poets.org/poet/william-shakespeare,1904.0,1564.0,1616.0,['Sonnet'],['Public Domain'],[],Academy of American Poets,1564,1564.0
43742/sonnets-from-the-portuguese-43-how-do-i-love-thee-let-me-count-the-ways,False,open,antoniak-et-al,Elizabeth Barrett Browning,1806–1861,Sonnets from the Portuguese 43: How do I love ...,How do I love thee? Let me count the ways.\nI ...,sonnet,verse forms,"['Related Audio', 'Living', 'Marriage & Compan...",...,https://www.poetryfoundation.org/poets/elizabe...,,1806.0,1861.0,[],[],[],Poetry Foundation,1806,1806.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142852/essay-on-craft,False,closed,antoniak-et-al,,,,,,,,...,,,,,,,,,,
54525/kora-in-hell-improvisations-ii,False,closed,antoniak-et-al,,,,,,,,...,,,,,,,,,,


### From Dolma + Chadwyck-Healey

##### `get_memorized_poems_in_dolma`: Get memorized poems detected in the Dolma training corpus.

* Loading from `/Users/ryan/github/generative-formalism/data/raw/memorization/data.memorized_poems_in_dolma.csv.gz`

Unnamed: 0_level_0,txt,num_lines,num_rhyming_lines,perc_rhyming_lines,lines,count,found,found_source,found_corpus,id_hash,...,author,author_dob,title,year,num_lines_from_corpus,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
c20-american/am20114/Z300221220,Tambourines!\nTambourines!\nTambourines\nTo th...,14,0,0.000000,"['Tambourines!', 'Tambourines!', 'Tambourines'...",219,True,open,chadwyck,323918,...,"Hughes, Langston, 1902-1967.",1902.0,Tambourines,1932,16,,Tambourines!,,,1900-1950
english/wattsisa/Z400522989,When I survey the wondrous cross\nOn which the...,20,16,80.000000,"['When I survey the wondrous cross', 'On which...",99,True,open,chadwyck,656874,...,"Watts, Isaac, 1674-1748",1674.0,HYMN 7. (L. M.) Crucifixion to the World by th...,1704,20,The Works (1810),"On which the prince of glory dy'd,",y,Lyric,1650-1700
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
american/am0076/Z200145561,Thine eyes shall see the light of distant skie...,14,10,71.428571,['Thine eyes shall see the light of distant sk...,0,False,open,chadwyck,310302,...,"Bryant, William Cullen, 1794-1878",1794.0,"TO COLE, THE PAINTER, DEPARTING FOR EUROPE.",1824,14,The Poetical Works (1903),"&indent;A living image of our own bright land,",y,Sonnet,1750-1800
english/rawnsley/Z300472911,"The corn was yellow by the Lowland braes,\n ...",14,12,85.714286,"['The corn was yellow by the Lowland braes,', ...",0,False,open,chadwyck,736654,...,"Rawnsley, H. D. (Hardwicke Drummond), 1851-1920",1851.0,Lord Justice General Inglis.,1881,14,Valete: Tennyson [etc.] (1893),"&indent;The Highland heather purple on the hill,",y,Sonnet,1850-1900


## From memorization detection from closed-source models

##### `get_memorized_poems_in_completions`: Core function for detecting memorized poems in GenAI completions using similarity.

* Loading legacy genai rhyme completions from `{REPO}/data/data_as_in_paper/genai_rhyme_completions.csv.gz`

Computing line similarity:   0%|          | 0/326862 [00:00<?, ?it/s]

* Converting to poem txt format (not keeping first lines from original poem)

Unnamed: 0_level_0,model,first_n_lines,id_gen,keep_first_n_lines,id_hash,txt,num_lines,line_sim,found,found_source,found_corpus
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
african-american/belljame/Z200277067,ollama/llama3.1:8b,5,44e64e0c,False,464890,Reap the moments that are your own time's gift...,9,44.680851,False,closed,chadwyck
african-american/benjamin/Z200277076,ollama/olmo2:latest,5,b6451886,False,675304,"\n\n\n Embrace equality, let no chains confoun...",15,43.678161,False,closed,chadwyck
...,...,...,...,...,...,...,...,...,...,...,...
modern/sci0101/Z200480980,ollama/llama3.1:8b,5,02891a61,False,345397,I'd rather be elsewhere tonight\nwhere darknes...,19,46.376812,False,closed,chadwyck
modern/sci0101/Z200480982,ollama/olmo2:latest,5,0a73cde6,False,320878,"""of hope,"" for they held so much promise,\nthr...",80,46.753247,False,closed,chadwyck


## All together

##### `get_all_memorization_data`: Aggregate memorization detection results from all available data sources.

* Preprocessing from `/Users/ryan/github/generative-formalism/data/raw/memorization/antoniak-et-al`

* Preprocessing Antoniak et al. memorization data from `/Users/ryan/github/generative-formalism/data/raw/memorization/antoniak-et-al`

* Writing to `/Users/ryan/github/generative-formalism/data/raw/memorization/data.antoniak_et_al_memorization_results.csv.gz`

* Loading legacy genai rhyme completions from `{REPO}/data/data_as_in_paper/genai_rhyme_completions.csv.gz`

Computing line similarity:   0%|          | 0/326862 [00:00<?, ?it/s]

* Converting to poem txt format (not keeping first lines from original poem)

* Loading from `/Users/ryan/github/generative-formalism/data/raw/memorization/data.memorized_poems_in_dolma.csv.gz`

* Writing to `/Users/ryan/github/generative-formalism/data/raw/memorization/data.all_memorization_data.csv.gz`

*Breakdown for poems found by corpus and source
                                   count
found_corpus   found_source found       
antoniak-et-al closed       False   2330
                            True    1723
...                                  ...
chadwyck       open         False   4229
                            True     406

[8 rows x 1 columns]



found_corpus    found_source  found
antoniak-et-al  closed        False    2330
                              True     1723
                                       ... 
chadwyck        open          False    4229
                              True      406
Length: 8, dtype: int64

Unnamed: 0_level_0,found,found_source,found_corpus,txt,author,birth_death_dates,title,form,form_group,tags,...,period_meta,subcorpus,year,num_lines_from_corpus,volume,line,rhyme,genre,period,found_source_corpus
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
poem/alas-tis-true-i-have-gone-here-and-there-sonnet-110,True,open,antoniak-et-al,"Alas, ’tis true I have gone here and there\nAn...",William Shakespeare,1564 – 1616,"Alas, 'tis true I have gone here and there (So...",sonnet,verse forms,"['Sonnet', 'Public Domain']",...,,,,,,,,,,open|antoniak-et-al
43742/sonnets-from-the-portuguese-43-how-do-i-love-thee-let-me-count-the-ways,False,open,antoniak-et-al,How do I love thee? Let me count the ways.\nI ...,Elizabeth Barrett Browning,1806–1861,Sonnets from the Portuguese 43: How do I love ...,sonnet,verse forms,"['Related Audio', 'Living', 'Marriage & Compan...",...,,,,,,,,,,open|antoniak-et-al
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
american/am0076/Z200145561,False,open,chadwyck,Thine eyes shall see the light of distant skie...,"Bryant, William Cullen, 1794-1878",,"TO COLE, THE PAINTER, DEPARTING FOR EUROPE.",,,,...,1835-1869 Mid Nineteenth-Century,American Poetry,1824.0,14.0,The Poetical Works (1903),"&indent;A living image of our own bright land,",y,Sonnet,1750-1800,open|chadwyck
english/rawnsley/Z300472911,False,open,chadwyck,"The corn was yellow by the Lowland braes,\n ...","Rawnsley, H. D. (Hardwicke Drummond), 1851-1920",,Lord Justice General Inglis.,,,,...,1870-1899 Later Nineteenth-Century,English Poetry,1881.0,14.0,Valete: Tennyson [etc.] (1893),"&indent;The Highland heather purple on the hill,",y,Sonnet,1850-1900,open|chadwyck
