In [1]:
# !pip install -r ../../prosodic/requirements.txt
import sys
sys.path.insert(0,'..')
from generative_formalism import *

## Previous data

### Antoniak et al


In [7]:
documentation(get_antoniak_et_al_memorization_data)
documentation(preprocess_antoniak_et_al_memorization_data)
df_antoniak_et_al_memorization_data = get_antoniak_et_al_memorization_data(overwrite=False)
df_antoniak_et_al_memorization_data

**`get_antoniak_et_al_memorization_data`**

```md
Load Antoniak et al. memorization data with caching support.
    Convenience function for preprocess_antoniak_et_al_memorization_data() 
    with caching support.

    Returns processed Antoniak et al. dataset containing public domain poems 
    with memorization detection results from both closed and open language models.
    
```
----


**`preprocess_antoniak_et_al_memorization_data`**

```md
Preprocess Antoniak et al. memorization data from raw files.

    This function processes the raw Antoniak et al. dataset by combining:
    1. Public domain poetry metadata and text
    2. Closed-model memorization detection results
    3. Open-model memorization detection from Walsh et al. (wimbd) data

    The function handles data cleaning, ID extraction, date processing,
    and integration of multiple data sources into a unified format.

    Parameters
    ----------
    data_fldr : str, default PATH_ANTONIAK_ET_AL_DIR
        Path to directory containing Antoniak et al. raw data files:
        - poetry-evaluation_public-domain-poems.csv
        - memorization_results.csv
        - wimbd*.csv files (from Walsh et al.)
    out_path : str, default PATH_ANTONIAK_ET_AL_CSV
        Path where processed data will be saved as compressed CSV.
    verbose : bool, default DEFAULT_VERBOSE
        If True, print progress messages during processing.

    Returns
    -------
    pd.DataFrame
        Processed DataFrame with columns:
        - 'id': Poem identifier (extracted from poem_link)
        - 'found': Boolean indicating memorization detection
        - 'found_source': 'closed' or 'open'
        - 'found_corpus': 'antoniak-et-al'
        - 'txt': Full poem text
        - 'title': Poem title
        - 'author_dob_str': Author birth year as string
        - 'author_dob': Author birth year as numeric
        - Additional metadata columns from original dataset

    Notes
    -----
    The preprocessing involves:
    - Extracting poem IDs from URLs in poem_link column
    - Processing author birth/death dates to extract birth years
    - Loading closed-model memorization predictions
    - Processing wimbd CSV files to detect open-model memorization
    - Merging all data sources into unified format

    Results are cached to out_path for subsequent loads.
    
```
----


Unnamed: 0_level_0,found,found_source,found_corpus,author,birth_death_dates,title,txt,form,form_group,tags,...,author_link,pub_year,extracted_birth_year,extracted_death_year,form_tags,theme_tags,occasion_tags,collected_from,author_dob_str,author_dob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
poem/alas-tis-true-i-have-gone-here-and-there-sonnet-110,True,open,antoniak-et-al,William Shakespeare,1564 – 1616,"Alas, 'tis true I have gone here and there (So...","Alas, ’tis true I have gone here and there\nAn...",sonnet,verse forms,"['Sonnet', 'Public Domain']",...,https://poets.org/poet/william-shakespeare,1904.0,1564.0,1616.0,['Sonnet'],['Public Domain'],[],Academy of American Poets,1564,1564.0
43742/sonnets-from-the-portuguese-43-how-do-i-love-thee-let-me-count-the-ways,False,open,antoniak-et-al,Elizabeth Barrett Browning,1806–1861,Sonnets from the Portuguese 43: How do I love ...,How do I love thee? Let me count the ways.\nI ...,sonnet,verse forms,"['Related Audio', 'Living', 'Marriage & Compan...",...,https://www.poetryfoundation.org/poets/elizabe...,,1806.0,1861.0,[],[],[],Poetry Foundation,1806,1806.0
43749/cleon,False,open,antoniak-et-al,Robert Browning,1812–1889,Cleon,"""As certain also of your own poets have said""—...",dramatic monologue,types/modes,"['Living', 'Death', 'Growing Old', 'Arts & Sci...",...,https://www.poetryfoundation.org/poets/robert-...,,1812.0,1889.0,[],[],[],Poetry Foundation,1812,1812.0
43749/cleon,False,open,antoniak-et-al,Robert Browning,1812–1889,Cleon,"""As certain also of your own poets have said""—...",blank verse,meters,"['Living', 'Death', 'Growing Old', 'Arts & Sci...",...,https://www.poetryfoundation.org/poets/robert-...,,1812.0,1889.0,[],[],[],Poetry Foundation,1812,1812.0
45016/the-house-of-life-73-the-choice-iii,False,open,antoniak-et-al,Dante Gabriel Rossetti,1828–1882,"The House of Life: 73. The Choice, III",Think thou and act; to-morrow thou shalt die\n...,sonnet,verse forms,"['Living', 'Death', 'Time & Brevity', 'Nature'...",...,https://www.poetryfoundation.org/poets/dante-g...,,1828.0,1882.0,[],[],[],Poetry Foundation,1828,1828.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44483/robin-hood,True,closed,antoniak-et-al,John Keats,1795–1821,Robin Hood,TO A FRIEND\n\nNo! those days are gone away \n...,couplet,stanza forms,"['Nature', 'Stars, Planets, Heavens', 'Summer'...",...,https://www.poetryfoundation.org/poets/john-keats,,1795.0,1821.0,[],[],[],Poetry Foundation,1795,1795.0
46889/turning-forty-56d226f965d75,False,closed,antoniak-et-al,,,,,,,,...,,,,,,,,,,
55856/the-magic-trick,False,closed,antoniak-et-al,,,,,,,,...,,,,,,,,,,
142852/essay-on-craft,False,closed,antoniak-et-al,,,,,,,,...,,,,,,,,,,


## Chadwyck-Healey

### Memorized by closed models

In [8]:
documentation(get_memorized_poems_in_completions_as_in_paper)
df_mem_chadwyck_closed = get_memorized_poems_in_completions_as_in_paper()
df_mem_chadwyck_closed

**`get_memorized_poems_in_completions_as_in_paper`**

```md
Get memorized poems from GenAI completions using original paper methodology.

    This function identifies poems that were memorized by language models by analyzing
    GenAI rhyme completions data using the methodology from the original paper.
    It uses similarity-based detection to find poems that appear to be directly
    copied from training data.

    Parameters
    ----------
    threshold : int, default 95
        Similarity threshold (0-100) for determining memorization.
        Poems with line_sim > threshold are considered memorized.
    verbose : bool, default True
        If True, print progress messages during processing.
    overwrite : bool, default False
        If True, force reprocessing instead of using cached results.

    Returns
    -------
    pd.DataFrame
        DataFrame containing poems detected as memorized in GenAI completions,
        indexed by poem ID. Includes:
        - Original poem metadata and text
        - Similarity scores and detection results
        - 'found': True for all rows (only memorized poems returned)
        - 'found_source': 'closed'
        - 'found_corpus': 'chadwyck'

    Notes
    -----
    This function uses the original paper's methodology for detecting memorization:
    1. Loads GenAI rhyme completions data (get_genai_rhyme_completions_as_in_paper)
    2. Applies similarity-based detection with specified threshold
    3. Returns only poems that exceed the memorization threshold

    The results are cached in a global variable for performance. This represents
    the "closed-source" detection method using direct model outputs rather than
    searching training corpora.

    See Also
    --------
    get_memorized_poems_in_completions : Core memorization detection logic
    get_genai_rhyme_completions_as_in_paper : Source data for detection
    
```
----


Unnamed: 0_level_0,model,first_n_lines,date,id_gen,keep_first_n_lines,id_hash,txt,num_lines,line_sim,found,found_source,found_corpus
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
english/arminrob/Z300263156,gpt-3.5-turbo,5,2025-03-24,000098c6,True,424102,"I cannot tell for certain, yet Isle guess.\nYo...",12,42.500000,False,closed,chadwyck
english/heywoodt/Z200396007,ollama/olmo2:latest,5,2025-03-16,0004444f,True,266924,"Excellent Princes may you ever be,\nAs great a...",20,45.454545,False,closed,chadwyck
english/sewardan/Z300482291,gpt-3.5-turbo,5,2025-03-16,0007c94e,True,497962,"Up this bleak hill, in wintery night's dread h...",14,43.956044,False,closed,chadwyck
c20-english/ep56001/Z400307768,ollama/mistral:text,5,2025-03-20,0008fbe6,True,826002,"Oh hard is the bed they have made him,\n An...",16,53.191489,False,closed,chadwyck
c20-english/ep20004/Z300593359,claude-3-sonnet-20240229,5,2025-03-16,000e56b8,True,486728,"Achilles grieves. A soldier, weeping, seems\nN...",10,44.210526,False,closed,chadwyck
...,...,...,...,...,...,...,...,...,...,...,...,...
c20-english/ep40001/Z300306087,ollama/olmo2:latest,5,2025-03-17,ffe6c6b0,True,606477,Learn to know the mind-behind-\nMind that sees...,16,44.067797,False,closed,chadwyck
c20-american/am23033/Z300253614,ollama/olmo2:latest,5,2025-03-16,ffe6f12a,True,575872,Driving north ninety miles\nmy daughter to col...,13,41.463415,False,closed,chadwyck
c20-american/am22097/Z200236446,ollama/olmo2:latest,5,2025-03-16,fff3b957,True,715797,The blackberries that ripened\nsoon after you ...,10,39.393939,False,closed,chadwyck
english/rawnsley/Z200471879,ollama/llama3.1:8b,5,2025-03-24,fffb372e,True,145671,"The moat is dry, the drawbridge solid stone;\n...",14,44.705882,False,closed,chadwyck


### Found in open training data

In [9]:
documentation(get_memorized_poems_in_dolma)
df_mem_chadwyck_open = get_memorized_poems_in_dolma()
head(df_mem_chadwyck_open)

**`get_memorized_poems_in_dolma`**

```md
Get memorized poems detected in the Dolma training corpus.

    This function provides access to poems that were detected as memorized in the
    Dolma training corpus. Note that Dolma is no longer publicly accessible, so
    this function works with pre-computed detection results.

    Parameters
    ----------
    *args
        Positional arguments passed to preprocess_memorized_poems_in_dolma().
    overwrite : bool, default False
        If True, force reprocessing of cached data instead of using existing results.
    verbose : bool, default True
        If True, print progress messages during data loading/preprocessing.
    **kwargs
        Keyword arguments passed to preprocess_memorized_poems_in_dolma().

    Returns
    -------
    pd.DataFrame
        DataFrame containing poems detected as memorized in Dolma training corpus,
        indexed by poem ID. See preprocess_memorized_poems_in_dolma() for details.

    Notes
    -----
    Dolma is no longer publicly accessible, so this function relies on pre-computed
    memorization detection results stored in the repository. The detection was
    performed by searching for poem sequences within the Dolma training corpus.

    This represents "open-source" memorization detection, where poems were found
    by direct corpus search rather than analyzing language model outputs.

    See Also
    --------
    preprocess_memorized_poems_in_dolma : Core preprocessing function
    get_memorized_poems_in_completions_as_in_paper : Closed-source detection method
    
```
----


* Loading from /Users/rj416/github/generative-formalism/data/data.memorized_poems_in_dolma.csv.gz


*Dataframe with 4,635 rows and  22 columns*

Unnamed: 0_level_0,txt,num_lines,num_rhyming_lines,perc_rhyming_lines,lines,count,found,found_source,found_corpus,id_hash,...,author,author_dob,title,year,num_lines_from_corpus,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
c20-american/am20114/Z300221220,Tambourines!\nTambourines!\nTambourines\nTo th...,14,0,0.0,"['Tambourines!', 'Tambourines!', 'Tambourines'...",219,True,open,chadwyck,225943,...,"Hughes, Langston, 1902-1967.",1902.0,Tambourines,1932,16,,Tambourines!,,,1900-1950
english/wattsisa/Z400522989,When I survey the wondrous cross\nOn which the...,20,16,80.0,"['When I survey the wondrous cross', 'On which...",99,True,open,chadwyck,412329,...,"Watts, Isaac, 1674-1748",1674.0,HYMN 7. (L. M.) Crucifixion to the World by th...,1704,20,The Works (1810),"On which the prince of glory dy'd,",y,Lyric,1650-1700
english-ed2/miscell3/Z300440750,"Go, lovely Rose!\nTell her, that wastes her ti...",20,17,85.0,"['Go, lovely Rose!', 'Tell her, that wastes he...",97,True,open,chadwyck,417111,...,"Waller, Edmund, 1606-1687",1606.0,"CXV [Go, lovely Rose!]",1636,20,,"&indent;Go, lovely Rose!",y,,1600-1650
c20-american/am20114/Z300220672,"I must say\nYes, sir,\nTo you all the time.\nY...",13,0,0.0,"['I must say', 'Yes, sir,', 'To you all the ti...",60,True,open,chadwyck,805969,...,"Hughes, Langston, 1902-1967.",1902.0,Porter,1932,15,,I must say,,,1900-1950
english/wattsisa/Z400522722,Come let us join our cheerful songs\n With ...,20,6,30.0,"['Come let us join our cheerful songs', 'With ...",31,True,open,chadwyck,627141,...,"Watts, Isaac, 1674-1748",1674.0,"HYMN 62. (C. M.) Christ Jesus, the Lamb of God...",1704,20,The Works (1810),&indent;With angels round the throne;,y,Lyric,1650-1700


## All together

In [10]:
documentation(get_all_memorization_data)
df_mem = get_all_memorization_data()
df_mem.groupby(['found_corpus', 'found_source','found']).size()

**`get_all_memorization_data`**

```md
Aggregate memorization detection results from all available data sources.

    This function combines memorization detection results from three different
    sources and data types: Antoniak et al. study, Chadwyck poetry corpus
    completions, and Dolma training corpus. It handles data integration,
    column ordering, and caching.

    Parameters
    ----------
    overwrite : bool, default False
        If True, force reprocessing of all data sources instead of using cached results.
    verbose : bool, default True
        If True, print progress messages and summary statistics.

    Returns
    -------
    pd.DataFrame
        DataFrame containing aggregated memorization data from all sources, indexed by poem ID.
        Includes columns for:
        - 'found': Boolean indicating if poem was detected as memorized
        - 'found_source': Source type ('closed' or 'open')
        - 'found_corpus': Corpus identifier ('antoniak-et-al' or 'chadwyck')
        - Poem metadata (title, author, dates, text, etc.)
        - 'id_hash': Unique hash identifier for each poem

    Notes
    -----
    Data Sources:
    - Antoniak et al.: Public domain poems with closed/open memorization detection
    - Chadwyck closed: Similarity-based detection in GenAI completions (original paper method)
    - Chadwyck open: Detection in Dolma training corpus

    The function performs column reordering to prioritize shared columns across
    all sources, followed by source-specific columns. Only poems marked as
    'found' (memorized) are returned in the final result.

    Cached results are stored in PATH_ALL_MEMORIZATION_DATA for performance.
    
```
----


* Loading from /Users/rj416/github/generative-formalism/data/data.all_memorization_data.csv.gz


found_corpus    found_source  found
antoniak-et-al  closed        False     2330
                              True      1723
                open          False     2294
                              True      1759
chadwyck        closed        False    11227
                              True        71
                open          False     4229
                              True       406
dtype: int64