# Prompting for completions of poems

In [1]:
import sys
sys.path.append('../')
from generative_formalism.rhyme_completions import *

## Preprocessing raw data

In [8]:
documentation(preprocess_legacy_genai_rhyme_completions)

**`preprocess_legacy_genai_rhyme_completions`**

```md
Preprocess legacy generative AI rhyme completions from raw pickle files.

    This function loads and preprocesses legacy rhyme completion data from multiple
    pickle files (v3-v7), combines them, deduplicates, and saves to CSV format.
    It also generates unique IDs for generated poems and provides statistics
    about the dataset.

    Args:
        path (str, optional): Path to save the processed CSV file.
            Defaults to PATH_GENAI_RHYME_COMPLETIONS.
        overwrite (bool, optional): Whether to overwrite existing processed data.
            Defaults to False.
        first_n_lines (int, optional): Number of first lines from original poems
            to consider. Defaults to FIRST_N_LINES.

    Returns:
        pd.DataFrame: Processed DataFrame with MultiIndex containing completion data,
            indexed by GENAI_RHYME_COMPLETIONS_INDEX.

    Note:
        This function uses a global cache (PREPROCESSED_LEGACY_COMPLETION_DATA)
        to avoid reprocessing data within the same session.
    
```
----


In [9]:
# Get data from raw data of past generations
df_preprocessed_rhyme_completions = preprocess_legacy_genai_rhyme_completions(overwrite=False)
df_preprocessed_rhyme_completions

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,line_real,line_gen
id_human,model,first_n_lines,version,date,id,stanza_num,line_num,Unnamed: 8_level_1,Unnamed: 9_level_1
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,1,"Long years ago, within a distant climb,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,2,"Ere Love had touched me with his wand sublime,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,3,I dreamed of one to make my life's calm May,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,4,The panting passion of a summer's day.,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,5,"And ever since, in almost sad suspense,",
...,...,...,...,...,...,...,...,...,...
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,5,13,"""I enclose some photographs,",into lives interwoven by sight.
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,5,14,all done with a camera.,"Now, with modern lenses, capturing light,"
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,6,15,"If they get crushed in the Post Office bags,","millions upon millions of images,"
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,6,16,they may easily be smoothed either by ironing,"stored in bytes unseen by human sight,"


In [10]:
documentation(get_genai_rhyme_completions_as_in_paper)

**`get_genai_rhyme_completions_as_in_paper`**

```md
Get generative AI rhyme completions data as used in the paper.

    This function retrieves preprocessed legacy rhyme completion data and
    optionally converts it to poem text format for analysis. It provides
    statistics about the dataset including line counts and poem length distributions.

    Args:
        by_line (bool, optional): If True, returns line-by-line data.
            If False, converts to poem text format. Defaults to True.
        keep_first_n_lines (bool, optional): Whether to keep the first N lines
            from original poems when converting to poem format. Defaults to True.

    Returns:
        pd.DataFrame: DataFrame containing rhyme completion data, either in
            line-by-line format or poem text format depending on by_line parameter.
    
```
----


In [14]:
# Data as in paper
df_rhyme_completions_by_line = get_genai_rhyme_completions_as_in_paper(by_line=True, filter_recognized=False)
df_rhyme_completions_by_line.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,line_real,line_gen
id_human,model,first_n_lines,version,date,id,stanza_num,line_num,Unnamed: 8_level_1,Unnamed: 9_level_1
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,1,"Long years ago, within a distant climb,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,2,"Ere Love had touched me with his wand sublime,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,3,I dreamed of one to make my life's calm May,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,4,The panting passion of a summer's day.,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,5,"And ever since, in almost sad suspense,",


In [15]:
# Preprocessed data should be equal to data as in paper
pd.testing.assert_frame_equal(df_rhyme_completions_by_line, df_preprocessed_rhyme_completions)

In [16]:
# Show data by poem
df_rhyme_completions_by_poem_including_first_n_lines = get_genai_rhyme_completions_as_in_paper(by_line=False, keep_first_n_lines=True)
df_rhyme_completions_by_poem_including_first_n_lines.head()

* Computing line similarity


100%|██████████| 326862/326862 [00:03<00:00, 85811.75it/s]


* Filtered out 169 recognized poems
* Converting to poem txt format (keeping first lines from original poem)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,id_hash,txt,num_lines
id_human,model,first_n_lines,date,id,keep_first_n_lines,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
english/arminrob/Z300263156,gpt-3.5-turbo,5,2025-03-24,000098c6,True,747080,"I cannot tell for certain, yet Isle guess.\nYo...",12
english/heywoodt/Z200396007,ollama/olmo2:latest,5,2025-03-16,0004444f,True,181378,"Excellent Princes may you ever be,\nAs great a...",20
english/sewardan/Z300482291,gpt-3.5-turbo,5,2025-03-16,0007c94e,True,586411,"Up this bleak hill, in wintery night's dread h...",14
c20-english/ep56001/Z400307768,ollama/mistral:text,5,2025-03-20,0008fbe6,True,154692,"Oh hard is the bed they have made him,\n An...",16
c20-english/ep20004/Z300593359,claude-3-sonnet-20240229,5,2025-03-16,000e56b8,True,188822,"Achilles grieves. A soldier, weeping, seems\nN...",10


In [17]:
# Show data by poem excluding first n lines
df_rhyme_completions_by_poem_excluding_first_n_lines = get_genai_rhyme_completions_as_in_paper(by_line=False, keep_first_n_lines=False)
df_rhyme_completions_by_poem_excluding_first_n_lines.head()

* Computing line similarity


100%|██████████| 326862/326862 [00:03<00:00, 87676.52it/s]


* Filtered out 169 recognized poems
* Converting to poem txt format (keeping first lines from original poem)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,id_hash,txt,num_lines
id_human,model,first_n_lines,date,id,keep_first_n_lines,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
english/arminrob/Z300263156,gpt-3.5-turbo,5,2025-03-24,000098c6,True,747080,"I cannot tell for certain, yet Isle guess.\nYo...",12
english/heywoodt/Z200396007,ollama/olmo2:latest,5,2025-03-16,0004444f,True,181378,"Excellent Princes may you ever be,\nAs great a...",20
english/sewardan/Z300482291,gpt-3.5-turbo,5,2025-03-16,0007c94e,True,586411,"Up this bleak hill, in wintery night's dread h...",14
c20-english/ep56001/Z400307768,ollama/mistral:text,5,2025-03-20,0008fbe6,True,154692,"Oh hard is the bed they have made him,\n An...",16
c20-english/ep20004/Z300593359,claude-3-sonnet-20240229,5,2025-03-16,000e56b8,True,188822,"Achilles grieves. A soldier, weeping, seems\nN...",10


### Replicated data

In [18]:
get_genai_rhyme_completions_as_replicated()


* Collecting genai rhyme promptings as replicated here
* Collecting from /Users/rj416/github/generative-formalism/data/stash/genai_rhyme_completions.jsonl
  * 16 generated completions


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,id_hash,txt,num_lines,temperature,prompt,system_prompt
model,first_n_lines,id,keep_first_n_lines,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
gemini-pro,5,00d43b92,True,80009,"Hence Sickness, nor about my weary head\n T...",14,0.7,"NUMBER OF LINES: 14\n\n1\tHence Sickness, nor ...",The following is the first 5 lines from a poem...
gpt-3.5-turbo,5,616128c1,True,308304,"Through gloomy grove, along the Lawn,\n Or ...",32,0.7,NUMBER OF LINES: 32\n\n1\tThrough gloomy grove...,The following is the first 5 lines from a poem...
claude-3-opus-20240229,5,4137d80a,True,470680,"""FOR THEY SAY THAT LITTLE INFANTS\n REPLY B...",24,0.7,"NUMBER OF LINES: 24\n\n1\t""FOR THEY SAY THAT L...",The following is the first 5 lines from a poem...
gpt-4-turbo,5,d4396f1a,True,47121,The poet's fancy takes from Flora's realm\n ...,14,0.7,NUMBER OF LINES: 14\n\n1\tThe poet's fancy tak...,The following is the first 5 lines from a poem...
gpt-3.5-turbo,5,6cd00f5c,True,383586,"First, feel, then feel, then\nread, or read, t...",21,0.7,"NUMBER OF LINES: 21\n\n1\tFirst, feel, then fe...",The following is the first 5 lines from a poem...
ollama/llama3.1:8b,5,4da1d3f3,True,463236,"Sing, Muse, though' feeble be thy strain,\nTho...",76,0.7,"NUMBER OF LINES: 76\n\n1\tSing, Muse, though' ...",The following is the first 5 lines from a poem...
claude-3-opus-20240229,5,0e1f415a,True,326189,"Baby came toddling up to my knee,\n His chu...",16,0.7,NUMBER OF LINES: 16\n\n1\tBaby came toddling u...,The following is the first 5 lines from a poem...
gpt-4-turbo,5,7889ef2a,True,474514,"An't please your Majesty, I'm overjoyed\n T...",34,0.7,NUMBER OF LINES: 34\n\n1\tAn't please your Maj...,The following is the first 5 lines from a poem...
