# Prompting for completions of poems

In [1]:
import sys
sys.path.append('../')
from generative_formalism.rhyme_completions import *

## Preprocessing raw data

In [2]:
documentation(preprocess_legacy_genai_rhyme_completions)

**Documentation for `preprocess_legacy_genai_rhyme_completions`**

*Description*

```md
Preprocess legacy generative AI rhyme completions from raw pickle files.
    
    This function loads and preprocesses legacy rhyme completion data from multiple
    pickle files (v3-v7), combines them, deduplicates, and saves to CSV format.
    It also generates unique IDs for generated poems and provides statistics
    about the dataset.
    
    Args:
        path (str, optional): Path to save the processed CSV file.
            Defaults to PATH_GENAI_RHYME_COMPLETIONS.
        overwrite (bool, optional): Whether to overwrite existing processed data.
            Defaults to False.
        first_n_lines (int, optional): Number of first lines from original poems
            to consider. Defaults to FIRST_N_LINES.
    
    Returns:
        pd.DataFrame: Processed DataFrame with MultiIndex containing completion data,
            indexed by GENAI_RHYME_COMPLETIONS_INDEX.
    
    Note:
        This function uses a global cache (PREPROCESSED_LEGACY_COMPLETION_DATA)
        to avoid reprocessing data within the same session.
    
```

*Call signature*

```md
preprocess_legacy_genai_rhyme_completions(
    path='/Users/rj416/github/generative-formalism/data/corpus_genai_rhyme_completions.csv.gz'
    overwrite=False
    first_n_lines=5
)
```

In [9]:
# Get data from raw data of past generations
df_preprocessed_rhyme_completions = preprocess_legacy_genai_rhyme_completions(overwrite=False)
df_preprocessed_rhyme_completions

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,line_real,line_gen
id_human,model,first_n_lines,version,date,id,stanza_num,line_num,Unnamed: 8_level_1,Unnamed: 9_level_1
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,1,"Long years ago, within a distant climb,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,2,"Ere Love had touched me with his wand sublime,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,3,I dreamed of one to make my life's calm May,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,4,The panting passion of a summer's day.,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,5,"And ever since, in almost sad suspense,",
...,...,...,...,...,...,...,...,...,...
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,5,13,"""I enclose some photographs,",into lives interwoven by sight.
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,5,14,all done with a camera.,"Now, with modern lenses, capturing light,"
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,6,15,"If they get crushed in the Post Office bags,","millions upon millions of images,"
modern/sci0101/Z200480982,ollama/olmo2:latest,5,1,2025-03-15,9e7e696c,6,16,they may easily be smoothed either by ironing,"stored in bytes unseen by human sight,"


In [10]:
documentation(get_genai_rhyme_completions_as_in_paper)

**Documentation for `get_genai_rhyme_completions_as_in_paper`**

*Description*

```md
Get generative AI rhyme completions data as used in the paper.
    
    This function retrieves preprocessed legacy rhyme completion data and
    optionally converts it to poem text format for analysis. It provides
    statistics about the dataset including line counts and poem length distributions.
    
    Args:
        by_line (bool, optional): If True, returns line-by-line data.
            If False, converts to poem text format. Defaults to True.
        keep_first_n_lines (bool, optional): Whether to keep the first N lines
            from original poems when converting to poem format. Defaults to True.
    
    Returns:
        pd.DataFrame: DataFrame containing rhyme completion data, either in
            line-by-line format or poem text format depending on by_line parameter.
    
```

*Call signature*

```md
get_genai_rhyme_completions_as_in_paper(
    by_line=True
    keep_first_n_lines=True
)
```

In [13]:
# Data as in paper
df_rhyme_completions_by_line = get_genai_rhyme_completions_as_in_paper(by_line=True)
df_rhyme_completions_by_line.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,line_real,line_gen
id_human,model,first_n_lines,version,date,id,stanza_num,line_num,Unnamed: 8_level_1,Unnamed: 9_level_1
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,1,"Long years ago, within a distant climb,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,2,"Ere Love had touched me with his wand sublime,",
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,3,I dreamed of one to make my life's calm May,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,4,The panting passion of a summer's day.,
african-american/dunbarpa/Z200343481,claude-3-sonnet-20240229,5,1,2025-03-24,0f81663a,1,5,"And ever since, in almost sad suspense,",


In [14]:
# Preprocessed data should be equal to data as in paper
pd.testing.assert_frame_equal(df_rhyme_completions_by_line, df_preprocessed_rhyme_completions)

In [4]:
# Show data by poem
df_rhyme_completions_by_poem_including_first_n_lines = get_genai_rhyme_completions_as_in_paper(by_line=False, keep_first_n_lines=True)
df_rhyme_completions_by_poem_including_first_n_lines

* Converting to poem txt format (keeping first lines from original poem)
* Total lines: 326,862
* Distribution of output poem lengths
  10  -- | 17 ]--                                                     100


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,id_hash,txt,num_lines
id_human,model,first_n_lines,id,keep_first_n_lines,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
english/arminrob/Z300263156,gpt-3.5-turbo,5,000098c6,True,23345,"I cannot tell for certain, yet Isle guess.\nYo...",12
english/heywoodt/Z200396007,ollama/olmo2:latest,5,0004444f,True,215823,"Excellent Princes may you ever be,\nAs great a...",20
english/sewardan/Z300482291,gpt-3.5-turbo,5,0007c94e,True,653777,"Up this bleak hill, in wintery night's dread h...",14
c20-english/ep56001/Z400307768,ollama/mistral:text,5,0008fbe6,True,11221,"Oh hard is the bed they have made him,\n An...",16
c20-english/ep20004/Z300593359,claude-3-sonnet-20240229,5,000e56b8,True,651728,"Achilles grieves. A soldier, weeping, seems\nN...",10
...,...,...,...,...,...,...,...
c20-american/am22097/Z200236446,ollama/olmo2:latest,5,fff3b957,True,282770,The blackberries that ripened\nsoon after you ...,10
english/rawnsley/Z200471879,ollama/llama3.1:8b,5,fffb372e,True,416910,"The moat is dry, the drawbridge solid stone;\n...",14
english-ed2/ep2418/Z300660513,claude-3-sonnet-20240229,5,fffd3c3a,True,302949,"BATTLE OF Sinope.\n""Assyrios complexa sinus st...",16
english/paynejoh/Z300458758,ollama/llama3.1:8b,5,fffe1d15,True,183105,"HERE, for such as will, are roses;\nNone of th...",14


In [5]:
# Show data by poem excluding first n lines
df_rhyme_completions_by_poem_excluding_first_n_lines = get_genai_rhyme_completions_as_in_paper(by_line=False, keep_first_n_lines=False)
df_rhyme_completions_by_poem_excluding_first_n_lines

* Converting to poem txt format (not keeping first lines from original poem)
* Total lines: 221,212
* Distribution of output poem lengths
  10  |12 ]--                                                         100


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,id_hash,txt,num_lines
id_human,model,first_n_lines,id,keep_first_n_lines,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
english/arminrob/Z300263156,gpt-3.5-turbo,5,000098c6,False,23345,For in my heart she never will depart.\nHer la...,7
english/heywoodt/Z200396007,ollama/olmo2:latest,5,0004444f,False,215823,Give to our hearts the blessings you can rear....,15
english/sewardan/Z300482291,gpt-3.5-turbo,5,0007c94e,False,653777,"Of Desolation, where she holds her home,\nAnd ...",9
c20-english/ep56001/Z400307768,ollama/mistral:text,5,0008fbe6,False,11221,And the world’s at his feet.\nHe had a fin...,11
c20-english/ep20004/Z300593359,claude-3-sonnet-20240229,5,000e56b8,False,651728,"The fields of war bring nothing but dismay,\nF...",5
...,...,...,...,...,...,...,...
c20-american/am22097/Z200236446,ollama/olmo2:latest,5,fff3b957,False,282770,"I watch for lightning strikes, recall\nmoments...",5
english/rawnsley/Z200471879,ollama/llama3.1:8b,5,fffb372e,False,416910,The rusty sword that hung beside it still glea...,9
english-ed2/ep2418/Z300660513,claude-3-sonnet-20240229,5,fffd3c3a,False,302949,And left thee but a shadow of thy might.\nYet ...,11
english/paynejoh/Z300458758,ollama/llama3.1:8b,5,fffe1d15,False,183105,"Within their secret, scented bowers;\nTheir be...",9


### Replicating

In [None]:
xw