# Rhyme

In [7]:
# !pip install -r ../requirements.txt
import sys
sys.path.append('../')
from generative_formalism import *

In [9]:
documentation(get_all_memorization_data_with_rhyme_data)
df_mem_rhyme = get_all_memorization_data_with_rhyme_data()

**`get_all_memorization_data_with_rhyme_data`**

```md
Get comprehensive memorization data with rhyme analysis for all poems.

    This function combines memorization detection results from multiple sources
    (Antoniak et al., Chadwyck completions, Dolma) and enriches them with
    rhyme analysis data.

    Parameters
    ----------
    overwrite : bool, default False
        If True, force reprocessing of all data sources instead of using cached results.
    verbose : bool, default True
        If True, print progress messages during processing.

    Returns
    -------
    pd.DataFrame
        DataFrame containing memorization data with rhyme analysis, indexed by poem ID.
        Includes columns for:
        - Memorization detection flags and sources
        - Poem metadata (title, author, dates, etc.)
        - Rhyme analysis metrics and features
        - Unique ID hash for each poem

    Notes
    -----
    The returned DataFrame combines data from:
    - Antoniak et al. memorization study (closed and open source detection)
    - Chadwyck poetry corpus completions (similarity-based detection)
    - Dolma training corpus (open source detection)

    Rhyme data is computed using get_rhyme_for_sample() and joined with
    left suffix '_from_sample' to avoid column name conflicts.
    
```
----


* Loading from /Users/rj416/github/generative-formalism/data/data.all_memorization_data.csv.gz


* Getting rhymes for sample: 100%|██████████| 24039/24039 [00:02<00:00, 9354.56it/s]


In [10]:
compute_all_stat_signif(
    df_mem_rhyme, 
    groupby='found_source_corpus',
    groupby_stat='found',
    valname='rhyme_pred_perc',
    verbose=True,
)

Computing comparisons:   0%|          | 0/1 [00:00<?, ?it/s]

Computing comparisons: 100%|██████████| 1/1 [00:00<00:00,  2.35it/s]
Computing comparisons: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it]
Computing comparisons: 100%|██████████| 1/1 [00:00<00:00,  2.67it/s]
Computing comparisons: 100%|██████████| 1/1 [00:00<00:00,  2.16it/s]


Unnamed: 0,comparison,n1,n2,p_value,effect_size,effect_size_str,mean1,mean2,significant,found,groupby
0,True vs False,1723,2330,0.0,0.902182,large,53.105049,14.935622,True,False,closed|antoniak-et-al
0,False vs True,11227,71,0.0483,0.245205,small,72.111873,83.098592,True,True,closed|chadwyck
0,True vs False,1759,2294,0.0,0.13005,,34.565094,28.552746,True,False,open|antoniak-et-al
0,True vs False,406,4229,0.813,0.01437,,71.428571,72.073776,False,False,open|chadwyck
