# Sampling Chadwyck-Healey poetry collections

In [1]:
import sys
sys.path.append('../')
from generative_formalism import *

## Data as in paper

In [2]:
documentation(get_chadwyck_corpus_sampled_by)

**`get_chadwyck_corpus_sampled_by`**

```md
Load or generate a sampled corpus by the specified criteria.

    Parameters
    - sample_by: Sampling criteria ('period', 'period_subcorpus', 'rhyme', 'sonnet_period')
    - as_in_paper: If True, load precomputed sample from paper
    - as_replicated: If True, load/generate replicated sample
    - as_regenerated: If True, load/generate regenrated sample
    - **kwargs: Additional arguments passed to generation/display functions

    Returns
    - pd.DataFrame containing the sampled corpus

    Calls
    -----
    - get_path(data_name, as_in_paper=True, as_replicated=False)

    
```
----


### Sampled by period

In [3]:
# Docs
documentation(get_chadwyck_corpus_sampled_by)

# Run
df_smpl_by_period_in_paper = get_chadwyck_corpus_sampled_by(
    'period', 
    as_in_paper=True, 
    verbose=True, 
    display=True
)

# Test
assert len(df_smpl_by_period_in_paper) == 8000

# Display
# df_smpl_by_period_in_paper

**`get_chadwyck_corpus_sampled_by`**

```md
Load or generate a sampled corpus by the specified criteria.

    Parameters
    - sample_by: Sampling criteria ('period', 'period_subcorpus', 'rhyme', 'sonnet_period')
    - as_in_paper: If True, load precomputed sample from paper
    - as_replicated: If True, load/generate replicated sample
    - as_regenerated: If True, load/generate regenrated sample
    - **kwargs: Additional arguments passed to generation/display functions

    Returns
    - pd.DataFrame containing the sampled corpus

    Calls
    -----
    - get_path(data_name, as_in_paper=True, as_replicated=False)

    
```
----


* Breakdown for period
           count
period          
1600-1650   1000
1650-1700   1000
1700-1750   1000
1750-1800   1000
1800-1850   1000
1850-1900   1000
1900-1950   1000
1950-2000   1000



### Sampled by rhyme

In [4]:
# Run
df_smpl_by_rhyme_in_paper = get_chadwyck_corpus_sampled_by(
    'rhyme', 
    as_in_paper=True, 
    verbose=True, 
    display=True
)

# Test
assert len(df_smpl_by_rhyme_in_paper) == 2000

# Display
# df_smpl_by_rhyme_in_paper

* Breakdown for rhyme
       count
rhyme       
n       1000
y       1000



### Sampled by period/subcorpus

In [5]:
# Run
df_smpl_by_period_subcorpus_in_paper = get_chadwyck_corpus_sampled_by(
    'period_subcorpus',
    as_in_paper=True, 
    display=True, 
    verbose=True
)

# Test
assert len(df_smpl_by_period_subcorpus_in_paper) > 20_000

# Display
# df_smpl_by_period_subcorpus_in_paper

* Breakdown for period_subcorpus
                                    count
period    subcorpus                      
1600-1650 American Poetry             361
          English Poetry             1000
1650-1700 American Poetry              74
          English Poetry             1000
1700-1750 African-American Poetry       3
          American Poetry             340
          English Poetry             1000
          The Faber Poetry Library      2
1750-1800 African-American Poetry     284
          American Poetry            1000
          English Poetry             1000
          The Faber Poetry Library      2
1800-1850 African-American Poetry     542
          American Poetry            1000
          English Poetry             1000
          Modern Poetry                 1
          The Faber Poetry Library      1
1850-1900 African-American Poetry    1000
          American Poetry            1000
          English Poetry             1000
          Modern Poetry               809
 

### Sampled by sonnet/period

In [6]:
# Run
df_smpl_by_sonnet_period_in_paper = get_chadwyck_corpus_sampled_by('sonnet_period', as_in_paper=True, display=True, verbose=True)


# Display
# df_smpl_by_sonnet_period_in_paper

* Breakdown for sonnet_period
           count
period          
1600-1650    152
1650-1700     65
1700-1750    325
1750-1800   1000
1800-1850   1000
1850-1900   1000
1900-1950    180
1950-2000     12



## Replicating new samples

Must have access to Chadwyck-Healey corpora.

In [7]:
documentation(get_chadwyck_corpus_sampled_by_replicated)
documentation(gen_chadwyck_corpus_sampled_by)
documentation(sample_chadwyck_corpus)

**`get_chadwyck_corpus_sampled_by_replicated`**

```md
Load or generate a stratified sample with disk caching.

    Loads a pre-generated stratified sample from disk if available, otherwise
    generates a new sample and caches it. This ensures efficient reuse of
    expensive sampling operations.

    Parameters
    ----------
    sample_by : str
        Sampling criteria ('rhyme', 'period', 'period_subcorpus', 'sonnet_period').
    force : bool, default=False
        If True, regenerate the sample even if a cached version exists.
    display : bool, default=False
        If True, display summary tables for certain sample types.
    verbose : bool, default=False
        If True, print progress information.
    as_in_paper : bool, default=True
        If True, use precomputed sample from paper.
    as_replicated : bool, default=False
        If True, use replicated sample.
    as_regenerated : bool, default=False
        If True, use regenrated sample.
    **kwargs: Additional arguments passed to gen_chadwyck_corpus_sampled_by

    Returns
    -------
    pd.DataFrame
        DataFrame containing the stratified sample.

    Calls
    -----
    - gen_chadwyck_corpus_sampled_by(sample_by, display=display) [if generating new sample]
    - save_sample(odf, path, overwrite=True) [if saving generated sample]
    - pd.read_csv(path).set_index('id').sort_values('id_hash') [if loading cached sample]
    - get_period_subcorpus_table(odf, return_display=True) [if display=True for period_subcorpus]
    - display(img) [if display=True and IPython available]
    
```
----


**`gen_chadwyck_corpus_sampled_by`**

```md
Generate a stratified sample from the full Chadwyck-Healey corpus.

    Creates a balanced sample of poems using the specified stratification criteria.
    Handles different sampling types including rhyme, period, period×subcorpus,
    and sonnet-based sampling.

    Parameters
    ----------
    sample_by : str
        Sampling criteria ('rhyme', 'period', 'period_subcorpus', 'sonnet_period').
    display : bool, default=False
        If True, display summary tables for certain sample types (e.g., period).
    **kwargs: Additional arguments passed to sample_chadwyck_corpus

    Returns
    -------
    pd.DataFrame
        DataFrame containing the stratified sample with balanced representation.

    Calls
    -----
    - get_chadwyck_corpus() [to load the full corpus]
    - sample_chadwyck_corpus(df_corpus, sample_by=...) [to create stratified sample]
    - get_period_subcorpus_table(df, return_display=True) [if display=True for period samples]
    - display(img) [if display=True and IPython available]
    
```
----


**`sample_chadwyck_corpus`**

```md
Deterministically sample the corpus by one or more grouping criteria.

    Creates a balanced sample from the corpus by grouping on specified criteria,
    filtering groups by size constraints, and taking deterministic subsets within
    each group. Uses id_hash sorting to ensure reproducible results across runs.

    Parameters
    ----------
    df_corpus : pd.DataFrame
        Corpus DataFrame to sample from (e.g., from get_chadwyck_corpus()).
        Must contain the columns specified in sample_by plus 'id_hash'.
    sample_by : str or list[str]
        Column name(s) to group by for stratified sampling.
    min_sample_n : int, default=MIN_SAMPLE_N
        Minimum number of items required in a group to be included.
    max_sample_n : int, default=MAX_SAMPLE_N
        Maximum number of items to take from each group.
    prefer_min_id_hash : bool, default=False
        If True, prefer items with smaller id_hash values when sampling.
    sort_id_hash : bool, default=True
        If True, sort the sample by id_hash.
    verbose : bool, default=False
        If True, print progress information.


    Returns
    -------
    pd.DataFrame
        Sampled DataFrame containing the selected rows from df_corpus.

    Calls
    -----
    - describe_qual(s, count=False) [to display group size distribution]
    
```
----


In [8]:
## By period

# Regenerated (attempts to reapply random sorting preference)
df_regen_smpl_by_period = get_chadwyck_corpus_sampled_by('period', as_regenerated=True, display=True, verbose=True, force=True)

# Replicated (entirely new sample)
df_replicated_smpl_by_period = get_chadwyck_corpus_sampled_by('period', as_replicated=True, display=True, verbose=True, force=True)

* Generating period sample
* Loading Chadwyck-Healey corpus (metadata + txt)


  : 100%|██████████| 204514/204514 [00:06<00:00, 29447.48it/s]


* Saved sample to /Users/rj416/github/generative-formalism/data/data_as_regenerated/corpus_sample_by_period.csv.gz
* Breakdown for period
           count
period          
1600-1650   1000
1650-1700   1000
1700-1750   1000
1750-1800   1000
1800-1850   1000
1850-1900   1000
1900-1950   1000
1950-2000   1000

* Generating period sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Saved sample to /Users/rj416/github/generative-formalism/data/data_as_replicated/corpus_sample_by_period.csv.gz
* Breakdown for period
           count
period          
1600-1650   1000
1650-1700   1000
1700-1750   1000
1750-1800   1000
1800-1850   1000
1850-1900   1000
1900-1950   1000
1950-2000   1000



In [9]:
##  Others

df_smpl_by_rhyme_replicated = get_chadwyck_corpus_sampled_by('rhyme', as_replicated=True, display=True, verbose=True, force=True)
df_smpl_by_period_subcorpus_replicated = get_chadwyck_corpus_sampled_by('period_subcorpus', as_replicated=True, display=True, verbose=True)
df_smpl_by_sonnet_period_replicated = get_chadwyck_corpus_sampled_by('sonnet_period', as_replicated=True, as_in_paper=False, display=True, verbose=True, force=True)

* Generating rhyme sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Saved sample to /Users/rj416/github/generative-formalism/data/data_as_replicated/corpus_sample_by_rhyme.csv.gz
* Breakdown for rhyme
       count
rhyme       
n       1000
y       1000

* Loading period_subcorpus sample from /Users/rj416/github/generative-formalism/data/data_as_replicated/corpus_sample_by_period_subcorpus.csv.gz
* Breakdown for period_subcorpus
                                    count
period    subcorpus                      
1600-1650 American Poetry             361
          English Poetry             1000
1650-1700 American Poetry              74
          English Poetry             1000
1700-1750 American Poetry             340
          English Poetry             1000
1750-1800 African-American Poetry     284
          American Poetry            1000
          English Poetry             1000
1800-1850 African-American Poetry     542
          American Poetry