# Corpus: Chadwyck-Healey poetry collections

## Loading corpus from source

In [1]:
import sys
sys.path.append('../')
from generative_formalism import *

In [2]:
documentation(check_paths,signature=False)
check_paths()

**Documentation for `check_paths`**

*Description*

```md
Check if the paths to the Chadwyck-Healey corpus and metadata are set and exist.
    Uses constants from `constants.py`.
    
```

✓ Chadwyck-Healey corpus path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/txt
✓ Chadwyck-Healey metadata path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
✓ Metadata file URL set in environment (.env or shell)
✓ Corpus text file URL set in environment (.env or shell)


### Loading corpus metadata

In [3]:
documentation(get_chadwyck_corpus_metadata)

**Documentation for `get_chadwyck_corpus_metadata`**

*Description*

```md
Load and normalize Chadwyck-Healey corpus metadata.

    This function reads `PATH_CHADWYCK_HEALEY_METADATA`, downloading and unzipping
    if missing. It coerces numeric fields, derives `id_hash` and binned `period`,
    applies min/max filters, and caches the resulting DataFrame in `CORPUS_METADATA`.

    Parameters
    - fields: Mapping from raw column names to canonical names used downstream.
    - period_by: Size of year bin for `period` derived from `author_dob`.
    - download_if_necessary: If True, download metadata when not present on disk.
    - overwrite: If True, force re-download when files exist.
    - min_num_lines, max_num_lines: Optional poem-length filters.
    - min_author_dob, max_author_dob: Optional birth-year filters.

    Returns
    - pd.DataFrame indexed by `id`, sorted by `id_hash`, including normalized fields
      and derived `period`.
    - Caches the DataFrame in the module-level `CORPUS_METADATA`.
    
```

*Call signature*

```md
get_chadwyck_corpus_metadata(
    fields={   'attdbase_str': 'subcorpus',
    'attgenre': 'genre',
    'attperi_str': 'period_meta',
    'attrhyme': 'rhyme',
    'author': 'author',
    'author_dob': 'author_dob',
    'id': 'id',
    'id_hash': 'id_hash',
    'l': 'line',
    'num_lines': 'num_lines',
    'title': 'title',
    'volhead': 'volume',
    'year': 'year'}
    period_by=50
    download_if_necessary=True
    overwrite=False
    min_num_lines=10
    max_num_lines=100
    min_author_dob=1600
    max_author_dob=2000
)
```

In [4]:
df_meta = get_chadwyck_corpus_metadata()
df_meta

* Loading metadata from /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
* Loaded 336180 rows of metadata
* Filtering: 259,310 rows after author birth year >= 1600
* Filtering: 259,310 rows after author birth year <= 2000
* Filtering: 225,986 rows after number of lines >= 10
* Filtering: 204,514 rows after number of lines <= 100


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
c20-english/ep30176/Z300606950,3,1900-1999 Twentieth-Century,English Poetry,"Tagore, Rabindranath, 1861-1941 (orig.) / Dyso...",1861.0,Memory,1891,39,,A town in the west country.,,,1850-1900
english/cunning2/Z200325930,4,1800-1834 Early Nineteenth-Century,English Poetry,"Cunningham, Allan, 1784-1842",1784.0,THE MOURNING LADY. SONG XV.,1814,48,Songs (1813),&indent;And ruddy hung the clust'ring rowan;,y,,1750-1800
modern/car3501/Z300553019,10,1900-1999 Twentieth-Century,Modern Poetry,"Prince, F. T. (Frank Templeton), 1912-",1912.0,The Intention,1942,14,,That at last the illustrious child,,,1900-1950
c20-american/am20110/Z300220313,22,1900-1999 Twentieth-Century,American Poetry,"Fitzgerald, Robert, 1910-1985",1910.0,SPRING SHADE,1940,22,,Lash one another's green in rinsing light.,,,1900-1950
c20-american/am23088/Z300256709,22,1900-1999 Twentieth-Century,American Poetry,"Peacock, Molly, 1947-",1947.0,Lullaby,1977,19,,Big as a down duvet the night,,Lyric,1900-1950
...,...,...,...,...,...,...,...,...,...,...,...,...,...
modern/ent2601/Z400573992,999991,1900-1999 Twentieth-Century,Modern Poetry,"Pitter, Ruth, 1897-1992",1897.0,"Female Yew&hyphen;tree, shedding condensed dro...",1927,16,,See how my yew&hyphen;tree,,,1850-1900
english/domettal/Z200340473,999992,1835-1869 Mid Nineteenth-Century,English Poetry,"Domett, Alfred, 1811-1887",1811.0,[Oft when I read the lays],1841,42,Poems (1833),“For a Muse is ours too.”,y,,1800-1850
c20-english/ep30038/Z300602451,999993,1900-1999 Twentieth-Century,English Poetry,"Dunmore, Helen, 1952-",1952.0,The parachute packers,1982,55,,The parachute packers with white faces,,,1950-2000
american/am0556/Z300164406,999995,1870-1899 Later Nineteenth-Century,American Poetry,"Dodge, Mary Abigail, 1833-1896",1833.0,"[Let not thy heart, O noble friend]",1863,16,"Chips, fragments and vestiges (1902)","&indent;My humble gift despise,",y,,1800-1850


### Loading corpus texts

In [5]:
documentation(get_chadwyck_corpus)
df_corpus = get_chadwyck_corpus(df_meta)
df_corpus

**Documentation for `get_chadwyck_corpus`**

*Description*

```md
Load metadata and poem texts into a single corpus DataFrame.

    Parameters
    - clean_poem: If True, clean poem texts after reading.
    - force: If True, ignore in-memory cache and rebuild corpus.
    - args/kwargs: Passed to `get_chadwyck_corpus_metadata`.

    Returns
    - pd.DataFrame with metadata plus a `txt` column containing poem text.

    Side Effects
    - Caches the result in the module-level `CORPUS`.
    
```

*Call signature*

```md
get_chadwyck_corpus(
    df_meta=None
    args
    clean_poem=True
    force=False
    download_if_necessary=True
    kwargs
)
```

* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading 204514 texts


  : 100%|██████████| 204514/204514 [00:29<00:00, 6840.02it/s]


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-english/ep30176/Z300606950,3,1900-1999 Twentieth-Century,English Poetry,"Tagore, Rabindranath, 1861-1941 (orig.) / Dyso...",1861.0,Memory,1891,39,,A town in the west country.,,,1850-1900,A town in the west country.\n On its se...
english/cunning2/Z200325930,4,1800-1834 Early Nineteenth-Century,English Poetry,"Cunningham, Allan, 1784-1842",1784.0,THE MOURNING LADY. SONG XV.,1814,48,Songs (1813),&indent;And ruddy hung the clust'ring rowan;,y,,1750-1800,"Bright shone the birks with morning Due,\n ..."
modern/car3501/Z300553019,10,1900-1999 Twentieth-Century,Modern Poetry,"Prince, F. T. (Frank Templeton), 1912-",1912.0,The Intention,1942,14,,That at last the illustrious child,,,1900-1950,"That at last the illustrious child\nYou, I wou..."
c20-american/am20110/Z300220313,22,1900-1999 Twentieth-Century,American Poetry,"Fitzgerald, Robert, 1910-1985",1910.0,SPRING SHADE,1940,22,,Lash one another's green in rinsing light.,,,1900-1950,"The April winds rise, and the willow whips\nLa..."
c20-american/am23088/Z300256709,22,1900-1999 Twentieth-Century,American Poetry,"Peacock, Molly, 1947-",1947.0,Lullaby,1977,19,,Big as a down duvet the night,,Lyric,1900-1950,Big as a down duvet the night\npulls the close...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
modern/ent2601/Z400573992,999991,1900-1999 Twentieth-Century,Modern Poetry,"Pitter, Ruth, 1897-1992",1897.0,"Female Yew&hyphen;tree, shedding condensed dro...",1927,16,,See how my yew&hyphen;tree,,,1850-1900,See how my yew-tree\nIn utter drought and heat...
english/domettal/Z200340473,999992,1835-1869 Mid Nineteenth-Century,English Poetry,"Domett, Alfred, 1811-1887",1811.0,[Oft when I read the lays],1841,42,Poems (1833),“For a Muse is ours too.”,y,,1800-1850,Oft when I read the lays\nOf many a deathless ...
c20-english/ep30038/Z300602451,999993,1900-1999 Twentieth-Century,English Poetry,"Dunmore, Helen, 1952-",1952.0,The parachute packers,1982,55,,The parachute packers with white faces,,,1950-2000,The parachute packers with white faces\nswathe...
american/am0556/Z300164406,999995,1870-1899 Later Nineteenth-Century,American Poetry,"Dodge, Mary Abigail, 1833-1896",1833.0,"[Let not thy heart, O noble friend]",1863,16,"Chips, fragments and vestiges (1902)","&indent;My humble gift despise,",y,,1800-1850,"Let not thy heart, O noble friend\n My humb..."


## Sampling corpus

### By period

##### As in paper

In [6]:
documentation(get_chadwyck_corpus_sampled_by_period_as_in_paper, source=True)

**Documentation for `get_chadwyck_corpus_sampled_by_period_as_in_paper`**

*Description*

```md
Load the period-based sample used in the paper (precomputed).
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_as_in_paper() -> pd.DataFrame:
    """Load the period-based sample used in the paper (precomputed)."""
    return pd.read_csv(PATH_SAMPLE_PERIOD_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')

```

In [7]:
df_smpl_by_period_in_paper = get_chadwyck_corpus_sampled_by_period_as_in_paper()
df_smpl_by_period_in_paper.head()

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english-ed2/ep2438/Z300661875,1,,English Poetry,"Price, Herbert, b. 1858",1858.0,THE FORSAKEN GARDEN,1888,35,Poems and Sonnets by Herbert Price (1914),"In the garden we loved that is now a waste,",y,,1850-1900,"Ah! sweet were the days, and the nights and th..."
english/pennecu1/Z200459978,1,1660-1700 Restoration,English Poetry,"Pennecuik, Alexander, 1652-1722",1652.0,THE CITY AND COUNTRY MOUSE.,1682,50,The Works (1815),"&indent;Met with a city mouse, right smooth an...",y,,1650-1700,"A country mouse, upon a winter's day,\n Met..."
english/wattsisa/Z300523040,2,1750-1799 Later Eighteenth-Century,English Poetry,"Watts, Isaac, 1674-1748",1674.0,SONG 11. Heaven and Hell.,1704,16,The Works (1810),&indent;A heav'n of joy and love;,y,Lyric,1650-1700,There is beyond the sky\n A heaven of joy a...
english/hardytho/Z200137433,3,1870-1899 Later Nineteenth-Century,English Poetry,"Hardy, Thomas, 1840-1928",1840.0,WHEN DEAD,1870,16,,&indent;&indent;I am under the bough;,y,,1800-1850,It will be much better when\n I am unde...
c20-american/da22040/Z300203417,3,1900-1999 Twentieth-Century,American Poetry,"Walker, Margaret, 1915-1998",1915.0,BALLAD OF THE HOPPY&hyphen;TOAD,1945,84,,Ain't been on Market Street for nothing,,,1900-1950,Ain't been on Market Street for nothing\nWith ...


In [8]:
assert len(df_smpl_by_period_in_paper) == 8000

##### Gegenerating new sample

In [9]:
documentation(get_chadwyck_corpus_sampled_by_period_as_replicated, source=True)
documentation(gen_chadwyck_corpus_sampled_by_period, source=True)
documentation(sample_chadwyck_corpus)

**Documentation for `get_chadwyck_corpus_sampled_by_period_as_replicated`**

*Description*

```md
Convenience wrapper to compute or load period-stratified sample (replication).
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_as_replicated(overwrite=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load period-stratified sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_period(force=overwrite)
    return df_smpl

```

**Documentation for `gen_chadwyck_corpus_sampled_by_period`**

*Description*

```md
Generate a period-stratified sample from the full corpus.
```

*Source code*

```py
def gen_chadwyck_corpus_sampled_by_period() -> pd.DataFrame:
    """Generate a period-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by='period',
    )
    return df

```

**Documentation for `sample_chadwyck_corpus`**

*Description*

```md
Deterministically sample `df_corpus` by one or more grouping keys.

    Rules
    - Keep only groups with at least `min_sample_n` items (if provided).
    - Within each group, sort by `id_hash` and take the first `max_sample_n` rows
      (if provided). This ensures stable sampling across runs.

    Parameters
    - df_corpus: Corpus DataFrame (e.g., from `get_chadwyck_corpus`).
    - sample_by: Column name or list of names to group by.
    - min_sample_n, max_sample_n: Group size constraints.

    Returns
    - pd.DataFrame containing the sampled rows.
    
```

*Call signature*

```md
sample_chadwyck_corpus(
    df_corpus
    sample_by
    min_sample_n=10
    max_sample_n=1000
    prefer_min_id_hash=False
)
```

In [10]:
df_smpl_by_period = get_chadwyck_corpus_sampled_by_period_as_replicated(overwrite=REPLICATE_OVERWRITE)
df_smpl_by_period.head()


* Generating period sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by period (min 10, max 1000)
* Original sample size: 204514
* Final sample size: 8000

* Breakdown for period
1600-1650    1000
1650-1700    1000
1700-1750    1000
1750-1800    1000
1800-1850    1000
1850-1900    1000
1900-1950    1000
1950-2000    1000

* Saved sample to /Users/ryan/github/generative-formalism/data/corpus_sample_by_period.replicated.csv.gz


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-english/ep30176/Z300606950,3,1900-1999 Twentieth-Century,English Poetry,"Tagore, Rabindranath, 1861-1941 (orig.) / Dyso...",1861.0,Memory,1891,39,,A town in the west country.,,,1850-1900,A town in the west country.\n On its se...
english/cunning2/Z200325930,4,1800-1834 Early Nineteenth-Century,English Poetry,"Cunningham, Allan, 1784-1842",1784.0,THE MOURNING LADY. SONG XV.,1814,48,Songs (1813),&indent;And ruddy hung the clust'ring rowan;,y,,1750-1800,"Bright shone the birks with morning Due,\n ..."
modern/car3501/Z300553019,10,1900-1999 Twentieth-Century,Modern Poetry,"Prince, F. T. (Frank Templeton), 1912-",1912.0,The Intention,1942,14,,That at last the illustrious child,,,1900-1950,"That at last the illustrious child\nYou, I wou..."
c20-american/am20110/Z300220313,22,1900-1999 Twentieth-Century,American Poetry,"Fitzgerald, Robert, 1910-1985",1910.0,SPRING SHADE,1940,22,,Lash one another's green in rinsing light.,,,1900-1950,"The April winds rise, and the willow whips\nLa..."
c20-american/am23088/Z300256709,22,1900-1999 Twentieth-Century,American Poetry,"Peacock, Molly, 1947-",1947.0,Lullaby,1977,19,,Big as a down duvet the night,,Lyric,1900-1950,Big as a down duvet the night\npulls the close...


In [11]:
if len(df_smpl_by_period):
    assert len(df_smpl_by_period) == 8000

### By rhyme

#### As in paper

In [12]:
documentation(get_chadwyck_corpus_sampled_by_rhyme_as_in_paper, source=True)

**Documentation for `get_chadwyck_corpus_sampled_by_rhyme_as_in_paper`**

*Description*

```md
Load the rhyme-based sample used in the paper (precomputed).
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_rhyme_as_in_paper() -> pd.DataFrame:
    """Load the rhyme-based sample used in the paper (precomputed)."""
    return pd.read_csv(PATH_SAMPLE_RHYMES_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')

```

In [13]:
df_smpl_by_rhyme_in_paper = get_chadwyck_corpus_sampled_by_rhyme_as_in_paper()
assert len(df_smpl_by_rhyme_in_paper) == 2000
df_smpl_by_rhyme_in_paper.head()

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english-ed2/ep2438/Z300661875,1,,English Poetry,"Price, Herbert, b. 1858",1858.0,THE FORSAKEN GARDEN,1888,35,Poems and Sonnets by Herbert Price (1914),"In the garden we loved that is now a waste,",y,,1850-1900,"Ah! sweet were the days, and the nights and th..."
english/pennecu1/Z200459978,1,1660-1700 Restoration,English Poetry,"Pennecuik, Alexander, 1652-1722",1652.0,THE CITY AND COUNTRY MOUSE.,1682,50,The Works (1815),"&indent;Met with a city mouse, right smooth an...",y,,1650-1700,"A country mouse, upon a winter's day,\n Met..."
english/wattsisa/Z300523040,2,1750-1799 Later Eighteenth-Century,English Poetry,"Watts, Isaac, 1674-1748",1674.0,SONG 11. Heaven and Hell.,1704,16,The Works (1810),&indent;A heav'n of joy and love;,y,Lyric,1650-1700,There is beyond the sky\n A heaven of joy a...
english/hardytho/Z200137433,3,1870-1899 Later Nineteenth-Century,English Poetry,"Hardy, Thomas, 1840-1928",1840.0,WHEN DEAD,1870,16,,&indent;&indent;I am under the bough;,y,,1800-1850,It will be much better when\n I am unde...
english/fawkesfr/Z300372956,4,1750-1799 Later Eighteenth-Century,English Poetry,"Fawkes, Francis, 1720-1777",1720.0,"III. ON A WORTHY FRIEND, Who was accomplished...",1750,10,Original Poems and Translations (1761),"Thou friendly, candid, virtuous mind, farewel!",y,,1700-1750,"Oh born in liberal studies to excel,\nThou fri..."


#### Replicated

In [14]:
documentation(get_chadwyck_corpus_sampled_by_rhyme_as_replicated, source=True)
documentation(gen_chadwyck_corpus_sampled_by_rhyme, source=True)

**Documentation for `get_chadwyck_corpus_sampled_by_rhyme_as_replicated`**

*Description*

```md
Convenience wrapper to compute or load rhyme-stratified sample (replication).
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_rhyme_as_replicated(overwrite=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load rhyme-stratified sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_rhyme(force=overwrite)
    return df_smpl

```

**Documentation for `gen_chadwyck_corpus_sampled_by_rhyme`**

*Description*

```md
Generate a rhyme-stratified sample from the full corpus.
```

*Source code*

```py
def gen_chadwyck_corpus_sampled_by_rhyme() -> pd.DataFrame:
    """Generate a rhyme-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df_corpus = df_corpus[df_corpus.rhyme.isin({'y','n'})]
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by='rhyme',
    )
    return df

```

In [15]:
df_smpl_by_rhyme = get_chadwyck_corpus_sampled_by_rhyme_as_replicated(overwrite=REPLICATE_OVERWRITE)
df_smpl_by_rhyme.head()

* Generating rhyme sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by rhyme (min 10, max 1000)
* Original sample size: 146309
* Final sample size: 2000

* Breakdown for rhyme
n    1000
y    1000

* Saved sample to /Users/ryan/github/generative-formalism/data/corpus_sample_by_rhyme.replicated.csv.gz


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english/cunning2/Z200325930,4,1800-1834 Early Nineteenth-Century,English Poetry,"Cunningham, Allan, 1784-1842",1784.0,THE MOURNING LADY. SONG XV.,1814,48,Songs (1813),&indent;And ruddy hung the clust'ring rowan;,y,,1750-1800,"Bright shone the birks with morning Due,\n ..."
english/swaincha/Z200502215,37,1835-1869 Mid Nineteenth-Century,English Poetry,"Swain, Charles, 1801-1874",1801.0,"ELIZABETH WILSON; OR, SOMETHING TO DO.",1831,28,Rhymes for Childhood [1846],&indent;For pleasure by labour is won;,y,,1800-1850,"Tis well to have something to do,\n For ple..."
english/mallockw/Z200424116,43,1870-1899 Later Nineteenth-Century,English Poetry,"Mallock, W. H. (William Hurrell), 1849-1923",1849.0,A MAY IDYL.,1879,40,Poems (1880),"Thus by green shadow of hazels murmured o'er,",y,,1800-1850,"Would I might lean and dream here evermore,\nT..."
english/heathrob/Z300390769,49,1603-1660 Jacobean and Caroline,English Poetry,"Heath, Robert, fl. 1636-1659",1636.0,On the loss of Clarastella's black fan.,1666,48,Clarastella (1650),"&indent;Courted your wanton hair,",y,,1600-1650,Tel me (fair wonder!) when the gentle air\n ...
english/chandler/Z200310600,61,1700-1749 Early Eighteenth-Century,English Poetry,"Chandler, Mary, 1687-1745",1687.0,My WISH.,1717,53,The Description of Bath (1736),"For future Life, it shou'd be this;",y,,1650-1700,Wou'd Heav'n indulgent grant my Wish\nFor futu...


In [16]:
assert len(df_smpl_by_rhyme) == 2000

### By period/subcorpus

#### As in paper

In [17]:
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper, source=True)
documentation(display_period_subcorpus_tables, source=True)
documentation(get_period_subcorpus_table)


**Documentation for `get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper`**

*Description*

```md
Load the period×subcorpus sample used in the paper and optionally display a table.
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper(display=False) -> pd.DataFrame:
    """Load the period×subcorpus sample used in the paper and optionally display a table."""
    odf = pd.read_csv(PATH_SAMPLE_PERIOD_SUBCORPUS_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')
    if display:
        display_period_subcorpus_tables(odf)
    return odf

```

**Documentation for `display_period_subcorpus_tables`**

*Description*

```md
Display summary tables for a sampled DataFrame (IPython rich display if available).
```

*Source code*

```py
def display_period_subcorpus_tables(df):
    """Display summary tables for a sampled DataFrame (IPython rich display if available)."""
    try_display(get_period_subcorpus_table(df, return_display=True))

```

**Documentation for `get_period_subcorpus_table`**

*Description*

```md
Build a period×subcorpus summary table and optionally save LaTeX.

    Parameters
    - df_smpl: Sampled DataFrame containing `period`, `subcorpus`, `author`, `id`.
    - save_latex_to: Base path for LaTeX/table image output; if falsy, skip saving.
    - save_latex_to_suffix: Filename suffix for differentiation.
    - return_display: If True, return a display object suitable for notebooks.
    - table_num: Optional table number for LaTeX captioning.

    Returns
    - A formatted DataFrame (if not returning display object) or a display/image object.
    
```

*Call signature*

```md
get_period_subcorpus_table(
    df_smpl
    save_latex_to='/Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tex'
    save_latex_to_suffix='tmp'
    return_display=False
    table_num=None
)
```

In [21]:
df_smpl_by_period_subcorpus_in_paper = get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper(display=True)
df_smpl_by_period_subcorpus_in_paper.head()

* Loading corpus metadata from memory
* Writing LaTeX to /Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.tex
* Rendering PNG to /Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.png
* LaTeX compile failed: Command '['/Library/TeX/texbin/pdflatex', '-interaction=nonstopmode', '-halt-on-error', 'table.tex']' returned non-zero exit status 1.


'/Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.tex'

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-english/ep20152/Z200586158,2,1900-1999 Twentieth-Century,English Poetry,"Rosenberg, Isaac, 1890-1918",1890.0,‘I KNOW YOU GOLDEN’,1920,12,,I know you golden,,,1850-1900,I know you golden\nAs summer and pale\nAs the ...
english/kerpeter/Z300410015,3,1660-1700 Restoration,English Poetry,"Ker, Patrick, fl. 1691",1691.0,On the Memory of a Married Maid.,1721,16,Flosculum Poeticum (1684),A Marrie'd&hyphen;Virgin to remain.,y,,1650-1700,"Within this Coffin here does lie,\nA Pattern o..."
american/am1258/Z200196105,7,1835-1869 Mid Nineteenth-Century,American Poetry,"Emerson, Ralph Waldo, 1803-1882",1803.0,SEPTEMBER,1833,16,Poems [1904],"&indent;Of a gusty Autumn day,",y,,1800-1850,In the turbulent beauty\n Of a gusty Autumn...
english/gilfilla/Z400379001,8,1800-1834 Early Nineteenth-Century,English Poetry,"Gilfillan, Robert, 1798-1850",1798.0,NORWEGIAN SMUGGLER'S SONG.,1828,36,Poems and Songs (1851),"&indent;The storm is loud and high,",y,,1750-1800,"Awake, you midnight mariners!\n The storm i..."
english/wattwill/Z300523577,18,1800-1834 Early Nineteenth-Century,English Poetry,"Watt, William, 1793-1859",1793.0,BAB AT THE BOWSTER.,1823,40,Poems and Songs (1860),Wi' touslet hair and drowsy een?,y,Ballad,1750-1800,"Lassie, whare were you yestreen,\nWi' touslet ..."


#### Replicated

In [22]:
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated, source=True)
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus, source=True)
documentation(gen_chadwyck_corpus_sampled_by_period_subcorpus, source=True)

**Documentation for `get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated`**

*Description*

```md
Convenience wrapper to compute or load period×subcorpus sample (replication).
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated(overwrite=False, display=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load period×subcorpus sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_period_subcorpus(force=overwrite)
    if display:
        display_period_subcorpus_tables(df_smpl)
    return df_smpl

```

**Documentation for `get_chadwyck_corpus_sampled_by_period_subcorpus`**

*Description*

```md
Load or generate period×subcorpus sample; cache on disk at `PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED`.
```

*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus(force=False, display=False) -> pd.DataFrame:
    """Load or generate period×subcorpus sample; cache on disk at `PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED`."""
    path = PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED
    if force or not os.path.exists(path):
        print(f'* Generating period subcorpus sample')
        odf = gen_chadwyck_corpus_sampled_by_period_subcorpus()
        if len(odf):
            save_sample(odf, path, overwrite=True)
    else:
        print(f'* Loading period subcorpus sample from {path}')
        odf = pd.read_csv(path).set_index('id').sort_values('id_hash')
    if display:
        try:
            from IPython.display import display
            img = get_period_subcorpus_table(odf, return_display=True)
            display(img)
        except (NameError, ImportError):
            print(f'* Warning: Could not display image')
            pass
    return odf

```

**Documentation for `gen_chadwyck_corpus_sampled_by_period_subcorpus`**

*Description*

```md
Generate a period×subcorpus-stratified sample from the full corpus.
```

*Source code*

```py
def gen_chadwyck_corpus_sampled_by_period_subcorpus() -> pd.DataFrame:
    """Generate a period×subcorpus-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by=['period','subcorpus'],
    )
    return df

```

In [23]:
df_smpl_by_period_subcorpus_replicated = get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated(display=True, overwrite=REPLICATE_OVERWRITE)
df_smpl_by_period_subcorpus_replicated

* Generating period subcorpus sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by ['period', 'subcorpus'] (min 10, max 1000)
* Original sample size: 204514
* Final sample size: 22709

* Breakdown for period/subcorpus
1600-1650  American Poetry              361
           English Poetry              1000
1650-1700  American Poetry               74
           English Poetry              1000
1700-1750  American Poetry              340
           English Poetry              1000
1750-1800  African-American Poetry      284
           American Poetry             1000
           English Poetry              1000
1800-1850  African-American Poetry      542
           American Poetry             1000
           English Poetry              1000
1850-1900  African-American Poetry     1000
           American Poetry             1000
           English Poetry              1000
           Modern Poetry                809
           The Faber Poe

'/Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.tex'

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-english/ep30176/Z300606950,3,1900-1999 Twentieth-Century,English Poetry,"Tagore, Rabindranath, 1861-1941 (orig.) / Dyso...",1861.0,Memory,1891,39,,A town in the west country.,,,1850-1900,A town in the west country.\n On its se...
english/cunning2/Z200325930,4,1800-1834 Early Nineteenth-Century,English Poetry,"Cunningham, Allan, 1784-1842",1784.0,THE MOURNING LADY. SONG XV.,1814,48,Songs (1813),&indent;And ruddy hung the clust'ring rowan;,y,,1750-1800,"Bright shone the birks with morning Due,\n ..."
modern/car3501/Z300553019,10,1900-1999 Twentieth-Century,Modern Poetry,"Prince, F. T. (Frank Templeton), 1912-",1912.0,The Intention,1942,14,,That at last the illustrious child,,,1900-1950,"That at last the illustrious child\nYou, I wou..."
c20-american/am23088/Z300256709,22,1900-1999 Twentieth-Century,American Poetry,"Peacock, Molly, 1947-",1947.0,Lullaby,1977,19,,Big as a down duvet the night,,Lyric,1900-1950,Big as a down duvet the night\npulls the close...
c20-american/am20110/Z300220313,22,1900-1999 Twentieth-Century,American Poetry,"Fitzgerald, Robert, 1910-1985",1910.0,SPRING SHADE,1940,22,,Lash one another's green in rinsing light.,,,1900-1950,"The April winds rise, and the willow whips\nLa..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
c20-african-american/da22011/Z300262786,999638,1900-1999 Twentieth-Century,African-American Poetry,"Jackson, Angela, 1951-",1951.0,other evenings,1981,14,,other evenings when we called,,,1950-2000,other evenings when we called\nourselves ladie...
faber/fa20402/Z300564445,999668,1900-1999 Twentieth-Century,The Faber Poetry Library,"Sassoon, Siegfried, 1886-1967",1886.0,Concert&hyphen;Interpretation (LE SACRE DU PR...,1916,47,,The audience pricks an intellectual Ear ...,,,1850-1900,The audience pricks and intellectual Ear ...\n...
american/am0073/Z200145415,999736,1750-1799 Later Eighteenth-Century,American Poetry,"Evans, Nathaniel, 1742-1767",1742.0,A SONG To MIRA; ON PARTING.,1772,32,Poems on several occasions (1772),&indent;Two long&hyphen;ling'ring months to part—,y,,1700-1750,Can my Mira leave her lover?\n Two long-lin...
american/am1218/Z200193670,999881,1750-1799 Later Eighteenth-Century,American Poetry,"Hopkinson, Francis, 1737-1791",1737.0,AN ELEGY SACRED TO THE MEMORY OF Mrs. ANN GRÆ...,1767,88,The miscellaneous essays and occasional writin...,Why gleams the day light on her sacred gloom?,y,,1700-1750,Why move the marble jaws of yonder tomb?\nWhy ...
