# Corpus: Chadwyck-Healey poetry collections

## Loading corpus from source

In [1]:
import sys
sys.path.append('../')
from generative_formalism import *

In [2]:
documentation(check_paths,signature=False)
check_paths()

**`check_paths`**

```md
Check if the paths to the Chadwyck-Healey corpus and metadata are set and exist.
    Uses constants from `constants.py`.
    
```
----


✓ Chadwyck-Healey corpus path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/txt
✓ Chadwyck-Healey metadata path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
✓ Metadata file URL set in environment (.env or shell)
✓ Corpus text file URL set in environment (.env or shell)


### Loading corpus metadata

In [3]:
documentation(get_chadwyck_corpus_metadata)

**`get_chadwyck_corpus_metadata`**

```md
Load and normalize Chadwyck-Healey corpus metadata.

    This function reads `PATH_CHADWYCK_HEALEY_METADATA`, downloading and unzipping
    if missing. It coerces numeric fields, derives `id_hash` and binned `period`,
    applies min/max filters, and caches the resulting DataFrame in `CORPUS_METADATA`.

    Parameters
    - fields: Mapping from raw column names to canonical names used downstream.
    - period_by: Size of year bin for `period` derived from `author_dob`.
    - download_if_necessary: If True, download metadata when not present on disk.
    - overwrite: If True, force re-download when files exist.
    - min_num_lines, max_num_lines: Optional poem-length filters.
    - min_author_dob, max_author_dob: Optional birth-year filters.

    Returns
    - pd.DataFrame indexed by `id`, sorted by `id_hash`, including normalized fields
      and derived `period`.
    - Caches the DataFrame in the module-level `CORPUS_METADATA`.
    
```
----


In [4]:
df_meta = get_chadwyck_corpus_metadata()
df_meta

* Loading metadata from /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
* Loaded 336180 rows of metadata
* Filtering: 259,310 rows after author birth year >= 1600
* Filtering: 259,310 rows after author birth year <= 2000
* Filtering: 225,986 rows after number of lines >= 10
* Filtering: 204,514 rows after number of lines <= 100


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650
...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/nesbitep/Z300136537,999962,1870-1899 Later Nineteenth-Century,English Poetry,"Nesbit, E. (Edith), 1858-1924",1858.0,“THIS DESIRABLE MANSION”,1888,16,,"&indent;Across the sodden, tangled grass,",y,,1850-1900
c20-american/am20048/Z300370315,999971,1900-1999 Twentieth-Century,American Poetry,"Wheelwright, John, 1897-1940",1897.0,DEFENDER OF THE FAITH,1927,59,,While voices sift from the infrangile air:,,,1850-1900
english/loversam/Z300418290,999973,1800-1834 Early Nineteenth-Century,English Poetry,"Lover, Samuel, 1797-1868",1797.0,ABSENCE. To &lblank;.,1827,16,Songs and ballads (1858),"Then, all is night;",y,,1750-1800
c20-english/ep30101/Z200604786,999978,1900-1999 Twentieth-Century,English Poetry,"McGuckian, Medbh, 1950-",1950.0,"Road 32, Roof 13–23, Grass 23",1980,38,,The dark wound her chestnut hair,,,1950-2000


### Loading corpus texts

In [5]:
documentation(get_chadwyck_corpus)
df_corpus = get_chadwyck_corpus(df_meta)
df_corpus

**`get_chadwyck_corpus`**

```md
Load metadata and poem texts into a single corpus DataFrame.

    Parameters
    - clean_poem: If True, clean poem texts after reading.
    - force: If True, ignore in-memory cache and rebuild corpus.
    - args/kwargs: Passed to `get_chadwyck_corpus_metadata`.

    Returns
    - pd.DataFrame with metadata plus a `txt` column containing poem text.

    Side Effects
    - Caches the result in the module-level `CORPUS`.
    
```
----


* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading 204514 texts


  : 100%|██████████| 204514/204514 [00:35<00:00, 5770.47it/s]


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950,"I Feel only the desolation of wide water,\nIts..."
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750,Why does my Mistress mourn my fate?\nOr murmur...
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650,The Life and Death I once did mark\nOf a wax C...
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800,Wood walks are pleasant every day\nWhere thoug...
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650,Favour I shall not hawk to gain;\nThe Quarry i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/nesbitep/Z300136537,999962,1870-1899 Later Nineteenth-Century,English Poetry,"Nesbit, E. (Edith), 1858-1924",1858.0,“THIS DESIRABLE MANSION”,1888,16,,"&indent;Across the sodden, tangled grass,",y,,1850-1900,The long white windows blankly stare\n Acro...
c20-american/am20048/Z300370315,999971,1900-1999 Twentieth-Century,American Poetry,"Wheelwright, John, 1897-1940",1897.0,DEFENDER OF THE FAITH,1927,59,,While voices sift from the infrangile air:,,,1850-1900,"While voices sift from the infrangile air:\n""I..."
english/loversam/Z300418290,999973,1800-1834 Early Nineteenth-Century,English Poetry,"Lover, Samuel, 1797-1868",1797.0,ABSENCE. To &lblank;.,1827,16,Songs and ballads (1858),"Then, all is night;",y,,1750-1800,"As when the sun withdraweth quite,\nThen, all ..."
c20-english/ep30101/Z200604786,999978,1900-1999 Twentieth-Century,English Poetry,"McGuckian, Medbh, 1950-",1950.0,"Road 32, Roof 13–23, Grass 23",1980,38,,The dark wound her chestnut hair,,,1950-2000,The dark wound her chestnut hair\nAround her n...


## Sampling corpus

### By period

##### As in paper

In [6]:
documentation(get_chadwyck_corpus_sampled_by_period_as_in_paper, source=True)

**`get_chadwyck_corpus_sampled_by_period_as_in_paper`**

```md
Load the period-based sample used in the paper (precomputed).
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_as_in_paper() -> pd.DataFrame:
    """Load the period-based sample used in the paper (precomputed)."""
    return pd.read_csv(PATH_SAMPLE_PERIOD_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')

```

In [7]:
df_smpl_by_period_in_paper = get_chadwyck_corpus_sampled_by_period_as_in_paper()
df_smpl_by_period_in_paper.head()

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english-ed2/ep2438/Z300661875,1,,English Poetry,"Price, Herbert, b. 1858",1858.0,THE FORSAKEN GARDEN,1888,35,Poems and Sonnets by Herbert Price (1914),"In the garden we loved that is now a waste,",y,,1850-1900,"Ah! sweet were the days, and the nights and th..."
english/pennecu1/Z200459978,1,1660-1700 Restoration,English Poetry,"Pennecuik, Alexander, 1652-1722",1652.0,THE CITY AND COUNTRY MOUSE.,1682,50,The Works (1815),"&indent;Met with a city mouse, right smooth an...",y,,1650-1700,"A country mouse, upon a winter's day,\n Met..."
english/wattsisa/Z300523040,2,1750-1799 Later Eighteenth-Century,English Poetry,"Watts, Isaac, 1674-1748",1674.0,SONG 11. Heaven and Hell.,1704,16,The Works (1810),&indent;A heav'n of joy and love;,y,Lyric,1650-1700,There is beyond the sky\n A heaven of joy a...
english/hardytho/Z200137433,3,1870-1899 Later Nineteenth-Century,English Poetry,"Hardy, Thomas, 1840-1928",1840.0,WHEN DEAD,1870,16,,&indent;&indent;I am under the bough;,y,,1800-1850,It will be much better when\n I am unde...
c20-american/da22040/Z300203417,3,1900-1999 Twentieth-Century,American Poetry,"Walker, Margaret, 1915-1998",1915.0,BALLAD OF THE HOPPY&hyphen;TOAD,1945,84,,Ain't been on Market Street for nothing,,,1900-1950,Ain't been on Market Street for nothing\nWith ...


In [8]:
assert len(df_smpl_by_period_in_paper) == 8000

##### Gegenerating new sample

In [9]:
documentation(get_chadwyck_corpus_sampled_by_period_as_replicated, source=True)
documentation(gen_chadwyck_corpus_sampled_by_period, source=True)
documentation(sample_chadwyck_corpus)

**`get_chadwyck_corpus_sampled_by_period_as_replicated`**

```md
Convenience wrapper to compute or load period-stratified sample (replication).
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_as_replicated(overwrite=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load period-stratified sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_period(force=overwrite)
    return df_smpl

```

**`gen_chadwyck_corpus_sampled_by_period`**

```md
Generate a period-stratified sample from the full corpus.
```
----


*Source code*

```py
def gen_chadwyck_corpus_sampled_by_period() -> pd.DataFrame:
    """Generate a period-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by='period',
    )
    return df

```

**`sample_chadwyck_corpus`**

```md
Deterministically sample `df_corpus` by one or more grouping keys.

    Rules
    - Keep only groups with at least `min_sample_n` items (if provided).
    - Within each group, sort by `id_hash` and take the first `max_sample_n` rows
      (if provided). This ensures stable sampling across runs.

    Parameters
    - df_corpus: Corpus DataFrame (e.g., from `get_chadwyck_corpus`).
    - sample_by: Column name or list of names to group by.
    - min_sample_n, max_sample_n: Group size constraints.

    Returns
    - pd.DataFrame containing the sampled rows.
    
```
----


In [10]:
df_smpl_by_period = get_chadwyck_corpus_sampled_by_period_as_replicated(overwrite=REPLICATE_OVERWRITE)
df_smpl_by_period.head()


* Generating period sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by period (min 10, max 1000)
* Original sample size: 204514
* Final sample size: 8000

* Breakdown for period
1600-1650    1000
1650-1700    1000
1700-1750    1000
1750-1800    1000
1800-1850    1000
1850-1900    1000
1900-1950    1000
1950-2000    1000

* Saved sample to /Users/ryan/github/generative-formalism/data/corpus_sample_by_period.replicated.csv.gz


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750,Why does my Mistress mourn my fate?\nOr murmur...
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950,"I Feel only the desolation of wide water,\nIts..."
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650,The Life and Death I once did mark\nOf a wax C...
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800,Wood walks are pleasant every day\nWhere thoug...
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650,Favour I shall not hawk to gain;\nThe Quarry i...


In [11]:
if len(df_smpl_by_period):
    assert len(df_smpl_by_period) == 8000

### By rhyme

#### As in paper

In [12]:
documentation(get_chadwyck_corpus_sampled_by_rhyme_as_in_paper, source=True)

**`get_chadwyck_corpus_sampled_by_rhyme_as_in_paper`**

```md
Load the rhyme-based sample used in the paper (precomputed).
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_rhyme_as_in_paper() -> pd.DataFrame:
    """Load the rhyme-based sample used in the paper (precomputed)."""
    return pd.read_csv(PATH_SAMPLE_RHYMES_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')

```

In [13]:
df_smpl_by_rhyme_in_paper = get_chadwyck_corpus_sampled_by_rhyme_as_in_paper()
assert len(df_smpl_by_rhyme_in_paper) == 2000
df_smpl_by_rhyme_in_paper.head()

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english-ed2/ep2438/Z300661875,1,,English Poetry,"Price, Herbert, b. 1858",1858.0,THE FORSAKEN GARDEN,1888,35,Poems and Sonnets by Herbert Price (1914),"In the garden we loved that is now a waste,",y,,1850-1900,"Ah! sweet were the days, and the nights and th..."
english/pennecu1/Z200459978,1,1660-1700 Restoration,English Poetry,"Pennecuik, Alexander, 1652-1722",1652.0,THE CITY AND COUNTRY MOUSE.,1682,50,The Works (1815),"&indent;Met with a city mouse, right smooth an...",y,,1650-1700,"A country mouse, upon a winter's day,\n Met..."
english/wattsisa/Z300523040,2,1750-1799 Later Eighteenth-Century,English Poetry,"Watts, Isaac, 1674-1748",1674.0,SONG 11. Heaven and Hell.,1704,16,The Works (1810),&indent;A heav'n of joy and love;,y,Lyric,1650-1700,There is beyond the sky\n A heaven of joy a...
english/hardytho/Z200137433,3,1870-1899 Later Nineteenth-Century,English Poetry,"Hardy, Thomas, 1840-1928",1840.0,WHEN DEAD,1870,16,,&indent;&indent;I am under the bough;,y,,1800-1850,It will be much better when\n I am unde...
english/fawkesfr/Z300372956,4,1750-1799 Later Eighteenth-Century,English Poetry,"Fawkes, Francis, 1720-1777",1720.0,"III. ON A WORTHY FRIEND, Who was accomplished...",1750,10,Original Poems and Translations (1761),"Thou friendly, candid, virtuous mind, farewel!",y,,1700-1750,"Oh born in liberal studies to excel,\nThou fri..."


#### Replicated

In [14]:
documentation(get_chadwyck_corpus_sampled_by_rhyme_as_replicated, source=True)
documentation(gen_chadwyck_corpus_sampled_by_rhyme, source=True)

**`get_chadwyck_corpus_sampled_by_rhyme_as_replicated`**

```md
Convenience wrapper to compute or load rhyme-stratified sample (replication).
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_rhyme_as_replicated(overwrite=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load rhyme-stratified sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_rhyme(force=overwrite)
    return df_smpl

```

**`gen_chadwyck_corpus_sampled_by_rhyme`**

```md
Generate a rhyme-stratified sample from the full corpus.
```
----


*Source code*

```py
def gen_chadwyck_corpus_sampled_by_rhyme() -> pd.DataFrame:
    """Generate a rhyme-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df_corpus = df_corpus[df_corpus.rhyme.isin({'y','n'})]
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by='rhyme',
    )
    return df

```

In [15]:
df_smpl_by_rhyme = get_chadwyck_corpus_sampled_by_rhyme_as_replicated(overwrite=REPLICATE_OVERWRITE)
df_smpl_by_rhyme.head()

* Generating rhyme sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by rhyme (min 10, max 1000)
* Original sample size: 146309
* Final sample size: 2000

* Breakdown for rhyme
n    1000
y    1000

* Saved sample to /Users/ryan/github/generative-formalism/data/corpus_sample_by_rhyme.replicated.csv.gz


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750,Why does my Mistress mourn my fate?\nOr murmur...
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650,The Life and Death I once did mark\nOf a wax C...
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800,Wood walks are pleasant every day\nWhere thoug...
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650,Favour I shall not hawk to gain;\nThe Quarry i...
american/am1172/Z300192013,11,1835-1869 Mid Nineteenth-Century,American Poetry,"Tuckerman, Henry T. (Henry Theodore), 1813-1871",1813.0,XIX. STEINHAUSEN'S HERO AND LEANDER.,1843,15,Poems (1851),&indent;Behold life mantle in his glowing face,y,,1800-1850,STEINHAUSEN'S HERO AND LEANDER.\nFaint from th...


In [16]:
assert len(df_smpl_by_rhyme) == 2000

### By period/subcorpus

#### As in paper

In [17]:
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper, source=True)
documentation(display_period_subcorpus_tables, source=True)
documentation(get_period_subcorpus_table)


**`get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper`**

```md
Load the period×subcorpus sample used in the paper and optionally display a table.
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper(display=False) -> pd.DataFrame:
    """Load the period×subcorpus sample used in the paper and optionally display a table."""
    odf = pd.read_csv(PATH_SAMPLE_PERIOD_SUBCORPUS_IN_PAPER).fillna('').set_index('id').sort_values('id_hash')
    if display:
        display_period_subcorpus_tables(odf)
    return odf

```

**`display_period_subcorpus_tables`**

```md
Display summary tables for a sampled DataFrame (IPython rich display if available).
```
----


*Source code*

```py
def display_period_subcorpus_tables(df):
    """Display summary tables for a sampled DataFrame (IPython rich display if available)."""
    try_display(get_period_subcorpus_table(df, return_display=True))

```

**`get_period_subcorpus_table`**

```md
Build a period×subcorpus summary table and optionally save LaTeX.

    Parameters
    - df_smpl: Sampled DataFrame containing `period`, `subcorpus`, `author`, `id`.
    - save_latex_to: Base path for LaTeX/table image output; if falsy, skip saving.
    - save_latex_to_suffix: Filename suffix for differentiation.
    - return_display: If True, return a display object suitable for notebooks.
    - table_num: Optional table number for LaTeX captioning.

    Returns
    - A formatted DataFrame (if not returning display object) or a display/image object.
    
```
----


In [18]:
df_smpl_by_period_subcorpus_in_paper = get_chadwyck_corpus_sampled_by_period_subcorpus_as_in_paper(display=True)
df_smpl_by_period_subcorpus_in_paper.head()

* Loading corpus metadata from memory


'/Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.tex'

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-english/ep20152/Z200586158,2,1900-1999 Twentieth-Century,English Poetry,"Rosenberg, Isaac, 1890-1918",1890.0,‘I KNOW YOU GOLDEN’,1920,12,,I know you golden,,,1850-1900,I know you golden\nAs summer and pale\nAs the ...
english/kerpeter/Z300410015,3,1660-1700 Restoration,English Poetry,"Ker, Patrick, fl. 1691",1691.0,On the Memory of a Married Maid.,1721,16,Flosculum Poeticum (1684),A Marrie'd&hyphen;Virgin to remain.,y,,1650-1700,"Within this Coffin here does lie,\nA Pattern o..."
american/am1258/Z200196105,7,1835-1869 Mid Nineteenth-Century,American Poetry,"Emerson, Ralph Waldo, 1803-1882",1803.0,SEPTEMBER,1833,16,Poems [1904],"&indent;Of a gusty Autumn day,",y,,1800-1850,In the turbulent beauty\n Of a gusty Autumn...
english/gilfilla/Z400379001,8,1800-1834 Early Nineteenth-Century,English Poetry,"Gilfillan, Robert, 1798-1850",1798.0,NORWEGIAN SMUGGLER'S SONG.,1828,36,Poems and Songs (1851),"&indent;The storm is loud and high,",y,,1750-1800,"Awake, you midnight mariners!\n The storm i..."
english/wattwill/Z300523577,18,1800-1834 Early Nineteenth-Century,English Poetry,"Watt, William, 1793-1859",1793.0,BAB AT THE BOWSTER.,1823,40,Poems and Songs (1860),Wi' touslet hair and drowsy een?,y,Ballad,1750-1800,"Lassie, whare were you yestreen,\nWi' touslet ..."


#### Replicated

In [19]:
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated, source=True)
documentation(get_chadwyck_corpus_sampled_by_period_subcorpus, source=True)
documentation(gen_chadwyck_corpus_sampled_by_period_subcorpus, source=True)

**`get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated`**

```md
Convenience wrapper to compute or load period×subcorpus sample (replication).
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated(overwrite=False, display=False) -> pd.DataFrame:
    """Convenience wrapper to compute or load period×subcorpus sample (replication)."""
    df_smpl = get_chadwyck_corpus_sampled_by_period_subcorpus(force=overwrite)
    if display:
        display_period_subcorpus_tables(df_smpl)
    return df_smpl

```

**`get_chadwyck_corpus_sampled_by_period_subcorpus`**

```md
Load or generate period×subcorpus sample; cache on disk at `PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED`.
```
----


*Source code*

```py
def get_chadwyck_corpus_sampled_by_period_subcorpus(force=False, display=False) -> pd.DataFrame:
    """Load or generate period×subcorpus sample; cache on disk at `PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED`."""
    path = PATH_SAMPLE_PERIOD_SUBCORPUS_REPLICATED
    if force or not os.path.exists(path):
        print(f'* Generating period subcorpus sample')
        odf = gen_chadwyck_corpus_sampled_by_period_subcorpus()
        if len(odf):
            save_sample(odf, path, overwrite=True)
    else:
        print(f'* Loading period subcorpus sample from {path}')
        odf = pd.read_csv(path).set_index('id').sort_values('id_hash')
    if display:
        try:
            from IPython.display import display
            img = get_period_subcorpus_table(odf, return_display=True)
            display(img)
        except (NameError, ImportError):
            print(f'* Warning: Could not display image')
            pass
    return odf

```

**`gen_chadwyck_corpus_sampled_by_period_subcorpus`**

```md
Generate a period×subcorpus-stratified sample from the full corpus.
```
----


*Source code*

```py
def gen_chadwyck_corpus_sampled_by_period_subcorpus() -> pd.DataFrame:
    """Generate a period×subcorpus-stratified sample from the full corpus."""
    df_corpus = get_chadwyck_corpus()
    df = sample_chadwyck_corpus(
        df_corpus,
        sample_by=['period','subcorpus'],
    )
    return df

```

In [20]:
df_smpl_by_period_subcorpus_replicated = get_chadwyck_corpus_sampled_by_period_subcorpus_as_replicated(display=True, overwrite=REPLICATE_OVERWRITE)
df_smpl_by_period_subcorpus_replicated

* Generating period subcorpus sample
* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading corpus from memory
* Sampling corpus by ['period', 'subcorpus'] (min 10, max 1000)
* Original sample size: 204514
* Final sample size: 22709

* Breakdown for period/subcorpus
1600-1650  American Poetry              361
           English Poetry              1000
1650-1700  American Poetry               74
           English Poetry              1000
1700-1750  American Poetry              340
           English Poetry              1000
1750-1800  African-American Poetry      284
           American Poetry             1000
           English Poetry              1000
1800-1850  African-American Poetry      542
           American Poetry             1000
           English Poetry              1000
1850-1900  African-American Poetry     1000
           American Poetry             1000
           English Poetry              1000
           Modern Poetry                809
           The Faber Poe

'/Users/ryan/github/generative-formalism/data/tex/table_5.period_subcorpus_counts.tmp.tex'

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950,"I Feel only the desolation of wide water,\nIts..."
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750,Why does my Mistress mourn my fate?\nOr murmur...
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650,The Life and Death I once did mark\nOf a wax C...
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800,Wood walks are pleasant every day\nWhere thoug...
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650,Favour I shall not hawk to gain;\nThe Quarry i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
c20-african-american/da22025/Z200571283,999251,1900-1999 Twentieth-Century,African-American Poetry,"Jones, Patricia Spears, 1955-",1955.0,Taking the Curve (A COUNTRY SONG),1985,18,,The faster you go,,,1950-2000,The faster you go\nthe smoother the turn\nwhen...
american/am0999/Z300181365,999407,1660-1700 Restoration,American Poetry,"Tompson, Benjamin, 1642-1714",1642.0,On A FORTIFICATION At Boston begun by Women. ...,1672,24,"[Poems, in] Benjamin Tompson [1980]",A Grand attempt some Amazonian Dames,y,,1600-1650,A Grand attempt some Amazonian Dames\nContrive...
african-american/hortonge/Z200399842,999416,1835-1869 Mid Nineteenth-Century,African-American Poetry,"Horton, George Moses, 1798?-ca.1880",1798.0,"THE SPECTATOR OF THE BATTLE OF BELMONT, NOVEMB...",1828,24,Naked Genius (1865),"O, brother spectators, I long shall remember,",y,,1750-1800,"O, brother spectators, I long shall remember,\..."
american/am1218/Z200193703,999588,1750-1799 Later Eighteenth-Century,American Poetry,"Hopkinson, Francis, 1737-1791",1737.0,SONG VIII. The traveller benighted and lost,1767,18,The miscellaneous essays and occasional writin...,&indent;O'er the mountains pursues his lone way;,y,,1700-1750,"The traveller benighted and lost,\n O'er th..."
