# Corpus: Chadwyck-Healey poetry collections

## Loading corpus from source

In [1]:
import sys
sys.path.append('../')
from generative_formalism import *

In [2]:
documentation(check_paths,signature=False)
check_paths()

##### `check_paths`

```md
Check if the paths to the Chadwyck-Healey corpus and metadata are set and exist.

    Validates the configuration and availability of corpus files and URLs.
    Prints status indicators for each required path and URL.

    Returns
    -------
    None
        Prints status information but doesn't return a value.

    Calls
    -----
    - os.path.exists(PATH_CHADWYCK_HEALEY_TXT)
    - os.path.exists(PATH_CHADWYCK_HEALEY_METADATA)
    
```
----


✓ Chadwyck-Healey corpus path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/txt
✓ Chadwyck-Healey metadata path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
✓ Metadata file URL set in environment (.env or shell)
✓ Corpus text file URL set in environment (.env or shell)


### Loading corpus metadata

In [3]:
documentation(get_chadwyck_corpus_metadata)

##### `get_chadwyck_corpus_metadata`

```md
Load and normalize Chadwyck-Healey corpus metadata.

    This function reads `PATH_CHADWYCK_HEALEY_METADATA`, downloading and unzipping
    if missing. It coerces numeric fields, derives `id_hash` and binned `period`,
    applies min/max filters, and caches the resulting DataFrame in `CORPUS_METADATA`.

    Parameters
    - fields: Mapping from raw column names to canonical names used downstream.
    - period_by: Size of year bin for `period` derived from `author_dob`.
    - download_if_necessary: If True, download metadata when not present on disk.
    - overwrite: If True, force re-download when files exist.
    - min_num_lines, max_num_lines: Optional poem-length filters.
    - min_author_dob, max_author_dob: Optional birth-year filters.

    Returns
    - pd.DataFrame indexed by `id`, sorted by `id_hash`, including normalized fields
      and derived `period`.
    - Caches the DataFrame in the module-level `CORPUS_METADATA`.
    
```
----


In [4]:
df_meta = get_chadwyck_corpus_metadata()
df_meta

Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
english/devereau/Z400337366,0,1870-1899 Later Nineteenth-Century,English Poetry,"De Vere, Aubrey, 1814-1902",1814.0,V. POETIC RESERVE.,1844,14,The Poetical Works (1884),"But, ere their Songs disperse o'er man's domain,",y,Sonnet,1800-1850
american/am1021/Z200182419,0,1835-1869 Mid Nineteenth-Century,American Poetry,"Fields, James Thomas, 1817-1881",1817.0,SONG IN A DREAM.,1847,10,"[Poems, in] The poets of Portsmouth (1865)","&indent;Drifting o'er our darling's bed,—",y,,1800-1850
english/kenyonjo/Z300409894,1,1800-1834 Early Nineteenth-Century,English Poetry,"Kenyon, John, 1784-1856",1784.0,FOR THE SISTERS' ALBUM,1814,36,Poems (1838),Long since with me have had their day;,y,,1750-1800
american/am0978/Z200179167,13,1870-1899 Later Nineteenth-Century,American Poetry,"Miller, Joaquin, 1837-1913",1837.0,MY COUNTRY.,1867,20,In Classic Shades (1890),"&indent;From holy traditions of dear baby land,",y,,1800-1850
english/woodford/Z300542488,16,1660-1700 Restoration,English Poetry,"Woodford, Samuel, 1636-1700",1636.0,Psalm CXLVIII. Laudate Dominum de Cœlis.,1666,96,A paraphrase upon the psalms of David (1667),"&indent;Th' Eternal King, and so long see",y,Metrical Psalm,1600-1650
...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/marstonp/Z400426075,999976,1870-1899 Later Nineteenth-Century,English Poetry,"Marston, Philip Bourke, 1850-1887",1850.0,SHIPWRECK.,1880,14,The Collected Poems (1892),The night is dense; the waves climb wild and h...,y,,1850-1900
english/havergal/Z600388371,999984,1870-1899 Later Nineteenth-Century,English Poetry,"Havergal, Frances Ridley, 1836-1879",1836.0,[But Eternity is long],1866,41,The Poetical Works (1884),"But Eternity is long,",y,,1800-1850
c20-english/car2902/Z300134680,999991,1900-1999 Twentieth-Century,Modern Poetry,"Morgan, Edwin, 1920-",1920.0,An Atrium,1950,30,,At first we loved the plate&hyphen;glass glare...,,,1900-1950
english/priormat/Z300465772,999998,1700-1749 Early Eighteenth-Century,English Poetry,"Prior, Matthew, 1664-1721",1664.0,Lamentation for DORINDA.,1694,70,Dialogues of the Dead (1907),"Sinking vallies, rising mountains:",y,,1650-1700


### Loading corpus texts

In [5]:
documentation(get_chadwyck_corpus)
df_corpus = get_chadwyck_corpus(df_meta)
df_corpus

##### `get_chadwyck_corpus`

```md
Load metadata and poem texts into a single corpus DataFrame.

    Combines corpus metadata with poem text content into a single DataFrame.
    Uses in-memory caching to avoid repeated expensive loading operations.

    Parameters
    ----------
    df_meta : pd.DataFrame, optional
        Pre-loaded metadata DataFrame. If None, loads using get_chadwyck_corpus_metadata.
    clean_poem : bool, default=True
        If True, apply text cleaning/normalization to poem texts.
    force : bool, default=False
        If True, ignore in-memory cache and rebuild corpus.
    download_if_necessary : bool, default=True
        If True, download corpus files if not present locally.
    *args, **kwargs
        Additional arguments passed to get_chadwyck_corpus_metadata.

    Returns
    -------
    pd.DataFrame
        DataFrame with metadata plus a 'txt' column containing poem text.

    Calls
    -----
    - get_chadwyck_corpus_metadata(*args, **kwargs) [if df_meta is None]
    - download_chadwyck_corpus_txt() [if download_if_necessary=True and corpus text not found]
    - get_chadwyck_corpus_texts(df_meta, clean_poem=clean_poem) [to load poem texts]
    
```
----


* Loading Chadwyck-Healey corpus (metadata + txt)


  : 100%|██████████| 204514/204514 [00:32<00:00, 6371.10it/s]


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
english/devereau/Z400337366,0,1870-1899 Later Nineteenth-Century,English Poetry,"De Vere, Aubrey, 1814-1902",1814.0,V. POETIC RESERVE.,1844,14,The Poetical Works (1884),"But, ere their Songs disperse o'er man's domain,",y,Sonnet,1800-1850,"Not willingly the Muses sing of love:\nBut, er..."
american/am1021/Z200182419,0,1835-1869 Mid Nineteenth-Century,American Poetry,"Fields, James Thomas, 1817-1881",1817.0,SONG IN A DREAM.,1847,10,"[Poems, in] The poets of Portsmouth (1865)","&indent;Drifting o'er our darling's bed,—",y,,1800-1850,"Winter rose-leaves, silver-white,\n Driftin..."
english/kenyonjo/Z300409894,1,1800-1834 Early Nineteenth-Century,English Poetry,"Kenyon, John, 1784-1856",1784.0,FOR THE SISTERS' ALBUM,1814,36,Poems (1838),Long since with me have had their day;,y,,1750-1800,"Soft lays, that dwell on lips and\neyes.\nLong..."
american/am0978/Z200179167,13,1870-1899 Later Nineteenth-Century,American Poetry,"Miller, Joaquin, 1837-1913",1837.0,MY COUNTRY.,1867,20,In Classic Shades (1890),"&indent;From holy traditions of dear baby land,",y,,1800-1850,"My country, what is it? A place that is dear\n..."
english/woodford/Z300542488,16,1660-1700 Restoration,English Poetry,"Woodford, Samuel, 1636-1700",1636.0,Psalm CXLVIII. Laudate Dominum de Cœlis.,1666,96,A paraphrase upon the psalms of David (1667),"&indent;Th' Eternal King, and so long see",y,Metrical Psalm,1600-1650,"You blessed Souls, who stand before\n Th' E..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/marstonp/Z400426075,999976,1870-1899 Later Nineteenth-Century,English Poetry,"Marston, Philip Bourke, 1850-1887",1850.0,SHIPWRECK.,1880,14,The Collected Poems (1892),The night is dense; the waves climb wild and h...,y,,1850-1900,The night is dense; the waves climb wild and h...
english/havergal/Z600388371,999984,1870-1899 Later Nineteenth-Century,English Poetry,"Havergal, Frances Ridley, 1836-1879",1836.0,[But Eternity is long],1866,41,The Poetical Works (1884),"But Eternity is long,",y,,1800-1850,"But Eternity is long,\n And its joys are ma..."
c20-english/car2902/Z300134680,999991,1900-1999 Twentieth-Century,Modern Poetry,"Morgan, Edwin, 1920-",1920.0,An Atrium,1950,30,,At first we loved the plate&hyphen;glass glare...,,,1900-1950,"At first we loved the plate-glass glare, the c..."
english/priormat/Z300465772,999998,1700-1749 Early Eighteenth-Century,English Poetry,"Prior, Matthew, 1664-1721",1664.0,Lamentation for DORINDA.,1694,70,Dialogues of the Dead (1907),"Sinking vallies, rising mountains:",y,,1650-1700,"Farewell you shady walks, and fountains,\nSink..."
