# Corpus: Chadwyck-Healey poetry collections

## Loading corpus from source

In [None]:
import sys
sys.path.append('../')
from generative_formalism import *

In [2]:
documentation(check_paths,signature=False)
check_paths()

**`check_paths`**

```md
Check if the paths to the Chadwyck-Healey corpus and metadata are set and exist.
    Uses constants from `constants.py`.
    
```
----


✓ Chadwyck-Healey corpus path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/txt
✓ Chadwyck-Healey metadata path: /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
✓ Metadata file URL set in environment (.env or shell)
✓ Corpus text file URL set in environment (.env or shell)


### Loading corpus metadata

In [3]:
documentation(get_chadwyck_corpus_metadata)

**`get_chadwyck_corpus_metadata`**

```md
Load and normalize Chadwyck-Healey corpus metadata.

    This function reads `PATH_CHADWYCK_HEALEY_METADATA`, downloading and unzipping
    if missing. It coerces numeric fields, derives `id_hash` and binned `period`,
    applies min/max filters, and caches the resulting DataFrame in `CORPUS_METADATA`.

    Parameters
    - fields: Mapping from raw column names to canonical names used downstream.
    - period_by: Size of year bin for `period` derived from `author_dob`.
    - download_if_necessary: If True, download metadata when not present on disk.
    - overwrite: If True, force re-download when files exist.
    - min_num_lines, max_num_lines: Optional poem-length filters.
    - min_author_dob, max_author_dob: Optional birth-year filters.

    Returns
    - pd.DataFrame indexed by `id`, sorted by `id_hash`, including normalized fields
      and derived `period`.
    - Caches the DataFrame in the module-level `CORPUS_METADATA`.
    
```
----


In [4]:
df_meta = get_chadwyck_corpus_metadata()
df_meta

* Loading metadata from /Users/ryan/github/generative-formalism/data/chadwyck_poetry/metadata.csv
* Loaded 336180 rows of metadata
* Filtering: 259,310 rows after author birth year >= 1600
* Filtering: 259,310 rows after author birth year <= 2000
* Filtering: 225,986 rows after number of lines >= 10
* Filtering: 204,514 rows after number of lines <= 100


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650
...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/nesbitep/Z300136537,999962,1870-1899 Later Nineteenth-Century,English Poetry,"Nesbit, E. (Edith), 1858-1924",1858.0,“THIS DESIRABLE MANSION”,1888,16,,"&indent;Across the sodden, tangled grass,",y,,1850-1900
c20-american/am20048/Z300370315,999971,1900-1999 Twentieth-Century,American Poetry,"Wheelwright, John, 1897-1940",1897.0,DEFENDER OF THE FAITH,1927,59,,While voices sift from the infrangile air:,,,1850-1900
english/loversam/Z300418290,999973,1800-1834 Early Nineteenth-Century,English Poetry,"Lover, Samuel, 1797-1868",1797.0,ABSENCE. To &lblank;.,1827,16,Songs and ballads (1858),"Then, all is night;",y,,1750-1800
c20-english/ep30101/Z200604786,999978,1900-1999 Twentieth-Century,English Poetry,"McGuckian, Medbh, 1950-",1950.0,"Road 32, Roof 13–23, Grass 23",1980,38,,The dark wound her chestnut hair,,,1950-2000


### Loading corpus texts

In [5]:
documentation(get_chadwyck_corpus)
df_corpus = get_chadwyck_corpus(df_meta)
df_corpus

**`get_chadwyck_corpus`**

```md
Load metadata and poem texts into a single corpus DataFrame.

    Parameters
    - clean_poem: If True, clean poem texts after reading.
    - force: If True, ignore in-memory cache and rebuild corpus.
    - args/kwargs: Passed to `get_chadwyck_corpus_metadata`.

    Returns
    - pd.DataFrame with metadata plus a `txt` column containing poem text.

    Side Effects
    - Caches the result in the module-level `CORPUS`.
    
```
----


* Loading Chadwyck-Healey corpus (metadata + txt)
* Loading 204514 texts


  : 100%|██████████| 204514/204514 [00:35<00:00, 5770.47it/s]


Unnamed: 0_level_0,id_hash,period_meta,subcorpus,author,author_dob,title,year,num_lines,volume,line,rhyme,genre,period,txt
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
c20-american/am22053/Z300233356,1,1900-1999 Twentieth-Century,American Poetry,"Murray, Joan, 1917-1942",1917.0,I FEEL ONLY THE DESOLATION OF WIDE WATER,1947,26,,Its back a silver dimpling from the sun.,,,1900-1950,"I Feel only the desolation of wide water,\nIts..."
english/keategeo/Z200407830,1,1750-1799 Later Eighteenth-Century,English Poetry,"Keate, George, 1729-1797",1729.0,TO A LADY; FROM HER DEAD BULLFINCH.,1759,42,The Poetical Works (1781),Or murmur at my alter'd state?,y,,1700-1750,Why does my Mistress mourn my fate?\nOr murmur...
english/beaujose/Z200275687,2,1603-1660 Jacobean and Caroline,English Poetry,"Beaumont, Joseph, 1616-1699",1616.0,The Candle,1646,84,The Minor Poems (1914),Of a wax Candle in y&supere; Dark:,y,,1600-1650,The Life and Death I once did mark\nOf a wax C...
english-ed2/clarejoh/Z300313661,8,1800-1834 Early Nineteenth-Century,English Poetry,"Clare, John, 1793-1864",1793.0,ON SEEING SOME MOSS IN FLOWER EARLY IN SPRING,1823,56,The Midsummer Cushion (1990),Wood walks are pleasant every day,y,,1750-1800,Wood walks are pleasant every day\nWhere thoug...
english/shipmant/Z200485468,10,1660-1700 Restoration,English Poetry,"Shipman, Thomas, 1632-1680",1632.0,To the Reader of the following Poem.,1662,35,"Carolina: or, Loyal Poems (1683)","For all that can be done or said,",y,,1600-1650,Favour I shall not hawk to gain;\nThe Quarry i...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
english/nesbitep/Z300136537,999962,1870-1899 Later Nineteenth-Century,English Poetry,"Nesbit, E. (Edith), 1858-1924",1858.0,“THIS DESIRABLE MANSION”,1888,16,,"&indent;Across the sodden, tangled grass,",y,,1850-1900,The long white windows blankly stare\n Acro...
c20-american/am20048/Z300370315,999971,1900-1999 Twentieth-Century,American Poetry,"Wheelwright, John, 1897-1940",1897.0,DEFENDER OF THE FAITH,1927,59,,While voices sift from the infrangile air:,,,1850-1900,"While voices sift from the infrangile air:\n""I..."
english/loversam/Z300418290,999973,1800-1834 Early Nineteenth-Century,English Poetry,"Lover, Samuel, 1797-1868",1797.0,ABSENCE. To &lblank;.,1827,16,Songs and ballads (1858),"Then, all is night;",y,,1750-1800,"As when the sun withdraweth quite,\nThen, all ..."
c20-english/ep30101/Z200604786,999978,1900-1999 Twentieth-Century,English Poetry,"McGuckian, Medbh, 1950-",1950.0,"Road 32, Roof 13–23, Grass 23",1980,38,,The dark wound her chestnut hair,,,1950-2000,The dark wound her chestnut hair\nAround her n...
