## Introduction and setup

To get started, we need to import the Python modules we'll use throughout this notebook.

In [154]:
from htrc_features import FeatureReader
from htrc_features import utils
from htrc_features import Volume

import numpy as np
import pandas as pd
import glob


### Data being used:
* **Poetry**
    * coo.31924054824473 - Harper's magazine. v.142 1920/21
         * "Fire and Ice" - Robert Frost - page 67, sequence 79
    * uc1.32106006795980 - *The last song : poems* - Joy Harjo
    * mdp.39015000639800 - *Hard words, and other poems* - Ursula K. Le Guin

* **Prose:**
    * mdp.39015037761049 - *The spiral of memory : interviews* - Joy Harjo; edited by Laura Coltelli
    * mdp.39015052467530 - *The left hand of darkness* - Ursula K. LeGuin
    * Rest of that issue of Harper's Magazine


### General workflow:
* Show differentiating between volumes of prose and poetry for the same author:
    * Use EF files to show higher porportion of capitalized beginLineChars in poetry v. prose (if valid)
* Show IDing poetry pages in a volume of mixed poetry and prose
    * Find a volume with both, make note of page of poem(s)
    * check beginLineChars for pages with higher rates of capitalized beginLineChars
    * check for average tokens per page
    * if both criteria are met, mark as possible poetry on page
*

In [311]:
fr_poetry = FeatureReader(ids=['uc1.32106006795980','mdp.39015000639800'])
fr_prose = FeatureReader(ids=['mdp.39015037761049','mdp.39015052467530'])
fr_harpers = FeatureReader(ids=['coo.31924054824473'])

fr_got = FeatureReader(ids=['mdp.39015050507618','mdp.39015046463629','mdp.39015054095784'])


None
['uc1.32106006795980', 'mdp.39015000639800']
None
['mdp.39015037761049', 'mdp.39015052467530']
None
['coo.31924054824473']
None
['mdp.39015050507618', 'mdp.39015046463629', 'mdp.39015054095784']


Let's see what titles have been loaded as Volumes. Because these are volumes within a larger work, they have the same basic title.

In [470]:
for vol in fr_poetry.volumes():
    print(vol.title + vol.id)
    
for vol in fr_prose.volumes():
    print(vol.title + vol.id)
    
for vol in fr_harpers.volumes():
    print(vol.title, vol.enumeration_chronology, vol.id)


The last song : [poems] / Joy Harjo.uc1.32106006795980
Hard words, and other poems / Ursula K. Le Guin.mdp.39015000639800
The spiral of memory : interviews / Joy Harjo ; edited by Laura Coltelli.mdp.39015037761049
The left hand of darkness / by Ursula K. LeGuin.mdp.39015052467530
Harper's magazine. v.142 1920/21 coo.31924054824473


## File and page structure

We can call just one Volume at a time in order to examine its contents. In this example, we are taking the first file.

In [32]:
uklg_poetry = Volume('mdp.39015000639800')
print(uklg_poetry)

uklg_prose = Volume('mdp.39015052467530')
print(uklg_prose)

harpers = Volume('coo.31924054824473')
print(harpers)

jh_poetry = Volume('uc1.32106006795980')
print(jh_poetry)

jh_prose = Volume('mdp.39015037761049')
print(jh_prose)

<Volume: Hard words, and other poems /... (1981) by Le Guin, Ursula K. 1929->
<Volume: The left hand of darkness / by... (1969) by Le Guin, Ursula K. 1929->
<Volume: Harper's magazine. (1921) by Allen, Frederick Lewis 1890-1954 ed.>
<Volume: The last song : [poems] / Joy... (1975) by Harjo, Joy.>
<Volume: The spiral of memory : intervi... (1996) by Harjo, Joy.>


In [82]:
uklg_poetry_blc = uklg_poetry.begin_line_chars()
# uklg_poetry_blc.head(25)

harpers_blc = harpers.begin_line_chars()
# harpers_blc.head(25)


In [471]:
uklg_blc_caps = 0
uklg_blc_lower = 0

for char, counts in uklg_poetry.begin_line_chars().iterrows():
    # print(char)
    blc = char[3]
    # print(blc)
    if blc.isupper() == True:
        uklg_blc_caps +=1
    else:
        uklg_blc_lower +=1

print(f"Found {uklg_blc_caps} capitalized begin line characters")
print(f"Found {uklg_blc_lower} lowercase begin line characters")
print(f"Found {uklg_blc_caps + uklg_blc_lower} total begin line characters")

uklg_caps_pct = uklg_blc_caps / len(uklg_poetry_blc['count'])
print(f"Uppercase characters make up {'{:.2%}'.format(uklg_caps_pct)} of begin line characters")

Found 390 capitalized begin line characters
Found 389 lowercase begin line characters
Found 779 total begin line characters
Uppercase characters make up 50.06% of begin line characters


In [472]:
uklg_prose_blc_caps = 0
uklg_prose_blc_lower = 0

for char, counts in uklg_prose.begin_line_chars().iterrows():
    # print(char)
    blc = char[3]
    # print(blc)
    if blc.isupper() == True:
        uklg_prose_blc_caps +=1
    else:
        uklg_prose_blc_lower +=1

print(f"Found {uklg_prose_blc_caps} capitalized begin line characters")
print(f"Found {uklg_prose_blc_lower} lowercase begin line characters")
print(f"Found {uklg_prose_blc_caps + uklg_prose_blc_lower} total begin line characters")
        
        
uklg_prose_caps_pct = uklg_prose_blc_caps / (uklg_prose_blc_caps + uklg_prose_blc_lower)
print(f"Uppercase characters make up {'{:.2%}'.format(uklg_prose_caps_pct)} of begin line characters")

Found 1081 capitalized begin line characters
Found 3520 lowercase begin line characters
Found 4601 total begin line characters
Uppercase characters make up 23.49% of begin line characters


Given the two sets of results, it seems highly likely that our first volume, where just over 50% of our begin line characters are capitalized, is poetry while the second volume, with begin line characters half as likely to be capitalized, is likely to be prose. Though this is a simple and broad method, it is generalizable and relatively quick for both human and compute time, which could make it 

### Finding poetry within a mixed volume
We've shown above that, given two volumes from a single author, we can use EF files--specifically the page-level metadata--to identify which is poetry and which is prose. But what about a single volume with mixed poetry and prose included? A similar workflow should be successful, but let's try.

First, we'll replicate the begin line characters analysis, as before, but modified slightly to delineate pages that are suspected poetry from suspected prose

In [94]:
harpers_tokens.mean?

In [474]:
h_tokens = harpers.tokens_per_page()

h_avg_token_ct = h_tokens.mean()
print(f"Average tokens in this volume: {round(h_avg_tokens)}")

h_tokens.head(15)

Average tokens in this volume: 619


page
1       0
2      44
3     502
4       0
5      25
6     666
7     752
8     155
9       0
10     18
11    502
12    780
13    200
14    756
15     11
Name: tokenCount, dtype: int64

In [477]:
h_tokens.describe()
h_tokens.std()

benchmark = (h_avg_token_ct - h_tokens.std())

print(benchmark)

count     990.000000
mean      619.421212
std       242.466424
min         0.000000
25%       462.000000
50%       732.500000
75%       793.000000
max      1398.000000
Name: tokenCount, dtype: float64

Place holder for some text explaining above and next steps

In [172]:
poetry_pages = []
prose_pages = []
benchmark = (h_avg_token_ct - h_tokens.std())

for page, count in harpers_tokens.iteritems():
    # print(page)
    # print(count)
    if (benchmark - 100) <= count <= (benchmark + 100):
        # print(f"YOWIE! Page {page} has {count} tokens!")
        poetry_pages.append(page)
    else:
        prose_pages.append(page)
        
print(f"Found {len(poetry_pages)} suspected pages of poetry")
print(f"Found {len(prose_pages)} suspected pages of prose")

print(poetry_pages)

Found 147 suspected pages of poetry
Found 843 suspected pages of prose
[16, 18, 35, 37, 39, 40, 69, 70, 71, 72, 73, 74, 75, 76, 79, 81, 82, 101, 105, 108, 141, 147, 148, 156, 157, 171, 173, 175, 180, 182, 183, 184, 185, 187, 188, 189, 190, 242, 251, 253, 276, 277, 293, 297, 301, 307, 337, 339, 341, 342, 344, 350, 357, 361, 363, 365, 368, 371, 373, 375, 377, 379, 381, 383, 389, 411, 413, 415, 416, 428, 435, 437, 459, 463, 493, 494, 495, 497, 498, 551, 553, 554, 559, 560, 563, 564, 582, 583, 586, 587, 588, 615, 616, 617, 618, 620, 652, 656, 659, 660, 691, 699, 700, 702, 731, 734, 753, 757, 758, 777, 781, 782, 827, 829, 830, 844, 882, 884, 886, 887, 888, 890, 891, 894, 895, 904, 912, 915, 919, 925, 931, 933, 935, 938, 939, 943, 951, 955, 958, 959, 965, 966, 967, 969, 972, 974, 977]


The code above sets a first pass list of possible pages that contain poetry/prose and added the page numbers to the appropriate list. We've identified these pages by first defining a benchmark that could indicate the presence of poetry: where the total tokens on a page are within 100 of a standard deviation of tokens per page for this volume of *Harper's Magazine*. This is a bit of an estimate, and is (and should) only be used as a first-pass method for weeding out pages that are very unlikely to contain traditionally-structured poetry (those with large token counts). However, this method alone is not useful, so we must combine it with further analysis to be sure of results. We'll do this by looking at the percentage of begin-line characters (letters) are capitalized, and see if we can find a trend.

We'll be re-using the general code and workflow from the Le Guin example, with some modifications:

In [478]:
h_blc = harpers.begin_line_chars()
h_blc = h_blc.reset_index() # flattening the dataframe for a bit more clarity in code

h_blc.head()

Unnamed: 0,page,section,place,char,count
0,2,body,begin,(,1
1,2,body,begin,B,1
2,2,body,begin,H,1
3,2,body,begin,I,1
4,2,body,begin,J,1


Place holder for some text about above and below

In [384]:
h_blc_up = {}
h_blc_low = {}
h_blc_nonalpha = {}

for i, r in h_blc.iterrows():
    page = r[0]
    section = r[1]
    token_place = r[2]
    char = r[3]
#     print(char)
    count = r[4]
#     print(count)
    
    if char.isalpha() == True:
        if char.isupper() == True:
            if page not in h_blc_up:
                h_blc_up[page] = 1
            else:
                h_blc_up[page] += 1
    
        else:
            if page not in h_blc_low:
                h_blc_low[page] = 1
            else:
                h_blc_low[page] += 1
    else:
        if page not in h_blc_nonalpha:
            h_blc_nonalpha[page] = 1
        else:
            h_blc_nonalpha[page] += 1

# for r in h_blc.index:
#     page = r[i]
#     print(r)
#   print(f"this is r: {r}")

#     print(r.values[i])
#   print(type(r[3]))
# h_blc.shape

In [None]:
# verifying results

for k, v in h_blc_up.items():
    print(k,v)

Place holder text to explain above and below

In [448]:
dictionaries = [h_blc_nonalpha, h_blc_low, h_blc_up]
harpers_blc_df = pd.DataFrame.from_dict(dictionaries)
# harpers_blc_df.head(15)

harpers_blc_df = harpers_blc_df.rename(index={0: "non-alpha", 1: "lowercase", 2: "uppercase"})
harpers_blc_df = harpers_blc_df.T
harpers_blc_df = harpers_blc_df.fillna(0)
harpers_blc_df.head(15)
# harpers_blc_df.shape

Unnamed: 0,non-alpha,lowercase,uppercase
2,1.0,0.0,5.0
3,11.0,15.0,7.0
5,2.0,0.0,5.0
6,7.0,9.0,16.0
7,8.0,6.0,21.0
8,1.0,1.0,11.0
10,1.0,0.0,2.0
11,0.0,15.0,13.0
12,2.0,21.0,9.0
13,0.0,14.0,3.0


In [458]:
blc_poss_poetry = []

for i, r in harpers_blc_df.iterrows():
#     print(i)
    non_alpha = r[0]
    low = r[1]
    up = r[2]
    total = non_alpha + low + up
#     print(total)
    if non_alpha > low and non_alpha > up:
#         print('non-alpha')
        continue
        
    elif low > non_alpha and low > up:
#         print('low')
        continue
        
    elif up > non_alpha and up > low:
        blc_poss_poetry.append(i)
        

In [None]:
# verifying results

blc_poss_poetry

Now that we have two lists of potential pages, one based on token counts by page (`poetry_pages`) and one based on capital letters making up the plurality of begin line characters (`blc_poss_poetry`)--we can identify pages that appear in both lists, and thus are the most likely candidates to be poetry instead of prose. We can do this using one simple piece of code:

In [462]:
poetry = set(poetry_pages).intersection(blc_poss_poetry)

poetry

{79,
 81,
 82,
 242,
 307,
 350,
 416,
 691,
 884,
 890,
 895,
 943,
 955,
 958,
 966,
 969,
 977}

It's important to remember that this workflow is part of the exploratory process, so we should manually check the above the results, and then refine our process to make sure it's as accurate as possible based on our data. Since this issue of *Harper's Magazine* is in the public domain, we can check the above pages and see what we find. To make this easy for use in our Jupyter notebook, we'll use a the URL to the first page of the volume in HathiTrust's page turner and then substitute the sequence number at the very end with our possible pages:

In [469]:
import webbrowser # optional library to use which will open each of our page URLs in a new browser tab

harpers_url = 'https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq='


for page in sorted(poetry):
    page_url = harpers_url + str(page)
#     print(page_url) # instead of opening tabs, we could just print the URLs and open them manually
    webbrowser.open_new_tab(page_url) # optional: will open each page URL in a new browser tab, for investigation
