## Introduction and setup

This notebook contains example use cases for HTRC's [Extracted Features Dataset](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset). This derived dataset, generated from all volumes in HathiTrust, contains volume-level bibliographic metadata, and page-level metadata and features. You can read the full description of the dataset and its fields at the above URL. Given its popularity and the expertise of HTRC staff, we'll be using Python in this Jupyter notebook to demo the use cases, however this dataset can be used with any language, and with or without the libraries used below. This notebook also has [an accompanying web page](https://wiki.htrc.illinois.edu/display/COM/EF+Use+Cases+and+Examples) with more general information about the dataset and how it can be used. If you're new to the dataset, we'd strongly encourage you to start there.

Our sample use cases in this notebook are centered around using metadata and features to identify poetry amongst prose. We'll focus on two specific examples:
1. Identifying which volume is poetry and which is prose when given mixed volumes by a single author
2. Identifying pages of poetry within a single volume that mixes both prose and poetry.

As with most Python notebooks, we'll start by importing some libraries we'll need to tackle these examples, [Pandas](https://pandas.pydata.org/), a very common data science library in Python, and the [HTRC FeatureReader](https://github.com/htrc/htrc-feature-reader), a library written by HTRC to ease the use of the Extracted Features Dataset:

In [93]:
from htrc_features import FeatureReader
from htrc_features import Volume
import pandas as pd

### The data

For our first use case, we'll be using two volumes by Ursula K. Le Guin--one volume of prose (*The Left Hand of Darkness*) and one volume of poetry (*Hard Words and Other Poems*). Since the goal of this notebook is to demo a possible use case, and not actually achieve the task of differentiating these two volumes, our results are spoiled, since we know the volume titles. But, in the name of learning, we'll continue as if we are ignorant of the giveaway in the title of one of these volumes!

For our second use case, we'll be using one issue of *Harper's Magazine*: volume 142 which includes issues from 1920 and 1921. This particular volume is verified as having poetry within it, including a relatively famous poem by Robert Frost (but probably not the one you're thinking of!).



In the interests of not spoiling the results, we'll 
* **Poetry**
    * coo.31924054824473 - Harper's magazine. v.142 1920/21
         * "Fire and Ice" - Robert Frost - page 67, sequence 79
    * Volume 1: mdp.39015000639800 - *Hard words, and other poems* - Ursula K. Le Guin

* **Prose:**
    * Volume 2: mdp.39015052467530 - *The left hand of darkness* - Ursula K. LeGuin
    * coo.31924054824473 - Harper's magazine. v.142 1920/21 (mostly prose!)


### General workflow:
* Show differentiating between volumes of prose and poetry for the same author:
    * Use EF files to show higher porportion of capitalized beginLineChars in poetry v. prose (if valid)
* Show IDing poetry pages in a volume of mixed poetry and prose
    * Find a volume with both, make note of page of poem(s)
    * check beginLineChars for pages with higher rates of capitalized beginLineChars
    * check for average tokens per page
    * if both criteria are met, mark as possible poetry on page
*

Let's see what titles have been loaded as Volumes. Because these are volumes within a larger work, they have the same basic title.

We can call just one Volume at a time in order to examine its contents. In this example, we are taking the first file.

In [37]:
uklg_v1 = Volume('mdp.39015000639800')
# print(uklg_poetry)
print(uklg_v1.title, uklg_v1.id)

uklg_v2 = Volume('mdp.39015052467530')
# print(uklg_prose)
print(uklg_v2.title, uklg_v2.id)

harpers = Volume('coo.31924054824473')
# print(harpers)
print(harpers.title, harpers.enumeration_chronology, harpers.id)

Hard words, and other poems / Ursula K. Le Guin. mdp.39015000639800
The left hand of darkness / by Ursula K. LeGuin. mdp.39015052467530
Harper's magazine. v.142 1920/21 coo.31924054824473


In [38]:
uklg_v1_blc = uklg_v1.begin_line_chars()
uklg_v2_blc = uklg_v2.begin_line_chars()

uklg_v1_blc.head(15)
# uklg_v2_blc.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,place,char,Unnamed: 4_level_1
1,body,begin,0,1
3,body,begin,H,2
3,body,begin,U,1
3,body,begin,a,2
9,body,begin,H,1
11,body,begin,',3
11,body,begin,/,1
11,body,begin,D,1
11,body,begin,H,1
11,body,begin,J,2


In [39]:
uklg_v1_blc_caps = 0
uklg_v1_blc_lower = 0

for idx, count in uklg_v1_blc.iterrows():
    # print(idx)
    # print(count)
    blc = idx[3]
    # print(blc)
    if blc.isupper() == True:
        uklg_v1_blc_caps +=1
    else:
        uklg_v1_blc_lower +=1

print(f"Found {uklg_v1_blc_caps} capitalized begin line characters")
print(f"Found {uklg_v1_blc_lower} lowercase begin line characters")
print(f"Found {uklg_v1_blc_caps + uklg_v1_blc_lower} total begin line characters")

uklg_v1_caps_pct = uklg_v1_blc_caps / len(uklg_v1_blc['count'])
print(f"Uppercase characters make up {'{:.2%}'.format(uklg_v1_caps_pct)} of begin line characters")

Found 390 capitalized begin line characters
Found 389 lowercase begin line characters
Found 779 total begin line characters
Uppercase characters make up 50.06% of begin line characters


In [40]:
uklg_v2_blc_caps = 0
uklg_v2_blc_lower = 0

for idx, count in uklg_v2_blc.iterrows():
    # print(count)
    blc = idx[3]
    # print(blc)
    if blc.isupper() == True:
        uklg_v2_blc_caps +=1
    else:
        uklg_v2_blc_lower +=1

print(f"Found {uklg_v2_blc_caps} capitalized begin line characters")
print(f"Found {uklg_v2_blc_lower} lowercase begin line characters")
print(f"Found {uklg_v2_blc_caps + uklg_v2_blc_lower} total begin line characters")
        
        
uklg_v2_caps_pct = uklg_prose_blc_caps / (uklg_v2_blc_caps + uklg_v2_blc_lower)
print(f"Uppercase characters make up {'{:.2%}'.format(uklg_v2_caps_pct)} of begin line characters")

Found 1081 capitalized begin line characters
Found 3520 lowercase begin line characters
Found 4601 total begin line characters
Uppercase characters make up 23.49% of begin line characters


Given the two sets of results, it seems highly likely that our first volume, where just over 50% of our begin line characters are capitalized, is poetry while the second volume, with begin line characters half as likely to be capitalized, is likely to be prose. Though this is a simple and broad method, it is was relatively quick to write and run, which could make it a generalizable process for exploring a workset or collection.

### Finding poetry within a mixed volume
We've shown above that, given two volumes from a single author, we can use EF files--specifically the page-level metadata--to identify which is poetry and which is prose. But what about a single volume with mixed poetry and prose included? A similar workflow should be successful, but let's try.

First, we'll replicate the begin line characters analysis, as before, but modified slightly to delineate pages that are suspected poetry from suspected prose

In [41]:
harpers_tokens.mean?

Object `harpers_tokens.mean` not found.


In [42]:
h_tokens = harpers.tokens_per_page()

h_avg_token_ct = h_tokens.mean()
print(f"Average tokens in this volume: {round(h_avg_token_ct)}")

h_tokens.head(15)

Average tokens in this volume: 619


page
1       0
2      44
3     502
4       0
5      25
6     666
7     752
8     155
9       0
10     18
11    502
12    780
13    200
14    756
15     11
Name: tokenCount, dtype: int64

In [43]:
h_tokens.describe()
h_tokens.std()

benchmark = (h_avg_token_ct - h_tokens.std())

print(benchmark)

376.95478765907876


Place holder for some text explaining above and next steps

In [45]:
poetry_pages = []
prose_pages = []
benchmark = (h_avg_token_ct - h_tokens.std())

for page, count in h_tokens.iteritems():
    # print(page)
    # print(count)
    if (benchmark - 100) <= count <= (benchmark + 100):
        # print(f"YOWIE! Page {page} has {count} tokens!")
        poetry_pages.append(page)
    else:
        prose_pages.append(page)
        
print(f"Found {len(poetry_pages)} suspected pages of poetry")
print(f"Found {len(prose_pages)} suspected pages of prose")

print(poetry_pages)

Found 147 suspected pages of poetry
Found 843 suspected pages of prose
[16, 18, 35, 37, 39, 40, 69, 70, 71, 72, 73, 74, 75, 76, 79, 81, 82, 101, 105, 108, 141, 147, 148, 156, 157, 171, 173, 175, 180, 182, 183, 184, 185, 187, 188, 189, 190, 242, 251, 253, 276, 277, 293, 297, 301, 307, 337, 339, 341, 342, 344, 350, 357, 361, 363, 365, 368, 371, 373, 375, 377, 379, 381, 383, 389, 411, 413, 415, 416, 428, 435, 437, 459, 463, 493, 494, 495, 497, 498, 551, 553, 554, 559, 560, 563, 564, 582, 583, 586, 587, 588, 615, 616, 617, 618, 620, 652, 656, 659, 660, 691, 699, 700, 702, 731, 734, 753, 757, 758, 777, 781, 782, 827, 829, 830, 844, 882, 884, 886, 887, 888, 890, 891, 894, 895, 904, 912, 915, 919, 925, 931, 933, 935, 938, 939, 943, 951, 955, 958, 959, 965, 966, 967, 969, 972, 974, 977]


The code above sets a first pass list of possible pages that contain poetry/prose and added the page numbers to the appropriate list. We've identified these pages by first defining a benchmark that could indicate the presence of poetry: where the total tokens on a page are within 100 of a standard deviation of tokens per page for this volume of *Harper's Magazine*. This is a bit of an estimate, and is (and should) only be used as a first-pass method for weeding out pages that are very unlikely to contain traditionally-structured poetry (those with large token counts). However, this method alone is not useful, so we must combine it with further analysis to be sure of results. We'll do this by looking at the percentage of begin-line characters (letters) are capitalized, and see if we can find a trend.

We'll be re-using the general code and workflow from the Le Guin example, with some modifications:

In [46]:
h_blc = harpers.begin_line_chars()
h_blc = h_blc.reset_index() # flattening the dataframe for a bit more clarity in code

h_blc.head()

Unnamed: 0,page,section,place,char,count
0,2,body,begin,(,1
1,2,body,begin,B,1
2,2,body,begin,H,1
3,2,body,begin,I,1
4,2,body,begin,J,1


Place holder for some text about above and below

In [70]:
h_blc_up = {}
h_blc_low = {}
h_blc_nonalpha = {}

for i, r in h_blc.iterrows():
    page = r[0]
    section = r[1]
    token_place = r[2]
    char = r[3]
#     print(char)
    count = r[4]
#     print(count)
    
    if char.isalpha() == True:
        if char.isupper() == True:
            if page not in h_blc_up:
                h_blc_up[page] = 1
            else:
                h_blc_up[page] += 1
    
        else:
            if page not in h_blc_low:
                h_blc_low[page] = 1
            else:
                h_blc_low[page] += 1
    else:
        if page not in h_blc_nonalpha:
            h_blc_nonalpha[page] = 1
        else:
            h_blc_nonalpha[page] += 1

print(f"Found {len(h_blc_up)} uppercase BLC")
print(f"Found {len(h_blc_low)} lowercase BLC")
print(f"Found {len(h_blc_nonalpha)} non-alpha BLC")

Found 977 uppercase BLC
Found 956 lowercase BLC
Found 871 non-alpha BLC


Now, with three different dictionaries with common keys--sequence/page number--we'll create a new DataFrame with each page sequence as a row, and values for each type of begin line character as columns. To do this,  create a DataFrame from the dictionaries, rename the columns, transpose its axes to move pages to rows, and then fill all `NaN` values (which are a result from pages with no lower/upper/non-alpha begin line characters) with 0 in order to compare them numerically to the other values:

In [83]:
dictionaries = [h_blc_nonalpha, h_blc_low, h_blc_up]
harpers_blc_df = pd.DataFrame.from_dict(dictionaries)
# harpers_blc_df.head(15)

harpers_blc_df = harpers_blc_df.rename(index={0: "non-alpha", 1: "lowercase", 2: "uppercase"})
harpers_blc_df = harpers_blc_df.T
harpers_blc_df = harpers_blc_df.fillna(0)
harpers_blc_df.head(15)
# harpers_blc_df.shape

Unnamed: 0,non-alpha,lowercase,uppercase
2,1.0,0.0,5.0
3,11.0,15.0,7.0
5,2.0,0.0,5.0
6,7.0,9.0,16.0
7,8.0,6.0,21.0
8,1.0,1.0,11.0
10,1.0,0.0,2.0
11,0.0,15.0,13.0
12,2.0,21.0,9.0
13,0.0,14.0,3.0


With this flattened DataFrame, we can now go through the rows and track which category of begin line characters is the most common on each page, and then add the page number to lists to track each page. For this example, since we're only interested in pages of possible poetry, we're simply continuing the loop if a page has more lowercase or non-alpha begin line characters, and only tracking pages where uppercase characters are most common:

In [84]:
blc_poss_poetry = []

for i, r in harpers_blc_df.iterrows():
#     print(i)
    non_alpha = r[0]
    low = r[1]
    up = r[2]
    total = non_alpha + low + up
#     print(total)
    if non_alpha > low and non_alpha > up:
#         print('non-alpha')
        continue
        
    elif low > non_alpha and low > up:
#         print('low')
        continue
        
    elif up > non_alpha and up > low:
        blc_poss_poetry.append(i)

blc_poss_poetry[:15]

[2, 5, 6, 7, 8, 10, 15, 17, 79, 80, 81, 82, 231, 237, 242]

Now that we have two lists of potential pages, one based on token counts by page (`poetry_pages`) and one based on capital letters making up the plurality of begin line characters (`blc_poss_poetry`)--we can identify pages that appear in both lists, and thus are the most likely candidates to be poetry instead of prose. We can do this using one simple piece of code:

In [92]:
poetry = set(poetry_pages).intersection(blc_poss_poetry)

poetry

{79,
 81,
 82,
 242,
 307,
 350,
 416,
 691,
 884,
 890,
 895,
 943,
 955,
 958,
 966,
 969,
 977}

From 990 pages, we now have 17 strong candidates for pages that contain poetry! However, it's important to remember that this workflow is part of the exploratory process, so we should manually check our results, and, if applicable, improve our process and code to maximize accuracy given what we know about our data. Since this issue of *Harper's Magazine* is in the public domain, we can check our result pages and see what we find. To make this easy for use in our Jupyter notebook, we'll use the URL to the first page of the volume in HathiTrust's page turner as a base, and then substitute the sequence number at the very end with our possible pages in order to print URLs to each page (or see an alternative in the code for opening each page in a browser tab):

In [90]:
# import webbrowser # OPTIONAL: import a library we can use use to open each page URLs in a new browser tab

harpers_url = 'https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq='


for page in sorted(poetry):
    page_url = harpers_url + str(page)
    print(page_url)
    # webbrowser.open_new_tab(page_url) # OPTIONAL: this code will open each page URL in a new browser 
                                        # tab, for investigation


https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=79
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=81
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=82
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=242
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=307
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=350
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=416
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=691
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=884
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=890
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=895
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=943
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=955
https://babel.hathitrust.org