## Example Use Cases for HTRC's Extracted Features Dataset 

This notebook contains example use cases for HTRC's [Extracted Features Dataset](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset). This derived dataset, generated from all volumes in HathiTrust, contains volume-level bibliographic metadata, and page-level metadata and features. You can read the full description of the dataset and its fields at the above URL. Given its popularity and the expertise of HTRC staff, we'll be using Python in this Jupyter notebook to demo the use cases, however this dataset can be used with any language, and with or without the libraries used below. This notebook also has [an accompanying web page](https://wiki.htrc.illinois.edu/display/COM/EF+Use+Cases+and+Examples) with more general information about the dataset and how it can be used. If you're new to the dataset, we'd strongly encourage you to start there.

Our sample use cases in this notebook are centered around using metadata and features to identify poetry amongst prose. We'll focus on two specific examples:
1. Identifying which volume is poetry and which is prose when given mixed volumes by a single author
2. Identifying pages of poetry within a single volume that mixes both prose and poetry

### The data

For our first use case, we'll be using two volumes by Ursula K. Le Guin, who wrote both prose and poetry. Our goal with these volumes will be to try to identify which volume is prose and which poetry using the Extracted Features Dataset.

For our second use case, we'll be using one issue of *Harper's Magazine*: volume 142 which includes issues from 1920 and 1921. This particular volume is verified as having poetry within it, including a relatively famous poem by Robert Frost (but probably not the one you're thinking of).

### Example 1: Using page-level features to identify poetry and prose in different volumes
Now that we know about our data and our goals, let's dive into our first use case: using Extracted Features to identify volumes of prose or poetry given a mixed workset of volumes. Though this use case may seem a little strange, since we could simply open a book and see if we find poetry or prose, that method will not work for volumes under copyright nor will it practically work for hundreds and thousands of volumes, as it would take ages to manually verify each in a library. However, since the Extracted Features Dataset is [non-consumptive](https://www.hathitrust.org/htrc_ncup), we can use this dataset, without restriction, to study volumes we cannot read due to copyright. Additionally, this workflow could be used to study up to millions of volumes and identify works of poetry amongst prose, something humanly impossible for a single researcher.

As with most Jupyter notebooks, we'll start by importing some libraries we'll need to tackle these examples, [Pandas](https://pandas.pydata.org/), a very common data science library in Python, and the [HTRC FeatureReader](https://github.com/htrc/htrc-feature-reader), a library written by HTRC to ease the use of the Extracted Features Dataset:

In [101]:
from htrc_features import FeatureReader, Volume
import pandas as pd

Our first step using our data will be to create FeatureReader Volume objects for each of our volumes of interest. A Volume in FeatureReader has specific properties and built-in methods that let us do a number of common tasks. We can create Volumes simply by using `Volume()` and adding the HathiTrust ID (HTID) in the parentheses:

In [102]:
leguin_v1 = Volume('mdp.39015000639800')
# print(uklg_v1)
print(f"{leguin_v1.author}, {leguin_v1.pub_date}, {leguin_v1.id}")

leguin_v2 = Volume('mdp.39015052467530')
# print(leguin_v2)
print(f"{leguin_v2.author}, {leguin_v2.pub_date}, {leguin_v2.id}")


['Le Guin, Ursula K. 1929- '], 1981, mdp.39015000639800
['Le Guin, Ursula K. 1929- '], 1969, mdp.39015052467530


Print statements are included to make sure things went as planned. They also illustrate that Volumes have specific metadata that we can access very quickly using the FeatureReader. To explore all of the options available, try going back to the above code cell and typing `harpers.` (don't forget the period!) and then pressing `Tab`. A small pop-up will appear after the period shwoing all of the properties or methods available for that object.

We'll be focusing on one particular field of features in the EF Dataset: begin-line characters. We'll be working off of the assumption that a page of poetry is likely to contain more capitalized letters at the start of each line on a page, as this is a common convention for poetry that has endured over hundreds of years. To start, we'll create a Pandas DataFrame with each begin line character, by page, for both of our volumes:

In [103]:
leguin_v1_blc = leguin_v1.begin_line_chars() 
leguin_v2_blc = leguin_v2.begin_line_chars()

leguin_v1_blc.head(15)
# leguin_v2_blc.head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,place,char,Unnamed: 4_level_1
1,body,begin,0,1
3,body,begin,H,2
3,body,begin,U,1
3,body,begin,a,2
9,body,begin,H,1
11,body,begin,',3
11,body,begin,/,1
11,body,begin,D,1
11,body,begin,H,1
11,body,begin,J,2


With this DataFrame, we have every begin-line character from our volumes, their count by page, which section of the page they occur in (header, body, footer) and the page on which they occur. FeatureReader makes getting this data very easy (only 2 lines of code!), and retrieves it in a format that lets us go through and make our own counts with just a simple `for` loop:

In [125]:
leguin_v1_blc_up = 0 # a variable that will count uppercase BLCs
leguin_v1_blc_low = 0 # a a variable that will count lowercase BLCs

for idx, count in leguin_v1_blc.iterrows(): # iterate through rows of our BLC dataframe
    # print(idx)
    # print(count)
    blc = idx[3] # assigning the third index in our dataframe, the BLC, to a variable
    # print(blc)# ch
    if blc.isupper() == True: # check if the BLC is uppercase
        leguin_v1_blc_up +=1 # if uppercase, adding 1 to the number of uppercase BLC
    else:
        leguin_v1_blc_low +=1 # if not uppercase, adding 1 to the number of non-uppercase BLC


# The next section has print statements that will report our results
        
print(f"Found {leguin_v1_blc_up} capitalized begin line characters")
print(f"Found {leguin_v1_blc_low} lowercase begin line characters")
print(f"Found {leguin_v1_blc_up + leguin_v1_blc_low} total begin line characters")

leguin_v1_caps_pct = leguin_v1_blc_up / len(leguin_v1_blc['count']) # using counts to generate percentages

print(f"Uppercase characters make up {'{:.2%}'.format(leguin_v1_caps_pct)} of begin line characters")

Found 390 capitalized begin line characters
Found 389 lowercase begin line characters
Found 779 total begin line characters
Uppercase characters make up 50.06% of begin line characters


Ok, so we found that over 50% of all BLC in Volume 1 by Le Guin are capitalized. That seems like a substantial amount, but it's hard to know for sure without comparison to our other volume. Let's repeat the code for Volume 2 and see what our results are:

In [123]:
leguin_v2_blc_up = 0
leguin_v2_blc_low = 0

for idx, count in leguin_v2_blc.iterrows():
    # print(count)
    blc = idx[3]
    # print(blc)
    if blc.isupper() == True:
        leguin_v2_blc_up +=1
    else:
        leguin_v2_blc_low +=1

print(f"Found {leguin_v2_blc_up} capitalized begin line characters")
print(f"Found {leguin_v2_blc_low} lowercase begin line characters")
print(f"Found {leguin_v2_blc_up + leguin_v2_blc_low} total begin line characters")
        
        
leguin_v2_caps_pct = leguin_v2_blc_up / (leguin_v2_blc_up + leguin_v2_blc_lower)
print(f"Uppercase characters make up {'{:.2%}'.format(leguin_v2_caps_pct)} of begin line characters")

Found 1081 capitalized begin line characters
Found 3520 lowercase begin line characters
Found 4601 total begin line characters
Uppercase characters make up 23.49% of begin line characters


From 50% capitalizaion of BLC to 23.5%! It now seems highly likely that our first volumeis poetry, while the second volume is likely to be prose. Since this process is still part of our exploratory data analysis phase, we should do our best to verify these results manually, if possible. Since both volumes came from HathiTrust, whe have bibliographic metadata for each volume, and can access certain fields within FeatureReader. Let's see what we can find out by retrieving the titles of our mystery volumes, along with a link back to the item in HathiTrust, in case we want to take a closer look:

In [132]:
leguin_v1.write

<bound method Volume.write of <htrc_features.feature_reader.Volume object at 0x12f0fef28>>

In [130]:
print(f"leguin_v1: {leguin_v1.title}, {leguin_v1.ht_bib_url}")

print(f"leguin_v2: {leguin_v2.title}, {leguin_v2.ht_bib_url}")

leguin_v1: Hard words, and other poems / Ursula K. Le Guin., http://catalog.hathitrust.org/api/volumes/full/htid/mdp.39015000639800.json
leguin_v2: The left hand of darkness / by Ursula K. LeGuin., http://catalog.hathitrust.org/api/volumes/full/htid/mdp.39015052467530.json


Success! `leguin_v1` makes verification easy by including "and other poems" in its title. We could also use the returned URL to examine the volumes' bibliographic record and see if there is a marker for poetry. (If these volumes were in the public domain, we could also get their handle URL using `leguin_v1.handle_url` and examine the texts firsthand.)

Though this is a simple and broad method, it is was relatively quick to write and run, which could make it a generalizable process for exploring a workset or collection, even at scale. With a digital library as large as HathiTrust, reliably finding a workset of materials can be a real challenge. This is one of the tasks that Extracted Features, and the FeatureReader, can help with, especially in cases where users cannot access the full volume directly.

### Example 2: Finding poetry within a mixed volume
We've shown above that, given two volumes from a single author, we can use EF files--specifically the page-level metadata--to identify which is poetry and which is prose. But what about a single volume with mixed poetry and prose included? A similar workflow should be successful, but let's try.

Broadly, we'll generate a benchmark tokens-per-page value, based on the mean number of tokens per page in this issue of Harper's, and then replicate the begin line characters analysis we did for Le Guin's work, but modified slightly to delineate pages that are suspected poetry from suspected prose. Pages that appear as suspected poetry by both metrics will be our results, which we'll then do our best to manually verify. 

But first, we'll import the libraries we need (in case you've jumped ahead to this example!) and then we'll load our volume of Harpers as a FeatureReader Volume:

In [134]:
from htrc_features import FeatureReader, Volume
import pandas as pd

harpers = Volume('coo.31924054824473')
# print(harpers)
print(f"{harpers.title}, {harpers.enumeration_chronology}, {harpers.id}")

Harper's magazine., v.142 1920/21, coo.31924054824473


All was successful, and we now have a FeatureReader Volume from our issue of Harper's, which lets us take advantage of all of the methods that the FeatureReader includes, one of which is `tokens_per_page()` which will return a sparse DataFrame with just tokens by page. This DataFrame will also allow us to identify the mean number of tokens per page in this volume, which we'll use for our benchmarking later on:

In [137]:
h_tokens = harpers.tokens_per_page() # creating a new DataFrame from our Volume, using tokens_per_page()

h_avg_token_ct = h_tokens.mean() # assigning the mean of 
print(f"Average tokens per page in this volume: {round(h_avg_token_ct)}")

h_tokens.head(15)

Average tokens per page in this volume: 619


page
1       0
2      44
3     502
4       0
5      25
6     666
7     752
8     155
9       0
10     18
11    502
12    780
13    200
14    756
15     11
Name: tokenCount, dtype: int64

Next, we're going to use our mean per-page token count to calculate a benchmark figure that is one standard deviation below the mean. Pandas allows you to get this number very easily for a given column (what Pandas calls a "series") of numerical data using `.std()`. We'll then subtract the standard deviation from the mean to get a benchmark number of tokens that we will use to compare each page of the volume:

In [141]:
h_tokens.std()
benchmark = (h_avg_token_ct - h_tokens.std())

print(benchmark)

376.95478765907876


So we'll be looking for pages with a number of tokens near our benchmark of ~377 tokens. Because this is imprecise, and we should consider this part of our project exploratory, we will be considering pages within 100 tokens of this number, either above or below, as possible pages of poetry. For any given project, these types of decisions will need to be made by the researcher given their preferences, understanding of the data, and approach. 

In the next code cell, we'll be using our benchmark token count to gather a preliminary list of suspected poetry pages by iterating through our tokens-per-page DataFrame `h_tokens` and recording rows where the token count is within 100 tokens of our benchmark:

In [143]:
poetry_pages = [] # a list that will hold page numbers for suspected poetry
other_pages = [] # a list that will hold page numbers that are not suspected to be poetry

for page, count in h_tokens.iteritems(): # iterating through the rows of our DataFrame
    # print(page)
    # print(count)
    if (benchmark - 100) <= count <= (benchmark + 100): # if count is within 100 of benchmark, we add to list of possible poetry
        # print(f"YOWIE! Page {page} has {count} tokens!")
        poetry_pages.append(page)
    else:
        other_pages.append(page) # if it's not in this range, we add to our other list

# printing out the number of pages suspected of poetry and those not       
print(f"Found {len(poetry_pages)} suspected pages of poetry")
print(f"Found {len(other_pages)} other pages")


Found 147 suspected pages of poetry
Found 843 other pages


The code above sets a first pass list of possible pages that contain poetry/prose and added the page numbers to the appropriate list. We've identified these pages by first defining a benchmark that could indicate the presence of poetry: where the total tokens on a page are within 100 of a standard deviation of tokens per page for this volume of *Harper's Magazine*. This is a bit of an estimate, and is (and should) only be used as a first-pass method for weeding out pages that are very unlikely to contain traditionally-structured poetry (those with large token counts). However, this method alone is not useful, so we must combine it with further analysis to be sure of results. We'll do this by looking at the percentage of begin-line characters (letters) are capitalized, and see if we can find a trend.

We'll be re-using the general code and workflow from the Le Guin example, with some modifications:

In [46]:
h_blc = harpers.begin_line_chars()
h_blc = h_blc.reset_index() # flattening the dataframe for a bit more clarity in code

h_blc.head()

Unnamed: 0,page,section,place,char,count
0,2,body,begin,(,1
1,2,body,begin,B,1
2,2,body,begin,H,1
3,2,body,begin,I,1
4,2,body,begin,J,1


Place holder for some text about above and below

In [70]:
h_blc_up = {}
h_blc_low = {}
h_blc_nonalpha = {}

for i, r in h_blc.iterrows():
    page = r[0]
    section = r[1]
    token_place = r[2]
    char = r[3]
#     print(char)
    count = r[4]
#     print(count)
    
    if char.isalpha() == True:
        if char.isupper() == True:
            if page not in h_blc_up:
                h_blc_up[page] = 1
            else:
                h_blc_up[page] += 1
    
        else:
            if page not in h_blc_low:
                h_blc_low[page] = 1
            else:
                h_blc_low[page] += 1
    else:
        if page not in h_blc_nonalpha:
            h_blc_nonalpha[page] = 1
        else:
            h_blc_nonalpha[page] += 1

print(f"Found {len(h_blc_up)} uppercase BLC")
print(f"Found {len(h_blc_low)} lowercase BLC")
print(f"Found {len(h_blc_nonalpha)} non-alpha BLC")

Found 977 uppercase BLC
Found 956 lowercase BLC
Found 871 non-alpha BLC


Now, with three different dictionaries with common keys--sequence/page number--we'll create a new DataFrame with each page sequence as a row, and values for each type of begin line character as columns. To do this,  create a DataFrame from the dictionaries, rename the columns, transpose its axes to move pages to rows, and then fill all `NaN` values (which are a result from pages with no lower/upper/non-alpha begin line characters) with 0 in order to compare them numerically to the other values:

In [83]:
dictionaries = [h_blc_nonalpha, h_blc_low, h_blc_up]
harpers_blc_df = pd.DataFrame.from_dict(dictionaries)
# harpers_blc_df.head(15)

harpers_blc_df = harpers_blc_df.rename(index={0: "non-alpha", 1: "lowercase", 2: "uppercase"})
harpers_blc_df = harpers_blc_df.T
harpers_blc_df = harpers_blc_df.fillna(0)
harpers_blc_df.head(15)
# harpers_blc_df.shape

Unnamed: 0,non-alpha,lowercase,uppercase
2,1.0,0.0,5.0
3,11.0,15.0,7.0
5,2.0,0.0,5.0
6,7.0,9.0,16.0
7,8.0,6.0,21.0
8,1.0,1.0,11.0
10,1.0,0.0,2.0
11,0.0,15.0,13.0
12,2.0,21.0,9.0
13,0.0,14.0,3.0


With this flattened DataFrame, we will go through the rows and track which category of begin line characters is the most common on each page, and then add the page number to lists to track each page. For this example, since we're only interested in pages of possible poetry, we're simply continuing the loop if a page has more lowercase or non-alpha begin line characters, and only tracking pages where uppercase characters are most common:

In [84]:
blc_poss_poetry = []

for i, r in harpers_blc_df.iterrows():
#     print(i)
    non_alpha = r[0]
    low = r[1]
    up = r[2]
    total = non_alpha + low + up
#     print(total)
    if non_alpha > low and non_alpha > up:
#         print('non-alpha')
        continue
        
    elif low > non_alpha and low > up:
#         print('low')
        continue
        
    elif up > non_alpha and up > low:
        blc_poss_poetry.append(i)

blc_poss_poetry[:15]

[2, 5, 6, 7, 8, 10, 15, 17, 79, 80, 81, 82, 231, 237, 242]

Now that we have two lists of potential pages, one based on token counts by page (`poetry_pages`) and one based on capital letters making up the plurality of begin line characters (`blc_poss_poetry`)--we can identify pages that appear in both lists, and thus are the most likely candidates to be poetry instead of prose. We can do this using one simple piece of code:

In [92]:
poetry = set(poetry_pages).intersection(blc_poss_poetry)

poetry

{79,
 81,
 82,
 242,
 307,
 350,
 416,
 691,
 884,
 890,
 895,
 943,
 955,
 958,
 966,
 969,
 977}

From 990 pages, we now have 17 strong candidates for pages that contain poetry! However, it's important to be reminded again that this workflow is part of the exploratory process, so we should manually check our results, and, if applicable, improve our process and code to maximize accuracy given what we know about our data.

This issue of *Harper's Magazine* is in the public domain, so we can check our result pages and see what we find. To make this easy for use in our Jupyter notebook, we'll use the URL to the first page of the volume in HathiTrust's page turner as a base, and then substitute the sequence number at the very end with our possible pages in order to print URLs to each page (or see an alternative in the code for opening each page in a browser tab):

In [90]:
# import webbrowser # OPTIONAL: import a library we can use use to open each page URLs in a new browser tab

harpers_url = 'https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq='


for page in sorted(poetry):
    page_url = harpers_url + str(page)
    print(page_url)
    # webbrowser.open_new_tab(page_url) # OPTIONAL: this code will open each page URL in a new browser 
                                        # tab, for investigation


https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=79
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=81
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=82
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=242
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=307
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=350
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=416
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=691
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=884
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=890
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=895
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=943
https://babel.hathitrust.org/cgi/pt?id=coo.31924054824473&view=1up&seq=955
https://babel.hathitrust.org

Our results look... not bad! There are some false positives in there, as there are pictures and ads in *Harper's* which can skew our method (though these features in the text are relatively unique when compared with most volumes), but we did find valid poetry, especially for pages that are not close to the end of the volume. These results also help to remind us that any general text analysis workflow will require customization based on knowledge of our target data--both what we hope to find and the pool of data in which we hope to find it!