Notebook for working with HathiTrust metadata from a search with the following criteria:

    - Title: liber libri libris libro OR All Fields: opus opera operibus OR Title: carmen carmina carminibus
    - Language: (Latin)
    - Original Format: (Book)

In [5]:
'''
author: Samuel J. Huskey
'''
# Import the necessary modules
import pandas as pd
import re

In [8]:
# Read in the tab-delimited file downloaded from Hathi and turn it into a dataframe
df = pd.read_csv('input/1908698974-1722799169.txt', sep='\t')

In [9]:
# Examine the basic structure of the file
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24799 entries, 0 to 24798
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   htid                     24799 non-null  object 
 1   access                   24799 non-null  int64  
 2   rights                   24799 non-null  object 
 3   ht_bib_key               24799 non-null  int64  
 4   description              10074 non-null  object 
 5   source                   24799 non-null  object 
 6   source_bib_num           24729 non-null  object 
 7   oclc_num                 17400 non-null  object 
 8   isbn                     164 non-null    object 
 9   issn                     0 non-null      float64
 10  lccn                     3208 non-null   object 
 11  title                    24799 non-null  object 
 12  imprint                  24788 non-null  object 
 13  rights_reason_code       24799 non-null  object 
 14  rights_timestamp      

## Analysis of columns

In [11]:
# Set the display options to show all columns
pd.set_option('display.max_columns', None)
# Examine the first five rows
df.head()

Unnamed: 0,htid,access,rights,ht_bib_key,description,source,source_bib_num,oclc_num,isbn,issn,lccn,title,imprint,rights_reason_code,rights_timestamp,us_gov_doc_flag,rights_date_used,pub_place,lang,bib_fmt,collection_code,content_provider_code,responsible_entity_code,digitization_agent_code,access_profile_code,author,catalog_url,handle_url
0,aeu.ark:/13960/t25b10270,1,pd,100281057,,AEU,6264341,768320676,06654768259780665476822,,,"Historiæ canadensis, seu Novæ-Franciæ libri de...",Apud Sebastianum Cramoisy et Sebast. Mabre-Cra...,bib,2014-09-17 03:25:33,0,1664,fr,lat,BK,AEU,ualberta,ualberta,ia,open,"Du Creux, François, 1596?-1666.",https://catalog.hathitrust.org/Record/100281057,https://hdl.handle.net/2027/aeu.ark:/13960/t25...
1,aeu.ark:/13960/t5q82gg1r,1,pd,100288370,,AEU,6279449,861561317,06656169299780665616921,,,Ernesti Meyer de plantis labradoricis libri tres.,"Sumtibus Leopoldi Vossii, 1830.",bib,2014-09-17 03:26:35,0,1830,gw,lat,BK,AEU,ualberta,ualberta,ia,open,"Meyer, Ernst H. F. 1791-1858.",https://catalog.hathitrust.org/Record/100288370,https://hdl.handle.net/2027/aeu.ark:/13960/t5q...
2,aeu.ark:/13960/t6155m888,1,pd,100315300,,AEU,6374963,85791860,06659401069780665940101,,,"Novus orbis, seu Descriptionis Indiae Occident...","Apud Elzevirios, 1633.",bib,2014-09-19 03:25:59,0,1633,ne,lat,BK,AEU,ualberta,ualberta,ia,open,"Laet, Joannes de, 1593-1649.",https://catalog.hathitrust.org/Record/100315300,https://hdl.handle.net/2027/aeu.ark:/13960/t61...
3,aeu.ark:/13960/t6tx4326r,1,pd,100266272,,AEU,4964437,719990409,06653521079780665352102,,,C. Julii Cæsaris commentariorum De Bello Galli...,"Armour and Ramsay, 1849.",bib,2014-09-17 03:26:53,0,1849,quc,lat,BK,AEU,ualberta,ualberta,ia,open,"Caesar, Julius",https://catalog.hathitrust.org/Record/100266272,https://hdl.handle.net/2027/aeu.ark:/13960/t6t...
4,aeu.ark:/13960/t77s8mb8n,1,pd,100312296,,AEU,6368951,867440434,"066589693X,9780665896934",,,Collectanea latina seu ecclesiasticæ antiquita...,"[s.n.], 1853.",bib,2014-09-19 03:26:14,0,1853,onc,lat,BK,AEU,ualberta,ualberta,ia,open,,https://catalog.hathitrust.org/Record/100312296,https://hdl.handle.net/2027/aeu.ark:/13960/t77...


### Columns that could be jettisoned

- `access`: its value is always "1". 
- `rights` will always be "pd" (public domain). 
- `description`: are all values "NaN"?
- `issn`: are all values "NaN"?
- `lccn`: are all values "NaN"?
- `us_gov_doc_flag`: the values should be "0"
- `lang`: the search criteria specified "lat"

I'll check which columns have multiple or single values.

In [17]:
# Use nunique() to check the number of unique values in each column
unique_values = df.nunique()

# Identify columns with only one unique value
single_value_columns = unique_values[unique_values == 1].index.tolist()

# Print the results
print("Columns with multiple values:")
print(unique_values[unique_values > 1])
print("\nColumns with a single unique value:")
print(unique_values[unique_values == 1])

Columns with multiple values:
htid                       24799
rights                         4
ht_bib_key                 14917
description                 1831
source                        35
source_bib_num             16403
oclc_num                    9336
isbn                          41
lccn                        1095
title                      14482
imprint                    13111
rights_reason_code             7
rights_timestamp            9090
rights_date_used             508
pub_place                     97
collection_code               51
content_provider_code         35
responsible_entity_code       35
digitization_agent_code       16
access_profile_code            2
author                      6017
catalog_url                14917
handle_url                 24799
dtype: int64

Columns with a single unique value:
access             1
us_gov_doc_flag    1
lang               1
bib_fmt            1
dtype: int64


I can safely jettison `access`, `us_gov_doc_flag`, `lang`, and `bib_fmt`.

I'm not interested right now in `right`, `rights_reason_code`, `rights_timestamp`, `rights_date_used`, `collection_code`, `content_provider_code`, `responsible_entity_code`, `digitization_agent_code`, or `access_profile_code`.

In fact, as long as I have one unique identifier to tie the records to the original dataframe, I can eliminate most of the columns so that I can focus on authors and titles. The `handle_url` column is the only one with a unique value in each row, so I'll use that as the identifier.

I'll make a new dataframe with only the columns needed: `author`, `title`, `imprint`, `pub_place`, `rights_date_used` (a.k.a. publication date), and `handle_url`.

In [23]:
# Make a new dataframe with the required columns
hathidata = df[['author','title','imprint','pub_place','rights_date_used','handle_url']]

In [24]:
# Inspect the first five records
hathidata.head()

Unnamed: 0,author,title,imprint,pub_place,rights_date_used,handle_url
0,"Du Creux, François, 1596?-1666.","Historiæ canadensis, seu Novæ-Franciæ libri de...",Apud Sebastianum Cramoisy et Sebast. Mabre-Cra...,fr,1664,https://hdl.handle.net/2027/aeu.ark:/13960/t25...
1,"Meyer, Ernst H. F. 1791-1858.",Ernesti Meyer de plantis labradoricis libri tres.,"Sumtibus Leopoldi Vossii, 1830.",gw,1830,https://hdl.handle.net/2027/aeu.ark:/13960/t5q...
2,"Laet, Joannes de, 1593-1649.","Novus orbis, seu Descriptionis Indiae Occident...","Apud Elzevirios, 1633.",ne,1633,https://hdl.handle.net/2027/aeu.ark:/13960/t61...
3,"Caesar, Julius",C. Julii Cæsaris commentariorum De Bello Galli...,"Armour and Ramsay, 1849.",quc,1849,https://hdl.handle.net/2027/aeu.ark:/13960/t6t...
4,,Collectanea latina seu ecclesiasticæ antiquita...,"[s.n.], 1853.",onc,1853,https://hdl.handle.net/2027/aeu.ark:/13960/t77...
