In [1]:
import pandas as pd

# Exploring Hathi Data

I'm using this notebook to investigate the title strings in the data saved from HathiTrust.

In [None]:
df = pd.read_csv('../../data/1908698974-1722799169.txt', sep='\t')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24799 entries, 0 to 24798
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   htid                     24799 non-null  object 
 1   access                   24799 non-null  int64  
 2   rights                   24799 non-null  object 
 3   ht_bib_key               24799 non-null  int64  
 4   description              10074 non-null  object 
 5   source                   24799 non-null  object 
 6   source_bib_num           24729 non-null  object 
 7   oclc_num                 17400 non-null  object 
 8   isbn                     164 non-null    object 
 9   issn                     0 non-null      float64
 10  lccn                     3208 non-null   object 
 11  title                    24799 non-null  object 
 12  imprint                  24788 non-null  object 
 13  rights_reason_code       24799 non-null  object 
 14  rights_timestamp      

In [4]:
df['title']

0        Historiæ canadensis, seu Novæ-Franciæ libri de...
1        Ernesti Meyer de plantis labradoricis libri tres.
2        Novus orbis, seu Descriptionis Indiae Occident...
3        C. Julii Cæsaris commentariorum De Bello Galli...
4        Collectanea latina seu ecclesiasticæ antiquita...
                               ...                        
24794    Prolegomena in Juliani imperatoris libros quib...
24795    Cvlex carmen Vergilio ascriptvm; recensvit et ...
24796    A. Persii flacci Satirarum liber Ex recensione...
24797    Liber decem quaestionum contra Christianos auc...
24798    Imperatoris Iustiniani Institutionum libri qua...
Name: title, Length: 24799, dtype: object

In [6]:
from collections import Counter
import re
df['cleaned_text'] = df['title'].apply(lambda x: re.sub(r'[^\w\s]', '', str(x).lower()))
all_words = ' '.join(df['cleaned_text']).split()
word_counts = Counter(all_words)
most_common_words = word_counts.most_common(10)  # Replace 10 with any number to get that many top words
print(most_common_words)

[('et', 14661), ('de', 12087), ('libri', 9766), ('in', 7988), ('opera', 7526), ('ad', 5115), ('cum', 4445), ('ex', 4139), ('omnia', 3939), ('liber', 3262)]


In [None]:
# What is the longest title string, and how long is it?
longest_string = max(title_list, key=len)
print(longest_string)
print(len(longest_string))

Aristotelis omnia quae extant opera : selectis translationibus, collatisq́ ; cum Graecis emendatissimis, ac vetustissimis exemplaribus, illustrata, prȩstantissimorumq́; aetatis nostrȩ philosophorum industria diligentissime recognita. Averrois Cordvbensis in ea opera omnes, qui ad haec vsq; tempora peruenere, commentarij. Nonnulli etiam ipsius in logica, philosophia, & medicina libri, cum Leui Gersonidis in libros logicos annotationibus, quorum plurimi sunt, à Iacob Mantino, in Latinum conuersi. Graecorum, Arabum & Latinorum lucubrationes quaedam, ad hoc opus pertinentes. Marciantonii Zimarae philosophi, in Aristotelis, & Auerrois dicta in philosophia contradictionum solutiones, proprijs locis annexae. Bernardini Tomitani logici, atqve philosophi prȩstantissimi, in Arist. & Auer. dicta in primo libro poster. resolut. contradictionum solutiones: nec non eiusdem libri locorum, qui obscuriores habentur conuersiones, & animaduersiones. in Auer. quaesita demonstratiua, argumenta, & magno

In [None]:
# What is the shortest one?
shortest_string = min(title_list, key=len)
print(shortest_string)
print(len(shortest_string))

Opera.
6


In [None]:
# What is the average length?
average_length = sum(len(s) for s in title_list) / len(title_list)
print(average_length)

153.1196419210452


In [7]:
title_list = df['title'].to_list()
for title in title_list:
    print(title)

Historiæ canadensis, seu Novæ-Franciæ libri decem, ad annum usque Christi MDCLVI auctore P. Francisco Creuxio ..
Ernesti Meyer de plantis labradoricis libri tres.
Novus orbis, seu Descriptionis Indiae Occidentalis libri XVIII authore Ioanne de Laet Antverp ; novis tabulis geographicis et variis animantium, plantarum fructuumque iconibus illustrati ...
C. Julii Cæsaris commentariorum De Bello Gallico libri IV / from the text of Herzog, carefully revised.
Collectanea latina seu ecclesiasticæ antiquitatis monumenta eximia : ex patrum operibus, in usum classis theologicoe, excerpta ...
R.P. Hieremiae Drexelii e Societate Jesu Opera omnia : in XXVIII tractatibus distributa & figuris aeneis adornata.
Athanasii Kircheri e Soc. Jesu mundus subterraneus, in XII libros digestus : quo divinum subterrestris mundi opificium, mira ergasteriorum naturae in eo distributio, verbo pantámorphou Protei regnum, universae denique naturae majestas & divitiae summa rerum varietate exponuntur : abditorum effe

It looks like the following patterns are prevalent:

- Title words ([:/...]) subtitle or further description
- Author ... title
- Title ...
- Title.
- Title, subtitle
- Title / Author ; publishing information

I'm going to see what happens if I split the title strings based on `:`, `/`, or `...`. I'll use `re.split(r'[:/;]|\.{3}')` to split on `:`, `/`, or exactly three periods.

In [17]:
# Use `re.split(r'[:/;]|\.{3}')` to split on `:`, `/`, or exactly three periods
df['title_list'] = df['title'].apply(lambda x: re.split(r'[:/;]|\.{3}', x))

In [21]:
split_titles = df['title_list'].to_list()
for title in split_titles:
    for item in title:
        print(item)

Historiæ canadensis, seu Novæ-Franciæ libri decem, ad annum usque Christi MDCLVI auctore P. Francisco Creuxio ..
Ernesti Meyer de plantis labradoricis libri tres.
Novus orbis, seu Descriptionis Indiae Occidentalis libri XVIII authore Ioanne de Laet Antverp 
 novis tabulis geographicis et variis animantium, plantarum fructuumque iconibus illustrati 

C. Julii Cæsaris commentariorum De Bello Gallico libri IV 
 from the text of Herzog, carefully revised.
Collectanea latina seu ecclesiasticæ antiquitatis monumenta eximia 
 ex patrum operibus, in usum classis theologicoe, excerpta 

R.P. Hieremiae Drexelii e Societate Jesu Opera omnia 
 in XXVIII tractatibus distributa & figuris aeneis adornata.
Athanasii Kircheri e Soc. Jesu mundus subterraneus, in XII libros digestus 
 quo divinum subterrestris mundi opificium, mira ergasteriorum naturae in eo distributio, verbo pantámorphou Protei regnum, universae denique naturae majestas & divitiae summa rerum varietate exponuntur 
 abditorum effectuu

In [22]:
authors = df['author'].to_list()
for author in authors:
    print(author)

Du Creux, François, 1596?-1666.
Meyer, Ernst H. F. 1791-1858.
Laet, Joannes de, 1593-1649.
Caesar, Julius
nan
Drexel, Jeremias, 1581-1638,
Kircher, Athanasius, 1602-1680
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Hincmar, Archbishop of Reims, approximately 806-882
Acosta, José de, 1540-1600,
Hincmar, Archbishop of Reims, approximately 806-882
Lessius, Leonardus, 1554-1623
Hincmar, Archbishop of Reims, approximately 806-882
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Riccioli, Giovanni Battista, 1598-1671,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Guazzo, Francesco Maria,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Drexel, Jeremias, 1581-1638,
Kircher, Athanasius, 1602-1680.
Drexel, Jeremias, 1581-1638,
Mersenne, Marin, 1588-1648,
Hincmar, Archbishop of Reims, approximately 806-882
Dre