# <center>Introduction to BookNLP</center>

<center>Dr. W.J.B. Mattingly</center>

<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>

<center>March 2022</center>

## Key Concepts in this Notebook

1) What is BookNLP?<br>
2) How to Install BookNLP<br>

## About the Author

I am Dr. William Mattingly. I hold a PhD in Medieval History from the University of Kentucky where I explored early medieval social networks. A lot of my research was aided by my ability to code, specifically in Python, for data cleaning and analysis. I was even able to use Python to plot and visualize social networks and data. During the fourth year of my PhD, I used Python to create an app that I could use to plot and analyze these social networks. Currently, I am a Postdoctoral Fellow for the Analysis of Historical Documents at the Smithsonian Institution with a joint appointment at the United States Holocaust Memorial Museum. In both institutions, I use Python, machine learning, and natural language processing (NLP) to analyze historical texts in large quantities to generate new insights about the documents held in the archives. In all, I have nearly a decade of experience using Python as a historian.

When I first started to explore Python, there were not many available tutorials geared towards humanists and, for that reason, four years ago I started PythonHumanities.com and Python Tutorials for Digital Humanities on YouTube. I geared these resources to humanists who had no prior knowledge about computing or coding. This new JupyterBook is the third iteration of this textbook that brings a lot of the material that first appeared on PythonHumanities.com years ago into a new, more accessible JupyterBook. It will forever remain free to all as will the video lectures embedded in this book.

## About this Textbook

Because this textbook is not peer-reviewed, typos may remain or errors may exist. I openly and freely admit to this. This textbook is community-inspired and I would like it to be community-supported. If you see a mistake, you can click thee GitHub logo inn the top right corner of the screen to submit a pull request or make a note for edits. I highly encourage this and I am open to and welcome any criticism to improve this textbook for all.

```{image} ./images/install/github_logo.JPG
:alt: jupyter_org
:class: bg-primary
:width: 500px
:align: center
```

## What is BookNLP?

<a href="https://github.com/booknlp/booknlp">BookNLP</a> is a new Python library created by <a href="https://github.com/dbamman">David Bamman</a>. It was originally created as a Java library in 2014 under the same name, <a href="https://github.com/dbamman/book-nlp">BookNLP</a> by David Bamman, Ted Underwood, and Noah Smith (see, David Bamman, Ted Underwood and Noah Smith, "A Bayesian Mixed Effects Model of Literary Character," ACL 2014). While Java is a powerful coding language, both in speed and ease-of-use, not many digital humanists code in Java primarily. I suspect (I want to emphasize I could be wrong) the reason for the Python library was to address the larger Python-coding community both in general and specifically within the digital humanities. This textbook will deal strictly with the Python library.

In the documentation, Bamman states:

"BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

- Part-of-speech tagging
- Dependency parsing
- Entity recognition
- Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
- Quotation speaker identification
- Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
- Event tagging
- Referential gender inference (TOM_SAWYER -> he/him/his)"

Unlike its predecessor, the Java library, the Python library leverages the Python NLP library, spaCy, and the Python Transformer library from HuggingFace, rathern than Stanford, to perform many of these tasks. In the last few years, spaCy has proven itself as a dominate force within the NLP community, outperforming many of its predecessors in accuracy and at sacle

It delivers in all these areas.

In [10]:
from booknlp.booknlp import BookNLP

In [26]:
model_params={
        "pipeline":"entity,quote,supersense,event,coref",
        "model":"big"}

In [33]:
booknlp=BookNLP("en", model_params)

{'pipeline': 'entity,event', 'model': 'big'}
--- startup: 2.055 seconds ---


In [30]:
# Input file to process
input_file="harry_potter.txt"

# Output directory to store resulting files in
output_directory="harry_potter"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="harry_potter"
booknlp.process(input_file, output_directory, book_id)

AttributeError: 'BookNLP' object has no attribute 'booknlp'

In [5]:
import json
from collections import Counter

In [11]:
def proc(filename):
    with open(filename) as file:
        data=json.load(file)
    return data

In [12]:
data=proc("harry_potter/harry_potter.book")

In [8]:
def get_counter_from_dependency_list(dep_list):
    counter=Counter()
    for token in dep_list:
        term=token["w"]
        tokenGlobalIndex=token["i"]
        counter[term]+=1
    return counter

In [21]:
for character in data["characters"]:
    agentList=character["agent"]
    patientList=character["patient"]
    possList=character["poss"]
    modList=character["mod"]
    
    character_id=character["id"]
    count=character["count"]

    referential_gender_distribution=referential_gender_prediction="unknown"

    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]

    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""

    # just print out information about named characters
    if len(mentions["proper"]) > 0:
        max_proper_mention=mentions["proper"][0]["n"]
        
        print(character_id, count, max_proper_mention, referential_gender)

        print()
        printTop=10
        for k, v in get_counter_from_dependency_list(possList).most_common(printTop):
            print("\tposs\t%s %s" % (v,k))
        print()
        for k, v in get_counter_from_dependency_list(agentList).most_common(printTop):
            print("\tagent\t%s %s" % (v,k))        
        print()
        for k, v in get_counter_from_dependency_list(patientList).most_common(printTop):
            print("\tpatient\t%s %s" % (v,k))       
        print()
        for k, v in get_counter_from_dependency_list(modList).most_common(printTop):
            print("\tmod\t%s %s" % (v,k))    
        print()

95 2022 Harry he/him/his

	poss	18 head
	poss	15 eyes
	poss	12 cupboard
	poss	12 hand
	poss	11 life
	poss	10 parents
	poss	9 aunt
	poss	9 mind
	poss	8 heart
	poss	7 uncle

	agent	91 said
	agent	43 had
	agent	29 know
	agent	22 felt
	agent	21 got
	agent	21 saw
	agent	20 looked
	agent	20 going
	agent	17 heard
	agent	17 have

	patient	11 told
	patient	5 take
	patient	5 asked
	patient	4 kill
	patient	4 reminded
	patient	4 got
	patient	3 took
	patient	3 caught
	patient	3 tell
	patient	2 left

	mod	9 sure
	mod	5 able
	mod	3 glad
	mod	2 name
	mod	2 famous
	mod	2 surprised
	mod	2 baby
	mod	2 stupid
	mod	2 afraid
	mod	2 safe

282 1594 Hagrid he/him/his

	poss	13 head
	poss	11 face
	poss	8 feet
	poss	7 eyes
	poss	7 hand
	poss	6 coat
	poss	6 broom
	poss	5 parents
	poss	5 back
	poss	5 hut

	agent	166 said
	agent	24 looked
	agent	23 told
	agent	20 know
	agent	19 had
	agent	18 got
	agent	13 pulled
	agent	10 knew
	agent	10 think
	agent	10 see

	patient	12 told
	patient	7 tell
	patient	5 followed
	pati