# Research Exercise #9: Finding place names in *The Adventures of Sherlock Holmes* with Named Entity Recognition

This week, we've been looking at projects that develop models to test out hypotheses about literary works.  

We're going to learn how to use a technique called Named Entity Recognition, and we're going to compare two different statistical models –– the default model in a Python library for natural languate processing called `spaCy` and a model developed for "book-length documents" called `BookNLP`. And we'll think a bit about what each of these models defines as a place name.

Below, find the table of contents, including the three questions for you to complete:

**Table of Contents:**

- [What is Named Entity Recognition?](#What-is-Named-Entity-Recognition?)
    - [How NER works](#How-NER-works:)
- [Method 1: Using spaCy's training model for named entity recognition](#Method-1:-Using-spaCy's-training-model-for-named-entity-recognition)
    - [spaCy's model and data](#spaCy's-model-&-data)
        - [Check-in](#💡-Check-in-💡)
    - [Getting spaCy set up](#Getting-spaCy-set-up)
    - [Running spaCy](#Running-spaCy)
        - [What entities can spaCy's 'Named Entity Recognition' model detect?](#What-entities-can-spaCy's-'Named-Entity-Recognition'-model-detect?)
    - [Defining a filepath and creating a spaCy document](#Defining-a-filepath-and-creating-a-spaCy-document)
    - [Output tagged entities in our text](#Output-tagged-entities-in-our-text)
    - [Visualizing the tagged named entities](#Visualizing-the-tagged-named-entities)
    - [Output an HTML file with our tagged text](#Output-an-HTML-file-with-our-tagged-text)
    - [Generating a list of place names from our spaCy document](#Generate-a-list-of-place-names-from-our-spaCy-document)
- [Method 2: Using BookNLP for named entity recognition](#Method-2:-Using-BookNLP-for-named-entity-recognition)
    - [BookNLP's model and data](#BookNLP's-model-&-data)
        - [Reflection:](#💡-Reflection-💡-:)
        - [What entities can BookNLP's 'Named Entity Recognition' model detect?](#What-entities-can-BookNLP's-'Named-Entity-Recognition'-model-detect?)
    - [Install BookNLP]()
    - [Import BookNLP, define parameters and filepaths](#Import-BookNLP,-define-parameters-and-filepaths)
    - [Run BookNLP](#Run-BookNLP)
    - [Read in the BookNLP output as a dataframe](#Read-in-the-BookNLP-output-files-as-a-dataframe)
    - [Generating a list of place names from our BookNLP dataframe](#Generating-a-list-of-place-names-from-our--BookNLP-dataframe)
- [Your turn!](#Your-turn!)
    - [Question 1](#Question-1:)
    - [Question 2](#Question-2:)
    - [Question 3](#Question-3:)

## What is Named Entity Recognition?

"Named Entity Recognition" (NER) is a form of **natural language processing (NLP)** ––that is, the use of computational methods to study "natural language" or language that has evolved through human use. 

NER is one techniqued used to extract information from a set of texts using machine learning. Rather than looking for paricular words––like all appearances of the word "London" in Arthur Conan Doyle's Sherlock Holmes, this technique of machine reading training.

We've encountered named entity recognition (NER) before: Elson, Dames, and McKeown, in ["Extracting Social Networks from Literary Fiction,"](https://princeton.instructure.com/courses/6331/files?preview=1351449) use NER to extract character names (see pages 141-142).  Matthew Wilkens, in ["The Geographic Imagination of Civil-War-Era American Fiction,"](https://princeton.instructure.com/courses/6331/files?preview=1351450) uses NER to extract city, state, and country names from his corpus of American fiction (see page 833). Both projects use the [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.html), a named entity recognizer developed by Jenny Finkel and trained on a corpus of tagged [Reuters](https://www.clips.uantwerpen.be/conll2003/ner/) and [Wall Street Journal](https://catalog.ldc.upenn.edu/LDC2003T13) articles (for more on how the model was trained, see the ["Models" section of the Stanfor NER page](https://nlp.stanford.edu/software/CRF-NER.html)). 

For this week, we're going to see natural language processing algorithm identifies as a place. We're going to use the [named entity recognition model](https://spacy.io/usage/linguistic-features#named-entities) from the [`spaCy` library](https://spacy.io/usage/spacy-101/#whats-spacy).

### How NER works: 

From `spaCy`'s [description of the pipeline used to train a model:](https://spacy.io/usage/training) 

![image](../_images/spacy-pipeline.png)

+ Named entity recognition is a particular kind of supervised machine learning, which starts by deciding on a set number of categories for entities––these are **labels**. 
+ Then, researchers develop a set of texts (usually at least a few hudnred)––this is (**training dataset**. They they go through these texts, carefully labeling (or "tagging") words within each text that fit the categories of named entities (eg persons, places, dates). 
+ A statistical **machine learning algorithm** is trained on this dataset––learning how to recognize entities by predicting them based on examples of text and labeled examples in the training data. This training process is done iteratively: examining the labels  and their frequencies within the corpus, learned to predict and assign weights to probability that a given text might fit a lablel, correcting the weights assigned to them, as the algorithm develops a model that might be generalized.
+ The statistical **model** produced by the algorithm is then compared to a new dataset (an "evaluation dataset") that is different from the training dataset. To evaluate how well a model performs, a sample of the evaluation data is checked by hand, with humans verifying how well the model performed at correctly tagging entities, and recording how much the model misssed).


A **machine learning algorithm** is a set of algorithms that "learn" or update based on a sample "training dataset".

A **label** (or "tag") is an annotation added by a human to a dataset. For instance, in a training dataset for NER, a human might annnotate with tags around all words that they classify as PERSONS or DATES in a text. 

A **training dataset** is a large (at least 100 documents) set of texts used to traing a statistical model to recognize and identify or label parts of text. 

An **evaluation dataset**: is a different dataset from the training datset used to evaluate how well the model performs at the task (be it recognizing named entities or recognizing parts-of-speech in a text).

A statistical **model** takes an input and produces an output to produce a mathematical representation of assumptions about predicted patterns in observed data. A model is generated by an initial learning algorithm and which can be used to predict patterns (like classificaitons) in a dataset.

A **named entity**: These can be persons, places, dates, currencies.  Here's [how to can view the named entity categories in the basic spaCy model](#What-entities-can-spaCy's-'Named-Entity-Recognition'-algorithm-detect?).

## Method 1: Using `spaCy`'s training model for named entity recognition

### `spaCy`'s model & data


Within spaCy, we're going to use the [EntityRecognizer algogrithm](https://spacy.io/api/entityrecognizer). This is one of many different kinds of tags that spaCy can perform on a text (it can also tag for part-of-speech, or syntactic dependency).

[`spaCy`'s model](https://spacy.io/usage/training) is trained on a dataest that consist of a corpus (that is, a collection of documents)  called [**OntoNotes 5.0**](https://catalog.ldc.upenn.edu/LDC2013T19). Click here to read a little more about what's in this training dataset: https://catalog.ldc.upenn.edu/LDC2013T19 



> ### 💡 Check-in 💡
Take a look at the training dataset for the default `spaCy` model. What do you notice? What might we need to keep in mind in using this model? What if we wanted to use this model to detect named entities in a literary work, say, a collection of short stories like *The Adventures of Sherlock Holmes?


**[TYPE YOUR RESPONSE HERE]**

### Getting `spaCy` set up

In [None]:
## Let's install spaCy
!pip install -U spacy

### Importing `spaCy`

In [3]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

### Downloading & loading the trained `spaCy` model for English
Here we're going to download the trained pipelines and weights for `spaCy`'s English language model, `en_core_web_sm`. `spaCy` has [models for a wide (though by no means comprehensive) set of languages–––including Japanese, Spanish, Catalan, among many others](https://spacy.io/usage/models#languages). 

For more about a Princeton CDH project building models for languages without a `spacy` model, see here: https://newnlp.princeton.edu/about/

In [None]:
# Download the model for English
!python -m spacy download en_core_web_sm

In [5]:
# Import and load the model for English 
import en_core_web_sm
nlp = en_core_web_sm.load()

### Running `spaCy`

#### What entities can `spaCy`'s 'Named Entity Recognition' model detect?

Some of the most common are `PERSON`(for proper names of people), `GPE`(Geo-political entities, which are defined as countries, cities, and states), `TIME`(temporal information), `LOC` (these are non-geopolitical locations, eg regions, bodies of water, mountains, etc), `FAC` (these are built facities, like streets, roads, and bridges), and `ORG` (organizations or corporate entities).

Let's look at what other default labels come with spaCy:

In [39]:
# Let's define a variable "list_of_ner_labels" and set that equal to the NLP pipe labels in spaCy's 'NER' algorithm
list_of_ner_labels = nlp.pipe_labels['ner']


# Print out the number of labels
print('Number of labels in the default spaCy Named Entity Recognizer:\n', len(list_of_ner_labels))
# Print out the names of the differe labels
print('Available labels in spaCy default Named Entity Recognizer:\n', list_of_ner_labels)

Number of labels in the default spaCy Named Entity Recognizer:
 18
Available labels in spaCy default Named Entity Recognizer:
 ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']


### Defining a filepath and running `spaCy` on our file

Below, we define our file paths and create a `spaCy` document by running `nlp()` on our text file. 

In [97]:
filepath = "../_datasets/texts/literature/Arthur-Conan-Doyle-The-Adventures-of-Sherlock-Holmes.txt"
text = open(filepath, encoding='utf-8').read()
spacy_document = nlp(text)

### Output tagged entities in our text 
Once we have our `spaCy` document, we can access some of the tagged text using `spacy_document.ents`. This `spacy_document.ents` stores information about every entity (e.g.  **the string text of that entity**, the **starting character of the text string for that entity**, the **ending character of that entity**, and the **label** assigned to that entity, (eg PERSON, GPE, DATE, TIME, etc)). 

We can query it to retrieve of information about our tagged text, such as the text string (`ent.text`), the starting (`ent.start_char`) and ending characters of the tagged text (`ent.end_char`) and the label that the entity has been tagged as (`ent.label_`). 

Let's use a for loop to iterate over all the entities that the spaCy algorithm tagged:

In [73]:
print("entity | entity_start_character | entity_end_character | entity_label\n")
for ent in spacy_document.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

entity | entity_start_character | entity_end_character | entity_label

The Adventures of Sherlock Holmes 1 34 WORK_OF_ART
Arthur Conan Doyle 39 57 ORG
Bohemia 93 100 GPE
The Red-Headed League 111 132 ORG
Identity 153 161 ORG
The Boscombe Valley Mystery 172 199 FAC
Five 214 218 CARDINAL
The Man 241 248 WORK_OF_ART
The Adventure of the Blue Carbuncle
   VIII 280 323 WORK_OF_ART
The Adventure of the Speckled Band 326 360 WORK_OF_ART
The Adventure of the Engineer’s Thumb 371 408 WORK_OF_ART
X.      412 419 PERSON
The Adventure of the Noble Bachelor
   XI 419 460 WORK_OF_ART
The Adventure of the Beryl Coronet 465 499 WORK_OF_ART
XII 503 506 ORG
The Adventure of the Copper Beeches




I. A SCANDAL IN BOHEMIA 510 573 WORK_OF_ART
Irene Adler 801 812 PERSON
one 837 840 CARDINAL
Irene 1679 1684 PERSON
Holmes 1750 1756 GPE
first 1905 1910 ORDINAL
Holmes 2008 2014 ORG
Bohemian 2065 2073 NORP
Baker Street 2108 2120 FAC
from week to week 2166 2183 DATE
Odessa 2627 2633 PERSON
Trepoff 2653 2660 PERSO

Saturday 72167 72175 DATE
Monday 72196 72202 DATE
Well, Watson 72235 72247 WORK_OF_ART
Holmes 72255 72261 PERSON
Holmes 72422 72428 PERSON
three 72753 72758 CARDINAL
fifty minutes 72814 72827 TIME
the St. James’s Hall 73273 73293 ORG
this afternoon 73294 73308 TIME
Watson 73344 73350 PERSON
a few
hours 73386 73397 TIME
first 73530 73535 ORDINAL
German 73615 73621 NORP
Italian 73684 73691 NORP
French 73695 73701 NORP
Underground 73784 73795 FAC
Aldersgate 73806 73816 ORG
four 73990 73994 CARDINAL
two 74010 74013 CARDINAL
laurel bushes 74134 74147 ORG
Three 74216 74221 CARDINAL
two 74739 74742 CARDINAL
three 74746 74751 CARDINAL
Holmes 74912 74918 PERSON
Strand 74980 74986 ORG
Third 74991 74996 ORDINAL
fourth 75004 75010 ORDINAL
Holmes 75101 75107 PERSON
fourth 75155 75161 ORDINAL
London 75178 75184 GPE
third 75245 75250 ORDINAL
Wilson 75318 75324 PERSON
Saxe-Coburg Square 75743 75761 ORG
Saxe-Coburg Square 75899 75917 ORG
Holmes 76490 76496 PERSON
London 76664 76670 GPE
Coburg 76741 767

Herefordshire 146104 146117 PERSON
James McCarthy 146378 146392 PERSON
three days 146501 146511 DATE
the morning of last Monday 146556 146582 TIME
Ross 146709 146713 PERSON
John
Cobb 146719 146728 PERSON
William Crowder 147118 147133 PERSON
about a hundred yards 147296 147317 QUANTITY
Hatherley Farm 147771 147785 ORG
more than 150 yards 147802 147821 QUANTITY
some minutes 148082 148094 TIME
Turner 148124 148130 PERSON
The Coroner: Did your father 148459 148487 WORK_OF_ART
The Coroner: 148614 148626 WORK_OF_ART
The Coroner: 148736 148748 WORK_OF_ART
The Coroner: I am afraid 148867 148891 WORK_OF_ART
The Coroner: That 149051 149068 WORK_OF_ART
The Coroner: 149268 149280 WORK_OF_ART
The Coroner: How 149388 149404 WORK_OF_ART
Bristol 149508 149515 ORG
A Juryman: Did 149575 149589 WORK_OF_ART
The Coroner: 149743 149755 WORK_OF_ART
A dozen yards 150348 150361 QUANTITY
a dozen yards 150488 150501 QUANTITY
McCarthy 150718 150726 PERSON
Holmes 151029 151035 PERSON
Swindon 151875 151882 ORG
twen

Lone Star 224760 224769 PRODUCT
Atlantic 224860 224868 LOC
VI 225063 225065 ORG
Isa Whitney 225098 225109 PERSON
Elias Whitney 225131 225144 PERSON
D.D. 225146 225150 ORG
De Quincey’s 225340 225352 PERSON
many years
 225583 225594 DATE
One night 225842 225851 TIME
June 225862 225866 DATE
the
hour 225908 225916 TIME
first 225938 225943 ORDINAL
a weary day 226180 226191 DATE
Kate Whitney 226682 226694 PERSON
Kate 226717 226721 PERSON
James 227081 227086 PERSON
Isa 227167 227170 ORG
two days 227197 227205 DATE
first 227254 227259 ORDINAL
Hitherto 227695 227703 PERSON
one
day 227743 227750 DATE
eight 227851 227856 CARDINAL
the Bar of Gold 228034 228049 ORG
Upper Swandam Lane 228054 228072 FAC
one 228288 228291 CARDINAL
second 228360 228366 ORDINAL
Isa Whitney’s 228410 228423 ORG
two hours 228590 228599 TIME
ten minutes 228667 228678 TIME
first 228930 228935 ORDINAL
Upper Swandam Lane 228959 228977 PERSON
London Bridge 229080 229093 FAC
three 230382 230387 CARDINAL
two 230470 230473 CARDINA

Tottenham Court Road 289237 289257 ORG
Watson 289281 289287 PERSON
Henry Baker 289459 289470 PERSON
first 289744 289749 ORDINAL
all the evening 289800 289815 TIME
Goodge Street 289979 289992 FAC
Henry Baker 290028 290039 PERSON
6:30 this evening 290073 290090 TIME
Baker 290100 290105 PERSON
Peterson 290357 290365 PERSON
Peterson 290641 290649 PERSON
evening 290711 290718 TIME
Globe 290760 290765 ORG
Pall Mall 290777 290786 PERSON
Evening News 290813 290825 ORG
Peterson 290976 290984 PERSON
Holmes 291186 291192 PERSON
twenty years old 291504 291520 DATE
the Amoy River 291551 291565 LOC
China 291578 291583 GPE
two 291776 291779 CARDINAL
forty 291877 291882 CARDINAL
Countess 292070 292078 ORG
Horner 292133 292139 PERSON
Henry Baker 292222 292233 PERSON
Henry Baker 292312 292323 PERSON
seven 292846 292851 CARDINAL
Hudson 292956 292962 PERSON
a little after 293027 293041 DATE
half-past six 293042 293055 DATE
Baker Street 293079 293091 FAC
Scotch 293151 293157 NORP
Holmes 293360 293366 GPE
H

Paddington 372753 372763 GPE
Victor Hatherley 372960 372976 PERSON
16A 372998 373001 DATE
Victoria Street 373003 373018 PERSON
3rd 373020 373023 ORDINAL
morning 373075 373082 TIME
four 374238 374242 CARDINAL
one 374634 374637 CARDINAL
Sherlock Holmes 376128 376143 PERSON
five minutes 376819 376831 TIME
Baker Street 376889 376901 FAC
Times 377027 377032 ORG
the day before 377145 377159 DATE
Pray 377588 377592 WORK_OF_ART
Holmes 378013 378019 PERSON
London 378329 378335 GPE
seven
years 378439 378450 DATE
Venner & Matheson 378477 378494 ORG
Greenwich 378520 378529 GPE
Two years ago 378531 378544 DATE
Victoria Street 378726 378741 FAC
first 378779 378784 ORDINAL
two
years 378879 378888 DATE
three 378900 378905 CARDINAL
one 378924 378927 CARDINAL
27 10_s 379030 379037 DATE
Every day 379040 379049 DATE
nine 379056 379060 CARDINAL
four 379082 379086 CARDINAL
Yesterday 379240 379249 DATE
nearer forty than
thirty 379985 380009 DATE
Hatherley 380018 380027 PERSON
German 380059 380065 NORP
Hather

one 464410 464413 CARDINAL
morocco 465145 465152 GPE
the Beryl Coronet 465223 465240 FAC
One 465246 465249 CARDINAL
I. 465311 465313 ORG
thirty-nine 465471 465482 CARDINAL
four days 466041 466050 DATE
Holder 466158 466164 PERSON
Monday 466794 466800 DATE
morning 466801 466808 TIME
fifty £ 1000 466927 466939 MONEY
evening 467467 467474 TIME
the next few days 467750 467767 DATE
Streatham 467943 467952 ORG
Holmes 468127 468133 PERSON
three 468275 468280 CARDINAL
Lucy Parr 468401 468410 PERSON
second 468416 468422 ORDINAL
a few months 468465 468477 DATE
Arthur 468917 468923 PERSON
Holmes 468965 468971 PERSON
George Burnwell 470112 470127 PERSON
George Burnwell 470221 470236 PERSON
Arthur 470418 470424 PERSON
Mary 470823 470827 PERSON
five years ago 470964 470978 DATE
only one 471294 471302 CARDINAL
Holmes 471655 471661 PERSON
night 471807 471812 TIME
Arthur 471834 471840 GPE
Mary 471845 471849 PERSON
Lucy Parr 471963 471972 PERSON
Mary 472081 472085 PERSON
Arthur 472090 472096 PERSON
Arthu

For an alternate method, we can create a list that contains pairs of entity text and the tag applied to that text:

In [79]:
# Let's create a list of pairs (the_text_of_the_entity, the_tag_applied_to_that_entity) for all of the entities
list_of_named_entities_data=[(ent, ent.label_) for ent in spacy_document.ents]

# Let's look at just the first 10 entries in our list
print("List of the first 10 tagged entities:\n")
list_of_named_entities_data[:10]

List of the first 10 tagged entities:



[(The Adventures of Sherlock Holmes, 'WORK_OF_ART'),
 (Arthur Conan Doyle, 'ORG'),
 (Bohemia, 'GPE'),
 (The Red-Headed League, 'ORG'),
 (Identity, 'ORG'),
 (The Boscombe Valley Mystery, 'FAC'),
 (Five, 'CARDINAL'),
 (The Man, 'WORK_OF_ART'),
 (The Adventure of the Blue Carbuncle
     VIII,
  'WORK_OF_ART'),
 (The Adventure of the Speckled Band, 'WORK_OF_ART')]

### Visualizing the tagged named entities

spaCy has a built-in visualizer called the [displaCy entity visualizer](https://spacy.io/usage/visualizers#ent) which we can use to visualize all of these tagged entities within the full text of our document.

In [57]:
displacy.render(spacy_document, style="ent")

### Output an HTML file with our tagged text

In [18]:
# Let's save this visualization to an HTML file
html = displacy.render(spacy_document, style="ent", page=True, jupyter=False)
with open("sherlock_holmes_spacy_output.html", 'w+', encoding="utf-8") as fp:
        fp.write(html)
        fp.close()

### Generate a list of place names from our `spaCy` document

Here, we're going to look at place names as defined by the "GPE" (geo-political entities) tag.

In [50]:
print("Geopolitical place names according to spaCy:")
for named_entity in spacy_document.ents:
    if named_entity.label_ == "GPE": # Look at just the geopolitical entities, labeled "GPE"
        print(named_entity)

Geopolitical place names according to spaCy:
Bohemia
Holmes
Holland
Holmes
London
Europe
Egria
Bohemia
Bohemia
Holmes
Hercules
England
Europe
Europe
Prague
Warsaw
New Jersey
Prima
Warsaw
London
London
Regent Street
Serpentine
Holmes
Holmes
Arnsworth Castle
England
Esq
Holmes
China
China
China
China
Lebanon
Pennsylvania
U.S.A.
London
south
London
London
Holmes
Kensington
Holmes
London
Scotland
Cornwall
Holmes
London
London
Archie
London
Holland
London
Devonshire
New Zealand
France
France
France
France
England
Hague
Holmes
France
France
Horace
England
Boscombe Valley
Afghanistan
London
Boscombe Valley
Australia
Hatherley
colonies
the Boscombe Valley
Boscombe Valley
Boscombe Pool
Stroud Valley
Holmes
Victoria
Victoria
the Boscombe Pool
Holmes
Holmes
London
Bristol
Australia
Rotterdam
the Ballarat Gang
Melbourne
England
England
Uffa
London
Horsham
America
Florida
Europe
Sussex
Horsham
States
Horsham
England
India
Horsham
America
North
Fareham
Horsham
London
Horsham
London
America
Florida
E

In [104]:
print("Geopolitical place names according to spaCy:")
for named_entity in spacy_document.ents:
    if named_entity.label_ == "LOC": # Look at just the geopolitical entities, labeled "GPE"
        print(named_entity)

Geopolitical place names according to spaCy:
Europe
Europe
Europe
Regent Street
Boscombe Valley
Hatherley
the Boscombe Valley
Stroud Valley
Europe
North
South
South
Atlantic
Cannon Street
Europe
the Amoy River
Covent Garden
Covent Garden Market
Covent Garden
Atlantic
Pacific
Europe
Rockies


## Method 2: Using `BookNLP` for named entity recognition

What if we wanted to use a different model, one trained with a differen machine learning algorithm for recognizing entities?

David Bamman developed [a natural language processing pipeline just for book-length works in English called `BookNLP`](https://github.com/booknlp/booknlp). (Unlike `spaCy`, the BookNLP model is not currently available for other languages, though Bamman's team is developing "Multlingual BookNLP" for Spanish, German, Japanese and Russian). 



### `BookNLP`'s model & data

Bamman's `BookNLP` named entity recognition model uses similar pipeline as `spaCy` (and even uses the algorithm as `spaCy` for tagging parts of speech), but uses a different machine learning model for recognizing enities (the [BERT model](https://huggingface.co/docs/transformers/model_doc/bert#bert), developed by Google) and is trained on a different training dataset.  

From the `BookNLP` [description of the training datset](https://github.com/booknlp/booknlp#entity-annotations):

>"For more, see: David Bamman, Sejal Popat and Sheng Shen, ["An Annotated Dataset of Literary Entities,"](https://people.ischool.berkeley.edu/~dbamman/pubs/pdf/naacl2019_literary_entities.pdf) NAACL 2019.  
The entity tagging model within BookNLP is trained on an annotated dataset of 968K tokens, including the public domain materials in [LitBank](https://github.com/dbamman/litbank) and a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction (article forthcoming)."

> ### 💡 Reflection 💡 :
Take a look at the `BookNLP` training dataset as described above, and Bamman's [LitBank](https://github.com/dbamman/litbank) GitHub repository.  How does this training dataset compare to the training dataset for the defaul `spaCy` model? How might that change the context we might use it in?

**[TYPE YOUR RESPONSE HERE]**

###  What entities can `BookNLP`'s 'Named Entity Recognition' model detect?
For a complete list of the six categories of entities that `BookNLP` tags, see here: https://github.com/booknlp/booknlp#entity-annotations 

### Install `BookNLP`

In [None]:
!pip install booknlp

### Import `BookNLP`, define parameters and filepaths

Take a look at the way `BookNLP` describes its usage here: https://github.com/booknlp/booknlp#usage 

In [91]:
# Import BookNLP
from booknlp.booknlp import BookNLP

# Define the model parameters
# We can choose between the "big" or "small model"
# Below we are only using the named entity recognition pipeline ("entity")
# But there are other options that we might use, like quotations, or "coreference" resoluation
model_params={
		"pipeline":"entity", 
		"model":"small"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="../_datasets/texts/literature/Arthur-Conan-Doyle-The-Adventures-of-Sherlock-Holmes.txt"

# Output directory to store resulting files in
output_directory="sherlock_holmes/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="adventures_of_sherlock_holmes"

{'pipeline': 'entity', 'model': 'small'}
--- startup: 4.997 seconds ---


### Run BookNLP

Click and run the cell below–– `booknlp.process(input_file, output_directory, book_id)` to run the NLP model.

Because the `BookNLP` model is a fair bit bigger than the standard `spaCy`, it will tak 2-3minutes to run. Go get yourself a cup of tea or a cup of coffee in the meantime!

In [93]:
booknlp.process(input_file, output_directory, book_id)

--- spacy: 25.552 seconds ---
--- entities: 97.942 seconds ---
--- quotes: 0.164 seconds ---
--- name coref: 0.610 seconds ---
--- TOTAL (excl. startup): 127.606 seconds ---, 127661 words


### Read in the BookNLP output files as a dataframe

In the cell bellow we read in the `adventures_of_sherlock_holmes.entities` output from running the `BookNLP` model on *The Adventures of Sherlock Holmes* as a dataframe, so that we can work with the data about tagged entities in tabular format.

In [98]:
sherlock_holmes_entities_df = pd.read_csv("sherlock_holmes/adventures_of_sherlock_holmes.entities", delimiter="\t")
sherlock_holmes_entities_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,240,3,4,PROP,PER,Sherlock Holmes
1,241,6,8,PROP,PER,Arthur Conan Doyle
2,1,14,15,PROP,FAC,Bohemia II
3,-1,17,22,NOM,FAC,The Red - Headed League III
4,2,35,39,PROP,VEH,The Five Orange Pips VI
...,...,...,...,...,...,...
18063,-1,127638,127638,PRON,PER,she
18064,-1,127644,127648,NOM,FAC,a private school at Walsall
18065,239,127648,127648,PROP,GPE,Walsall
18066,0,127651,127651,PRON,PER,I


### Generating a list of place names from our  `BookNLP` dataframe

We've just created a `pandas` DataFrame-- so we can apply any of the `pandas` methods that we've learned to sort, filter, and analyze the dataframe.

Below, we're going to fileter our dataframe to look for values in the "cat" column (ie, the NER labels) that are "GPE" (ie, geopolitical entities).

In [99]:
sherlock_holmes_entities_df[sherlock_holmes_entities_df['cat'] == 'GPE']

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
9,4,99,99,PROP,GPE,BOHEMIA
62,6,509,509,PROP,GPE,Odessa
65,7,531,531,PROP,GPE,Trincomalee
68,8,551,551,PROP,GPE,Holland
83,9,657,657,PROP,GPE,Scarlet
174,10,1214,1214,PROP,GPE,London
229,11,1638,1638,PROP,GPE,Europe
264,-1,2025,2029,NOM,GPE,a German - speaking country
265,4,2032,2032,PROP,GPE,Bohemia
266,12,2037,2037,PROP,GPE,Carlsbad


## Your turn!

### Question 1: 
Compare the output of `spaCy` [generated list of place names](#Generate-a-list-of-place-names-from-our-spaCy-document) with the [generated list of place names from `BookNLP`](#Generating-a-list-of-place-names-from-our--BookNLP-dataframe). What do you notice?

[TYPE YOUR REFLECTION HERE]

### Question 2
Look at the dataframe we created above, `sherlock_holmes_entities_df`. Using the `pandas` methods that we've learned thus far, output the number of each values that appear.

In [None]:
## YOUR CODE HERE

### Question 3

+ 3a. Sticking with models we've just run on *The Adventures of Sherlock Holmes*, generate a list of "facility" names in `spaCy` (this is the 'FAC' tag) and generate a list of "facility" names from the `BookNLP` dataframe. 
+ 3b. How do they compare?

In [None]:
## YOUR CODE HERE

[TYPE YOUR REFLECTION HERE]

### Question 4:

+ 4a. Using `spaCy` and read in a new text from our "../datasets/texts/literature/" folder. 
+ 4b. Run the NER model on that text
+ 4c. Generate a list of all the "GPE" tagged strings of text
+ 4d. Using the `displacy` visualizer, save an HTML visualization of the full tagged text of your chosen work



In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

### Question 5:

+ 5a. Run `BookNLP` on the text from Question 4 (remember to replace the filepath, book directory and book ID fields in the code) 
+ 5b. Generate a dataframe from the BookNLP output on your chosen text
+ 5c. Using that dataframe, generate list of all the "GPE" tagged strings of text


In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE