---
title: NLP
subtitle: "Natural Language Processing and Named Entity Recognition"
author:
  - name: Charles Pletcher
    affiliations: Tufts University
    orcid: 0000-0003-2734-5511
    email: charles.pletcher@tufts.edu
license:
  code: MIT
date: 2025-04-06
---

# Natural Language Processing

NLP is a large field — we'll only be able to scratch the surface today. We've already started working with elements of NLP, however: tokenization is a common first step in NLP, and it is also an NLP problem in itself.

As we've discussed, tokenization is not as simple as breaking on whitespace, nor even as simple as breaking on whitespace and punctuation. How should we handle hyphenated words, for example? What about "U.K." or "U.S."?

## Named Entity Recognition

Another subtask of NLP is Named Entity Recognition (NER). NER itself is composed of other sub-problems like named entity classification ("What kind of entity is this?") and named entity linking ("To what specific entity does this refer?")

Today, we'll be focusing on named entity classification of Book 1 of Pausanias' _Periegesis_. We'll be able to look up these entities in a data dump from the Pleiades project and feed them back into ArcGIS along with their coordinates and other relevant information.

## Loading the data

First, let's read in the transcription of Book 1 that we'll be using.

In [2]:
from pathlib import Path

paus_filename = Path("./txt/tlg0525.tlg001.theoi-eng.txt")

with open(paus_filename) as f:
    book_1 = f.read()

Simple enough. Let's just peek at the data to make sure it looks sane.

In [3]:
book_1[100:200]

'e Sunium promontory stands out from the Attic land. When you have rounded the promontory you see a h'

Seems pretty reasonable to me!

## Installing spaCy

[spaCy](https://spacy.io) is a Python library for NLP. Unlike the [NLTK](https://nltk.org), which prioritizes teaching and research, spaCy generally provides one way of performing a given task. For our purposes, spaCy's guided approach will be more than sufficient.

To get started, install spaCy like any other Python library:

In [3]:
%pip install spacy

Collecting spacy
  Downloading spacy-3.8.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp312-cp312-manylinux_2_17_x86

But for spaCy to perform anything of use, we also need to download a pretrained _model_. Models are essentially large mappings of tokens (or subtokens) to long matrixes (lists of lists) of numbers. The larger the model, the more accurately it can represent a text in numerical terms — but also the more expensive it is to run.

We'll use the medium model today, as it hits the sweet spot for accuracy and usability.

In [3]:
%run -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


With the model downloaded, we can now run the text of `book_1` through spaCy's NER pipeline.

In [4]:
import spacy
from spacy import displacy

# Load the model that we downloaded.
# If this line fails, make sure that
# you have downloaded the model that's
# referenced here.
nlp = spacy.load("en_core_web_md")

# Analyze `book_1` — this might take a bit.
doc = nlp(book_1)

In [9]:
# We're running these lines in a separate cell so that we don't
# need to run the full analysis each time we inspect the results.

ents = [(e.text, e.label_) 
        for e in doc.ents
        if e.label_ not in ("CARDINAL", "ORDINAL")]

for ent in ents:
    print(ent)

('SUNIUM & LAURIUM', 'ORG')
('I.', 'PERSON')
('Greek', 'NORP')
('the Cyclades Islands', 'LOC')
('the Aegean Sea', 'LOC')
('Sunium', 'LOC')
('Attic', 'NORP')
('Athena of Sunium', 'ORG')
('Laurium', 'PERSON')
('Athenians', 'NORP')
('the Island of Patroclus', 'LOC')
('Patroclus', 'ORG')
('Egyptian', 'NORP')
('Ptolemy', 'PERSON')
('Ptolemy', 'GPE')
('Lagus', 'PERSON')
('Athenians', 'NORP')
('Antigonus', 'LOC')
('Demetrius', 'PERSON')
('PEIRAEUS', 'ORG')
('Peiraeus', 'PERSON')
('Themistocles', 'ORG')
('Phalerum', 'GPE')
('Athens', 'GPE')
('Menestheus', 'PERSON')
('Troy', 'PERSON')
('Theseus', 'PERSON')
('Minos', 'ORG')
('Androgeos', 'ORG')
('Themistocles', 'ORG')
('Peiraeus', 'PERSON')
('Phalerum', 'GPE')
('Athenian', 'NORP')
('Themistocles', 'ORG')
('Athenians', 'NORP')
('Themistocles', 'ORG')
('Magnesia', 'LOC')
('Themistocles', 'ORG')
('Parthenon', 'GPE')
('Themistocles', 'ORG')
('Peiraeus', 'PERSON')
('Athena', 'GPE')
('Zeus', 'PERSON')
('Zeus', 'PERSON')
('Athena', 'ORG')
('Arcesilaus'

In [13]:
import random

my_ents = random.sample(ents,20)

my_ents

[('Demetrius', 'PERSON'),
 ('two years', 'DATE'),
 ('Athenians', 'NORP'),
 ('Demetrius', 'PERSON'),
 ('Eumolpus', 'ORG'),
 ('Phamenoph', 'PERSON'),
 ('Italy', 'GPE'),
 ('Nisus', 'PERSON'),
 ('Trophonius', 'GPE'),
 ('Hieronymus', 'PERSON'),
 ('Megarian', 'NORP'),
 ('Oeneadae', 'GPE'),
 ('Pheidias', 'PERSON'),
 ('Nicias', 'PERSON'),
 ('Laconian', 'NORP'),
 ('Neoptolemus', 'GPE'),
 ('Hyperochus', 'PERSON'),
 ('Cranaus', 'PERSON'),
 ('Oropus', 'ORG'),
 ('Delians', 'NORP')]

In [16]:
ents = [(e.text, e.label_) 
        for e in doc.ents
        if e.label_ in ("LOC", "GPE")]

for ent in ents:
    print(ent)

('the Cyclades Islands', 'LOC')
('the Aegean Sea', 'LOC')
('Sunium', 'LOC')
('the Island of Patroclus', 'LOC')
('Ptolemy', 'GPE')
('Antigonus', 'LOC')
('Phalerum', 'GPE')
('Athens', 'GPE')
('Phalerum', 'GPE')
('Magnesia', 'LOC')
('Parthenon', 'GPE')
('Athena', 'GPE')
('the united', 'GPE')
('Boeotia', 'GPE')
('Munychia', 'GPE')
('Phalerum', 'GPE')
('Heros', 'GPE')
('Ionia', 'GPE')
('Phalerum', 'GPE')
('Athens', 'GPE')
('Hera', 'GPE')
('Thermodon', 'LOC')
('Macedonia', 'GPE')
('Syracuse', 'GPE')
('Macedonia', 'GPE')
('ATHENS', 'GPE')
('Poseidon', 'GPE')
('Polybotes', 'GPE')
('Poseidon', 'GPE')
('Melpomenus', 'LOC')
('Eubulides', 'GPE')
('Athens', 'GPE')
('Athens', 'GPE')
('Attica', 'GPE')
('Herse', 'GPE')
('Cecrops', 'GPE')
('Attica', 'GPE')
('Earth', 'LOC')
('Phaethon', 'LOC')
('Cleidicus', 'GPE')
('Mantinea', 'GPE')
('Leuctra', 'GPE')
('Thermopylae', 'GPE')
('Europe', 'LOC')
('Eridanus', 'GPE')
('the Ionian Sea', 'LOC')
('Macedonia', 'GPE')
('Thessaly', 'GPE')
('Thermopylae', 'GPE')
('

In [18]:
import random

my_ents = random.sample(ents,20)

my_ents

[('Bel', 'GPE'),
 ('Caria', 'GPE'),
 ('Eileithyia', 'GPE'),
 ('Asia', 'LOC'),
 ('Argives', 'GPE'),
 ('Megareus', 'GPE'),
 ('Pagae', 'GPE'),
 ('Eileithyia', 'GPE'),
 ('Tauri', 'GPE'),
 ('Immaradus', 'GPE'),
 ('Tauri', 'GPE'),
 ('the Ionian Sea', 'LOC'),
 ('Thrace', 'GPE'),
 ('Neoptolemus', 'GPE'),
 ('Lycus', 'GPE'),
 ('Athena', 'GPE'),
 ('Hera', 'GPE'),
 ('Egypt', 'GPE'),
 ('EGYPT', 'GPE'),
 ('Crete', 'GPE')]

:::{note}
What is the type of the results in `ents`?
:::

## Looking up coordinates

While these results are far from perfect — "Hyllus," at least in my practice runs, was classified as a "PRODUCT" rather than a "PERSON" — they're fairly useful in broad strokes for our purposes.

But we still need to add coordinates, and we have over 4000 entities to link. How can we go about doing this scalably?

## Build a search tool

All of the data we need is available through [Pleiades](https://pleiades.stoa.org) and [ToposText](https://topostext.org), but the strings that are labeled by our NER model might not match the titles of places available from these sources. We could build a search index that lets us match titles mor flexibly, but that is beyond the scope of our work for today.

## Annotate by hand

Instead, working in groups, choose about **20** places from the NER list that you would like to map. You could even pull them out randomly, if you'd like.

Then, using Pleiades's own search tool, find the coordinates for each location. Store this data, along with any contextual information or descriptions that you deem relevant, in a CSV or spreadsheet that you can upload to ArcGIS.

:::{note}
Can you also include a `count` parameter for how often each place is mentioned in Book 1?
:::

If you find that your group is working particularly quickly, grab another 10 placenames, or experiment with mapping specific sections of Pausanias' text.

## Readings

- <https://doi.org/10.5334/johd.150>
- <https://doi.org/10.5281/zenodo.1193921>
- Kirsch, Adam. "Technology Is Taking Over English Departments." The New Republic, May 2, 2014
- @Blei2012
- @Brett2012
- @Mimno2012
- @Wellmon2015

## Homework

- Finish annotating placenames and uploading the results to an ArcGIS map
- Share a link to the map on Canvas