# SpaCy Text Analysis for Named Entity Recognition 
The goal of this notebook is to explore NER (Named Entity Recognition) with regard to mostly geographic places, analyzing some typical DAMS metadata from the Pacific Basin Nautical Charts.

In [1]:
import spacy
import pandas as pd
import numpy as np

In [3]:
'''
This command might take a little while. First download this module via:
$ python -m spacy download en_core_web_sm
'''  
nlp = spacy.load("en_core_web_sm")

## Setting up the `pandas` dataframe

In [26]:
# load in our Excel as a pandas dataframe 
df = pd.read_excel('data/sio_hist_charts.xlsx')

In [27]:
df.head()

Unnamed: 0,Row,File Name,JPEG URL,ARK,Title,Survey Note,Scale,N,S,W,...,Edition,Date Created,Date Issued,Corporate author/Publisher,Contents,Insets,Geographic Subject (geonames),Country (geonames),Notes,Genre
0,16,HY00000123.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb07179095,Abaiang or Charlotte Island; Tarawa or Cook Is...,U.S. Ex. Ex. 1844,Scales differ,2.0,1.333333,172.45,...,12th ed.,1932-05-01,1951-10-01,United States Hydrographic Office,,,Abaiang Island; Tarawa Atoll,Kiribati,Depths shown by soundings and pictorially,Nautical charts
1,634,HY00000082.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb3823744m,Ahe and Manihi or Peacock and Wilsons Islands,By the U.S.Ex.Ex. 1839,"1:143,441",-14.33,-14.5,-146.5,...,11th ed.,1921-07-01,1949-12-01,United States Hydrographic Office,,Entrance to Ahe Lagoon; Entrance to Manihi Lagoon,Ahe; Manihi,"Tuamotu Archipelago, French Polynesia",Relief shown by shading; depths shown by sound...,Nautical charts
2,633,HY00005861.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb51206842,Anchorages in the Netherlands West Indies,From Netherlands Government surveys to 1933,Scales differ,-4.62,-8.78,115.82,...,1st ed.,1944-03-01,1944-03-01,United States Hydrographic Office,"Laurot Islands (Poelau Laoet Ketjil), Mata Sir...",,,Indonesia,Relief shown by contours and spot heights; dep...,Nautical charts
3,1043,HY00005862.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb16735538,Anchorages on the south coast of Java,From Netherlands Government survey in 1928,Scales differ,-8.48,-8.6,113.82,...,1st ed.,1942-12-01,1944-05-01,United States Hydrographic Office,Permisan Bay; Bandi Alit Bay; Radjeg Wesi Bay,,Java,Indonesia,Relief shown by form lines and spot heights; d...,Nautical charts
4,725,HY00015010.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb2868103r,Anchorages on the southeastern coast of Kamchatka,Emergency reprodution of Russian chart,Scales differ,,,,...,,1941-01-01,1941-01-01,United States Hydrographic Office,Anchorage at Cape Olga; Entrance to the mouth ...,,Kamchatka Peninsula,Russia,Relief shown by hachures; depths shown by soun...,Nautical charts


## Using SpaCy's `nlp` to get tokens 
We might be able to do NER directly, but first let's try to do the `nlp` on a column, "Title", and put it in a new column. This will give us a column of tokens

In [28]:
df['tokens'] = df['Title'].apply(nlp)

In [29]:
df.head()

Unnamed: 0,Row,File Name,JPEG URL,ARK,Title,Survey Note,Scale,N,S,W,...,Date Created,Date Issued,Corporate author/Publisher,Contents,Insets,Geographic Subject (geonames),Country (geonames),Notes,Genre,tokens
0,16,HY00000123.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb07179095,Abaiang or Charlotte Island; Tarawa or Cook Is...,U.S. Ex. Ex. 1844,Scales differ,2.0,1.333333,172.45,...,1932-05-01,1951-10-01,United States Hydrographic Office,,,Abaiang Island; Tarawa Atoll,Kiribati,Depths shown by soundings and pictorially,Nautical charts,"(Abaiang, or, Charlotte, Island, ;, Tarawa, or..."
1,634,HY00000082.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb3823744m,Ahe and Manihi or Peacock and Wilsons Islands,By the U.S.Ex.Ex. 1839,"1:143,441",-14.33,-14.5,-146.5,...,1921-07-01,1949-12-01,United States Hydrographic Office,,Entrance to Ahe Lagoon; Entrance to Manihi Lagoon,Ahe; Manihi,"Tuamotu Archipelago, French Polynesia",Relief shown by shading; depths shown by sound...,Nautical charts,"(Ahe, and, Manihi, or, Peacock, and, Wilsons, ..."
2,633,HY00005861.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb51206842,Anchorages in the Netherlands West Indies,From Netherlands Government surveys to 1933,Scales differ,-4.62,-8.78,115.82,...,1944-03-01,1944-03-01,United States Hydrographic Office,"Laurot Islands (Poelau Laoet Ketjil), Mata Sir...",,,Indonesia,Relief shown by contours and spot heights; dep...,Nautical charts,"(Anchorages, in, the, Netherlands, West, Indies)"
3,1043,HY00005862.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb16735538,Anchorages on the south coast of Java,From Netherlands Government survey in 1928,Scales differ,-8.48,-8.6,113.82,...,1942-12-01,1944-05-01,United States Hydrographic Office,Permisan Bay; Bandi Alit Bay; Radjeg Wesi Bay,,Java,Indonesia,Relief shown by form lines and spot heights; d...,Nautical charts,"(Anchorages, on, the, south, coast, of, Java)"
4,725,HY00015010.tif,https://libraries.ucsd.edu/apps/public/#DOV&bb...,bb2868103r,Anchorages on the southeastern coast of Kamchatka,Emergency reprodution of Russian chart,Scales differ,,,,...,1941-01-01,1941-01-01,United States Hydrographic Office,Anchorage at Cape Olga; Entrance to the mouth ...,,Kamchatka Peninsula,Russia,Relief shown by hachures; depths shown by soun...,Nautical charts,"(Anchorages, on, the, southeastern, coast, of,..."


In [30]:
df['tokens']

0       (Abaiang, or, Charlotte, Island, ;, Tarawa, or...
1       (Ahe, and, Manihi, or, Peacock, and, Wilsons, ...
2        (Anchorages, in, the, Netherlands, West, Indies)
3           (Anchorages, on, the, south, coast, of, Java)
4       (Anchorages, on, the, southeastern, coast, of,...
5       (Asia, :, anchorages, on, the, south, coast, o...
6       (Asia, :, anchorages, on, the, west, coast, of...
7       (Asia, :, anchorages, on, the, west, coast, of...
8       (Asia, :, Cambodia, -, Thailand, :, Gulf, of, ...
9       (Asia, :, China, :, approaches, to, the, Yangt...
10      (Asia, :, China, :, Chu, Kiang, or, Canton, Ri...
11      (Asia, :, China, :, Gulf, of, Liaotung, :, Hu,...
12      (Asia, :, China, :, Gulf, of, Liaotung, :, Hun...
13      (Asia, :, China, :, Gulf, of, Liaotung, :, Tai...
14      (Asia, :, China, :, Gulf, of, Pohai, :, Pei, H...
15      (Asia, :, China, :, Kwantung, Peninsula, :, Da...
16      (Asia, :, China, :, Liaodong, Wan, (, Liao, -,...
17      (Asia,

In [31]:
df['entities'] = df['Title'].apply(lambda x: list(nlp(x).ents))

In [32]:
df['entities']

0       [(Abaiang), (Charlotte, Island), (Tarawa), (Co...
1                   [(Ahe), (Manihi), (Wilsons, Islands)]
2                                                      []
3                                                      []
4                                           [(Kamchatka)]
5                                       [(Asia), (China)]
6                             [(Asia), (Chosen), (Korea)]
7                             [(Asia), (Chosen), (Korea)]
8       [(Asia), (Cambodia), (Thailand), (Thailand), (...
9                [(Asia), (China), (the, Yangtze, River)]
10      [(Asia), (China), (Chu, Kiang), (Canton, River...
11                         [(Asia), (China), (Hu, -, Lu)]
12      [(Asia), (China), (Fort, Head), (Chin, -, Chou...
13      [(Asia), (China), (Tai, -, Tzu), (Fu, -, Chou,...
14      [(Asia), (China), (Pohai), (Hai, Ho), (2), (Ko...
15      [(Asia), (China), (Kwantung, Peninsula), (Dair...
16      [(Asia), (China), (Liaodong, Wan), (Liao, -, T...
17            

## Using `displacy` to visualize entities
We can also run some neat visualization tools on text to get a visual picture of text using `displacy`

In [34]:
from spacy import displacy

In [39]:
text = """
Louis "Lubo" Pechi was born in the Croatian city of Zagreb. He was seven years old when the Germans invaded Yugoslavia. 
In response to the mounting anti-Semitic repression and strict laws prohibiting Jews from traveling, 
the Pechis converted to Catholicism so that they could escape to safety in Italy. The move marked a lengthy process of hiding: 
Lubo had to change his name, religion, and identity. The Pechi family finally managed to escape and make their way to Rome. 
Decades later, Lubo began the arduous process of recovering the memories of his hidden life by writing 
his memoir "I am Lubo: A Child Survivor from Yugoslavia."
"""
doc = nlp(text)
displacy.render(doc, style = "ent",jupyter = True)