# NER, Geocoding, and Mapping

In today's in-class practium, we're going to explore moving between a list of place names and techniqus for mapping those place names.


- [Part 1: Reviewing Named Entity Recognition (NER)](#Part-1:-Reviewing-Named-Entity-Recognition-(NER))
- [Part 2: Geocoding: From a list of place names to a map](#Part-2:-Geocoding:-From-a-list-of-place-names-to-a-map)
    - [Setting up](#Setting-up)
        - [Install GeoPy](#Install-GeoPy)
        - [Import GeoPy's Nominatim modules](#Import-GeoPy's-Nominatim-modules)
        - [Initializing-Nominatim](#Initializing-Nominatim)
    - [Geocode a location](#Geocode-a-location)
        - [Get location address](#Get-location-address)
        - [Get Latitude and Longitude](#Get-Latitude-and-Longitude)
        - [Get Location Class](#Get-Location-Class)
        - [Retrieve multiple possible matches](#Retrieve-multiple-possible-matches)
    - [Geocoding a CSV file list of multiple locations with pandas](#Geocoding-a-CSV-file-list-of-multiple-locations-with-pandas)
        - [Read in a CSV with a list of places](#Read-in-a-CSV-with-a-list-of-places)
        - [Filter our Data Frame so we just have a list of geo-political place names](#Filter-our-Data-Frame-so-we-just-have-a-list-of-geo-political-place-names)
        - [Create a function that uses geolocator.geocode()](#Create-a-function-that-uses-geolocator.geocode())
        - [Apply our find_location fuction to our dataframe to geocode multiple place names](#Apply-our-find_location-fuction-to-our-dataframe-to-geocode-multiple-place-names)
    - [Creating Interactive Maps](#Creating-Interactive-Maps)
        - [Install Folium](#Install-Folium)
        - [Import Folium](#Import-Folium)
        - [Make a Base Map](#Make-a-Base-Map)
        - [Add a Marker](#Add-a-Marker)
        - [Add Markers from dataframe](#Add-Markers-from-dataframe)
        - [To save a map as an HTML file](#To-save-map-as-an-HTML-file)
        - [Another example: *Robinson Crusoe*](#Another-Example:-Mapping-place-names-from-the-opening-of-Robinson-Crusoe-(1719))
    - [Your turn!](#Your-Turn!)
        - [Step 1: Assemble a list of place names](#Step-1:-Assemble-a-list-of-place-names)
        - [Step 2: Geocode the place names](#Step-2:-Geocode-the-place-names)
        - [Step 3: Create a map to visualize the geocoded the place names](#Step-3:-Create-a-map-to-visualize-the-geo-coded-locations)
        


    


## Part 1: Reviewing Named Entity Recognition (NER)

Let's discuss how the research exercise on named entity recognition went!

In [None]:
Questions?

## Part 2: Geocoding: From a list of place names to a map

For your research exercise today you today's in-class practicum,  we're going to be learning how to analyze and visualize geographic data. Geocoding is a technique for using uniqe identifiers for gographic locations.

We'll be using two libraries, `GeoPy` and `Folium`.

[GeoPy](https://geopy.readthedocs.io/en/stable/) is a Python library makes it easier to use a range of third-party geocoding API services, such as Google, Bing, ArcGIS, and OpenStreetMap to generate geographic coordinates. 

[Folium](https://python-visualization.github.io/folium/) is a library that integrates with the open-source JavaScript mapping software, [Leaflet.js](https://leafletjs.com/) to create interactive maps.


### Setting up

#### Install `GeoPy`

In [127]:
!pip install geopy



#### Import `GeoPy`'s Nominatim modules

In [6]:
from geopy.geocoders import Nominatim

[Nominatim](https://nominatim.org/release-docs/develop/api/Overview/) (which means “name” in Latin) uses [OpenStreetMap](https://www.openstreetmap.org/about) data to match addresses with geopgraphic coordinates. Though we don’t need an API key to use Nominatim, we do need to create a unique application name.

#### Initializing Nominatim 
Here we're initializing Nominatim as a variable called `geolocator` Change the application name below to your own application name.

In [7]:
geolocator = Nominatim(user_agent="Data and Literary Study mapping app", timeout=2)

## Geocode a location
To geocode an address or location, we simply use the `.geocode()` function:

In [100]:
location = geolocator.geocode("Nassau Hall")

In [101]:
location

Location(Nassau Hall, Chapel Drive, Princeton, Mercer County, New Jersey, 08544, United States, (40.34868635, -74.65939173555758, 0.0))

### Get location address

In [10]:
print(location.address)

Nassau Hall, Chapel Drive, Princeton, Mercer County, New Jersey, 08544, United States


### Get Latitude and Longitude

In [11]:
print(location.latitude, location.longitude)

40.34868635 -74.65939173555758


### Get Location Class

In [12]:
print(f"Class: {location.raw['class']} \nType: {location.raw['type']}")

Class: building 
Type: university


### Retrieve multiple possible matches
Here we're going to retrieve *all* possible matches for "Nassau Hall," rather than just the most likely. We're going to use a for loop to retrieve all matching entries, the exact addresses, latitutde & logintude, and "importance" score of locations. (For more on the "imporance" and other address rankings in Nominatim, see [here](https://nominatim.org/release-docs/develop/customize/Ranking/))

In [13]:
possible_locations = geolocator.geocode("Nassau Hall", exactly_one=False)

for location in possible_locations:
    print(location.address)
    print(location.latitude, location.longitude)
    print(f"Importance: {location.raw['importance']}")

Nassau Hall, Chapel Drive, Princeton, Mercer County, New Jersey, 08544, United States
40.34868635 -74.65939173555758
Importance: 0.5681802197739116
Nassau Hall, Muttontown, Town of Oyster Bay, Nassau County, New York, United States
40.8236664 -73.53870246658838
Importance: 0.35000000000000003
Nassau Hall, Muttontown Road, Muttontown, Town of Oyster Bay, Nassau County, New York, 11732, United States
40.822946 -73.53940501363016
Importance: 0.201
Nassau Hall, Red Jacket Drive, Village of Geneseo, Town of Geneseo, Livingston County, New York, 14454, United States
42.7922786 -77.82415458673518
Importance: 0.201
Nassau Hall, Marburger Drive, Stony Brook University, Suffolk County, New York, 11794, United States
40.90611495 -73.12092235439
Importance: 0.201


## Geocoding a CSV file list of multiple locations with `pandas`

What if we had a CSV list of multiple place names, like we created in our Research Exercise #9 using NER?

To geocode every location in a CSV file, we can use Pandas, make a Python function, and `.apply()` it to every row in the CSV file.

In [14]:
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

### Read in a CSV with a list of places
Let's read in our CSV with named entities in *The Adventures of Sherlock Holmes* as detected by BookNLP:

In [18]:
sherlock_holmes_df = pd.read_csv("../_week9/sherlock_holmes/adventures_of_sherlock_holmes.entities", delimiter="\t")

In [19]:
sherlock_holmes_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
0,240,3,4,PROP,PER,Sherlock Holmes
1,241,6,8,PROP,PER,Arthur Conan Doyle
2,1,14,15,PROP,FAC,Bohemia II
3,-1,17,22,NOM,FAC,The Red - Headed League III
4,2,35,39,PROP,VEH,The Five Orange Pips VI
...,...,...,...,...,...,...
18063,-1,127638,127638,PRON,PER,she
18064,-1,127644,127648,NOM,FAC,a private school at Walsall
18065,239,127648,127648,PROP,GPE,Walsall
18066,0,127651,127651,PRON,PER,I


#### Filter our Data Frame so we just have a list of geo-political place names 
We're going to filter to look at just the GPE categories

In [82]:
GPE_cat_in_sherlock_holmes_df = sherlock_holmes_df[sherlock_holmes_df['cat'] == 'GPE']
GPE_cat_in_sherlock_holmes_df

Unnamed: 0,COREF,start_token,end_token,prop,cat,text
9,4,99,99,PROP,GPE,BOHEMIA
62,6,509,509,PROP,GPE,Odessa
65,7,531,531,PROP,GPE,Trincomalee
68,8,551,551,PROP,GPE,Holland
83,9,657,657,PROP,GPE,Scarlet
...,...,...,...,...,...,...
17999,10,127281,127281,PROP,GPE,London
18033,230,127474,127474,PROP,GPE,Winchester
18049,237,127578,127578,PROP,GPE,Southampton
18054,238,127599,127599,PROP,GPE,Mauritius


In [67]:
just_place_names_in_sherlock_holmes = GPE_cat_in_sherlock_holmes_df[['cat', 'text']]
just_place_names_in_sherlock_holmes


Unnamed: 0,cat,text
9,GPE,BOHEMIA
62,GPE,Odessa
65,GPE,Trincomalee
68,GPE,Holland
83,GPE,Scarlet
...,...,...
17999,GPE,London
18033,GPE,Winchester
18049,GPE,Southampton
18054,GPE,Mauritius


### Create a function that uses `geolocator.geocode()`

Here we make a function with `geolocator.geocode()` and ask it to return the address, lat/lon, and importance score:

In [22]:
def find_location(row):
    
    place = row['text']
    
    location = geolocator.geocode(place)
    
    if location != None:
        return location.address, location.latitude, location.longitude, location.raw['importance']
    else:
        return "Not Found", "Not Found", "Not Found", "Not Found"

### Apply our `find_location` fuction to our dataframe to geocode multiple place names

Now let’s `.apply()` the function we created above ––– `find_location` –– to this dataframe just_place_names_in_sherlock_holmes and see what results Nominatim’s geocoding service spits out. Run the cell below.

In [68]:
just_place_names_in_sherlock_holmes[['address', 'lat', 'lon', 'importance']] = just_place_names_in_sherlock_holmes.apply(find_location, axis="columns", result_type="expand")
just_place_names_in_sherlock_holmes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,cat,text,address,lat,lon,importance
9,GPE,BOHEMIA,"Bohemia, Suffolk County, New York, 11716, United States",40.7702,-73.1198,0.497461
62,GPE,Odessa,"Одеса, Одеська міська громада, Одеський район, Одеська область, Україна",46.4859,30.6836,0.656234
65,GPE,Trincomalee,"திருகோணமலை, තිරිකුණාමළය දිස්ත්‍රික්කය, கிழக்கு மாகாணம், 32100, ශ්‍රී ලංකාව இலங்கை",8.57643,81.2345,0.47141
68,GPE,Holland,Nederland,52.1552,5.38721,0.829417
83,GPE,Scarlet,"Scarlet, Mingo County, West Virginia, United States",37.7759,-82.1471,0.3793
...,...,...,...,...,...,...
17999,GPE,London,"London, Greater London, England, United Kingdom",51.5073,-0.127647,0.940783
18033,GPE,Winchester,"Winchester, Hampshire, South East England, England, SO23 9LF, United Kingdom",51.0613,-1.31317,0.661129
18049,GPE,Southampton,"Southampton, South East England, England, SO14 2BY, United Kingdom",50.9025,-1.40419,0.716154
18054,GPE,Mauritius,Mauritius,-20.2759,57.5704,0.779831


## Creating Interactive Maps
We're going to usa a Python library called `Folium` to visualize our coordinate points on maps from OpenStreet Maps. 

### Install Folium

In [38]:
!pip install folium



### Import Folium

In [39]:
import folium

### Make a Base Map
First, we need to establish a base map. This is where we’ll map our geocoded locations from *The Adventures of Sherlock Holmes*. To do so, we’re going to call folium.Map()and enter the general latitude/longitude coordinates of the Marylebone area of London at a particular zoom.

(To find latitude/longitude coordintes for a particular location, you can use Google Maps, as described here.)

In [119]:
marylebone = geolocator.geocode("Marylebone")

In [120]:
marylebone

Location(Marylebone, City of Westminster, Greater London, England, NW1 5LQ, United Kingdom, (51.5232789, -0.155596, 0.0))

In [106]:
sherlock_holmes_map = folium.Map(location=[51.52, -0.16], zoom_start=14)
sherlock_holmes_map

### Add a Marker

Adding a marker to a map is easy with Folium! We’ll simply call `folium.Marker()` at a particular lat/lon, enter some text to display when the marker is clicked on, and then add it to our base map.

In [121]:
folium.Marker(location=[51.5233879, -0.1582367],popup="221B Baker Street").add_to(sherlock_holmes_map)
sherlock_holmes_map


## Add Markers from dataframe

To add markers for every location in our Pandas dataframe, we can make a Python function and `.apply()` it to every row in the dataframe.

In [122]:
def create_map_markers(row, map_name):
    folium.Marker(location=[row['lat'], row['lon']], popup=row['text']).add_to(map_name)

Before we apply this function to our dataframe, we’re going to drop any locations that were “Not Found” (which would cause folium.Marker() to return an error).

In [123]:
found_sherlock_holmes_locations = just_place_names_in_sherlock_holmes[just_place_names_in_sherlock_holmes['address'] != "Not Found"]

In [124]:
found_sherlock_holmes_locations.apply(create_map_markers, map_name=sherlock_holmes_map, axis='columns')
sherlock_holmes_map

### To save map as an HTML file

In [22]:
sherlock_holmes_map.save("Sherlock_Holmes-Place-Names-map.html")

## Another Example: Mapping place names from the opening of *Robinson Crusoe* (1719)

Daniel Defoe's *Robinson Crusoe* is often considered one of the first novels 

In [126]:
# Read in Dataframe
crusoe_df = pd.read_csv("robinson-crusoe-place-names-in-first-40-pages/robinson-crusoe-ner-pages-1-40.csv")
# Use nominatim with our `find_location` function to get the long, lat, and addresses of placenames
crusoe_df[['address', 'lat', 'lon', 'importance']] = crusoe_df.apply(find_location, axis="columns", result_type="expand")
#Set a base map
crusoe_map = folium.Map(location=[40.35, -74.65], zoom_start=2)
# Filter to look only at the locations where nominatim found our address
found_crusoe_locations = crusoe_df[crusoe_df['address'] != "Not Found"]
# Add the remaining locations to our  base map
found_crusoe_locations.apply(create_map_markers, map_name=crusoe_map, axis='columns')
crusoe_map

---


## Your Turn!

- Choose a text or group of texts. Create a CSV file with your group of place names. 
    - These could be place names you've extracted through NER, by hand, or through some other means. (For instance, your might consider recording the location of publisher, as detailed in one of the CSVs of bibliographic metadata for Amardp Singh's [African American Lterature corpus](https://github.com/sceckert/Data-and-Literary-Study-Spring2022/blob/main/_datasets/texts/literature/African-American-Literature-Text-Corpus/African-American-Literature-Corpus-Metadata-Amardeep-Singh.csv), or Singh's [Colonial South Asian Literature corpus](https://github.com/sceckert/Data-and-Literary-Study-Spring2022/blob/main/_datasets/texts/literature/Colonial-South-Asian-Literature-1850-1923/Colonial-South-Asian-Literature-1850-1923-Corpus-Metadata-Amardeep-Singh.csv)
- Use GeoPy to extract the coordinates from that CSV
- Plot it on a map using folium

### Step 1: Assemble a list of place names
Choose a text. Create a CSV file of place names, either by hand or by using NER.

In [87]:
### Your code here

### Step 2: Geocode the place names

In [None]:
### Your code here

### Step 3: Create a map to visualize the geo-coded locations

In [None]:
### Your code here