# Explore Colenda Items by Place

[Colenda Digital Repository at Penn Libraries](https://colenda.library.upenn.edu/) is a digital repository for digitized and born-digital material. It provides direct access and long-term stewardship for these important resources. Much of Colenda’s content consists of materials owned and digitized by the Penn Libraries, including significant collections that have been donated.

In this notebook we'll explore the spatial dimensions of data harvested from Colenda. What places are associated with these items? To do that we'll extract the spatial data, see what's there, and create a few maps.

[See here](kaplan_explore_records.ipynb) for an introduction to exploring Colenda data, and [here to explore items in a collection over time](kaplan_explore_time.ipynb).


* [Import What We Need](#Import-What-We-Need)
* [Load the Data](#Load-the-Data)
* [Concatenate and Split `metadata.geographic_subject` Fields](#Concatenate-and-Split-`metadata.geographic_subject`-Fields)
* [Geocode `geographic_subject` Data with Nominatim](#Geocode-geographic_subject-Data-with-Nominatim)
* [Map Geographic Subjects on a World Map](#Map-Geographic-Subjects-on-a-World-Map)
* [Map Geographic Subjects on a US Map](#Map-Geographic-Subjects-on-a-US-Map)
* [Enrich Geograhpic Subject Data](#Enrich-Geographic-Subject-Data)
* [Filter Items by US State](#Filter-Items-by-US-State)
* [Count Items by US State](#Count-Items-by-US-State)
* [Map Geographic Subjects on a US State Map](#Map-Geographic-Subjects-on-a-US-State-Map)
* [Need Help?](#Need-Help?)
* [Credits](#Credits)

<div class="alert alert-block alert-warning">
<p><b>Yellow blocks like this provide additional information about Python and Jupyter notebooks.</b></p>
    
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexplore_collection_object_over_time.ipynb">load a <b>live</b> version</a> running on Binder.</p>
</div>

## Import What We Need

<div>
    <p>In order to use this notebook, you first need to `import` modules and packages from Python.</p>
<div class="alert alert-block alert-warning">
<p>These modules and packages are units of code with specific tools or skills that we use in the script. If you're running this notebook on your computer, you may need to first `import` these modules within your Python interpreter. Find assistance for that <a href="https://packaging.python.org/tutorials/installing-packages/">here</a>.</p>
    </div>

In [2]:
!pip install -r requirements.txt

# Pandas is a Python package that provides numerous tools for data analysis
import pandas as pd

# IpyLeaflet is a Python library that enables interactive geospatial data visualization in Jupyter Notebook
from ipyleaflet import Map, Marker, Popup, MarkerCluster, basemap_to_tiles, CircleMarker

# IpyWidgets is a Python library that enables interactive HTML widgets for Jupyter notebooks.
import ipywidgets as widgets

# Reverse Geocode is a Python module that takes latitude / longitude coordinate and returns the country and city 
import reverse_geocode

# Altair is a Python library for declarative statistical visualization
import altair as alt

# IPython is a Python interpreter to display content
from IPython.display import display, HTML, FileLink

# Vega_Datasets is a Python package for example datasets - we use it for maps
from vega_datasets import data as vega_data

Collecting reverse_geocode
  Using cached reverse_geocode-1.4.1-py3-none-any.whl
Collecting vega_datasets
  Using cached vega_datasets-0.9.0-py3-none-any.whl (210 kB)
Collecting geopandas
  Using cached geopandas-0.9.0-py2.py3-none-any.whl (994 kB)
Collecting geopy
  Using cached geopy-2.1.0-py3-none-any.whl (112 kB)
Collecting statistics
  Using cached statistics-1.0.3.5-py3-none-any.whl
Collecting fiona>=1.8
  Using cached Fiona-1.8.20.tar.gz (1.3 MB)
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_505467e7228f480984be7de1e260cbce/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_505467e7228f480984be7de1e260cbce/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else 

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_e2b312ef87444e659349a506ea2a20c8/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_e2b312ef87444e659349a506ea2a20c8/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-la7ogj8k
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_e2b312ef87444e659349a506ea2a20c8/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_de773a80ab114b96bed11da11457a7b5/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_de773a80ab114b96bed11da11457a7b5/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-leu79esn
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_de773a80ab114b96bed11da11457a7b5/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_201c0de8389d4d6ebbfb0bb734c7761f/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_201c0de8389d4d6ebbfb0bb734c7761f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-nplvib7f
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_201c0de8389d4d6ebbfb0bb734c7761f/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_51b74e14b8bf44469194109f9af0ec73/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_51b74e14b8bf44469194109f9af0ec73/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-f3luplxy
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_51b74e14b8bf44469194109f9af0ec73/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_d3cb9aa0e1624e51afce7a2e29deb35c/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_d3cb9aa0e1624e51afce7a2e29deb35c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-xxjk6g4i
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/fiona_d3cb9aa0e1624e51afce7a2e29deb35c/
    Complete output (2 lines):
    Failed to get op

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpe61fahki
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_a543b44804da4b1f80dc044b93ce069b
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable. For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.6.1.post1.tar.gz (545 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpvuzxaduu
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_966146a87fb14ee0b0c765f2588507f1
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable.For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.4.1.tar.gz (462 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmphd9gcps7
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_3d8abc440dbe41c78a89265a0cbcdce7
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25hCollecting geopandas
  Using cached geopandas-0.8.1-py2.py3-none-any.whl (962 kB)
  Using cached geopandas-0.8.0-py2.py3-none-any.whl (962 kB)
  Using cached geopandas-0.7.0-py2.py3-none-any.whl (928 kB)
  Using cached geopandas-0.6.3-py2.py3-none-any.whl (920 kB)
Collecting pyproj
  Using cached pyproj-2.1.3.tar.gz (521 kB)
  Install

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmp9hymx5l1
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_6a27adca0f30465790b350b3307a2cc0
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25h  Using cached pyproj-1.9.6.tar.gz (2.8 MB)
Collecting geographiclib<2,>=1.49
  Using cached geographiclib-1.52-py3-none-any.whl (38 kB)
Collecting docutils>=0.3
  Using cached docutils-0.17.1-py2.py3-none-any.whl (575 kB)
Building wheels for collected packages: fiona, pyproj
  Building wheel for fiona (setup.py) ... [?25lerror
[31m

  Building wheel for pyproj (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_4a5f46b028384e3287ae07489a54e52c/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_4a5f46b028384e3287ae07489a54e52c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-wheel-ubrl9b0k
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_4a5f46b028384e3287ae07489a54e52c/
  Comple

Failed to build fiona pyproj
Installing collected packages: pyproj, geographiclib, fiona, docutils, vega-datasets, statistics, reverse-geocode, geopy, geopandas
    Running setup.py install for pyproj ... [?25lerror
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_4a5f46b028384e3287ae07489a54e52c/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-h5z2qgr1/pyproj_4a5f46b028384e3287ae07489a54e52c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pb

[?25h

OSError: Could not find library geos_c or load any of its variants ['/Library/Frameworks/GEOS.framework/Versions/Current/GEOS', '/opt/local/lib/libgeos_c.dylib']

## Load the Data

This pre-harvested dataset from Colenda includes many gifts of [Arnold and Deanne Kaplan](https://kaplan.exhibits.library.upenn.edu/thekaplans), which is concentrated in two collections. In this notebook we will only work with records from the Arnold and Deanne Kaplan Collection of **Early American Judaica**. We can access those items by using the `metadata.collection[1]` column and filtering on the Early American Judaica collection. 

In [None]:
# Convert to a dataframe
df = pd.read_csv("data/kaplan-test-data.csv", encoding= 'unicode_escape')

# Print the number of rows in the dataframe
print('There are {:,} items in this dataset from Colenda.'.format(df.shape[0]))

<div class="alert alert-block alert-warning">
<p>You may see a warning appear above stating that columns having "mixed types". This means that the CSV columns contain a mix of strings and integers. When converting the CSV into a dataframe, Python wasn't sure how to declare the column type. It's OK to ignore this warning for now - we may need to state directly this information later.<p>
</div>

In [None]:
# Filter for items from the Early American Judaica Collection
df = df.loc[df['metadata.collection[1]'] == "Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania)"]

# Return the first 5 rows of the dataframe
df.head()

## Concatenate and Split `metadata.geographic_subject` Fields

Now that we have this dataset in a dataframe, we can manipulate it. This dataset contains **descriptive metadata** about the items in the collection, which provides information about the intellectual content of a digital object. Descriptive metadata documents and tracks the intellectual content of an item, as well as support the search and discovery of these items within Colenda. The most important field of descriptive metadata is a unique identifier that uniquely identifies the object. Other descriptive metadata fields may include title, author, date of publication, subject, publisher and description. 

Locations are linked to item records through the `metadata.geographic_subject` columns. One item record could reference multiple locations - for example, an item from `United States -- Pennsylvania -- Philadelphia` would also be listed as `United States -- Pennsylvania`. For comparative and quantitative data analysis, we need to split those items into multiple rows instead of columns.

We'll write two **functions** help us do that: 
* `tidy_split` splits the values of each cell on a "|" so that there is one split value per row
* `tidy_concat` concatenates (combines) the values of columns that begin with a similar phrase into one cell with a "|" before using `tidy_split`. 

Now instead of having one row for each item with multiple linked locations, we can have one row for each linked location associated with an item.

The `tidy_split` function come from [Project Cognoma](http://cognoma.org/). 

<div class="alert alert-block alert-warning">
<p>A function is a block of reusable code that is used to perform a single, related action. Learn more about functions <a href="https://www.w3schools.com/python/python_functions.asp">here</a>.</p>
</div>

In [None]:
# Split the values of a column and expand so that the new DataFrame has one split value per row. Filters rows where column is empty 
def tidy_split(df, column, sep='|', keep=False):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [None]:
# Concatenate the values of columns beginnigng with a string and then use the tidy_split function to expand so that the new DataFrame has one split value per row
def tidy_concat(df, column_starts_with, sep="|"):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the columns to split and expand
    column_starts_with : str
        the string at the beginning of the column(s) to split
    sep : str
        the string used to split the column's values

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    list_of_columns = df.columns.to_list()
    columns_to_concat = [x for x in list_of_columns if x.startswith(column_starts_with)]
    column_starts_with = column_starts_with.split('.')[1]
    df[column_starts_with] = df[columns_to_concat[0]]
    for column in columns_to_concat[1:]:
        df[column_starts_with] = df[column_starts_with].astype(str) + sep + df[column].astype(str)
    new_df = tidy_split(df, column_starts_with, sep='|')
    new_df = new_df.drop(columns_to_concat, axis=1)
    return new_df

In [None]:
# Use the function to split the values of the Type column and expand so that the new DataFrame has one split value per row
df_places = tidy_concat(df, 'metadata.geographic_subject', sep='|')

# Report the dimensionality of the dataframe (number of rows, number of columns)
df.shape

### How many places are recorded in the Kaplan Collection?
This list will include many duplicates as more than one object will be linked to a particular location. Let's drop duplicates based on the `geographic_subject` and count how many there are.

In [None]:
# Keep only items that do not have a null value in the `geographic_subject` column
df_places = df_places[df_places['geographic_subject'].notna()]

In [None]:
# Drop the duplicate geographic subjects from the dataframe and report the dimensionality of the dataframe.
value = df_places.drop_duplicates(subset=['geographic_subject']).shape[0]

print('There are {:,} unique locations represented in the collection.'.format(value))

## Geocode `geographic_subject` Data with Nominatim

Let's put the locations on a map. First, we'll have to geocode these entries using Nominatim. 
Nominatim's API requires an email - you can input yours below:

In [None]:
email = input()

<div>
    <p>In order to use this portion of the notebook, you first need to `import` modules and packages from Python for working with geographic data.</p>
<div class="alert alert-block alert-warning">
<p>These modules and packages are units of code with specific tools or skills that we use in the script. If you're running this notebook on your computer, you may need to first `import` these modules within your Python interpreter. Find assistance for that <a href="https://packaging.python.org/tutorials/installing-packages/">here</a>.</p>
    </div>

In [None]:
# Geopandas is a Python package for working with geospatial data
import geopandas

# Geopy is a Python client for several popular geocoding web services
import geopy

# From Geopy, import AsyncRateLimiter & RateLimiter to perform bulk operations gracefully
from geopy.extra.rate_limiter import AsyncRateLimiter, RateLimiter

# From Geopy, import Nominatim, a specific geocoder for OpenStreetMap data
from geopy.geocoders import Nominatim

# NumPy is a library for adding support for manipulating and operating on large, multi-dimensional arrays
import numpy as np

Now that we have imported these modules and packages, let's prepare our dataset. Let's drop all the duplicate locations from the `geographic_subject` column and add all the linked locations to a list called `list_of_places`.

In [None]:
# Replace all empty strings in the `Geographic Subject` column with the np.nan value. 
df_places['geographic_subject'].replace('', np.nan, inplace=True)

df_places.dropna(subset=['geographic_subject'], inplace=True)

# Return a list of the values in the `Geographic Subject` column and convert into a set, removing duplicate values 
list_of_places = set(df_places['geographic_subject'].to_list())

Now that we have this list of locations, we will submit them to **Nominatim**, a geocoding software. We'll submit each location to Nominatim and save the latitude, longitude, and coordinate information for use in some visualizations.  

In [None]:
# Set Nominatim as the geocoding web service
locator = Nominatim(user_agent=email)

list_of_places = set(df_places['geographic_subject'].to_list())

#Create dictionaries for Latitude, Longitude, and Coordinates values
lat_dict = {}
lon_dict = {}
coords_dict = {}

# For each place in the list_of places:
for place in list_of_places:
    # Calculate the location
    location = locator.geocode(place)
    if location:
    # Add a new key/value pair to the dictionary, where the key is the place and the value is latitude, longitude
        coords_dict[place] = (location.latitude, location.longitude)
        lat_dict[place] = location.latitude
        lon_dict[place] = location.longitude
    else: 
        coords_dict[place] = np.nan
        lat_dict[place] = np.nan
        lon_dict[place] = np.nan

# Map the dictionary into the Coordinates column by matching the Geographic Subject key
df_places['Coordinates'] = df_places['geographic_subject'].map(coords_dict)
df_places['Latitude'] = df_places['geographic_subject'].map(lat_dict)
df_places['Longitude'] = df_places['geographic_subject'].map(lon_dict)

# Return the first 5 lines of the df_places dataframe.
df_places.head()

## Map Geographic Subjects on a World Map

Now that we have coordinates for these locations, let's make a world map. We'll be using the world map boundaries from the data provided by the `Vega datasets` Python package installed earlier.

In [None]:
# Altair only allows up to 5000 items in a map, so we're limiting this map to just one entry per Geographic Subject
map1 = df_places.drop_duplicates(subset='geographic_subject')

In [None]:
# Load the country boundaries data
countries = alt.topo_feature(vega_data.world_110m.url, feature='countries')

# Create the world map using the boundaries
background = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Plot the positions of places using circles
points = alt.Chart(map1).mark_circle(
    
    # Style the circles
    size=10,
    color='steelblue'
).encode(
    
    # Provide the coordinates
    longitude='Longitude:Q',
    latitude='Latitude:Q',
    
    # Provide the location name on hover
    tooltip=[alt.Tooltip('geographic_subject', title='Location')]
).properties(width=700)

# Layer the plotted points on top of the backgroup map
alt.layer(background, points)

### Why are there items outside of America? 

The collection is called **Early American Judaica**, so it makes sense that the bulk of the locations are located within the current-day United States of America. How many of them are not? 

In [None]:
# Create a new dataframe that filters rows does not contain "United States" as part of its Geographic Subject from the df_places dataframe
df_places = df_places[df_places['geographic_subject']!='nan']
df_not_usa = df_places[~df_places['geographic_subject'].str.startswith("United States")]
df_not_usa = df_not_usa[df_not_usa['geographic_subject']!='nan']

'{:.2%} of linked places are located outside the current-day United States of America'.format(df_not_usa.shape[0]/df_places.shape[0])

## Enrich Geographic Subject Data

The `geographic_subject` field now contains a string that often (but not always) includes information about the city, state, and country of item. We could split these into separate columns by hand, but an alternative is to use the geo-coordinates. Through a process known as reverse-geocoding, we can look up additional information about a city, state, and country that contains a set of coordinates and store it in a separate column. 

In [None]:
# Create a list of the values in the `Coordinates` column and and convert into a set
list_of_coords = set(df_places['Coordinates'].to_list())

# Create a dictionary for City, State, and Country values.
city_dict = {}
state_dict = {}
country_dict = {}

# Start a for-loop to iterate over each entry in list_of_coords
for coord in list_of_coords:
    if coord is not np.nan:
        # Use locator to reverse-geocode
        location = locator.reverse(coord)
        
        # Receive the raw location data and then save each specific value to its respective dictionary, with the coordinates as the key and the city/state/country as value
        address = location.raw['address']
        city_dict[coord] = address.get('city', '')
        state_dict[coord] = address.get('state','')
        country_dict[coord] = address.get('country','')
    else: 
        # If the entry is blank or Not a Number (nan), save the value as np.nan
        city_dict[coord] = np.nan
        state_dict[place] = np.nan
        country_dict[place] = np.nan
        
        
# Make a new column for City, State, and Country by mapping values from the corresponding dictionary according to Coordinates column as the key        
df_places['City'] = df_places['Coordinates'].map(city_dict)
df_places['State'] = df_places['Coordinates'].map(state_dict)
df_places['Country'] = df_places['Coordinates'].map(country_dict)

Did it work? Let's look at the `Country` values.

In [None]:
# Return a Series containing counts of unique rows in the dataframe for each Country
#df_places['Country'].value_counts()
#df_places['State'].value_counts()[:25]
df_places['Country'].value_counts()[:25]

## Map Geographic Subjects on a US Map

Now that we have these additional columns, we can use them to filter our data. Let's look at places where objects were created in United States of America.

First we'll filter our data by `Country`.

In [None]:
# Locate the rows that have a `Country` column value of 'United States'. 
df_places_us = df_places.loc[(df_places['Country'] == 'United States')]
df_places_us.shape

Now we can create a map. Note that we're changing the map layer in this chart to use just United States boundaries.

In [None]:
# Remove duplicate places
places = df_places_us[['geographic_subject', 'Latitude', 'Longitude']].drop_duplicates()

# Load United States boundaries
us = alt.topo_feature(vega_data.us_10m.url, feature='states')

# Create the map of United States using the boundaries
us_background = alt.Chart(us).mark_geoshape(
    
    # Style the map
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=1500)

# Plot the places
points = alt.Chart(places).mark_circle(
    
    # Style circle markers
    size=10,
    color='steelblue'
).encode(
    
    # Set position of each place using lat and lon
    longitude='Longitude:Q',
    latitude='Latitude:Q',
    
    # Provide location details on hover
    tooltip=[alt.Tooltip('geographic_subject', title='Location')]
).properties(width=700)

# Combine map and points
alt.layer(us_background, points)

## Number of Items Associated with State

So far we've only looked at the places themselves, but we can also find out how many items are associated with each place. To do this, we'll group the items by `State` and count the number of grouped items.

In [None]:
# Count the number of grouped records and add as a 'State-Count' column
df_places_us['State-Count'] = df_places_us['State'].map(df_places_us['State'].value_counts())
df_places_us.columns.values

# Drop the `City` column from the dataframe
df_places_states = df_places_us.drop(['City'], axis=1)
df_places_states

# Drop rows with duplicate values from the `State` Column
map2 = df_places_states.drop_duplicates(subset='State')
map2.shape

<div class="alert alert-block alert-warning">
<p>You may see a warning appear above that 'A value is trying to be set on a copy of a slice from a dataframe.' Python wants to make sure that you intended to make those changes to a copy rather than the original dataframe. In this case, you intendend to make those changes.<p>
</div>

Now we can map the results.

In [None]:
# Load United States boundaries
usa = alt.topo_feature(vega_data.us_10m.url, feature='states')

# Create the map of the United States
us_background = alt.Chart(usa).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# First we'll plot the created places
points = alt.Chart(map2).mark_circle().encode(
    
    # Position the circles
    longitude='Longitude:Q',
    latitude='Latitude:Q',
    
    # Hover for more details about the number of items and the location
    tooltip=[alt.Tooltip('State-Count:Q', title='Number of Items'), alt.Tooltip('State', title='State')],
    
    # The size of the circles is determined by the number of objects
    size=alt.Size('State-Count:Q',
        scale=alt.Scale(range=[0, 1000]),
        legend=alt.Legend(title='Number of Items')
    )
).properties(height = 700, width=700)

# Create a map by combining the background map and the points
map_us_counts = alt.layer(us_background, points)
map_us_counts

Now we'll show this as a choropleth map, which will use shading patterns to show us density of items in various states. 

In [None]:
# We'll use the world map as a background to the choropleth, otherwise countries with no objects will be invisible!
background = alt.Chart(usa).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Chart the created numbers by country - once again we use the world boundaries to define countries
choro = alt.Chart(usa).mark_geoshape(
    stroke='white'
).encode(
    
    # Color is determined by the number of objects
    color=alt.Color('State-Count:Q', scale=alt.Scale(scheme='greenblue'), legend=alt.Legend(title='Number of Items')),
    
    # Hover for more details about the number of items and the state
    tooltip=[alt.Tooltip('State:N', title='State'), alt.Tooltip('Count:Q', title='Number of Items')]
    
    # This is the critical section that links the map to the object data
).transform_lookup(
    
    # This is the field that contains the country ids in the boundaries file
    lookup='id',
    
    # This is where we link the dataframe with the counts by country
    # The numeric field is the country identifier and will be used to connect data with country
    # We can also need the count and country fields
    from_=alt.LookupData(map2, 'numeric', ['State-Count', 'State'])
).project('equirectangular').properties(width=700, title='States Associated with Items')

# Create the map by combining the background and the choropleth
created_choro = alt.layer(background, choro)

### Save Counts by State as a CSV File

It might be handy to have a CSV file that shows the count of items by `State`. Let's save it.

In [None]:
# Save the to a comma-separated values (csv) file.
df_places_us[['State', 'State-Count']].to_csv('data/kaplan_item_counts_by_state.csv', index=False)

# Display a link to the CSV.
display(FileLink('data/kaplan_item_counts_by_state.csv'))

## Map Geographic Subjects on a State Map

We can zoom in on Pennsylvania by using the `State` to filter the data. As before, we can then group by place and count the number of objects in each group.

In [None]:
# Locate the rows that have the value of Pennsylvania in the `State` column
df_places_pa = df_places_us.loc[(df_places_us['State'] == 'Pennsylvania')]

# Count the number of grouped records by City and add as a 'City-Count' column
df_places_pa['City-Count'] = df_places_pa['City'].map(df_places_pa['City'].value_counts())

# Drop duplicates in the `City` column
map3 = df_places_pa.drop_duplicates(subset='City')

#Create a list of columns to keep
col_list = ['geographic_subject', 'City-Count','Latitude','Longitude']

# Save the dataframe to contain only the columns in col_list
map3 = map3[col_list]

# Drop any `Geographic Subject` that is only "Pennsylvania, United States"
map3 = map3[map3["geographic_subject"]!='United States -- Pennsylvania']
map3.shape
map3

<div class="alert alert-block alert-warning">
<p>You may see a warning appear above that 'A value is trying to be set on a copy of a slice from a dataframe.' Python wants to make sure that you intended to make those changes to a copy rather than the original dataframe. In this case, you intendend to make those changes.<p>
</div>

Since we are focusing on Pennsylvania, we need to load a Pennsylvania map for the site. We'll use this geojson file at the county level.

In [None]:
pa_geojson = 'https://raw.githubusercontent.com/deldersveld/topojson/master/countries/us-states/PA-42-pennsylvania-counties.json'

Now we can map the results!

In [None]:
# Load PA state boundaries
pennsylvania = alt.topo_feature(pa_geojson, 'cb_2015_pennsylvania_county_20m')

# Make map of Pennsylvania
pa_background = alt.Chart(pennsylvania).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Plot points for created places
points = alt.Chart(map3).mark_circle().encode(
    
    # Postion the markers
    longitude='Longitude:Q',
    latitude='Latitude:Q',
    
    # Hover for more details about the number of items and the location
    tooltip=[alt.Tooltip('geographic_subject', title='Location'),alt.Tooltip('City-Count:Q', title='Number of Items')],
    
    # Size determined by the number of objects
    size=alt.Size('City-Count:Q',
        scale=alt.Scale(range=[0, 1000]),
        legend=alt.Legend(title='Number of Items')
    )
).properties(width=700)

# Create a map by combining background and points
alt.layer(pa_background, points)

### Save Counts by State as a CSV File

It might be handy to have a CSV file that shows the count of items by `State`. Let's save it.

In [None]:
# Save the to a comma-separated values (csv) file.
df_places_us[['State', 'State-Count']].to_csv('data/kaplan_item_counts_by_state.csv', index=False)

# Display a link to the CSV.
display(FileLink('kaplan_item_counts_by_state.csv'))

# Need Help?
<div class="alert alert-block alert-warning">
    <p>For additional Python and Digital Scholarship resources:</p>
    <ul>
        <li><a href"https://www.w3schools.com/python/pandas/default.asp">Pandas Tutorial from W3 Schools</a></li>
        <li><a href"https://altair-viz.github.io/altair-tutorial/README.html">Altair Tutorial from W3 Schools</a></li>
        <li><a href="https://guides.library.upenn.edu/digital-scholarship">Center for Research Data and Digital Scholarship</a></li>
    </ul>
    <p>For help with this notebook:</p>    
<ul>
    <li>If you encounter any errors in this notebook, you can open an issue on GitHub or email estene@upenn.edu and reference this notebook.</li>

<li>If you encounter any errors while working with the collection metadata (an incorrect date or broken ARK identifier), you can email estene@upenn.edu.</li>

<li>Colenda is still a beta service. If you encounter issues with accessing any of the IIIF images or links, visit
    <a href="https://colenda.library.upenn.edu/">Colenda</a></li>
    </ul>
</div>

----

# Credits

Created by [Emily Esten](https://www.library.upenn.edu/people/staff/emily-esten). 

Judaica Digital Humanities at the <a href="http://library.upenn.edu">Penn Libraries</a> (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our [website](judaicadh.library.upenn.edu).

The pre-harvested dataset for this notebook works with items from the **Arnold and Deanne Kaplan Collection of Early American Judaica**. Donated to the University of Pennsylvania Libraries in 2012 by the Kaplans, and growing each year, this collection teaches us about the everyday lives, families, communal institutions, religious organizations, voluntary associations,  businesses, and political circumstances of Jewish life throughout the western hemisphere over four centuries. More information about the collection can be found at [https://kaplan.exhibits.library.upenn.edu](https://kaplan.exhibits.library.upenn.edu). 

This notebook references existing code and Jupyter notebooks, including: 
* [GLAM Workbench for the National Museum of Australia](https://doi.org/10.5281/zenodo.3544747) sponsored by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).
* [Library of Congress Data Exploration: IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/26510c3f4da0bc85dfa87e82141173b1830e9d64/IIIF.ipynb).
* Gustavo Candela, María Dolores Sáez, Pilar Escobar, Manuel Marco-Such, & Rafael C.Carrasco. (2020, May 8). hibernator11/notebook-iiif-images: release1.1 (Version 1.1). Zenodo. [http://doi.org/10.5281/zenodo.3816611](https://zenodo.org/badge/latestdoi/255172461). 
* [Genes for Project Cognoma](https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py)