# Explore Colenda Items Over Time

[Colenda Digital Repository at Penn Libraries](https://colenda.library.upenn.edu/) is a digital repository for digitized and born-digital material. It provides direct access and long-term stewardship for these important resources. Much of Colenda’s content consists of materials owned and digitized by the Penn Libraries, including significant collections that have been donated.

In this notebook we'll explore the temporal dimensions of data harvested from Colenda. When were items created, collected, or used? To do that we'll extract the nested temporal data, see what's there, and create a few charts.

[See here](kaplan_explore_records.ipynb) for an introduction to exploring Colenda data, and [here to explore spatial dimensions](kaplan_explore_places.ipynb) of the data.

* [Import What We Need](#Import-What-We-Need)
* [Load the Data](#Load-the-Data)
* [Concatenate and Split `metadata.date` Fields](#Concatenate-and-Split-metadata.date-Fields)
* [Explore Items by Year](#Explore-Items-By-Year)
* [Explore Items Distributed Over Time](#Explore-Items-Distributed-Over-Time)
* [Focus on Items in a Particular Year](#Focus-on-Items-in-a-Particular-Year)
* [Explore Items by Year and Type](#Explore-Items-by-Year-and-Type)
* [Explore Items by Year and Location](#Explore-Items-by-Year-and-Location)
* [Need Help?](#Need-Help?)
* [Credits](#Credits)

<div class="alert alert-block alert-warning">
<p><b>Yellow blocks like this provide additional information about Python and Jupyter notebooks.</b></p>
    
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexplore_collection_object_over_time.ipynb">load a <b>live</b> version</a> running on Binder.</p>
</div>

## Import What We Need

<div>
    <p>In order to use this notebook, you first need to `import` modules and packages from Python. These are units of code with specific tools or skills that we use in the script.</p>
<div class="alert alert-block alert-warning">
<p>If you're running this notebook on your computer, you may need to first `import` these modules within your Python interpreter. Find assistance for that <a href="https://packaging.python.org/tutorials/installing-packages/">here</a>.</p>
    </div>

In [1]:
!pip install -r requirements.txt

# Pandas is a Python package that provides numerous tools for data analysis. 
import pandas as pd

# IpyLeaflet is a Python library that enables interactive geospatial data visualization in Jupyter Notebook.
from ipyleaflet import Map, Marker, Popup, MarkerCluster

# IpyWidgets is a Python library that enables interactive HTML widgets for Jupyter notebooks.
import ipywidgets as widgets

# Altair is a Python library for declarative statistical visualization. 
import altair as alt

# IPython is a Python interpreter to display content.
from IPython.display import display, HTML, FileLink

Collecting reverse_geocode
  Using cached reverse_geocode-1.4.1-py3-none-any.whl
Collecting vega_datasets
  Using cached vega_datasets-0.9.0-py3-none-any.whl (210 kB)
Collecting geopandas
  Using cached geopandas-0.9.0-py2.py3-none-any.whl (994 kB)
Collecting geopy
  Using cached geopy-2.1.0-py3-none-any.whl (112 kB)
Collecting statistics
  Using cached statistics-1.0.3.5-py3-none-any.whl
Collecting fiona>=1.8
  Using cached Fiona-1.8.20.tar.gz (1.3 MB)
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_a2dff1878fa249a587ae4868871552bd/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_a2dff1878fa249a587ae4868871552bd/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else 

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_a9455110ec154b41b5e53b73a50e3cac/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_a9455110ec154b41b5e53b73a50e3cac/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-muh6qkz_
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_a9455110ec154b41b5e53b73a50e3cac/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8cb5c81aad84f4588df2b4fcbc3f55f/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8cb5c81aad84f4588df2b4fcbc3f55f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-sxjg5dnm
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8cb5c81aad84f4588df2b4fcbc3f55f/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ae08a679c6aa42a5aa15922cc20f2041/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ae08a679c6aa42a5aa15922cc20f2041/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-eimczpjf
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ae08a679c6aa42a5aa15922cc20f2041/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8bcddf5f30141f39cae703f74aba3d1/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8bcddf5f30141f39cae703f74aba3d1/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-n46mcqyp
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_b8bcddf5f30141f39cae703f74aba3d1/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ad8a93e2859c4141bceb1c62d84712bd/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ad8a93e2859c4141bceb1c62d84712bd/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-fohi760f
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/fiona_ad8a93e2859c4141bceb1c62d84712bd/
    Complete output (2 lines):
    Failed to get op

[?25h  Using cached pyproj-3.0.0.tar.gz (663 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpr67w2q7q
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_7cc2057ff79b455da69714afc3f95088
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable. For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.6.1.post1.tar.gz (545 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with ex

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmp_9109pz8
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_3c5d1c75e528400f8d6602da36d41f9b
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable.For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.4.1.tar.gz (462 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpik5lphpl
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_6ba985e9e3044aa1890bc4fc4d16cb24
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25hCollecting geopandas
  Using cached geopandas-0.8.1-py2.py3-none-any.whl (962 kB)
  Using cached geopandas-0.8.0-py2.py3-none-any.whl (962 kB)
  Using cached geopandas-0.7.0-py2.py3-none-any.whl (928 kB)
  Using cached geopandas-0.6.3-py2.py3-none-any.whl (920 kB)
Collecting pyproj
  Using cached pyproj-2.1.3.tar.gz (521 kB)
  Install

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpwlj8gb6m
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_62f9c1e1bf4148468345d5f22c7bd83e
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25h  Using cached pyproj-1.9.6.tar.gz (2.8 MB)
Collecting fiona
  Using cached Fiona-1.7.13.tar.gz (731 kB)
Collecting geographiclib<2,>=1.49
  Using cached geographiclib-1.52-py3-none-any.whl (38 kB)
Collecting docutils>=0.3
  Using cached docutils-0.17.1-py2.py3-none-any.whl (575 kB)
Building wheels for collected packages: fiona, pypro

  Building wheel for pyproj (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_41de7d72c2784a0aae2c9675130c098b/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_41de7d72c2784a0aae2c9675130c098b/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-wheel-9nur0g1s
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_41de7d72c2784a0aae2c9675130c098b/
  Comple

Failed to build fiona pyproj
Installing collected packages: pyproj, geographiclib, fiona, docutils, vega-datasets, statistics, reverse-geocode, geopy, geopandas
    Running setup.py install for pyproj ... [?25lerror
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_41de7d72c2784a0aae2c9675130c098b/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-jbrkz4u8/pyproj_41de7d72c2784a0aae2c9675130c098b/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pb

[?25h

OSError: Could not find library geos_c or load any of its variants ['/Library/Frameworks/GEOS.framework/Versions/Current/GEOS', '/opt/local/lib/libgeos_c.dylib']

## Load the Data

This pre-harvested dataset from Colenda includes every gift of Arnold and Deanne Kaplan, which covers two collections. In this notebook we will only work with records from the Arnold and Deanne Kaplan Collection of **Early American Judaica**. Let's load our data into Python using [Pandas](https://www.w3schools.com/python/pandas/default.asp), and then filtering on the Early American Judaica collection  by using the `metadata.collection[1]` column. 

In [None]:
# Convert to a dataframe (a two-dimensional data structure with rows and columns)
df = pd.read_csv("data/kaplan-test-data.csv", encoding= 'unicode_escape')

# Print the number of rows in the dataframe
print('There are {:,} items in this dataset from Colenda.'.format(df.shape[0]))

<div class="alert alert-block alert-warning">
<p>You may see a warning appear above stating that columns having "mixed types". This means that the CSV columns contain a mix of strings and integers. When converting the CSV into a dataframe, Python wasn't sure how to declare the column type. It's OK to ignore this warning for now - we may need to state directly this information later.<p>
</div>

In [None]:
# Filter for items from the Early American Judaica Collection. 
df = df.loc[df['metadata.collection[1]'] == "Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania)"]

# Return the first 5 rows of the dataframe.
df.head()

## Concatenate and Split `metadata.date` Fields

Specific dates or years are linked to items through the `metadata.date` columns. Each item may have multiple multiple dates associated with it. For comparative and quantitative data analysis, we need to split those items into multiple rows instead of columns.

We'll write two **functions** help us do that: 
* `tidy_split` splits the values of each cell on a "|" so that there is one split value per row, and 
* `tidy_concat` concatenates (joins) the values of columns that begin with a similar phrase into one cell with a "|" before using `tidy_split`. 

Now instead of having one row for each item with multiple dates, we can have one row for each date associated with an item.

<div class="alert alert-block alert-warning">
<p>A function is a block of reusable code that is used to perform a single, related action. Learn more about functions <a href="https://www.w3schools.com/python/python_functions.asp">here</a>).</p>
</div>

In [None]:
# Split the values of a column and expand so that the new DataFrame has one split value per row. Filters rows where column is empty. 
def tidy_split(df, column, sep='|', keep=False):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as its own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [None]:
# Concatenate the values of columns beginnigng with a string and then use the tidy_split function to expand so that the new DataFrame has one split value per row.
def tidy_concat(df, column_starts_with, sep="|"):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the columns to split and expand
    column_starts_with : str
        the string at the beginning of the column(s) to split
    sep : str
        the string used to split the column's values

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    list_of_columns = df.columns.to_list()
    columns_to_concat = [x for x in list_of_columns if x.startswith(column_starts_with)]
    df[column_starts_with] = df[columns_to_concat[0]]
    for column in columns_to_concat[1:]:
        df[column_starts_with] = df[column_starts_with].astype(str) + sep + df[column].astype(str)
    new_df = tidy_split(df, column_starts_with, sep='|')
    new_df = new_df.drop(columns_to_concat, axis=1)
    return new_df

Now that we've defined these the `tidy_concat` and `tidy_split` functions, we can use them to concatenate the multiple `metadata.date` columns into one column, and then split it so that the new dataframe has one split vavlue per row.

In [None]:
# Concatenate the values of columns beginnigng with a string and then use the tidy_split function to expand so that the new DataFrame has one split value per row.
df_dates = tidy_concat(df, 'metadata.date', sep='|')

# Filter the dataframe to only include dates that are NOT 'unknown'
df_dates = df_dates[df_dates["metadata.date"]!='unknown']

# Filter the datafarme to only include the dates are NOT 'nan' (not a number)
df_dates = df_dates[df_dates["metadata.date"]!='nan']

# Report the dimensionality of the dataframe (number of rows, number of columns).
df_dates.shape



How many **unique** dates are represented in the collection?

In [None]:
# Drop the duplicate dates from the dataframe and report the dimensionality of the dataframe.
unique_df_dates = df_dates.drop_duplicates(subset=['metadata.date']).shape

print('There are {:,} unique dates represented in the Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania).'.format(unique_df_dates[0]))

## Explore Items by Year

Dates in this Colenda are formatted YYYY-MONTH-DD for a specific date. Let's extract years from the dates to make comparisons a bit easier.

In [None]:
# Use a regular expression (a specific text pattern) to find the first four digits in the date fields
df_dates['metadata.year'] = df_dates['metadata.date'].str.extract(r'^(\d{4})').fillna(0).astype('int')

<div class="alert alert-block alert-warning">
<p>A regular expressions is a sequence of characters that forms a search pattern, often used to find if a string contains a specific search pattern. Find more information about regular expression <a href="https://www.w3schools.com/python/python_regex.asp">here</a>.</p>
</div>

What's the earliest `year` represented in the collection?

In [None]:
# Locate the minimum year (above 0) represented in the dataframe's Year column. 
earliest_year = df_dates.loc[df_dates['metadata.year'] > 0]['metadata.year'].min()

print(earliest_year)

What items are from 1555? 

To identify these items, let's create a dataframe of items from 1555 and Then, we'll use the item's ARK identifier to generate links to an item's record in Colenda. More about ARK identifiers can be found in [Explore Colenda Records](kaplan_explore_records.ipynb). 

In [None]:
# Create a dataframe called df_earliest, which is a subsection of the df_dates dataframe that contains only those where the year = earliest_year
df_earliest = df_dates[df_dates['metadata.year'] == earliest_year]

# Select the first row in the dataframe and access the Unique Identifier
identifier = df_earliest.iloc[0]['unique_identifier']

# Split the string up to the second occurrence of "/" and join all but the first element of the split string 
identifier = "-".join(identifier.split("/", 2)[1:])

# Display the link to the item in Colenda, with the item-specific URL and the item's title as the hyperlinked text 
display(HTML('<a href="https://colenda.library.upenn.edu/catalog/{}">{}</a>'.format(identifier, df_earliest.iloc[0]['metadata.title[1]'])))

What's the latest year represented in the collection?

In [None]:
# Return the maximum year of the years represented in the dataframe
latest_year = df_dates['metadata.year'].max()

print(latest_year)

How many items are there from the latest year? 

In [None]:
# Create a dataframe called df_latest, which is a subsection of the df_dates dataframe that contains only those where the year = latest_year
df_latest = df_dates[df_dates['metadata.year'] == latest_year]

print('There are {:,} items from 1899 represented in the collection'.format(df_latest.shape[0]))

The year `1899` is used as a boundary year between two Kaplan Collections at the University of Pennsylvania Libraries: that of **Early American Judaica** (pre-1899) and **Modern American Judaica** (post-1899). What items are from 1899? 

Since there may be multiple items from 1899, we'll use a *for* loop to iterate over all the items from that year. 

<div class="alert alert-block alert-warning">
    <p>A <i>for</i> loop is used when iterating over a sequence in order. For each item in the sequence, the same action(s) will be performed. Find more information on <i>for</i> loops <a href="https://www.w3schools.com/python/python_for_loops.asp">here</a></p>
</div>

In [None]:
count = 0

# Create a for loop that iterates over rows in the df_latest dataframe
for idx, row in df_latest.iterrows():
    
# Select the first row in the dataframe and access the Unique Identifier 
    identifier = df_latest.iloc[count]['unique_identifier']

# Split the string up to the second occurrence of "/" and join all but the first two elements of the split string 
    identifier = "-".join(identifier.split("/", 2)[1:])

# Display the link to the item in Colenda, with the item-specific URL and the item's title as the hyperlinked text
    display(HTML('<a href="https://colenda.library.upenn.edu/catalog/{}">{}</a>'.format(identifier, df_latest.iloc[count]['metadata.title[1]'])))
    count+=1

## Explore Items Distributed Over Time

Let's use the `Statistics` module to gather additional information about the years represented in the collection.

In [None]:
#Statistics is a Python module that you can use to calculate mathematical statistics of numeric data
import statistics 

# Convert the values of the Year column to a list
col_dates_list = df_dates['metadata.year'].tolist()

# Remove all the 0 values from the list. 
col_dates_list = [i for i in col_dates_list if i != 0]

# Sorted is a Python method that returns a numerically sorted list
sorted(col_dates_list)

# Calculate the median year in the list 
median = statistics.median(col_dates_list)

# Calculate the mode year in the list
mode = statistics.mode(col_dates_list)

print('The median year is ' + str(median) + ' for items in the collection.')
print('The most common year is ' + str(mode) + ' for items in the collection.')

The median and mode suggest that the majority of items in this collection are associated with the late nineteenth century. Let's make a chart of to see the number of items per year to see this as a visualization. 

In [None]:
# Count how many rows contain a year in the Year column. and save as a dataframe called year_counts
year_counts = df_dates['metadata.year'].value_counts().to_frame().reset_index()
year_counts.columns = ['Year', 'Count']
year_counts.head(10)

In [None]:
# Create a bar chart (limit to years greater than 0)
alt.Chart(year_counts.loc[year_counts['Year'] > 0]).mark_bar(size=2).encode(
    
    # Year on the X axis
    x=alt.X('Year:Q', axis=alt.Axis(format='c', title='Year')),
    
    # Number of objects on the Y axis
    y=alt.Y('Count:Q', title='Number of Items'),
    
    # Show details on hover
    tooltip=[alt.Tooltip('Year:Q', title='Year'), alt.Tooltip('Count:Q', title='Count', format=',')]
).properties(width=700)

## Focus on Items in a Particular Year

In another notebook, we [used the `metadata.item_type` columns](kaplan_explore_records.ipynb#The-metadata.item_type-Fields) to learn about the types of items in the collection. Let's use it to see what types of objects are associated with a particular year. For the Kaplan Collection, we will use the year 1881.

Let's explode `metadata.item_type[1]` and create a new dataframe with the results.

In [None]:
# Identify the rows where the `metadata.item_type[1]` column is not null, and create a new dataframe that maintains four of the columns 
df_dates_types = df_dates.loc[df_dates['metadata.item_type[1]'].notnull()][['unique_identifier', 'metadata.title[1]', 'metadata.year', 'metadata.item_type[1]']]

# Return the first 5 lines of the df_dates_types dataframe
df_dates_types.head()

Now we can filter by year to see what types of items are associated in 1881.

In [None]:
# Create a new dataframe from the df_dates_types dataframe of the rows where the year is 1881. 
year_1881 = df_dates_types.loc[df_dates_types['metadata.year'] == 1881]
year_1881.head()

Let's look at the top twenty-five types of things created in 1881.

In [None]:
# Return a Series containing counts of unique rows in the dataframe for each item type (up to 25)
year_1881['metadata.item_type[1]'].value_counts()[:25]

The vast majority of items are 'Trade cards'. Let's look at one of the 'Trade cards' in more detail.

The images in Colenda for these Trade cards are available under the [**International Image Interoperability Framework (IIIF)**](https://iiif.io/), which makes these images accessible and interoperable between image repositories. 
Let's take a look at the images of the clock. 

To display those images, we'll write two **functions** help us do that: 
* `_src_from_data` splits the values of each cell on a "|" so that there is one split value per row, and 
* `gallery` shows a set of images in a gallery that flexes with the width of the notebook. 

These functions for working with IIIF images come from [BVMC Labs](http://data.cervantesvirtual.com/). 

In [None]:
# Encode image bytes for inclusion in an HTML img element
# Requests is a Python package that allows you to send HTTP/1.1 requests.
import requests

def _src_from_data(data):
    img_obj = Image(data=data)
    for bundle in img_obj._repr_mimebundle_():
        for mimetype, b64value in bundle.items():
            if mimetype.startswith('image/'):
                return f'data:{mimetype};base64,{b64value}'

#  Shows a set of images in a gallery that flexes with the width of the notebook.
def gallery(dictionary, row_height='auto'):
    figures = []
    for image, label in dictionary.items():
        src = image
        figures.append(f'''<figure style="margin: 5px !important;">
        <img src="{src}" 
        style="height: {row_height}">
        <figcaption style="font-size: 1em">{label}</figcaption>
        </figure>''')
    return HTML(data=f'''<div style="display: flex; flex-flow: row wrap; text-align: center;">{''.join(figures)}</div>''')

In [None]:
# Locate in the year_1881 dataframe of the rows where the type is Trade cards
year_1881.loc[year_1881['metadata.item_type[1]'] == 'Trade cards'].head()

# Select the first row in the dataframe. 
item = year_1881.loc[year_1881['metadata.item_type[1]'] == 'Trade cards'].iloc[0]

# Select the first row in the dataframe and access the Unique Identifier. 
identifier = year_1881.iloc[count]['unique_identifier']

# Split the string up to the second occurrence of "/" and join all but the first two elements of the split string. 
identifier = "-".join(identifier.split("/", 2)[1:])

# Create a string that is the link to the item-specific IIIF manifest. 
manifest = "https://colenda.library.upenn.edu/phalt/iiif/2/" + identifier + "/manifest"

# Get the manifest
r = requests.get(manifest)

# Get the information about all the images for this item as a list 
results = r.json()["sequences"][0]['canvases']

# Create a dictionary to collect each image URL (key) and corresponding label (value) for this item. 
imagesDict = {}

# Iterate over each image in the results list to extract the URL and label for the image, adding it to the lists above
for i in range(len(results)):
    label = results[i]['label']
    resource = results[i]['images'][0]['resource']
    images = resource['@id']
    imagesDict[images] = label 
    
# Display the link to the item in Colenda, with the item-specific URL and the item's title as the hyperlinked text. 
display(HTML('<a href="https://colenda.library.upenn.edu/catalog/{}">{}</a>'.format(identifier, year_1881.iloc[count]['metadata.title[1]'])))

# Display the images as a gallery    
gallery(imagesDict, row_height='150px')


## Explore Items by Year and Type

Now that we have a dataframe that combines creation dates with object types, we can look at how the creation of particular object types changes over time. Let's look at the 1880s as an example.

In [None]:
# Create a dataframe containing the years 1880-1889
df_1880s = df_dates_types.loc[(df_dates_types['metadata.year'] > 1879) & (df_dates_types['metadata.year'] < 1890)]

#Create a list of columns to keep
col_list = ['metadata.year', 'metadata.item_type[1]']

# Save the dataframe to contain only the columns in col_list
df_1880s1 = df_1880s[col_list]

# Group the dataframe by the columns Year and Type to calculate the number of elements, and save as column `Count`
df_1880s1 =  df_1880s1.groupby(['metadata.year','metadata.item_type[1]']).size().reset_index(name="Count")

# Sort the column `Year` by its values
df_1880s1 = df_1880s1.sort_values(by=['metadata.year'])

df_1880s

In [None]:
# We have to rename the columns to remove the `.` or else the charts below won't work
df_1880s = df_1880s.rename(columns={"metadata.year": "year", "metadata.item_type[1]": "additionalType"})

In [None]:
# Create a stacked bar chart
alt.Chart(df_1880s).mark_bar(size=3).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year')),
    
    # Number of objects on the Y axis
    y=alt.Y('count()', title='Number of objects'),
    
    # Color according to the type
    color='additionalType:N',
    
    # Details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:Q', title='Year'), alt.Tooltip('count():Q', title='Objects', format=',')]
).properties(width=700)

Let's try another way of charting changes in the creation of the most common object types over time.

First we'll get the top ten object types (which have years) as a list.

In [None]:
# Get most common 10 values and convert to a list
top_types = df_dates_types['metadata.item_type[1]'].value_counts()[:10].index.to_list()
top_types

Now we'll use the list of `top_types` to filter the creation dates, so we only have events relating to those types of items.

In [None]:
# Only include records where the additionalType value is in the list of top_types
df_top_types = df_dates_types.loc[(df_dates_types['metadata.year'] > 0) & (df_dates_types['metadata.item_type[1]'].isin(top_types))]

In [None]:
# Get the counts for year / type
top_type_counts = df_top_types.groupby('metadata.year')['metadata.item_type[1]'].value_counts().to_frame()
top_type_counts.columns = ['Count']
top_type_counts.reset_index(inplace=True)
print(top_type_counts)

To chart this data, we're going to use circles for each point and create 'bubble lines' for each item type to show how the number of items collected for that type varied year by year.

In [None]:
# Rename the columns to remove the `.` - formulas do not work without them
top_type_counts = top_type_counts.rename(columns={"metadata.year": "year", "metadata.item_type[1]": "additionalType"})

In [None]:
# Create a chart
alt.Chart(top_type_counts).mark_circle(
    
    # Style the circles
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year', labelAngle=0), scale=alt.Scale(zero=False)),
    
    # Object type on the Y axis
    y=alt.Y('additionalType:N', title='Item Type'),
    
    # Size of the circles represents the number of objects
    size=alt.Size('Count:Q',
        scale=alt.Scale(range=[0, 2000]),
        legend=alt.Legend(title='Number of Items')
    ),
    
    # Color the circles by object type
    color=alt.Color('additionalType:N'),
    
    # Provide type, year, and count details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:Q', title='Year'), alt.Tooltip('Count:Q', title='Number of Items', format=',')]
).properties(
    width=700
)

This chart shows us how item types are distributed over time. Let's calculate the earliest year for each item type in the collection and save it to a CSV for review.

In [None]:
# NumPy is a library for adding support for manipulating and operating on large, multi-dimensional arrays
import numpy as np

# Group DataFrame by the `additionalType` columns and aggregate the year column for its minimum and maximum year
top_type_counts_grouped = top_type_counts.groupby('additionalType').agg({'year' : [np.min, np.max]})

# Write the top_type_counts_grouped dataframe to a comma-separated values (csv) file.
top_type_counts_grouped.to_csv('data/colenda_item_type_years.csv', index=False)

# Display a link to the CSV.
display(FileLink('data/colenda_item_type_years.csv'))

## Explore Items by Year and Location

In preparation for our the next Jupyter notebook, let's take a look atlocation data instead of item types. To chart this data we're going to use circles for each point and create 'bubble lines' for each state to show how the number of items varies over time.

In [None]:
# Use the function to split the values of the `metadata.geographic_subject` column and expand so that the new dataframe has one split value per row.
df_places = tidy_concat(df_dates, 'metadata.geographic_subject', sep='|')


In [None]:
# Filter the dataframe to only include non-null values
df_places = df_places[df_places['metadata.geographic_subject'].notna()]

# Filter the dataframe to only include geographic subjects that start with `United States`
df_usa = df_places[df_places['metadata.geographic_subject'].str.startswith("United States")]

# Filter the dataframe to only include values that have two dashes (drop anything more detailed than the State level)
df_usa = df_usa[df_usa["metadata.geographic_subject"].str.count("-")==2]

# Filter the dataframe to only include values that are not zero
df_usa = df_usa.loc[df_usa["metadata.year"] != 0]

# Rename the columns to remove the `.` or else the charts below won't work
df_usa = df_usa.rename(columns={'metadata.geographic_subject': 'geographic_subject','metadata.year':'year'})

# Get the counts for year / geographics_subject
df_usa_counts = df_usa.groupby('year')['geographic_subject'].value_counts().to_frame()
df_usa_counts.columns = ['Count']
df_usa_counts.reset_index(inplace=True)
print(df_usa_counts)

In [None]:
# Create a chart
alt.Chart(df_usa_counts).mark_circle(
    
    # Style the circles
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year', labelAngle=0), scale=alt.Scale(zero=False)),
    
    # Object type on the Y axis
    y=alt.Y('geographic_subject:N', title='State'),
    
    # Size of the circles represents the number of objects
    size=alt.Size('Count:Q',
        scale=alt.Scale(range=[0, 2000]),
        legend=alt.Legend(title='Number of Items')
    ),
    
    # Color the circles by object type
    #color=alt.Color('geographic_subject:N'),
    
    # Provide state, year, and count details on hover
    tooltip=[alt.Tooltip('geographic_subject:N', title='State'), alt.Tooltip('year:Q', title='Year'), alt.Tooltip('Count:Q', title='Number of Items', format=',')]
).properties(
    width=700
)

# Need Help?
<div class="alert alert-block alert-warning">
    <p>For additional Python and Digital Scholarship resources:</p>
    <ul>
        <li><a href"https://www.w3schools.com/python/pandas/default.asp">Pandas Tutorial from W3 Schools</a></li>
        <li><a href"https://altair-viz.github.io/altair-tutorial/README.html">Altair Tutorial from W3 Schools</a></li>
        <li><a href="https://guides.library.upenn.edu/digital-scholarship">Center for Research Data and Digital Scholarship</a></li>
    </ul>
    <p>For help with this notebook:</p>    
<ul>
    <li>If you encounter any errors in this notebook, you can open an issue on GitHub or email estene@upenn.edu and reference this notebook.</li>

<li>If you encounter any errors while working with the collection metadata (an incorrect date or broken ARK identifier), you can email estene@upenn.edu.</li>

<li>Colenda is still a beta service. If you encounter issues with accessing any of the IIIF images or links, visit
    <a href="https://colenda.library.upenn.edu/">Colenda</a></li>
    </ul>
</div>

----

# Credits

Created by [Emily Esten](https://www.library.upenn.edu/people/staff/emily-esten). 

Judaica Digital Humanities at the <a href="http://library.upenn.edu">Penn Libraries</a> (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our [website](judaicadh.library.upenn.edu).

The pre-harvested dataset for this notebook works with items from the **Arnold and Deanne Kaplan Collection of Early American Judaica**. Donated to the University of Pennsylvania Libraries in 2012 by the Kaplans, and growing each year, this collection teaches us about the everyday lives, families, communal institutions, religious organizations, voluntary associations,  businesses, and political circumstances of Jewish life throughout the western hemisphere over four centuries. More information about the collection can be found at [https://kaplan.exhibits.library.upenn.edu](https://kaplan.exhibits.library.upenn.edu). 

This notebook references existing code and Jupyter notebooks, including: 
* [GLAM Workbench for the National Museum of Australia](https://doi.org/10.5281/zenodo.3544747) sponsored by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).
* [Library of Congress Data Exploration: IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/26510c3f4da0bc85dfa87e82141173b1830e9d64/IIIF.ipynb).
* Gustavo Candela, María Dolores Sáez, Pilar Escobar, Manuel Marco-Such, & Rafael C.Carrasco. (2020, May 8). hibernator11/notebook-iiif-images: release1.1 (Version 1.1). Zenodo. [http://doi.org/10.5281/zenodo.3816611](https://zenodo.org/badge/latestdoi/255172461). 
* [Genes for Project Cognoma](https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py)


This work is modeled after Tim Sherrat's work with the National Museum of Australia, which was issued under an MIT Licesnse. 