# Explore Colenda Records

## Welcome!

[Colenda Digital Repository at Penn Libraries](https://colenda.library.upenn.edu/) is a digital repository for digitized and born-digital material. It provides direct access and long-term stewardship for these important resources. Much of Colenda’s content consists of materials owned and digitized by the Penn Libraries, including significant collections that have been donated.

In this notebook we'll have a preliminary look at data harvested from Colenda. What kind of data and files can we access in Colenda? We'll introduce how to calculate basic shape/stats of the data, split and concatenate columns, and access images in the collection. Other notebooks will explore data from Colenda over [time](kaplan_explore_time.ipynb) and [space](kaplan_explore_places.ipynb).

* [Import What We Need](#Import-What-We-Need)
* [Load the Data](#Load-the-Data)
* [Reviewing the Data](#Reviewing-the-Data)
* [Concatenate and Split Columns](#Concatenate-and-Split-Columns)
* [The `metadata.item_type` Field](#The-metadata.item_type-Field)
* [Access Images of Items in the Collection](#Access-Images-of-Items-in-the-Collection)
* [Need Help?](#Need-Help?)
* [Credits](#Credits)

<div class="alert alert-block alert-warning">
<p><b>Yellow blocks like this provide additional information about Python and Jupyter notebooks.</b></p>
    
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexplore_collection_object_over_time.ipynb">load a <b>live</b> version</a> running on Binder.</p>
</div>

## Import What We Need

<div>
    <p>In order to use this notebook, you first need to `import` modules and packages from Python.</p>
<div class="alert alert-block alert-warning">
<p>These modules and packages are units of code with specific tools or skills that we use in the script. If you're running this notebook on your computer, you may need to first `import` these modules within your Python interpreter. Find assistance for that <a href="https://packaging.python.org/tutorials/installing-packages/">here</a>.</p>
    </div>

In [1]:
!pip install -r requirements.txt

# Pandas is a Python package that provides numerous tools for data analysis. 
import pandas as pd

# IPython is a Python interpreter to display content.
from IPython.display import display, HTML, FileLink, Image

Collecting reverse_geocode
  Using cached reverse_geocode-1.4.1-py3-none-any.whl
Collecting vega_datasets
  Using cached vega_datasets-0.9.0-py3-none-any.whl (210 kB)
Collecting geopandas
  Using cached geopandas-0.9.0-py2.py3-none-any.whl (994 kB)
Collecting geopy
  Using cached geopy-2.1.0-py3-none-any.whl (112 kB)
Collecting statistics
  Using cached statistics-1.0.3.5-py3-none-any.whl
Collecting pyproj>=2.2.0
  Using cached pyproj-3.1.0.tar.gz (182 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmp0gir_a8r
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_4836e0275f964fe18b7df2639edc30a4
  Complete outpu

  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmp2oivzxcw
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_8c1f3b8881d846ff84affdeb6b08c9af
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable. For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-3.0.0.post1.tar.gz (663 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@

[?25h  Using cached pyproj-2.5.0.tar.gz (508 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpqpahsoee
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_6faf974bbc764564bc1730b9cace2be7
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable.For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.4.2.post1.tar.gz (463 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exi

[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpjcjp_1b2
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_1f1349bb9ddd4af28669563e437f9adf
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.2.2.tar.gz (7.2 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpzyjxshe5
      

[?25h  Using cached pyproj-2.1.0.tar.gz (459 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/tmpze7lc8rm
       cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_54827542aa464b03b04b5e4f530a5141
  Complete output (1 lines):
  Proj executable not found. Please set PROJ_DIR variable.
  ----------------------------------------[0m
[?25h  Using cached pyproj-2.0.2.tar.gz (407 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_80355f7e27d2434fb003208825de9d12/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_80355f7e27d2434fb003208825de9d12/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-djp1uyza
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_80355f7e27d2434fb003208825de9d12/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_46f269afb3d34ebcb49d353072bb4658/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_46f269afb3d34ebcb49d353072bb4658/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-0okgkhod
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_46f269afb3d34ebcb49d353072bb4658/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_35e97123f48d4aafb0c1a5e3ec7de6a9/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_35e97123f48d4aafb0c1a5e3ec7de6a9/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-6bpje8mv
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_35e97123f48d4aafb0c1a5e3ec7de6a9/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_d2c4a473d870434d9a7093c0de4ffbdb/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_d2c4a473d870434d9a7093c0de4ffbdb/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-i2bbbj89
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_d2c4a473d870434d9a7093c0de4ffbdb/
    Complete output (2 lines):
    Failed to get op

[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_c8e910bc12664984bdac7a14affa6b8d/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_c8e910bc12664984bdac7a14affa6b8d/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-pip-egg-info-duzqa8vo
         cwd: /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/fiona_c8e910bc12664984bdac7a14affa6b8d/
    Complete output (2 lines):
    Failed to get op

  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/opt/homebrew/include -I/opt/homebrew/opt/openssl@1.1/include -I/opt/homebrew/opt/sqlite/include -I/opt/homebrew/opt/python@3.9/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c fiona/_transform.cpp -o build/temp.macosx-11-arm64-3.9/fiona/_transform.o
  fiona/_transform.cpp:606:10: fatal error: 'cpl_conv.h' file not found
  #include "cpl_conv.h"
           ^~~~~~~~~~~~
  1 error generated.
  error: command '/usr/bin/clang' failed with exit code 1
  ----------------------------------------[0m
[31m  ERROR: Failed building wheel for fiona[0m
[?25h  Running setup.py clean for fiona
  Building wheel for pyproj (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0

Failed to build fiona pyproj
Installing collected packages: pyproj, geographiclib, fiona, docutils, vega-datasets, statistics, reverse-geocode, geopy, geopandas
    Running setup.py install for pyproj ... [?25lerror
[31m    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/python@3.9/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_e77095291e6945daa88a3cb8bcaa3ae9/setup.py'"'"'; __file__='"'"'/private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pbr0000gs/T/pip-install-fw61g_g7/pyproj_e77095291e6945daa88a3cb8bcaa3ae9/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/kl/0dwxwfpd25xdhw1l6w4r3pb

[?25h

## Load the Data

This pre-harvested dataset from Colenda includes many gifts of [Arnold and Deanne Kaplan](https://kaplan.exhibits.library.upenn.edu/thekaplans), which is concentrated in two collections. In this notebook we will only work with records from the Arnold and Deanne Kaplan Collection of **Early American Judaica**. We can access those items by using the `metadata.collection[1]` column and filtering on the Early American Judaica collection. 

In [2]:
# Convert to a dataframe
df = pd.read_csv("data/kaplan-test-data.csv", encoding= 'unicode_escape')

# Print the number of rows in the dataframe
print('There are {:,} items in this dataset from Colenda.'.format(df.shape[0]))

There are 9,418 items in this dataset from Colenda.


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<div class="alert alert-block alert-warning">
<p>You may see a warning appear above stating that columns having "mixed types". This means that the CSV columns contain a mix of strings and integers. When converting the CSV into a dataframe, Python wasn't sure how to declare the column type. It's OK to ignore this warning for now - we may need to state directly this information later.<p>
</div>

In [3]:
# Filter for items from the Early American Judaica Collection. 
df = df.loc[df['metadata.collection[1]'] == "Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania)"]

# Return the first 5 rows of the dataframe.
df.head()

Unnamed: 0,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],metadata.date[1],...,metadata.subject[2],metadata.subject[3],metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],unique_identifier
2,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Baum & Bernstein,,,,,,unknown,...,Jewish merchants,Trade cards (advertising),,,,,,,"Trade card; Baum & Bernstein; Meriden, Connect...",ark:/81431/p3000003f
3,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,,,,,,,1837-11-07,...,Jewish merchants,Family papers,Manuscripts (documents),,,,,,"Letter; Tobias, Henry; Liverpool, United Kingd...",ark:/81431/p3000006w
4,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,I. H. Brounstein's One Price Clothing House,,,,,,unknown,...,Advertisements,Jewish merchants,Clothing trade,,,,,,Trade card; I. H. Brounstein's One Price Cloth...,ark:/81431/p3000013w
5,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Kast's Fine Shoes,,,,,,unknown,...,Jewish merchants,Trade cards (advertising),,,,,,,"Trade card; Kast's Fine Shoes; San Francisco, ...",ark:/81431/p3000019s
6,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Mandel Bro's,,,,,,unknown,...,General stores,Dry-goods,Clothing trade,Jewish merchants,Trade cards (advertising),,,,Trade card; Mandel Bro's; undated,ark:/81431/p3000020w


## Reviewing the Data

In [4]:
print('There are {:,} items in Colenda from the Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania).'.format(df.shape[0]))

There are 8,494 items in Colenda from the Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania).


Now that we have this dataset in a dataframe, we can manipulate it. 

This dataset contains **descriptive metadata** about the items in the collection, which provides information about the intellectual content of a digital object. Descriptive metadata documents and tracks the intellectual content of an item, as well as support the search and discovery of these items within Colenda. The most important field of descriptive metadata is a unique identifier that uniquely identifies the object. Other descriptive metadata fields may include title, author, date of publication, subject, publisher and description. 

What descriptive metadata fields is in this dataframe?

In [5]:
# Retrieve the column names and add it to list
df.columns.to_list()

['metadata.call_number[1]',
 'metadata.collection[1]',
 'metadata.contributor[1]',
 'metadata.corporate_name[1]',
 'metadata.corporate_name[2]',
 'metadata.corporate_name[3]',
 'metadata.corporate_name[4]',
 'metadata.corporate_name[5]',
 'metadata.corporate_name[6]',
 'metadata.date[1]',
 'metadata.date[2]',
 'metadata.date[3]',
 'metadata.description[1]',
 'metadata.format[1]',
 'metadata.format[2]',
 'metadata.format[3]',
 'metadata.format[4]',
 'metadata.geographic_subject[1]',
 'metadata.geographic_subject[2]',
 'metadata.geographic_subject[3]',
 'metadata.geographic_subject[4]',
 'metadata.geographic_subject[5]',
 'metadata.geographic_subject[6]',
 'metadata.geographic_subject[7]',
 'metadata.geographic_subject[8]',
 'metadata.geographic_subject[9]',
 'metadata.geographic_subject[10]',
 'metadata.geographic_subject[11]',
 'metadata.geographic_subject[12]',
 'metadata.geographic_subject[13]',
 'metadata.geographic_subject[14]',
 'metadata.geographic_subject[15]',
 'metadata.geogra

That is a long list of columns! You'll notice some similar fields:

* **metadata.call_number** includes the physical item's location in the library
* **metadata.collection** includes the name of the collection(s) with which the item is associated
* **metadata.contributor** includes any responsible for making contributions to the resource
* **metadata.corporate_name** includes any businesses, organizations, or institutions that are mentioned or associated with the item 
* **metadata.date** includes any dates that appear on the item, or that may be associated with the item
* **metadata.description** includes a short account of the item
* **metadata.format** is the singular version of the *metadata.item_type'* fields
* **metadata.geographic_subject** includes any geographic locations associated with the item
* **metadata.identifier** includes any other "names" or unique terms used to refer to this item
* **metadata.item_type** includes the genre(s) of the resource.
* **metadata.language** includes any languages that appear on the item
* **metadata.notes** includes any notes associated with the item
* **metadata.personal_name** includes any individuals that are mentioned or associated with the item
* **metadata.provenance** includes the history of the physical item
* **metadata.publisher** includes the publisher of the item
* **metadata.rights** includes information about rights held in and over the item
* **metadata.subject** includes any terms used to categorize the item
* **metadata.title** includes a short name given to the resource
* **unique_identifier** is the [Archival Resource Key identifier](https://n2t.net/e/ark_ids.html) (more on this later)

Not every item has a value for every column. Let's create a quick count of the number of values in each column.

In [6]:
# Count non-NA cells for each column (cells not missing data)
df.count()

metadata.call_number[1]       8493
metadata.collection[1]        8494
metadata.contributor[1]          3
metadata.corporate_name[1]    6331
metadata.corporate_name[2]     565
                              ... 
metadata.subject[7]            105
metadata.subject[8]             32
metadata.subject[9]             12
metadata.title[1]             8494
unique_identifier             8494
Length: 98, dtype: int64

Let's express those counts as a percentage of the total number of records, and display them as a bar chart using Pandas.

In [7]:
# Get the counts for each column and convert to a new dataframe
field_counts = df.count().to_frame().reset_index()

# Change column headings
field_counts.columns = ['Field', 'Count']

# Calculate proportion of the total
field_counts['Proportion'] = field_counts['Count'].apply(lambda x: x / df.shape[0])

# Style the results as a barchart
field_counts.style.bar(subset=['Proportion'], color='#d65f5f').format({'Proportion': '{:.2%}'.format})

Unnamed: 0,Field,Count,Proportion
0,metadata.call_number[1],8493,99.99%
1,metadata.collection[1],8494,100.00%
2,metadata.contributor[1],3,0.04%
3,metadata.corporate_name[1],6331,74.53%
4,metadata.corporate_name[2],565,6.65%
5,metadata.corporate_name[3],29,0.34%
6,metadata.corporate_name[4],7,0.08%
7,metadata.corporate_name[5],3,0.04%
8,metadata.corporate_name[6],0,0.00%
9,metadata.date[1],8494,100.00%


## Concatenate and Split Columns

Some of the column headings appear multiple times, identified by a number at the end of it. For example, the `metadata.item_type` column appears twice, indicating there are two item types for some items. For comparative and quantitative data analysis, we may need to split those items into multiple rows instead of columns.

Let's use `metadata.item_type` as an example of how we can concatenate and split these columns. How many items have more than one item type?

In [8]:
# Count how many rows are not blank in the 'metadata.item_type[2]'' column
df['metadata.item_type[2]'].count()

48

Let's take a look at those items.

In [9]:
# Create a filtered dataframe by 'metadata.item_type[2]', including only those that have data in that column
df1 = df[df['metadata.item_type[2]'].notnull()]

# Return the first 5 lines of the df1 dataframe
df1.head()

Unnamed: 0,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],metadata.date[1],...,metadata.subject[2],metadata.subject[3],metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],unique_identifier
2031,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Savannah Republican,,,,,,1852-11-26,...,Broadsides (notices),Jewish merchants,General stores,,,,,,Broadside; Letter; Cohen; Savannah Republican;...,ark:/81431/p37p8tc5c
7308,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Pollak Bros.,,,,,,1890-02-19,...,Letterheads,Jewish merchants,Jewelry trade,Jewelry stores,,,,,"Envelope; Pollak, Chas.; Pollak Bros.; Kansas ...",ark:/81431/p3374v
7313,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,New Orleans Wholesale Price Current,Benjamin Levy,,,,,1835-08-01,...,Correspondence,Jewish printers,Printing industry,Lists (document genres),,,,,"Periodical; Levy, Benjamin; New Orleans Wholes...",ark:/81431/p33b1r
7346,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Levensohn & Galland,,,,,,1869-10-20,...,Letterheads,Food industry and trade,General stores,Dry-goods,Clothing trade,,,,"Envelope; Levensohn, Mayer; Levensohn & Gallan...",ark:/81431/p33q9k
7378,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,S. M. Rosenbaum,Fry & Stebbins,,,,,1871-09-18,...,Receipts (financial records),Jewish merchants,Money,,,,,,"Envelope; Rosenbaum, S. M.; S. M. Rosenbaum; R...",ark:/81431/p3418m


In order to accurately count how many items of each item type are included in this dataset, we need to split the values in those cells into individual rows. 

We'll write two **functions** help us do that: 
* `tidy_split` splits the values of each cell on a "|" so that there is one split value per row
* `tidy_concat` concatenates (combines) the values of columns that begin with a similar phrase into one cell with a "|" before using `tidy_split`. 

Now instead of having one row for each item with multiple types, we can have one row for each type associated with an item.

The `tidy_split` function come from [Project Cognoma](http://cognoma.org/). 

<div class="alert alert-block alert-warning">
<p>A function is a block of reusable code that is used to perform a single, related action. Learn more about functions <a href="https://www.w3schools.com/python/python_functions.asp">here</a>.</p>
</div>

In [10]:
# Split the values of a column and expand so that the new DataFrame has one split value per row
# Filters rows where column is empty 
def tidy_split(df, column, sep='|', keep=False):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [11]:
# Concatenate the values of columns beginnigng with a string and then use the tidy_split function to expand so that the new DataFrame has one split value per row
def tidy_concat(df, column_starts_with, sep="|"):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the columns to split and expand
    column_starts_with : str
        the string at the beginning of the column(s) to split
    sep : str
        the string used to split the column's values

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    list_of_columns = df.columns.to_list()
    columns_to_concat = [x for x in list_of_columns if x.startswith(column_starts_with)]
    df[column_starts_with] = df[columns_to_concat[0]]
    for column in columns_to_concat[1:]:
        df[column_starts_with] = df[column_starts_with].astype(str) + sep + df[column].astype(str)
    new_df = tidy_split(df, column_starts_with, sep='|')
    new_df = new_df.drop(columns_to_concat, axis=1)
    return new_df

In [12]:
# Use the function to split the values of the Type column and expand so that the new DataFrame has one split value per row
df = tidy_concat(df, 'metadata.item_type', sep='|')

# Report the dimensionality of the dataframe (number of rows, number of columns)
df.shape

(16988, 97)

## The `metadata.item_type` Field

The `metadata.item_type` field refers to the type of item: books, manuscripts, sound recordings, etc. Let's look at the 25 most common item types in the collection.

In [13]:
# Return a Series containing counts of unique rows in the dataframe for each Type (up to 25 Types)
df['metadata.item_type'].value_counts()[:25]

nan                       8446
Trade cards               3845
Letters                    978
Billheads                  485
Periodicals                377
Receipts                   294
Billhead                   293
Envelopes                  192
Letterheads                181
Pamphlets                  155
Broadsides                 150
Monetary                   141
Deeds                      114
Cartes-de-visite           107
Legal documents             97
Ports of entry              90
Negotiable instruments      67
Official documents          66
Miscellaneous               63
Court records               55
Invitations                 41
Photographs                 36
Manuscripts                 34
Trade tokens                34
Sheet music                 34
Name: metadata.item_type, dtype: int64

`nan` refers to empty values, meaning that there are no item types listed for those rows. 
Of all the item types that appear in the collection, how many item types only appear once?

In [14]:
# Create a new dataframe called type_counts, which includes a Type column and a Count column
type_counts = df['metadata.item_type'].value_counts().to_frame().reset_index().rename({'index': 'type', 'metadata.item_type': 'count'}, axis=1)

# Locate the rows that have a 'unique' type, or a count of 1
unique_types = type_counts.loc[type_counts['count'] == 1]

# Print the number of rows in the dataframe
print('There are {:,} items from the collection with unique item types.'.format(unique_types.shape[0]))

There are 57 items from the collection with unique item types.


Interesting! Let's save the complete list of types as a CSV file for future reference.

In [15]:
# Write the type_counts dataframe to a comma-separated values (csv) file.
type_counts.to_csv('data/colenda_item_type_counts.csv', index=False)

# Display a link to the CSV.
display(FileLink('data/colenda_item_type_counts.csv'))

Browsing the CSV, I noticed that there was one item with the type `Clocks`. Let's find some more out about it.

In [16]:
# Find the item in the complete data set
clocks = df.loc[df['metadata.item_type'].notnull()]['metadata.item_type'].apply(lambda x: 'Clocks' in x)
clock = df.loc[df['metadata.item_type'].notnull()][clocks]
clock

Unnamed: 0,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],metadata.date[1],...,metadata.subject[3],metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],unique_identifier,metadata.item_type
850,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,J. Warshawsky,,,,,,unknown,...,Jewelry trade,Jewelry stores,,,,,,Clock; J. Warshawsky; undated,ark:/81431/p3319s54c,Clocks


We can create a link into the item's record in Colenda using its `unique_identifier`. The value in this column is an [Archival Resource Key identifier](https://n2t.net/e/ark_ids.html), designed to support long-term access to information objects. This identifier can be divided into three parts, separated by `/`: the ARK label, the collection of which the item is a part, and the unique identifier for the item within the collection.

To create the link, we only need the second and third part of the unique identifier. 

In [17]:
# Select the first row in the dataframe
identifier = clock.iloc[0]['unique_identifier']

# Split the string up to the second occurrence of "/" and join all but the first element of the split string 
identifier = "-".join(identifier.split("/", 2)[1:])

# Display the link to the item in Colenda, with the item-specific URL and the item's title as the hyperlinked text 
display(HTML('<a href="https://colenda.library.upenn.edu/catalog/{}">{}</a>'.format(identifier, clock.iloc[0]['metadata.title[1]'])))

## Access Images of Items in the Collection

The images in Colenda for these items are available under the [**International Image Interoperability Framework (IIIF)**](https://iiif.io/), which makes these images accessible and interoperable between image repositories. 

To display those images, we'll write two **functions** help us do that: 
* `_src_from_data` splits the values of each cell on a "|" so that there is one split value per row, and 
* `gallery` shows a set of images in a gallery that flexes with the width of the notebook. 

These functions for working with IIIF images come from [BVMC Labs](http://data.cervantesvirtual.com/). 

In [18]:
# Encode image bytes for inclusion in an HTML img element
def _src_from_data(data):
    img_obj = Image(data=data)
    for bundle in img_obj._repr_mimebundle_():
        for mimetype, b64value in bundle.items():
            if mimetype.startswith('image/'):
                return f'data:{mimetype};base64,{b64value}'

#  Shows a set of images in a gallery that flexes with the width of the notebook.
def gallery(dictionary, row_height='auto'):
    figures = []
    for image, label in dictionary.items():
        src = image
        figures.append(f'''<figure style="margin: 5px !important;">
        <img src="{src}" 
        style="height: {row_height}">
        <figcaption style="font-size: 1em">{label}</figcaption>
        </figure>''')
    return HTML(data=f'''<div style="display: flex; flex-flow: row wrap; text-align: center;">{''.join(figures)}</div>''')
    

Let's take a look at the images of the clock. 

In [19]:
# Requests is a Python package that allows you to send HTTP/1.1 requests.
import requests

# Create a string that is the link to the item-specific IIIF manifest. 
manifest = "https://colenda.library.upenn.edu/phalt/iiif/2/" + identifier + "/manifest"

# Get the manifest
r = requests.get(manifest)

# Get the information about all the images for this item as a list 
results = r.json()["sequences"][0]['canvases']

# Create a dictionary to collect each image URL (key) and corresponding label (value) for this item. 
imagesDict = {}

# Iterate over each image in the results list to extract the URL and label for the image, adding it to the lists above
for i in range(len(results)):
    label = results[i]['label']
    resource = results[i]['images'][0]['resource']
    images = resource['@id']
    imagesDict[images] = label 
    
# Display the images as a gallery    
gallery(imagesDict, row_height='150px')

Nice work! We can now use these basic instructions to explore more aspects of the collection.

# Need Help?
<div class="alert alert-block alert-warning">
    <p>For additional Python and Digital Scholarship resources:</p>
    <ul>
        <li><a href"https://www.w3schools.com/python/pandas/default.asp">Pandas Tutorial from W3 Schools</a></li>
        <li><a href"https://altair-viz.github.io/altair-tutorial/README.html">Altair Tutorial from W3 Schools</a></li>
        <li><a href="https://guides.library.upenn.edu/digital-scholarship">Center for Research Data and Digital Scholarship</a></li>
    </ul>
    <p>For help with this notebook:</p>    
<ul>
    <li>If you encounter any errors in this notebook, you can open an issue on GitHub or email estene@upenn.edu and reference this notebook.</li>

<li>If you encounter any errors while working with the collection metadata (an incorrect date or broken ARK identifier), you can email estene@upenn.edu.</li>

<li>Colenda is still a beta service. If you encounter issues with accessing any of the IIIF images or links, visit
    <a href="https://colenda.library.upenn.edu/">Colenda</a></li>
    </ul>
</div>

----

# Credits

Created by [Emily Esten](https://www.library.upenn.edu/people/staff/emily-esten). 

Judaica Digital Humanities at the <a href="http://library.upenn.edu">Penn Libraries</a> (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our [website](judaicadh.library.upenn.edu).

The pre-harvested dataset for this notebook works with items from the **Arnold and Deanne Kaplan Collection of Early American Judaica**. Donated to the University of Pennsylvania Libraries in 2012 by the Kaplans, and growing each year, this collection teaches us about the everyday lives, families, communal institutions, religious organizations, voluntary associations,  businesses, and political circumstances of Jewish life throughout the western hemisphere over four centuries. More information about the collection can be found at [https://kaplan.exhibits.library.upenn.edu](https://kaplan.exhibits.library.upenn.edu). 

This notebook references existing code and Jupyter notebooks, including: 
* [GLAM Workbench for the National Museum of Australia](https://doi.org/10.5281/zenodo.3544747) sponsored by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).
* [Library of Congress Data Exploration: IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/26510c3f4da0bc85dfa87e82141173b1830e9d64/IIIF.ipynb).
* Gustavo Candela, María Dolores Sáez, Pilar Escobar, Manuel Marco-Such, & Rafael C.Carrasco. (2020, May 8). hibernator11/notebook-iiif-images: release1.1 (Version 1.1). Zenodo. [http://doi.org/10.5281/zenodo.3816611](https://zenodo.org/badge/latestdoi/255172461). 
* [Genes for Project Cognoma](https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py)
* https://mindtrove.info/jupyter-tidbit-image-gallery/