# Exploring Colenda Records

In this notebook we'll have a preliminary look at the data harvested from the [Colenda Digital Repository at Penn Libraries](https://colenda.library.upenn.edu/). I'll focus here on the basic shape/stats of the data. Other notebooks will explore data from Colenda over [time](kaplan_explore_time.ipynb) and [space](kaplan_explore_places.ipynb).

If you haven't already, you'll need to [download a pre-harvested dataset](unzip_preharvested_data.ipynb) for use with this notebook. 

* [Import What We Need](#Import-What-We-Need)
* [Load the Data](#Load-the-Data)
* [The Shape of the Data](#The-Shape-of-the-Data)
* [Concatenate and Split Columns](#Concatenate-and-Split-Columns)
* [The `metadata.item_type` Field](#The-metadata.item_type-Field)
* [Access Images of Items in the Collection](#Access-Images-of-Items-in-the-Collection)
* [Credits](#Credits)

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexploring_object_records.ipynb">load a <b>live</b> version</a> running on Binder.</p>

</div>

## Import What We Need

In [1]:
# Pandas is a Python package that provides numerous tools for data analysis. 
import pandas as pd

# IPython is a Python interpreter to display content.
from IPython.display import display, HTML, FileLink, Image

## Load the Data

This pre-harvested dataset from Colenda includes every gift of Arnold and Deanne Kaplan, which covers two collections. In this notebook we will only work with records from the Arnold and Deanne Kaplan Collection of **Early American Judaica**. We can access those items by using the `metadata.collection[1]` column and filtering on the Early American Judaica collection. 


In [2]:
# Convert to a dataframe
df = pd.read_csv("kaplan-style.csv", encoding= 'unicode_escape')

# Print the number of rows in the dataframe
print('There are {:,} items in this dataset from Colenda.'.format(df.shape[0]))

There are 9,418 items in this dataset from Colenda.


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
# Filter for items from the Early American Judaica Collection. 
df = df.loc[df['metadata.collection[1]'] == "Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania)"]

# Return the first 5 rows of the dataframe.
df.head()

Unnamed: 0,action,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],...,metadata.subject[3],metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],structural.filenames,unique_identifier
2,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Baum & Bernstein,,,,,,...,Trade cards (advertising),,,,,,,"Trade card; Baum & Bernstein; Meriden, Connect...",tc_br9s56_1r.tif; tc_br9s56_1v.tif,ark:/81431/p3000003f
3,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,,,,,,,...,Family papers,Manuscripts (documents),,,,,,"Letter; Tobias, Henry; Liverpool, United Kingd...",p3000006w_0001.tif; p3000006w_0002.tif; p30000...,ark:/81431/p3000006w
4,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,I. H. Brounstein's One Price Clothing House,,,,,,...,Jewish merchants,Clothing trade,,,,,,Trade card; I. H. Brounstein's One Price Cloth...,p3000013w_body0001.tif; p3000013w_body0002.tif,ark:/81431/p3000013w
5,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Kast's Fine Shoes,,,,,,...,Trade cards (advertising),,,,,,,"Trade card; Kast's Fine Shoes; San Francisco, ...",tc_br7s42_3r.tif; tc_br7s42_3v.tif,ark:/81431/p3000019s
6,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Mandel Bro's,,,,,,...,Dry-goods,Clothing trade,Jewish merchants,Trade cards (advertising),,,,Trade card; Mandel Bro's; undated,tc_br10s17_2r.tif; tc_br10s17_2v.tif,ark:/81431/p3000020w


## The Shape of the Data

In [4]:
print('There are {:,} items in Colenda from the Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania).'.format(df.shape[0]))

There are 8,494 items in Colenda from the Arnold and Deanne Kaplan Collection of Early American Judaica (University of Pennsylvania).


Now that we have this dataset in a dataframe, we can manipulate it. This dataset contains **descriptive metadata** about the items in the collection, which allow us to discover and identify items. What columns are in this dataframe?

In [5]:
# Retrieve the column names and add it to list
df.columns.to_list()

['action',
 'metadata.call_number[1]',
 'metadata.collection[1]',
 'metadata.contributor[1]',
 'metadata.corporate_name[1]',
 'metadata.corporate_name[2]',
 'metadata.corporate_name[3]',
 'metadata.corporate_name[4]',
 'metadata.corporate_name[5]',
 'metadata.corporate_name[6]',
 'metadata.date[1]',
 'metadata.date[2]',
 'metadata.date[3]',
 'metadata.description[1]',
 'metadata.format[1]',
 'metadata.format[2]',
 'metadata.format[3]',
 'metadata.format[4]',
 'metadata.geographic_subject[1]',
 'metadata.geographic_subject[2]',
 'metadata.geographic_subject[3]',
 'metadata.geographic_subject[4]',
 'metadata.geographic_subject[5]',
 'metadata.geographic_subject[6]',
 'metadata.geographic_subject[7]',
 'metadata.geographic_subject[8]',
 'metadata.geographic_subject[9]',
 'metadata.geographic_subject[10]',
 'metadata.geographic_subject[11]',
 'metadata.geographic_subject[12]',
 'metadata.geographic_subject[13]',
 'metadata.geographic_subject[14]',
 'metadata.geographic_subject[15]',
 'meta

That is a long list of columns! Not every item has a value for every column. Let's create a quick count of the number of values in each column.

In [6]:
# Count non-NA cells for each column
df.count()

action                        8494
metadata.call_number[1]       8493
metadata.collection[1]        8494
metadata.contributor[1]          3
metadata.corporate_name[1]    6331
                              ... 
metadata.subject[8]             31
metadata.subject[9]             12
metadata.title[1]             8494
structural.filenames          8494
unique_identifier             8494
Length: 100, dtype: int64

Let's express those counts as a percentage of the total number of records, and display them as a bar chart using Pandas.

In [7]:
# Get the counts for each column and convert to a new dataframe
field_counts = df.count().to_frame().reset_index()

# Change column headings
field_counts.columns = ['Field', 'Count']

# Calculate proportion of the total
field_counts['Proportion'] = field_counts['Count'].apply(lambda x: x / df.shape[0])

# Style the results as a barchart
field_counts.style.bar(subset=['Proportion'], color='#d65f5f').format({'Proportion': '{:.2%}'.format})

Unnamed: 0,Field,Count,Proportion
0,action,8494,100.00%
1,metadata.call_number[1],8493,99.99%
2,metadata.collection[1],8494,100.00%
3,metadata.contributor[1],3,0.04%
4,metadata.corporate_name[1],6331,74.53%
5,metadata.corporate_name[2],565,6.65%
6,metadata.corporate_name[3],29,0.34%
7,metadata.corporate_name[4],7,0.08%
8,metadata.corporate_name[5],3,0.04%
9,metadata.corporate_name[6],0,0.00%


## Concatenate and Split Columns

You may note that some of the columns appear multiple times, identified by a number at the end of it. For example, the `metadata.item_type` column appears twice, indicating there are two item types for some items. For comparative and quantitative data analysis, we may need to split those items into multiple rows instead of columns.

How many items have more than one item type?

In [8]:
# Count how many rows are not blank in the 'metadata.item_type[2]'' column
df['metadata.item_type[2]'].count()

48

Let's take a look at those items.

In [9]:
# Create a filtered dataframe by 'metadata.item_type[2]', including only those that have data in that column
df1 = df[df['metadata.item_type[2]'].notnull()]

# Return the first 5 lines of the df1 dataframe
df1.head()

Unnamed: 0,action,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],...,metadata.subject[3],metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],structural.filenames,unique_identifier
2031,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Savannah Republican,,,,,,...,Jewish merchants,General stores,,,,,,Broadside; Letter; Cohen; Savannah Republican;...,p37p8tc5c_0001.tif; p37p8tc5c_0002.tif; p37p8t...,ark:/81431/p37p8tc5c
7308,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Pollak Bros.,,,,,,...,Jewish merchants,Jewelry trade,Jewelry stores,,,,,"Envelope; Pollak, Chas.; Pollak Bros.; Kansas ...",doc_bhl_mo_324_1r.tif; doc_bhl_mo_324_1v.tif; ...,ark:/81431/p3374v
7313,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,New Orleans Wholesale Price Current,Benjamin Levy,,,,,...,Jewish printers,Printing industry,Lists (document genres),,,,,"Periodical; Levy, Benjamin; New Orleans Wholes...",lbr_fl_s26_1r.tif; lbr_fl_s26_1v.tif; lbr_fl_s...,ark:/81431/p33b1r
7346,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,Levensohn & Galland,,,,,,...,Food industry and trade,General stores,Dry-goods,Clothing trade,,,,"Envelope; Levensohn, Mayer; Levensohn & Gallan...",doc_bhe_ca_753_1r.tif; doc_bhe_ca_753_1v.tif,ark:/81431/p33q9k
7378,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,S. M. Rosenbaum,Fry & Stebbins,,,,,...,Jewish merchants,Money,,,,,,"Envelope; Rosenbaum, S. M.; S. M. Rosenbaum; R...",doc_bhe_va_3044_1r.tif; doc_bhe_va_3044_1v.tif...,ark:/81431/p3418m


We can fix that! We need to split the values in those cells into individual rows. 

These functions help us do that: `tidy_split` splits the values of each cell so that there is one split value per row, and `tidy_concat` concatenates the values of columns that begin with a similar phrase into one cell before using `tidy_split`.

The `tidy_split` function come from [Project Cognoma](http://cognoma.org/). 

In [10]:
# Split the values of a column and expand so that the new DataFrame has one split value per row
# Filters rows where column is empty 
def tidy_split(df, column, sep='|', keep=False):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [11]:
# Concatenate the values of columns beginnigng with a string and then use the tidy_split function to expand so that the new DataFrame has one split value per row
def tidy_concat(df, column_starts_with, sep="|"):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the columns to split and expand
    column_starts_with : str
        the string at the beginning of the column(s) to split
    sep : str
        the string used to split the column's values

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    list_of_columns = df.columns.to_list()
    columns_to_concat = [x for x in list_of_columns if x.startswith(column_starts_with)]
    df[column_starts_with] = df[columns_to_concat[0]]
    for column in columns_to_concat[1:]:
        df[column_starts_with] = df[column_starts_with].astype(str) + sep + df[column].astype(str)
    new_df = tidy_split(df, column_starts_with, sep='|')
    new_df = new_df.drop(columns_to_concat, axis=1)
    return new_df

In [12]:
# Use the function to split the values of the Type column and expand so that the new DataFrame has one split value per row
df = tidy_concat(df, 'metadata.item_type', sep='|')

# Report the dimensionality of the dataframe (number of rows, number of columns)
df.shape

(16988, 99)

## The `metadata.item_type` Field

The `metadata.item_type` field refers to the type of item: books, manuscripts, sound recordings, etc. Let's look at the 25 most common item types in the collection.

In [13]:
# Return a Series containing counts of unique rows in the dataframe for each Type (up to 25 Types)
df['metadata.item_type'].value_counts()[:25]

nan                       8446
Trade cards               3845
Letters                    978
Billheads                  485
Periodicals                377
Receipts                   294
Billhead                   293
Envelopes                  192
Letterheads                181
Pamphlets                  155
Broadsides                 150
Monetary                   141
Deeds                      114
Cartes-de-visite           107
Legal documents             97
Ports of entry              90
Negotiable instruments      67
Official documents          66
Miscellaneous               63
Court records               55
Invitations                 41
Photographs                 36
Sheet music                 34
Manuscripts                 34
Trade tokens                34
Name: metadata.item_type, dtype: int64

`nan` refers to empty values. How many item types only appear once?

In [14]:
# Create a new dataframe called type_counts, which includes a Type column and a Count column
type_counts = df['metadata.item_type'].value_counts().to_frame().reset_index().rename({'index': 'type', 'metadata.item_type': 'count'}, axis=1)

# Locate the rows that have a 'unique' type, or a count of 1
unique_types = type_counts.loc[type_counts['count'] == 1]

# Print the number of rows in the dataframe
print('There are {:,} items from the collection with unique item types.'.format(unique_types.shape[0]))

There are 57 items from the collection with unique item types.


Let's save the complete list of types as a CSV file.

In [15]:
# Write the type_counts dataframe to a comma-separated values (csv) file.
type_counts.to_csv('colenda_item_type_counts.csv', index=False)

# Display a link to the CSV.
display(FileLink('colenda_item_type_counts.csv'))

Browsing the CSV, I noticed that there was one item with the type `Clocks`. Let's find some more out about it.

In [16]:
# Find the item in the complete data set
clocks = df.loc[df['metadata.item_type'].notnull()]['metadata.item_type'].apply(lambda x: 'Clocks' in x)
clock = df.loc[df['metadata.item_type'].notnull()][clocks]
clock

Unnamed: 0,action,metadata.call_number[1],metadata.collection[1],metadata.contributor[1],metadata.corporate_name[1],metadata.corporate_name[2],metadata.corporate_name[3],metadata.corporate_name[4],metadata.corporate_name[5],metadata.corporate_name[6],...,metadata.subject[4],metadata.subject[5],metadata.subject[6],metadata.subject[7],metadata.subject[8],metadata.subject[9],metadata.title[1],structural.filenames,unique_identifier,metadata.item_type
850,MIGRATE,Arc.MS.56,Arnold and Deanne Kaplan Collection of Early A...,,J. Warshawsky,,,,,,...,Jewelry stores,,,,,,Clock; J. Warshawsky; undated,p3319s54c_001.tif; p3319s54c_002.tif,ark:/81431/p3319s54c,Clocks


We can create a link into the item's record in Colenda using its `unique_identifier`. The value in this column is an [Archival Resource Key identifier](https://n2t.net/e/ark_ids.html), designed to support long-term access to information objects. This identifier can be divided into three parts, separated by `/`: the ARK label, the collection of which the item is a part, and the unique identifier for the item within the collection.

To create the link, we only need the second and third part of the unique identifier. 

In [17]:
# Select the first row in the dataframe
identifier = clock.iloc[0]['unique_identifier']

# Split the string up to the second occurrence of "/" and join all but the first element of the split string 
identifier = "-".join(identifier.split("/", 2)[1:])

# Display the link to the item in Colenda, with the item-specific URL and the item's title as the hyperlinked text 
display(HTML('<a href="https://colenda.library.upenn.edu/catalog/{}">{}</a>'.format(identifier, clock.iloc[0]['metadata.title[1]'])))

## Access Images of Items in the Collection

The images in Colenda for these items are available under the [**International Image Interoperability Framework (IIIF)**](https://iiif.io/), which makes these images accessible and interoperable between image repositories. 
Let's take a look at the images of the clock. 

These functions for working with IIIF images come from [BVMC Labs](http://data.cervantesvirtual.com/). 

In [18]:
# Encode image bytes for inclusion in an HTML img element
def _src_from_data(data):
    img_obj = Image(data=data)
    for bundle in img_obj._repr_mimebundle_():
        for mimetype, b64value in bundle.items():
            if mimetype.startswith('image/'):
                return f'data:{mimetype};base64,{b64value}'

#  Shows a set of images in a gallery that flexes with the width of the notebook.
def gallery(dictionary, row_height='auto'):
    figures = []
    for image, label in dictionary.items():
        src = image
        figures.append(f'''<figure style="margin: 5px !important;">
        <img src="{src}" 
        style="height: {row_height}">
        <figcaption style="font-size: 1em">{label}</figcaption>
        </figure>''')
    return HTML(data=f'''<div style="display: flex; flex-flow: row wrap; text-align: center;">{''.join(figures)}</div>''')
    

In [19]:
# Requests is a Python package that allows you to send HTTP/1.1 requests.
import requests

# Create a string that is the link to the item-specific IIIF manifest. 
manifest = "https://colenda.library.upenn.edu/phalt/iiif/2/" + identifier + "/manifest"

# Get the manifest
r = requests.get(manifest)

# Get the information about all the images for this item as a list 
results = r.json()["sequences"][0]['canvases']

# Create a dictionary to collect each image URL (key) and corresponding label (value) for this item. 
imagesDict = {}

# Iterate over each image in the results list to extract the URL and label for the image, adding it to the lists above
for i in range(len(results)):
    label = results[i]['label']
    resource = results[i]['images'][0]['resource']
    images = resource['@id']
    imagesDict[images] = label 
    
# Display the images as a gallery    
gallery(imagesDict, row_height='150px')

Nice work! We can now use these basic instructions to explore more aspects of the collection.

----

# Credits

Created by [Emily Esten](https://www.library.upenn.edu/people/staff/emily-esten). 

Judaica Digital Humanities at the <a href="http://library.upenn.edu">Penn Libraries</a> (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our [website](judaicadh.library.upenn.edu).


This notebook references existing code and Jupyter notebooks, including: 
* [GLAM Workbench for the National Museum of Australia](https://doi.org/10.5281/zenodo.3544747) sponsored by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).
* [Library of Congress Data Exploration: IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/26510c3f4da0bc85dfa87e82141173b1830e9d64/IIIF.ipynb).
* Gustavo Candela, María Dolores Sáez, Pilar Escobar, Manuel Marco-Such, & Rafael C.Carrasco. (2020, May 8). hibernator11/notebook-iiif-images: release1.1 (Version 1.1). Zenodo. [http://doi.org/10.5281/zenodo.3816611](https://zenodo.org/badge/latestdoi/255172461). 
* [Genes for Project Cognoma](https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py)