## Welcome!

[OPenn](https://openn.library.upenn.edu/) contains complete sets of high-resolution archival images of manuscripts from the University of Pennsylvania Libraries and other institutions, along with machine-readable TEI P5 descriptions and technical metadata. All materials on this site are in the public domain or released under Creative Commons licenses as Free Cultural Works.

In this notebook we'll have a preliminary look at metadata harvested from OPenn. How are manuscripts distributed over time? How are manuscripts described? Which places are associated with manuscripts? To answer these questions and more, we'll extract relevant data, see what's there, and create visualizations. 

This notebook works with a CSV file created in the [introductory notebook]() Go back for an introduction to accessing OPenn images and metadata, or [skip this step to work with manuscript images]().

* [Import What We Need](#Import-What-We-Need)
* [Load the Data](#Load-the-Data)
* [Build a Gantt Chart](#Build-a-Gantt-Chart)
* [Build a Stacked Bar Chart](#Build-a-Stacked-Bar-Chart)
* [Need Help?](#Need-Help?)
* [Need Help?](#Need-Help?)
* [Credits](#Credits)

<div class="alert alert-block alert-warning">
<p><b>Yellow blocks like this provide additional information about Python and Jupyter notebooks.</b></p>
    
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them.</li>
        <li>To run a code cell click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>

<p><b>Is this thing on?</b> If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to <a href="https://mybinder.org/v2/gh/GLAM-Workbench/national-museum-australia/master?urlpath=lab%2Ftree%2Fexplore_collection_object_over_time.ipynb">load a <b>live</b> version</a> running on Binder.</p>
</div>

## Import What We Need


<div>
    <p>In order to use this notebook, you first need to `import` modules and packages from Python.</p>
<div class="alert alert-block alert-warning">
<p>These modules and packages are units of code with specific tools or skills that we use in the script. If you're running this notebook on your computer, you may need to first `import` these modules within your Python interpreter. Find assistance for that <a href="https://packaging.python.org/tutorials/installing-packages/">here</a>.</p>
    </div>

In [29]:
# PIP is a package manager for Python packages. This command installs the list of libraries contained in the `requirements.txt` file.
!pip install -r requirements.txt

# Pandas is a Python package that provides numerous tools for data analysis
import pandas as pd

# Altair is a Python library for declarative statistical visualization
import altair as alt

# Country converter (coco) is a Python package to convert and match country names between different classifications
import country_converter as coco

# Vega Datasets is a Python package for offline access to vega datasets.
from vega_datasets import data as vega_data

# Geopandas is a Python package for working with geospatial data
import geopandas

# Geopy is a Python client for several popular geocoding web services
import geopy

# From Geopy, import AsyncRateLimiter & RateLimiter to perform bulk operations gracefully
from geopy.extra.rate_limiter import AsyncRateLimiter, RateLimiter

# From Geopy, import Nominatim, a specific geocoder for OpenStreetMap data
from geopy.geocoders import Nominatim

# NumPy is a library for adding support for manipulating and operating on large, multi-dimensional arrays
import numpy as np

# Time is a module for handling time-related tasks.
import time

#Altair_Saver saves the output of your chart as an external output
from altair_saver import save

collections_contents = pd.read_csv("/Users/estene/Documents/GitHub/collections-as-data-notebooks/testing notebooks/collections_contents_w_metadata.csv")

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

In [30]:
# Read a comma-separated values (csv) file into a Pandas dataframe
collections_contents = pd.read_csv("/Users/estene/Documents/GitHub/collections-as-data-notebooks/testing notebooks/collections_contents_w_metadata.csv")

## Build a Gantt Chart 

Now that we've explored the dataset, we'll start by taking a look at the dates represented in the collection. 
For many manuscripts, it can be difficult to pinpoint an exact year of origin. Manuscripts may be completed over an extended period of time or drafted as separate manuscripts. Instead, manuscript dates are often represented by a date range. Rather than build a traditional timeline, we'll build a Gantt chart to illustrate the start, end, and duration of manuscript development. 

In [31]:
# Fill NA/NaN values with a 0 value in the `OrigDate_Start` and `OrigDate_End` columns
collections_contents.fillna({'OrigDate_Start':0, 'OrigDate_End':0}, inplace=True)

# Cast the `OrigDate_Start` and `OrigDate_End` columns as integer columns
collections_contents['OrigDate_Start'] = collections_contents['OrigDate_Start'].astype(int)
collections_contents['OrigDate_End'] = collections_contents['OrigDate_End'].astype(int)

# Print the earliest year and latest_year represented in the collection by returning the lowest and highest values of the column
earliest_year = collections_contents.loc[collections_contents['OrigDate_Start'] > 0]['OrigDate_Start'].min()
latest_year = collections_contents['OrigDate_End'].max()
print(earliest_year, latest_year)

# Filter the dataframe for all non-zero values in the `OrigDate_Start` column
date_ranges = collections_contents[collections_contents['OrigDate_Start'] != 0]
# Return a random sample of 50 items from the date_ranges dataframe
sample_dates = date_ranges.sample(n = 50)

800 1920


In [32]:
# Build a Gantt Chart using the Altair and the sample_dates dataframe 
gantt = alt.Chart(sample_dates).mark_bar().encode(
    # Label the X axis as 'OrigDate_Start'
    alt.X('OrigDate_Start',
    # Establish the scale of the Gantt Chart as the earliest and latest years      
    scale=alt.Scale(domain=(earliest_year, latest_year))),
    # Also Label the X axis as 'OrigDate_End'
    x2='OrigDate_End',
    # Label the Y axis as 'path'
    y='path',
    tooltip=[alt.Tooltip('OrigDate_Start:Q', title='Start Year'), alt.Tooltip('OrigDate_End:Q', title='End Year', format=',')]
)

# Display the Gantt Chart
gantt

In [33]:
# Save the chart
save(gantt, 'charts/gantt-chart-origDate.html')

# Create a Bar Chart 

Because the year ranges of a manuscript can be difficult to identify, the BiblioPhilly cataloguers also noted the century (or centuries) to which a manuscript can be dated. This information is contained inside the `Keywords` column of the dataset. We'll build a Bar Chart to illustrate the number of manuscripts in the collection by century.

In [34]:
# Add a new column to the dataframe with a blank string as the value for each row 
collections_contents['Century'] = ''
# Evaluate the `Keywords` column to recognize cell values as a list
collections_contents['Keywords'] = collections_contents['Keywords'].apply(eval)

# Iterate over each index/row in the dataframe 
for idx, row in collections_contents.iterrows():
    # For that row, make a list of all the keywords in the `Keywords` column that contain 'century', join the list by a pipe, and save to the `Century` column
    collections_contents.at[idx, 'Century'] = "|".join(map(str,[ele for ele in row.Keywords if "century" in ele]))

# Print the first 5 rows of the dataframe
collections_contents.head()

Unnamed: 0,curated_collection,document_id,path,repository_id,metadata_type,title,added,document_created,document_updated,Summary,...,Support Desc,Extent,Script Note,OrigDate_When,OrigDate_Start,OrigDate_End,OrigPlace,Keywords,ImgNames,Century
0,bibliophilly,4221,0023/lewis_e_018,23,TEI,Liber de vinis,2017-05-10T14:47:02+00:00,2017-05-10T14:27:01+00:00,2018-08-17T19:07:56+00:00,This manuscript is an early 15th-century Germa...,...,paper,ii+24+i; 187 x 135 mm,Gothic--cursiva,,1400,1415,Germany,"[15th century, German, Germany, Science -- Med...","['4221_0000_web.jpg', '4221_0001_web.jpg', '42...",15th century
1,bibliophilly,4222,0023/lewis_e_057,23,TEI,Carmen in honorem Beatae Mariae Virginis,2017-05-10T20:19:36+00:00,2017-05-10T18:22:19+00:00,2018-08-17T19:15:35+00:00,This manuscript contains fragments of a poem i...,...,parchment,iii+12+iii; 163 x 110 mm bound to 177 x 130 mm,Caroline minuscule,,1200,1299,Germany,"[13th century, German, Germany, Literature -- ...","['4222_0000_web.jpg', '4222_0001_web.jpg', '42...",13th century
2,bibliophilly,4223,0023/lewis_e_083,23,TEI,Historia belli civilis inter Caesarem et Pompeium,2017-05-10T20:19:42+00:00,2017-05-10T18:52:40+00:00,2018-08-17T19:18:33+00:00,This manuscript contains an account of the civ...,...,parchment,ii+30+ii; 165 x 110 mm bound to 170 x 120 mm,Gothic--textualis,,1440,1460,Italy,"[History, 15th century, Italian, Italy, Gothic...","['4223_0000_web.jpg', '4223_0001_web.jpg', '42...",15th century
3,bibliophilly,4225,0023/lewis_e_009,23,TEI,Processional; Astronomical Text binding fragment,2017-05-10T20:19:47+00:00,2017-05-10T19:10:48+00:00,2018-08-17T19:07:03+00:00,This manuscript is a late fifteenth-century pr...,...,parchment,51+i; 125 x 86 mm bound to 130 x 90 mm,Gothic--textualis quadrata,,1450,1499,Germany,"[Processional, Astrology, Science, 15th centur...","['4225_0000_web.jpg', '4225_0001_web.jpg', '42...",15th century|13th century|10th century
4,bibliophilly,4226,0023/lewis_e_003,23,TEI,Canon super almanach; De 12 signis et eorum na...,2017-05-10T20:19:52+00:00,2017-05-10T19:24:04+00:00,2018-08-17T19:05:44+00:00,This manuscript consists of a series of astron...,...,mixed,19+i; 217 x 145 mm bound to 233 x 160 mm,Gothic--textualis,,1340,1599,England,"[Astrology, Tables, Science, 14th century, 15t...","['4226_0000_web.jpg', '4226_0001_web.jpg', '42...",14th century|15th century


In [35]:
# Create a function that splits a dataframe column into rows

def tidy_split(df, column, sep='|', keep=False):
    """
    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

In [36]:
# Return the dimensionality (number of rows & columns) of the dataframe
collections_contents.shape

(3534, 22)

In [37]:
# Use the tidy_split function defined above to split the `Century` column into multiple rows
centuries_contents = tidy_split(collections_contents, "Century", sep='|', keep=False)

# Return the dimensionality (number of rows & columns) of the new dataframe
centuries_contents.shape

# Print the first 5 rows of the dataframe
centuries_contents.head()

Unnamed: 0,curated_collection,document_id,path,repository_id,metadata_type,title,added,document_created,document_updated,Summary,...,Support Desc,Extent,Script Note,OrigDate_When,OrigDate_Start,OrigDate_End,OrigPlace,Keywords,ImgNames,Century
0,bibliophilly,4221,0023/lewis_e_018,23,TEI,Liber de vinis,2017-05-10T14:47:02+00:00,2017-05-10T14:27:01+00:00,2018-08-17T19:07:56+00:00,This manuscript is an early 15th-century Germa...,...,paper,ii+24+i; 187 x 135 mm,Gothic--cursiva,,1400,1415,Germany,"[15th century, German, Germany, Science -- Med...","['4221_0000_web.jpg', '4221_0001_web.jpg', '42...",15th century
1,bibliophilly,4222,0023/lewis_e_057,23,TEI,Carmen in honorem Beatae Mariae Virginis,2017-05-10T20:19:36+00:00,2017-05-10T18:22:19+00:00,2018-08-17T19:15:35+00:00,This manuscript contains fragments of a poem i...,...,parchment,iii+12+iii; 163 x 110 mm bound to 177 x 130 mm,Caroline minuscule,,1200,1299,Germany,"[13th century, German, Germany, Literature -- ...","['4222_0000_web.jpg', '4222_0001_web.jpg', '42...",13th century
2,bibliophilly,4223,0023/lewis_e_083,23,TEI,Historia belli civilis inter Caesarem et Pompeium,2017-05-10T20:19:42+00:00,2017-05-10T18:52:40+00:00,2018-08-17T19:18:33+00:00,This manuscript contains an account of the civ...,...,parchment,ii+30+ii; 165 x 110 mm bound to 170 x 120 mm,Gothic--textualis,,1440,1460,Italy,"[History, 15th century, Italian, Italy, Gothic...","['4223_0000_web.jpg', '4223_0001_web.jpg', '42...",15th century
3,bibliophilly,4225,0023/lewis_e_009,23,TEI,Processional; Astronomical Text binding fragment,2017-05-10T20:19:47+00:00,2017-05-10T19:10:48+00:00,2018-08-17T19:07:03+00:00,This manuscript is a late fifteenth-century pr...,...,parchment,51+i; 125 x 86 mm bound to 130 x 90 mm,Gothic--textualis quadrata,,1450,1499,Germany,"[Processional, Astrology, Science, 15th centur...","['4225_0000_web.jpg', '4225_0001_web.jpg', '42...",15th century
3,bibliophilly,4225,0023/lewis_e_009,23,TEI,Processional; Astronomical Text binding fragment,2017-05-10T20:19:47+00:00,2017-05-10T19:10:48+00:00,2018-08-17T19:07:03+00:00,This manuscript is a late fifteenth-century pr...,...,parchment,51+i; 125 x 86 mm bound to 130 x 90 mm,Gothic--textualis quadrata,,1450,1499,Germany,"[Processional, Astrology, Science, 15th centur...","['4225_0000_web.jpg', '4225_0001_web.jpg', '42...",13th century


In [38]:
# Filter out any rows that contain two dashes or commas in the `Century` column 
centuries_contents = centuries_contents[~centuries_contents['Century'].str.contains("--")]
centuries_contents = centuries_contents[~centuries_contents['Century'].str.contains(",")]

# Group the dataframe by the `Century` column and create a new dataframe
centuries_count =  centuries_contents.groupby(['Century']).size().reset_index(name="Count")

# Replace any empty strings in the `Century` column with a NaN value
centuries_count['Century'].replace('', np.nan, inplace=True)
# Drop the NA/NaN values from the `Century` column
centuries_count.dropna(subset=['Century'], inplace=True)
# Sort the values by the `Century` column
centuries_count.sort_values('Century')
# Display the centuries_count dataframe
centuries_count

Unnamed: 0,Century,Count
1,10th century,5
2,11th century,16
3,12th century,95
4,13th century,264
5,14th century,405
6,15th century,1402
7,16th century,833
8,17th century,108
9,18th century,36
10,19th century,15


In [39]:
# Create a Bar Chart of the centuries
counts_over_time = alt.Chart(centuries_count).mark_bar(size=10).encode(
    
    # Century on the X axis
    x=alt.X('Century:N', axis=alt.Axis(format='c', title='Century')),
    
    # Number of manuscripts on the Y axis
    y=alt.Y('Count:Q', title='Number of Items'),
    
    # Show details on hover
    tooltip=[alt.Tooltip('Century:N', title='Century'), alt.Tooltip('Count:Q', title='Count', format=',')]
).properties(width=700)

# Display the Bar Chart
counts_over_time

In [40]:
save(counts_over_time, 'charts/counts_per_century.html')

## Build a Stacked Bar Chart 

A stacked bar chart is a type of bar chart for representing part-to-whole comparison over time. Instead of one variable (century), we can show two variables. In this case, we'll use the `Script Note` column to display the use of different scripts in manuscripts across centuries.

In [41]:
# Count the number of unique rows in the `Script Note` column and add the top 10 to a list called `top_scripts`
top_scripts = centuries_contents['Script Note'].value_counts()[:10].index.to_list()

# Filter the dataframe for values in the `Script Note` column that are in the `top_scripts` list
centures_top_scripts = centuries_contents.loc[(centuries_contents['Script Note'].isin(top_scripts))]

# Group the `Century` and `Script Note` columns and count the number of unique rows
centuries_scripts_counts = centures_top_scripts.groupby('Century')['Script Note'].value_counts().to_frame()
centuries_scripts_counts.columns = ['Count']
centuries_scripts_counts.reset_index(inplace=True)

# Replace the empty strings in the `Century` column and zeroes in the `Script Note` column with NaN values
centuries_scripts_counts['Century'].replace('', np.nan, inplace=True)
centuries_scripts_counts['Script Note'].replace('0', np.nan, inplace=True)

# Drop the `Century` and `Script Note` columns 
centuries_scripts_counts.dropna(subset=['Century'], inplace=True)
centuries_scripts_counts.dropna(subset=['Script Note'], inplace=True)
# Filter the dataframe to only include non-zero values in the `Script Note` column
centuries_scripts_counts = centuries_scripts_counts[centuries_scripts_counts['Script Note'] !=0]

In [42]:
# Create a stacked bar chart
script_notes = alt.Chart(centuries_scripts_counts).mark_bar(size=10).encode(
    
    # Century on the X axis
    x=alt.X('Century:N', axis=alt.Axis(format='c', title='Century')),
    
    # Number of manuscripts on the Y axis
    y=alt.Y('Count:Q', title='Number of objects'),
    
    # Color according to the `Script Note` value
    color='Script Note:N',
    
    # Details on hover
    tooltip=[alt.Tooltip('Script Note:N', title='Script'), alt.Tooltip('Century:N', title='Century'), alt.Tooltip('Count:Q', title='Count', format=',')]
).properties(width=700)

script_notes


In [43]:
# Save the chart
save(script_notes, 'charts/script_notes_per_century.html')

## Keywords Over Time

In [44]:
collections_contents.head()
#
for idx, row in collections_contents.iterrows():
    #
    collections_contents.at[idx, 'Keyword'] = "|".join(map(str,row.Keywords))

#
keywords_contents = tidy_split(collections_contents, "Keyword", sep='|', keep=False)

#
keywords_count =  keywords_contents.groupby(['Keyword']).size().reset_index(name="Count")
keywords_count.sort_values(by='Count', ascending=False, inplace=True)

#
keywords_count.head(50)

Unnamed: 0,Keyword,Count
610,Fragment,1837
7,15th century,1402
824,Italian,1259
840,Italy,1164
1079,"Manuscripts, Renaissance",1087
8,16th century,833
365,Codices,765
759,Illumination,741
973,Liturgy,697
774,Illustration,625


## Build a Choropleth Map

Choropleth maps are thematic maps that use different shading patterns for geographical areas, based on data. In this case, we'll build a choropleth map of the modern European countries to illustrate the number of manuscripts from a country of origin. 

Geographical information is contained in two locations of this dataset - the `OrigPlace` and `Keywords` column. The `OrigPlace` typically contains name of places xyz, while `Keywords` typically contains names of places xyz. Let's make country columns based off both of these before deciding which to map. 

In [45]:
# Create a list of European countries (sourced from xyz)
list_of_countries = ['Albania','Andorra','Armenia','Austria','Azerbaijan','Belarus','Belgium','Bosnia and Herzegovina','Bulgaria','Croatia','Cyprus','Czech Republic','Denmark','England','Estonia','Finland','France','Georgia','Germany','Greece','Hungary','Iceland','Ireland','Italy','Kazakhstan','Kosovo','Latvia','Liechtenstein','Lithuania','Luxembourg','Malta','Moldova','Monaco','Montenegro','Netherlands','North Macedonia','Norway','Poland','Portugal','Romania','Russia','San Marino','Scotland','Serbia','Slovakia','Slovenia','Spain','Sweden','Switzerland','Turkey','Ukraine','United Kingdom','Vatican City','Wales']

# Create a new column called `Country_OrigPlace` and set the value as an empty string for each row
collections_contents['Country_OrigPlace'] = ''
# Fill NaN values in the `OrigPlace` column with empty strings
collections_contents.fillna({'OrigPlace':''}, inplace=True)

# Iterrate over index/row in the dataframe
for idx, row in collections_contents.iterrows():
    # For each country in the list_of_countries
    countries_mentioned = []
    for country in list_of_countries: 
        # If the country is contained in the `OrigPlace`
        if country in row.OrigPlace:
            # Set the value of the `Country_OrigPlace` as that country
            countries_mentioned.append(country)
    collections_contents.at[idx, 'Country_OrigPlace'] = countries_mentioned

# Create a new column called `Country_OrigPlace` and set the value as an empty string for each row         
collections_contents['Country_Keywords'] = ''

# Iterrate over index/row in the dataframe
for idx, row in collections_contents.iterrows():
    # Set the value of the `Country_OrigPlace` as any countries that appear in the list_of_countries 
    collections_contents.at[idx, 'Country_Keywords'] = [ele for ele in row.Keywords if ele in list_of_countries]

collections_contents.head()

Unnamed: 0,curated_collection,document_id,path,repository_id,metadata_type,title,added,document_created,document_updated,Summary,...,OrigDate_When,OrigDate_Start,OrigDate_End,OrigPlace,Keywords,ImgNames,Century,Keyword,Country_OrigPlace,Country_Keywords
0,bibliophilly,4221,0023/lewis_e_018,23,TEI,Liber de vinis,2017-05-10T14:47:02+00:00,2017-05-10T14:27:01+00:00,2018-08-17T19:07:56+00:00,This manuscript is an early 15th-century Germa...,...,,1400,1415,Germany,"[15th century, German, Germany, Science -- Med...","['4221_0000_web.jpg', '4221_0001_web.jpg', '42...",15th century,15th century|German|Germany|Science -- Medicin...,[Germany],[Germany]
1,bibliophilly,4222,0023/lewis_e_057,23,TEI,Carmen in honorem Beatae Mariae Virginis,2017-05-10T20:19:36+00:00,2017-05-10T18:22:19+00:00,2018-08-17T19:15:35+00:00,This manuscript contains fragments of a poem i...,...,,1200,1299,Germany,"[13th century, German, Germany, Literature -- ...","['4222_0000_web.jpg', '4222_0001_web.jpg', '42...",13th century,13th century|German|Germany|Literature -- Poet...,[Germany],[Germany]
2,bibliophilly,4223,0023/lewis_e_083,23,TEI,Historia belli civilis inter Caesarem et Pompeium,2017-05-10T20:19:42+00:00,2017-05-10T18:52:40+00:00,2018-08-17T19:18:33+00:00,This manuscript contains an account of the civ...,...,,1440,1460,Italy,"[History, 15th century, Italian, Italy, Gothic...","['4223_0000_web.jpg', '4223_0001_web.jpg', '42...",15th century,History|15th century|Italian|Italy|Gothic|Lite...,[Italy],[Italy]
3,bibliophilly,4225,0023/lewis_e_009,23,TEI,Processional; Astronomical Text binding fragment,2017-05-10T20:19:47+00:00,2017-05-10T19:10:48+00:00,2018-08-17T19:07:03+00:00,This manuscript is a late fifteenth-century pr...,...,,1450,1499,Germany,"[Processional, Astrology, Science, 15th centur...","['4225_0000_web.jpg', '4225_0001_web.jpg', '42...",15th century|13th century|10th century,Processional|Astrology|Science|15th century|13...,[Germany],[]
4,bibliophilly,4226,0023/lewis_e_003,23,TEI,Canon super almanach; De 12 signis et eorum na...,2017-05-10T20:19:52+00:00,2017-05-10T19:24:04+00:00,2018-08-17T19:05:44+00:00,This manuscript consists of a series of astron...,...,,1340,1599,England,"[Astrology, Tables, Science, 14th century, 15t...","['4226_0000_web.jpg', '4226_0001_web.jpg', '42...",14th century|15th century,Astrology|Tables|Science|14th century|15th cen...,[England],[England]


In [46]:
# This is based on OrigPlace
for idx, row in collections_contents.iterrows():
    orig_string = '|'.join(row['Country_OrigPlace'])
    collections_contents.at[idx, 'Country_OrigPlace'] = orig_string

countries_contents_origPlace = tidy_split(collections_contents, "Country_OrigPlace", sep='|', keep=False)
country_origPlace_count = countries_contents_origPlace.groupby(['Country_OrigPlace']).size().reset_index(name="Count")
country_origPlace_count.sort_values(by='Count', ascending=False, inplace=True)
country_origPlace_count.head(50)

Unnamed: 0,Country_OrigPlace,Count
0,,1514
11,Italy,874
7,France,439
8,Germany,261
6,England,167
2,Austria,110
16,Spain,77
3,Belgium,56
13,Netherlands,31
15,Portugal,11


In [47]:
# This is based on Keywords
for idx, row in collections_contents.iterrows():
    orig_string = '|'.join(row['Country_Keywords'])
    collections_contents.at[idx, 'Country_Keywords'] = orig_string
#    
countries_contents_keywords = tidy_split(collections_contents, "Country_Keywords", sep='|', keep=False)
#
country_keywords_count = countries_contents_keywords.groupby(['Country_Keywords']).size().reset_index(name="Count")
country_keywords_count.sort_values(by='Count', ascending=False, inplace=True)
country_keywords_count.head(50)



Unnamed: 0,Country_Keywords,Count
11,Italy,1164
0,,1033
7,France,491
8,Germany,291
6,England,184
2,Austria,114
17,Spain,113
3,Belgium,110
13,Netherlands,32
5,Czech Republic,13


In [48]:
country_counts = pd.merge(country_keywords_count, country_origPlace_count, left_on="Country_Keywords", right_on="Country_OrigPlace", how="outer")
country_counts

Unnamed: 0,Country_Keywords,Count_x,Country_OrigPlace,Count_y
0,Italy,1164.0,Italy,874.0
1,,1033.0,,1514.0
2,France,491.0,France,439.0
3,Germany,291.0,Germany,261.0
4,England,184.0,England,167.0
5,Austria,114.0,Austria,110.0
6,Spain,113.0,Spain,77.0
7,Belgium,110.0,Belgium,56.0
8,Netherlands,32.0,Netherlands,31.0
9,Czech Republic,13.0,,


Both opportunities give us a similar range of countries, though more countries are represented in the Keywords column than in the Original Place column. We'll use the Keywords column to map our locations. 

First, we'll have to geocode these entries using Nominatim. Nominatim's API requires an email - you can input yours below:

In [49]:
# Replace 'England' in the `Country Keywords` column with 'United Kingdom' 
countries_contents_keywords['Country_Keywords'] = countries_contents_keywords['Country_Keywords'].replace('England', 'United Kingdom')
country_keywords_count['Country_Keywords'] = country_keywords_count['Country_Keywords'].replace('England', 'United Kingdom')

In [50]:
email = input()

# Set Nominatim as the geocoding web service
locator = Nominatim(user_agent=email)
# Create a set of values in the `Country_Keywords` column
list_of_places = set(countries_contents_keywords['Country_Keywords'].to_list())

#Create dictionaries for Latitude, Longitude, and Coordinates values
lat_dict = {}
lon_dict = {}
coords_dict = {}

# For each place in the list_of places:
for place in list_of_places:
    # Calculate the location
    location = locator.geocode(place, timeout=None)
    if location:
    # Add a new key/value pair to the dictionary, where the key is the place and the value is latitude, longitude
        coords_dict[place] = (location.latitude, location.longitude)
        lat_dict[place] = location.latitude
        lon_dict[place] = location.longitude
    else: 
        coords_dict[place] = np.nan
        lat_dict[place] = np.nan
        lon_dict[place] = np.nan
    # Give the formula 2 seconds to run again 
    time.sleep(2)

# Map the dictionary into the `Coordinates`,`Latitude` and `Longitude` columns by matching the `Country_Keywords` value as key
countries_contents_keywords['Coordinates'] = countries_contents_keywords['Country_Keywords'].map(coords_dict)
countries_contents_keywords['Latitude'] = countries_contents_keywords['Country_Keywords'].map(lat_dict)
countries_contents_keywords['Longitude'] = countries_contents_keywords['Country_Keywords'].map(lon_dict)

# Return the first 5 lines of the df_places dataframe.
countries_contents_keywords.head()

JUDAICADH@GMAIL.COM


Unnamed: 0,curated_collection,document_id,path,repository_id,metadata_type,title,added,document_created,document_updated,Summary,...,OrigPlace,Keywords,ImgNames,Century,Keyword,Country_OrigPlace,Country_Keywords,Coordinates,Latitude,Longitude
0,bibliophilly,4221,0023/lewis_e_018,23,TEI,Liber de vinis,2017-05-10T14:47:02+00:00,2017-05-10T14:27:01+00:00,2018-08-17T19:07:56+00:00,This manuscript is an early 15th-century Germa...,...,Germany,"[15th century, German, Germany, Science -- Med...","['4221_0000_web.jpg', '4221_0001_web.jpg', '42...",15th century,15th century|German|Germany|Science -- Medicin...,Germany,Germany,"(51.0834196, 10.4234469)",51.08342,10.423447
1,bibliophilly,4222,0023/lewis_e_057,23,TEI,Carmen in honorem Beatae Mariae Virginis,2017-05-10T20:19:36+00:00,2017-05-10T18:22:19+00:00,2018-08-17T19:15:35+00:00,This manuscript contains fragments of a poem i...,...,Germany,"[13th century, German, Germany, Literature -- ...","['4222_0000_web.jpg', '4222_0001_web.jpg', '42...",13th century,13th century|German|Germany|Literature -- Poet...,Germany,Germany,"(51.0834196, 10.4234469)",51.08342,10.423447
2,bibliophilly,4223,0023/lewis_e_083,23,TEI,Historia belli civilis inter Caesarem et Pompeium,2017-05-10T20:19:42+00:00,2017-05-10T18:52:40+00:00,2018-08-17T19:18:33+00:00,This manuscript contains an account of the civ...,...,Italy,"[History, 15th century, Italian, Italy, Gothic...","['4223_0000_web.jpg', '4223_0001_web.jpg', '42...",15th century,History|15th century|Italian|Italy|Gothic|Lite...,Italy,Italy,"(42.6384261, 12.674297)",42.638426,12.674297
3,bibliophilly,4225,0023/lewis_e_009,23,TEI,Processional; Astronomical Text binding fragment,2017-05-10T20:19:47+00:00,2017-05-10T19:10:48+00:00,2018-08-17T19:07:03+00:00,This manuscript is a late fifteenth-century pr...,...,Germany,"[Processional, Astrology, Science, 15th centur...","['4225_0000_web.jpg', '4225_0001_web.jpg', '42...",15th century|13th century|10th century,Processional|Astrology|Science|15th century|13...,Germany,,,,
4,bibliophilly,4226,0023/lewis_e_003,23,TEI,Canon super almanach; De 12 signis et eorum na...,2017-05-10T20:19:52+00:00,2017-05-10T19:24:04+00:00,2018-08-17T19:05:44+00:00,This manuscript consists of a series of astron...,...,England,"[Astrology, Tables, Science, 14th century, 15t...","['4226_0000_web.jpg', '4226_0001_web.jpg', '42...",14th century|15th century,Astrology|Tables|Science|14th century|15th cen...,England,United Kingdom,"(54.7023545, -3.2765753)",54.702354,-3.276575


In [51]:
# Make a copy of the dataframe
unique_countries = countries_contents_keywords.copy()
# Drop duplicate rows based on the values of the `Country_Keywords` column - keep first
unique_countries.drop_duplicates(subset=['Country_Keywords'], keep='first',inplace=True)

# Iterate over index/row in dataframe
for idx, row in unique_countries.iterrows():
    # Set the value of a 'numeric' column as the ISO-numeric country code based on the `Country_Keywords` value
    unique_countries.at[idx, 'numeric'] = coco.convert(names=row['Country_Keywords'], to='ISOnumeric')

 not found in regex


In [52]:
# Merge the dataframes based on the `Country_Keywords` column and keep only the columns from the latter dataframe
manuscript_origin = pd.merge(unique_countries, country_keywords_count, on="Country_Keywords", how="right")

# Filter the dataframe for all values EXCEPT 'not found' in the numeric column
manuscript_origin = manuscript_origin[manuscript_origin['numeric']!='not found']
# Keep only the columns in the list
manuscript_origin = manuscript_origin[['Country_Keywords', 'Count', 'numeric']]

# Display the dataframe
manuscript_origin

Unnamed: 0,Country_Keywords,Count,numeric
0,Italy,1164,380.0
2,France,491,250.0
3,Germany,291,276.0
4,United Kingdom,184,826.0
5,Austria,114,40.0
6,Spain,113,724.0
7,Belgium,110,56.0
8,Netherlands,32,528.0
9,Czech Republic,13,203.0
10,Greece,10,300.0


In [53]:
# We'll use the world map as a background to the choropleth, otherwise countries with no objects will be invisible!

countries = alt.topo_feature(vega_data.world_110m.url, feature='countries')
background = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Chart the created numbers by country - we use the world boundaries to define countries
choro = alt.Chart(countries).mark_geoshape(
    stroke='white'
).encode(
    
    # Color is determined by the number of objects
    color=alt.Color('Count:Q', scale=alt.Scale(scheme='greenblue'), legend=alt.Legend(title='Number of manuscripts')),
    
    # Hover for details
    tooltip=[alt.Tooltip('Country_Keywords:N', title='Country'), alt.Tooltip('Count:Q', title='Number of manuscripts')]
    
    # This is the critical section that links the map to the object data
).transform_lookup(
    
    # This is the field that contains the country ids in the boundaries file
    lookup='id',
    
    # This is where we link the dataframe with the counts by country
    # The numeric field is the country identifier and will be used to connect data with country
    # We can also need the count and country fields
    from_=alt.LookupData(manuscript_origin, 'numeric', ['Count', 'Country_Keywords'])
).project('equirectangular').properties(width=700, title='Countries where manuscripts originate')

choro

In [54]:
# Create the map by combining the background and the choropleth
manuscript_origin_choro = alt.layer(background, choro)
manuscript_origin_choro

In [55]:
save(manuscript_origin_choro, 'charts/choro-map-manuscript-origin.html')

# Need Help?
<div class="alert alert-block alert-warning">
    <p>For additional Python and Digital Scholarship resources:</p>
    <ul>
        <li><a href"https://www.w3schools.com/python/pandas/default.asp">Pandas Tutorial from W3 Schools</a></li>
        <li><a href="https://guides.library.upenn.edu/digital-scholarship">Center for Research Data and Digital Scholarship</a></li>
    </ul>
    <p>For help with this notebook:</p>    
<ul>
    <li>If you encounter any errors in this notebook, you can open an issue on GitHub or email estene@upenn.edu and reference this notebook.</li>

<li>If you encounter any errors while working with the BiblioPhilly metadata, you can email dorp@upenn.edu.</li>

<li>If you encounter issues with accessing data from OPenn, visit
    <a href="https://openn.library.upenn.edu/TechnicalReadMe.html">OPenn</a></li>
    </ul>
</div>

----

# Credits

Created by [Emily Esten](https://www.library.upenn.edu/people/staff/emily-esten) and [Dot Porter](https://www.library.upenn.edu/people/staff/dot-porter). 

Judaica Digital Humanities at the <a href="http://library.upenn.edu">Penn Libraries</a> (also referred to as Judaica DH) is a robust program of projects and tools for experimental digital scholarship with Judaica collections, informed by digital humanities, Jewish studies, and cultural heritage approaches. Visit our [website](judaicadh.library.upenn.edu).

The dataset for this notebook works with items from the **Arnold and Deanne Kaplan Collection of Early American Judaica**. Members of the [Philadelphia Area Consortium of Special Collections Libraries (PACSCL)](http://pacscl.org/) catalogued and digitized medieval Western European manuscripts with the generous support of the [Council on Library and Information Resources (CLIR)](https://www.clir.org/), via its Digitizing Hidden Special Collections and Archives initiative. All images have been released into the public domain. More information about the collection can be found at [https://bibliophilly.library.upenn.edu/](https://bibliophilly.library.upenn.edu/). 

This notebook references existing code and Jupyter notebooks, including: 
* [GLAM Workbench for the National Museum of Australia](https://doi.org/10.5281/zenodo.3544747) sponsored by the [Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab](https://tinker.edu.au/).
* [Library of Congress Data Exploration: IIIF](https://github.com/LibraryOfCongress/data-exploration/blob/26510c3f4da0bc85dfa87e82141173b1830e9d64/IIIF.ipynb).
* Gustavo Candela, María Dolores Sáez, Pilar Escobar, Manuel Marco-Such, & Rafael C.Carrasco. (2020, May 8). hibernator11/notebook-iiif-images: release1.1 (Version 1.1). Zenodo. [http://doi.org/10.5281/zenodo.3816611](https://zenodo.org/badge/latestdoi/255172461). 
* [Genes for Project Cognoma](https://github.com/cognoma/genes/blob/721204091a96e55de6dcad165d6d8265e67e2a48/2.process.py)
* https://mindtrove.info/jupyter-tidbit-image-gallery/