# Explore Shakespeare and Company Project data

For convenience, we're going to load the official Shakespeare and Company Project version 1.2 datasets from the [code repository](https://github.com/rlskoeser/shxco-missingdata-specreading) associated with the ["Missing Data, Speculative Reading"](https://doi.org/10.22148/001c.116926) paper. 

The are three kinds of data:
- members (people who subscribed)
- books (books and periodicals that people borrowed)
- events (membership activity like subscriptions as well as borrowing activity)

The datasets are published as CSV; let's load them into Pandas DataFrames.  The events dataset is large, so we need to specify low memory is False.

In [2]:
import pandas as pd

# use v1.2 datasets; load from our repo for convenience
csv_urls = {
    # official published versions
    'members': 'https://raw.githubusercontent.com/rlskoeser/shxco-missingdata-specreading/main/data/source_data/SCoData_members_v1.2_2022-01.csv',
    'books': 'https://raw.githubusercontent.com/rlskoeser/shxco-missingdata-specreading/main/data/source_data/SCoData_books_v1.2_2022-01.csv',
    'events': 'https://raw.githubusercontent.com/rlskoeser/shxco-missingdata-specreading/main/data/source_data/SCoData_events_v1.2_2022-01.csv',
}

members_df = pd.read_csv(csv_urls['members'])
books_df = pd.read_csv(csv_urls['books'])
events_df = pd.read_csv(csv_urls['events'], low_memory=False)

Filter the events dataset to entries with fully-known dates and convert from string to date time objects.

Then create a filtered subset of borrow events with known dates, and limit to relevant fields.

In [3]:
# Filter the events dataframe to include only events with complete start and end dates.
# Note: This step might exclude borrow events with no end date. Consider the impact of this exclusion on the analysis.
date_events = events_df[(events_df.start_date.str.len() > 9) & (events_df.end_date.str.len() > 9)].copy()

# Convert the 'start_date' and 'end_date' columns to datetime format. The 'errors' parameter is set to 'coerce' to skip any errors encountered during the conversion.
date_events['start_datetime'] = pd.to_datetime(date_events.start_date, format='%Y-%m-%d', errors='coerce')
date_events['end_datetime'] = pd.to_datetime(date_events.end_date, format='%Y-%m-%d', errors='coerce')

# Sort the dataframe by 'start_datetime'.
date_events = date_events.sort_values(by=['start_datetime'])

# Filter the 'date_events' DataFrame to include only borrow events.
borrow_events = date_events[date_events.event_type == 'Borrow']
# limit to columns that are relevant for borrows
borrow_events = borrow_events[["start_datetime", "end_datetime", "member_uris", "member_names", "member_sort_names", "borrow_status", "borrow_duration_days", "item_uri", "item_title", "item_volume", "item_authors", "item_year", "start_date", "end_date"]]
borrow_events.head()

Unnamed: 0,start_datetime,end_datetime,member_uris,member_names,member_sort_names,borrow_status,borrow_duration_days,item_uri,item_title,item_volume,item_authors,item_year,start_date,end_date
674,1919-11-18,1919-11-28,https://shakespeareandco.princeton.edu/members...,Denise Ulmann,"Ulmann, Denise",Returned,10.0,https://shakespeareandco.princeton.edu/books/w...,De Profundis,,"Wilde, Oscar",1905.0,1919-11-18,1919-11-28
673,1919-11-18,1919-11-28,https://shakespeareandco.princeton.edu/members...,Denise Ulmann,"Ulmann, Denise",Returned,10.0,https://shakespeareandco.princeton.edu/books/m...,Diana of the Crossways,,"Meredith, George",1885.0,1919-11-18,1919-11-28
675,1919-11-18,1919-11-22,https://shakespeareandco.princeton.edu/members...,Henri Regnier,"Regnier, Henri",Returned,4.0,https://shakespeareandco.princeton.edu/books/h...,The Trumpet-Major,,"Hardy, Thomas",1880.0,1919-11-18,1919-11-22
678,1919-11-19,1919-11-22,https://shakespeareandco.princeton.edu/members...,Claude Cahun / Mlle Lucie Schwob,"Cahun, Claude",Returned,3.0,https://shakespeareandco.princeton.edu/books/j...,Roderick Hudson,Vol. 1,"James, Henry",1875.0,1919-11-19,1919-11-22
680,1919-11-19,1919-11-22,https://shakespeareandco.princeton.edu/members...,Maurice Oerthel,"Oerthel, Maurice",Returned,3.0,https://shakespeareandco.princeton.edu/books/h...,Characters of Shakespeare's Plays,Vol. 1,"Hazlitt, William",1817.0,1919-11-19,1919-11-22


## Borrowing data exploration

You can use this dataset of borrowing activity to answer some questions:

- Which books are borrowed most? How many times were they borrowed?
- Which authors are borrowed most?
- What books were borrowed most in a particular year between 1919 and 1942? How many books were borrowed that year?
- How many books were only borrowed once?

As you generate these numbers, think about how you would take into account the fact that we don't have all of the borrowing information and all of the book titles, and that we excluded events with dates that were not fully known.


In [29]:
# group borrow events and get counts for the grouped events
# you can group by item_title, item_authors, item_year

# here's a start
borrow_events.groupby(by='item_title').size().reset_index(name='counts')


Unnamed: 0,item_title,counts
0,'Twixt Land and Sea,1
1,14a,1
2,1914 and Other Poems,1
3,1919,8
4,1933: A Year Magazine,1
...,...,...
5341,[unclear]ile,1
5342,[unclear]y,1
5343,[unknown],1
5344,the transatlantic review,2


### Extended borrowing data exploration

More specific analysis on borrowing can be done if we combine the borrowing data with member data, which includes gender, birth and death years, and addresses for some members.

- Which books were borrowed most by men or by women?
- Which books were borrowed most in specific arrondissements or districts in Paris? *

\* *Note*: members may have multiple known addresses and arrondissements or they may have none.

In [4]:
# a few accounts were shared by members, so to join the borrow events on member data we need to split that out
borrow_events[['first_member_uri','second_member_uri']] = borrow_events.member_uris.str.split(';', expand=True)

# subset member data to information that might be interesting in combination with the books
member_demographics = members_df[['uri', 'name', 'sort_name', 'title', 'gender', 'is_organization', 'birth_year', 'death_year', 'membership_years', 'nationalities', 'addresses', 'arrondissements']]

# join borrow events on member data
# handle multiple member uris (joint accounts) in subscriptions by merging on the first one (has card should be same)
borrow_events_with_member_data = pd.merge(left=borrow_events, right=member_demographics, left_on="first_member_uri", right_on="uri")
borrow_events_with_member_data.head(3)

Unnamed: 0,start_datetime,end_datetime,member_uris,member_names,member_sort_names,borrow_status,borrow_duration_days,item_uri,item_title,item_volume,...,sort_name,title,gender,is_organization,birth_year,death_year,membership_years,nationalities,addresses,arrondissements
0,1919-11-18,1919-11-28,https://shakespeareandco.princeton.edu/members...,Denise Ulmann,"Ulmann, Denise",Returned,10.0,https://shakespeareandco.princeton.edu/books/w...,De Profundis,,...,"Ulmann, Denise",Mlle,Female,False,1902.0,1994.0,1919,France,"28 place Saint-Ferdinand, Paris",17.0
1,1919-11-18,1919-11-28,https://shakespeareandco.princeton.edu/members...,Denise Ulmann,"Ulmann, Denise",Returned,10.0,https://shakespeareandco.princeton.edu/books/m...,Diana of the Crossways,,...,"Ulmann, Denise",Mlle,Female,False,1902.0,1994.0,1919,France,"28 place Saint-Ferdinand, Paris",17.0
2,1919-11-18,1919-11-22,https://shakespeareandco.princeton.edu/members...,Henri Regnier,"Regnier, Henri",Returned,4.0,https://shakespeareandco.princeton.edu/books/h...,The Trumpet-Major,,...,"Regnier, Henri",M.,Male,False,,,1919,,"9 rue de la Gare, Cachan",


## Borrowing duration exploration

The borrowing data includes a `borrow_duration_days` field for many entries. 

How long did members keep the books they borrowed?

Let's generate some raincloud plots.



In [5]:
# the raincloud plot uses Altair; if you don't have it installed you'll need to do that first
# uncomment and run here, or run in your local python environment
#%pip install altair

In [18]:
import altair as alt

def raincloud_plot(dataset, fieldname):
    """Create a raincloud plot for the density of the specified field
    in the given dataset. 
    Takes a pandas DataFrame and the name of a column in that dataset.
    Returns an altair chart."""

    # create a density area plot of specified fieldname
    duration_density = (
        alt.Chart(dataset)
        .transform_density(
            fieldname,
            as_=[fieldname, "density"],
        )
        .mark_area(orient="vertical")
        .encode(
            x=alt.X(fieldname, title=None, axis=alt.X(labels=False, ticks=False)),
            y=alt.Y(
                "density:Q",
                # suppress labels and ticks because we're going to combine two plots
                title=None,
                axis=alt.Axis(labels=False, values=[0], grid=False, ticks=False),
            ),
        )
        .properties(height=100, width=800)
    )

    # Now create jitter plot of the same field
    stripplot = (
        alt.Chart(dataset)
        .mark_circle(size=50)
        .encode(
            x=alt.X(
                fieldname,
                #title=field_label,
                axis=alt.Axis(labels=True),
            ),
            y=alt.Y("jitter:Q", title=None, axis=None)
        )
        .transform_calculate(jitter="(random() / 200) - 0.0052")
        .properties(
            height=120,
            width=800,
        )
    )

    # use vertical concat to combine the two plots together
    raincloud_plot = alt.vconcat(duration_density, stripplot).configure_concat(
        spacing=0
    )
    return raincloud_plot

In [17]:
# how many data points are we working with?
print(f"{len(borrow_events):,} total borrowing events")

19,789 total borrowing events


By default, Altair won't plot anything over 5000 rows. We can override that when needed, but let's try plotting some subsets.

Because we already converted our dates to datetime objects, we can filter by year.

In [24]:
# get pandas statistics for borrow durations in 1932
borrow_events[borrow_events['start_datetime'].dt.year == 1932].borrow_duration_days.describe()

count    620.000000
mean      10.345161
std       20.827010
min        1.000000
25%        3.000000
50%        6.000000
75%       12.000000
max      384.000000
Name: borrow_duration_days, dtype: float64

In [25]:
# generate a raincloud plot for the same set of borrows
raincloud_plot(borrow_events[borrow_events['start_datetime'].dt.year == 1932], 'borrow_duration_days')

Prompts for more exploration:

- Is there any difference in how long members borrow books when you look at different years?
   - How do you think the missing data affects this? Remember we have a higher proportion of borrowing data for the 1930s than the 1920s.
- What about different months? People often went on vacation in August; were books checked out in August kept for longer than in other months?


In [12]:
# generate raincloud plots for different subsets of the borrowing data



## What about specific members?

The Shakespeare and Company Project lets us move between a web interface and the datasets to use different modes of analysis and work at different scales.

Browse the list of [members with lending library cards](https://shakespeareandco.princeton.edu/members/?has_card=on) - these are (generally) members with known borrowing activity.

Browse or search and choose a specific member. You can use the URI in your web browser location bar to find the identifier that is used in the datasets, and then filter the borrowing dataset to get events for that specific member.

Can you tell if famous members like Ernest Hemingway and James Joyce have different borrowing patterns than less famous members?



In [13]:
# generate raincloud plots for member-specific borrowing data

