In [17]:
from txtai.embeddings import Embeddings
import json
import numpy as np
import regex as re
import dateparser
import pandas as pd
from umap import UMAP
from sentence_transformers import SentenceTransformer

# Introduction

Phyllis Diller (1917–2012) was an iconic American female comedian of the mid-twentieth century. Her iconic comedic style was succinct, rapid, and directly challenged the notions of the "American housewife" through self-depreciation.

In 2003, Diller reached out to the Smithsonian about donating some of her personal belongings. Soon afterwards, Dwight Blocker Bowers met with Diller at her home in California where she presented him with her "Gag File", a collection of 50,000 3x5 index cards in which she prepared, modified, and recorded her jokes during her nearly 50-year long career. For more information on this Gag File and Diller's career, see this <a href="https://americanhistory.si.edu/blog/help-us-transcribe-phyllis-dillers-jokes">blogpost</a>.

<center><img src="https://ids.si.edu/ids/deliveryService?max=500&id=NMAH-ET2010-28667-000002"></center>


Mike Wilkins and Sheila Duignan supported the digitization of these index cards and the Smithsonian Transcription department then began crowd-sourcing the transcription of them. Today, the transcriptions are complete and available to the public.

In this blogpost, we will view these index cards through a lens of data science to see what kind of insights we can glean from them.

# Exploring the Data

The Diller dataset is available as a JSON file, a common method of transmitting data via the internet. We can examine the data as a truncated spreadsheet below.

In [33]:
with open("data/pd.json", "r") as f:
    data = json.load(f)
unclean_df = pd.DataFrame(data)

In [34]:
unclean_df.head(5)

Unnamed: 0,id,url,content
0,trl-1488576309868-1488576322718-5,transasset:9000:NMAH-AHB2016q108500,MOVIE STARS: Phyllis Diller\r\nPhyllis Diller\...
1,trl-1488907532569-1488907580088-4,transasset:9120:NMAH-AHB2016q120798,FAMOUS PEOPLE: Phyllis Diller\r\nPhyllis Dille...
2,trl-1490027110884-1490027127889-12,transasset:9374:NMAH-AHB2016q145481,LOSER GAG\r\nPhyllis Diller\r\n10/MAR/1978\r\n...
3,trl-1489523118765-1489523129036-17,transasset:9287:NMAH-AHB2016q135760,PHYLLIS DILLER: COOKING\r\nPhyllis Diller Gag\...
4,trl-1489595133716-1489595169321-5,transasset:9299:NMAH-AHB2016q137367,PHYLLIS: HAIR\r\nPHYLLIS DILLER GAGS\r\nJUL/19...
...,...,...,...
52201,trl-1488734717316-1488734731044-5,transasset:9051:NMAH-AHB2016q114811,"Unknown\r\nBill Daley\r\n17/MAR/1967\r\nHello,..."
52202,trl-1488763510416-1488763520708-10,transasset:9060:NMAH-AHB2016q115715-01,Unknown\r\nJoe Lucas\r\n27/DEC/1963\r\nReporte...
52203,trl-1488792310527-1488792322189-0,transasset:9075:NMAH-AHB2016q116828-01,Unknown\r\nR. L. Parker\r\n25/MAR/1966\r\nAunt...
52204,trl-1488921928068-1488921967987-7,transasset:9123:NMAH-AHB2016q121288,Unknown\r\nBarrie Payne\r\n14/OCT/1969\r\nIF A...


In the above spreadsheet, we can see that we have 52,206 index cards each with 3 different pieces of metadata:

- id: the unique id for the index card
- url: the url for the index card
- content: the raw text of the card

The primary item that we are concerned with is the content column as we will be working with the raw text. To get a better sense of these cards, let's explore one specifically. Below is the image of the first card in the dataset followed by the raw text of the transcription.

<center><img src="https://ids.si.edu/ids/viewTile/node1/A/085/08516721655da26a128cf280264e0701/512/9/0_0.jpg"></center>

In [43]:
print(unclean_df.iloc[0].content)

MOVIE STARS: Phyllis Diller
Phyllis Diller
10/MAR/1978
The place where Phyllis Dillers' star is on Hollywood Blvd went out of business.


The data on these cards usually consists of four sections:

- category: (subcategory) => these are the categories that Diller used. Occasionally, we see a subcategory given.
- author => This is the attribution of the joke.
- date => This is the data recorded on the card which is when the joke was first put in the Gag File
- content => This is the joke itself. Sometimes these are multiple lines.

Unfortunately, the original JSON file dataset does not separate out these important pieces of metadata, meaning in its current form, we cannot, for example, map the way in which Diller developed her style or the types of subjects that were discussed. Fortunately, we have ways to structure this content section into structured metadata.

# Cleaning and Structuring the Transcribed Data

We can clean the content data using Python, Pandas, and Regular Expressions (RegEx). The rules below correctly formats all but 21 of the index cards. For our purposes, we will treat these as rare exceptions.

In [44]:
jokes = [item["content"] for item in data]
structured_data = []
for joke in jokes:
    joke = joke.split("\n")
    if len(joke) > 3:
        header = joke[0].replace("\r", "").strip()
        author = joke[1].strip()
        date = joke[2]
        date = dateparser.parse(date)
        text = " ".join(joke[3:])
        text = text.replace("(RE: PD)", "").replace("RE: PD", "")
        text = re.sub(r'No\. \d{1,3}', '', text)
        text = text.replace("\r", " ").strip()
        # if {"header": header, "text": text} not in structured_data:
        structured_data.append({"header": header, "date": date, "author": author, "text": text})
    else:
        structured_data.append({"header": "UNKNOWN", "date": "UNKNOWN", "author": "UNKNOWN", "text": "UNKNOWN"})
print(len(structured_data))

  date_obj = stz.localize(date_obj)


52206


As we can see, our new spreadsheet has the metadata correctly aligned to four new categories:

- header
- date
- author
- text

In [45]:
structured_df = pd.DataFrame(structured_data)
structured_df.head(5)

Unnamed: 0,header,date,author,text
0,MOVIE STARS: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
1,FAMOUS PEOPLE: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
2,LOSER GAG,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
3,PHYLLIS DILLER: COOKING,1982-07-10 00:00:00,Phyllis Diller Gag,The kids call my gravy boat the Titanic.
4,PHYLLIS: HAIR,1984-07-10 00:00:00,PHYLLIS DILLER GAGS,(Phyllis' hair and appearance) MY FALL FELL!!!
...,...,...,...,...
52201,Unknown,1967-03-17 00:00:00,Bill Daley,"Hello, Sweetheart. Yes, Mother's in New York...."
52202,Unknown,1963-12-27 00:00:00,Joe Lucas,Reporter: What do you think qualifies you to b...
52203,Unknown,1966-03-25 00:00:00,R. L. Parker,Aunt Frank... She's one of my favorite uncle'...
52204,Unknown,1969-10-14 00:00:00,Barrie Payne,IF ABE'S WIFE HAD BEEN ONE OF THOSE Dialog be...


We can now merge this new spreadsheet with the original data to have a new, structured dataset.

In [55]:
final_df = unclean_df.join(structured_df)
final_df.head(5)

Unnamed: 0,id,url,content,header,date,author,text
0,trl-1488576309868-1488576322718-5,transasset:9000:NMAH-AHB2016q108500,MOVIE STARS: Phyllis Diller\r\nPhyllis Diller\...,MOVIE STARS: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
1,trl-1488907532569-1488907580088-4,transasset:9120:NMAH-AHB2016q120798,FAMOUS PEOPLE: Phyllis Diller\r\nPhyllis Dille...,FAMOUS PEOPLE: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
2,trl-1490027110884-1490027127889-12,transasset:9374:NMAH-AHB2016q145481,LOSER GAG\r\nPhyllis Diller\r\n10/MAR/1978\r\n...,LOSER GAG,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...
3,trl-1489523118765-1489523129036-17,transasset:9287:NMAH-AHB2016q135760,PHYLLIS DILLER: COOKING\r\nPhyllis Diller Gag\...,PHYLLIS DILLER: COOKING,1982-07-10 00:00:00,Phyllis Diller Gag,The kids call my gravy boat the Titanic.
4,trl-1489595133716-1489595169321-5,transasset:9299:NMAH-AHB2016q137367,PHYLLIS: HAIR\r\nPHYLLIS DILLER GAGS\r\nJUL/19...,PHYLLIS: HAIR,1984-07-10 00:00:00,PHYLLIS DILLER GAGS,(Phyllis' hair and appearance) MY FALL FELL!!!
...,...,...,...,...,...,...,...
52201,trl-1488734717316-1488734731044-5,transasset:9051:NMAH-AHB2016q114811,"Unknown\r\nBill Daley\r\n17/MAR/1967\r\nHello,...",Unknown,1967-03-17 00:00:00,Bill Daley,"Hello, Sweetheart. Yes, Mother's in New York...."
52202,trl-1488763510416-1488763520708-10,transasset:9060:NMAH-AHB2016q115715-01,Unknown\r\nJoe Lucas\r\n27/DEC/1963\r\nReporte...,Unknown,1963-12-27 00:00:00,Joe Lucas,Reporter: What do you think qualifies you to b...
52203,trl-1488792310527-1488792322189-0,transasset:9075:NMAH-AHB2016q116828-01,Unknown\r\nR. L. Parker\r\n25/MAR/1966\r\nAunt...,Unknown,1966-03-25 00:00:00,R. L. Parker,Aunt Frank... She's one of my favorite uncle'...
52204,trl-1488921928068-1488921967987-7,transasset:9123:NMAH-AHB2016q121288,Unknown\r\nBarrie Payne\r\n14/OCT/1969\r\nIF A...,Unknown,1969-10-14 00:00:00,Barrie Payne,IF ABE'S WIFE HAD BEEN ONE OF THOSE Dialog be...


In [62]:
final_df.to_csv("data/pd_final.csv", index=False)

# Converting Texts into Vectors

Once textual data is cleaned, it is often good to get a quick sense of it via machine learning. We have many tools at our disposal to do this. The approach I am opting for is an unsupervised machine learning approach. In this method, we convert all of our texts into large vectors. Vectors are numerical representations of a text which make it possible for a computer to parse them. The vectors generated are, however, exceptionally complex representation of a text that capture not only each word, but each word's general meaning in the English language.

While there are many ways in which we can do this, I have opted to do do this via transformer-based models, the current state-of-the-art language models that can parse and understand some of the more challenging aspects of language, such as typographical errors, misspellings, and idiomatic expressions. I will specifically be using the nli-mpnet-base-v2 model which is available from HuggingFace, the main Python framework for working with transformer models.

In [63]:
model = SentenceTransformer('nli-mpnet-base-v2')

# Load original dataset
df = pd.DataFrame(final_df)
sentences = df["text"]

# Calculate embeddings 
X =  model.encode(sentences)

The code above comes from the documentation of the Python library <a href="https://github.com/koaning/bulk">bulk</a>, created by <a href="https://koaning.io/">Vincent D. Warmerdam</a>, a machine learning engineer at ExplosionAI, the creators of spaCy.

Once we have converted all our texts into vectors, we can then begin to explore them in a graph. Before we can do that, however, must first make these vectors readable to humans. At this stage, our vectors are high-dimensional representations of our texts (imagine hundreds of graphs that must be read simultaneously). In order to make these texts more readable, we must reduce the dimensionality of these vectors into two dimensions. There are many ways to do this today, but I am opting for <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP (Uniform Manifold Approximation and Projection)</a>, an algorithm developed in 2018.

In [65]:
# Reduce the dimensions with UMAP
umap = UMAP()
X_tfm = umap.fit_transform(X)

With the UMAP dimensionality reduction complete, we can now load these two dimensions into our CSV file as X and Y coordinates.

In [66]:
# Apply coordinates
final_df['x'] = X_tfm[:, 0]
final_df['y'] = X_tfm[:, 1]
final_df.to_csv("data/pd_final_coords.csv", index=False)

# Visualizing the Data

At this stage, our data is now coordinates in a graph. In order to interact with that data, we need to visualize it. We can do this in a Jupyter Notebook via Bokeh. The code below is a modified version of the bulk library (mentioned above) and designed to work within a Jupyter Notebook.

In [67]:
from bokeh.io import curdoc
from bokeh.layouts import column, row
from bokeh.models import (Button, ColumnDataSource, DataTable, TableColumn, TextInput)
from bokeh.plotting import figure, show
from bokeh.models import DataTable, TableColumn
from bokeh.io import output_notebook
from bokeh.application import Application
from bokeh.application.handlers import FunctionHandler

In [68]:
output_notebook()

In [69]:
def bulk_text(path, keywords=None):
    df = pd.read_csv(path)
    df['alpha'] = 0.5
    if keywords:
        df['color'] = [determine_keyword(str(t), keywords) for t in df['text']]
        df['alpha'] = [0.4 if c == 'none' else 1 for c in df['color']]

    highlighted_idx = []

    # mapper, df = get_color_mapping(df)
    columns = [
        TableColumn(field="text", title="text"),
        TableColumn(field="header", title="header")
    ]

    def update(attr, old, new):
        """Callback used for plot update when lasso selecting"""
        global highlighted_idx
        subset = df.iloc[new]
        highlighted_idx = new
        subset = subset.iloc[np.random.permutation(len(subset))]
        source.data = subset

    def save():
        """Callback used to save highlighted data points"""
        global highlighted_idx
        df.iloc[highlighted_idx][['text']].to_csv(text_filename.value, index=False)

    source = ColumnDataSource(data=dict())
    source_orig = ColumnDataSource(data=df)

    data_table = DataTable(source=source, columns=columns, width=750 if "color" in df.columns else 800)
    source.data = df

    p = figure(title="", sizing_mode="scale_both", tools=["lasso_select", "box_select", "pan", "box_zoom", "wheel_zoom", "reset"])
    p.toolbar.active_drag = None
    p.toolbar.active_inspect = None

    circle_kwargs = {"x": "x", "y": "y", "size": 1, "source": source_orig, "alpha": "alpha"}

    scatter = p.circle(**circle_kwargs)
    p.plot_width = 300
    if "color" in df.columns:
        p.plot_width=350
    p.plot_height = 300

    scatter.data_source.selected.on_change('indices', update)

    text_filename = TextInput(value="out.csv", title="Filename:")
    save_btn = Button(label="SAVE")
    save_btn.on_click(save)

    controls = column(p, text_filename, save_btn)
    def make_doc(doc):
        doc.add_root(
        row(controls, data_table)
    )
    handler = FunctionHandler(make_doc)
    app=Application(handler)
    return app


app = bulk_text("data/pd_final_coords.csv")
show(app)

# Using Machine Learning to Search Texts with TxtAI

While the above visualization allows us to easily see how Diller's jokes cluster together, it does not allow us to easily search the jokes. Searching texts is an area of active research. Traditional search engines rely on a method known as TF-IDF, or term-frequency inverses-document frequency. This is a rules-based approach to a text that reduces all documents to a set of terms. The frequency with which your search appears in a single document relative to all other documents allows for a search engine to list and prioritize certain documents in the results.

Scholars are developing newer approaches to this same problem. The key issue with TF-IDF is that it does now allow for semantic-based searching. In other words, you cannot necessarily search for concepts. Imagine you wanted to find all texts that dealt with hunger. While some results would be returned with the keyword "hunger", others would not, e.g. texts that do not use the word hunger, rather synonyms or represent the idea of hunger in an abstract way.

Machine learning allows us to solve this issue by search for texts not based on keywords alone, rather key concepts. Results in this type of search engine may or may not have the words of the initial prompt; instead, they will all be based on the concepts expressed in the search prompt.

For this blogpost, we will see this process work via the TxtAI Python library that allows us to use the embedding we generated above and index them. We can then search the text embeddings (vectors), rather than the raw text, to retrieve broader and better results.

In [9]:
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

In [10]:
df = pd.read_csv("data/pd_final_coords.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,url,content,header,date,author,text,x,y
0,0,trl-1488576309868-1488576322718-5,transasset:9000:NMAH-AHB2016q108500,MOVIE STARS: Phyllis Diller\r\nPhyllis Diller\...,MOVIE STARS: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...,4.046152,3.980110
1,1,trl-1488907532569-1488907580088-4,transasset:9120:NMAH-AHB2016q120798,FAMOUS PEOPLE: Phyllis Diller\r\nPhyllis Dille...,FAMOUS PEOPLE: Phyllis Diller,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...,4.095754,3.908792
2,2,trl-1490027110884-1490027127889-12,transasset:9374:NMAH-AHB2016q145481,LOSER GAG\r\nPhyllis Diller\r\n10/MAR/1978\r\n...,LOSER GAG,1978-03-10 00:00:00,Phyllis Diller,The place where Phyllis Dillers' star is on Ho...,4.105895,3.916691
3,3,trl-1489523118765-1489523129036-17,transasset:9287:NMAH-AHB2016q135760,PHYLLIS DILLER: COOKING\r\nPhyllis Diller Gag\...,PHYLLIS DILLER: COOKING,1982-07-10 00:00:00,Phyllis Diller Gag,The kids call my gravy boat the Titanic.,1.823139,0.387008
4,4,trl-1489595133716-1489595169321-5,transasset:9299:NMAH-AHB2016q137367,PHYLLIS: HAIR\r\nPHYLLIS DILLER GAGS\r\nJUL/19...,PHYLLIS: HAIR,1984-07-10 00:00:00,PHYLLIS DILLER GAGS,(Phyllis' hair and appearance) MY FALL FELL!!!,5.877452,5.740893
...,...,...,...,...,...,...,...,...,...,...
52201,52201,trl-1488734717316-1488734731044-5,transasset:9051:NMAH-AHB2016q114811,"Unknown\r\nBill Daley\r\n17/MAR/1967\r\nHello,...",Unknown,1967-03-17 00:00:00,Bill Daley,"Hello, Sweetheart. Yes, Mother's in New York....",7.407110,2.933207
52202,52202,trl-1488763510416-1488763520708-10,transasset:9060:NMAH-AHB2016q115715-01,Unknown\r\nJoe Lucas\r\n27/DEC/1963\r\nReporte...,Unknown,1963-12-27 00:00:00,Joe Lucas,Reporter: What do you think qualifies you to b...,4.182758,6.000739
52203,52203,trl-1488792310527-1488792322189-0,transasset:9075:NMAH-AHB2016q116828-01,Unknown\r\nR. L. Parker\r\n25/MAR/1966\r\nAunt...,Unknown,1966-03-25 00:00:00,R. L. Parker,Aunt Frank... She's one of my favorite uncle'...,4.951826,1.974605
52204,52204,trl-1488921928068-1488921967987-7,transasset:9123:NMAH-AHB2016q121288,Unknown\r\nBarrie Payne\r\n14/OCT/1969\r\nIF A...,Unknown,1969-10-14 00:00:00,Barrie Payne,IF ABE'S WIFE HAD BEEN ONE OF THOSE Dialog be...,5.583723,2.097227


In [11]:
jokes = df.text.tolist()

In [12]:
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(jokes)])

## Exploring our Search Engine

In [16]:
res = embeddings.search("marriage", 20)
for i, r in enumerate(res):
    print()
    print(f"Text {i:03}: {jokes[r[0]]}")
    print(f"Similarity: {r[1]}")
    print()


Text 000: I married for better or worse ... but not necessarily for keeps.
Similarity: 0.6346619129180908


Text 001: I married for better or worse ... but not necessarily for keeps.
Similarity: 0.634661853313446


Text 002: Marriage counselor to old couple: "You must realize that there is more to marriage than mad, unbridled sex."
Similarity: 0.622109055519104


Text 003: Marriage counselor to old couple:  "You must realize that there is more to marriage than mad, unbridled sex."
Similarity: 0.622109055519104


Text 004: Marriage counselor to old couple: "You must realize that there is more to marriage than mad, unbridled sex."
Similarity: 0.622109055519104


Text 005: Marriage counselor to old couple: "You must realize that there is more to marriage than mad, unbridled sex."
Similarity: 0.622109055519104


Text 006: I married for better or for worse - but not necessarily for keeps.
Similarity: 0.6216991543769836


Text 007: I married for better or for worse - but not necessarily for

In [15]:
embeddings.save("models/pd_index")

# Exploring Trends in the Data