#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science – Summer '22

# Notebook 5: Document Similarity 

In this notebook, we're going to go a bit farther with vector semantics, which is one of the main approaches we'll use in this class and which has had an enormous influence in cultural sociology. Specifically, we are going to build on Notebook 3 by using document-term matrices and tf-idf weighting as a basis for directly measuring how similar or dissimilar documents are.

Regarding the exercises at the end, don't worry if you aren't a history buff. The exercises are just meant to reinforce how text data can be used for social inquiry. 

Please download the [State of the Union Corpus (1790-2018)](https://www.kaggle.com/rtatman/state-of-the-union-corpus-1989-2017), which was posted to Kaggle by Rachael Tatman and Liling Tan. 

In [None]:
import copy
import matplotlib.pyplot as plt
import os
import pandas as pd
import numpy as np
import re
import seaborn as sns
import spacy

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

sns.set_theme(style="darkgrid")

## Loading and Cleaning the Data

We're going to load the data a bit like we did in Notebook 3. First, we're going to create a `list` called <tt>text_files</tt> using the [`filter() method`](https://www.geeksforgeeks.org/filter-in-python/). `filter()` uses a [lambda function](https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/) to filter out unwanted elements of an iterable (such as a list). The function and the iterable are the two parts of the call to `filter()`.

In [None]:
text_files = list(filter(lambda x: x.endswith(".txt"), os.listdir("data/sotu")))

In [None]:
text_files[:10]

In this application, ```lambda x: x.endswith(".txt")``` returns <tt>True</tt> if something is a string and ends with ".txt" or <tt>False</tt> otherwise.

The second part, ```os.listdir("sotu")```, uses the [`os` module's](https://www.geeksforgeeks.org/os-module-python-examples/) `listdir()` method to return a list of filenames in the directory.

Our call to `filter()` checks whether each filename in that directory ends with ".txt" and returns only the files for which that is true. We then cast the result as a `list`.

The result, <tt>text_files</tt>, is a list of filenames in the directory "sotu" that end with ".txt"--but the file names don't include the full path from our working directory to the files. We need to add the directory "sotu" to the filenames to access the files.

We're going to use [`os.path.join()`](https://www.geeksforgeeks.org/python-os-path-join-method/) to create a list of file paths. We'll call this list <tt>address_paths</tt> because it's a list of the file paths for State of the Union Addresses (stored as files ending in .txt). We'll use a list comprehension, but this is the same as looping through <tt>text_files</tt> with a for loop, using `os.path.join()`, and appending the result to the list we are creating.

In [None]:
address_paths = [os.path.join("data/sotu/", f) for f in text_files]

In [None]:
print(address_paths[:10])

Now we're going to use these file paths to create a data frame with the text of each State of the Union as well as the president and year. The function <tt>return_sotu_name_year_text()</tt> accepts just one argument, a file path.

The file path points us to the text, but the *filename* includes the president and the year joined by an underscore.

> Adams_1797.txt

Since the filename is just a string, we can easily associate the text of each State of the Union with the corresponding president and year.

In [None]:
def return_sotu_name_year_text(f: str):
    """
    Return the name, year, and text of a SOTU.
    """
    doc = open(f, "r").read().strip()
    f = os.path.split(f)[-1]
    f = f.replace(".txt", "")
    pres, year = f.split("_")
    
    return pres, year, doc

<div class="alert alert-info">
The line
    
```python
doc = open(f, "r").read().strip()
```
opens the file path <tt>f</tt>, reads the full file into memory as a string, and strips whitespace like linebreaks from the ends.<br>

```python
f = os.path.split(f)[-1]
```

replaces the file path we've stored as the variable <tt>f</tt> with the last part of the file path, in this case the filename ending in .txt, by splitting the file path and taking the last element by using the index -1.

```python
pres, year = f.split("_")
```
creates the variables <tt>pres</tt> and <tt>year</tt> by splitting the filename <tt>f</tt> on underscores.
</div>

Now we can load the data. First, we create a dataframe with only one column: the file path. Next, we apply our function <tt>return_sotu_name_year_text()</tt> to each row's file path, creating three new columns.

In [None]:
df = pd.DataFrame(address_paths, columns = ["file_path"])

In [None]:
df.head()

In [None]:
df[["president", "year", "text"]] = df.file_path.apply(lambda x: pd.Series(return_sotu_name_year_text(x)))
df.drop(columns = ["file_path"], inplace = True)

We also drop the original column containing the file path. Now we have a dataframe with just the president, year, and text for each State of the Union.

In [None]:
df.head()

In [None]:
df.shape

Next, we sort the dataframe by year, remove any rows without actual speeches (i.e., where the <tt>text</tt> column is an empty string), identify any missing years, and reset the index.

In [None]:
df = df[df["text"] != ""]
df = df.astype({"year": int})
df.year.min(), df.year.max()

In [None]:
[i for i in range(1791,2019) if i not in df.year.values]

In [None]:
len(df.index)

In [None]:
df.shape

In [None]:
df.sort_values(by="year", inplace=True)
df.reset_index(inplace=True, drop=True)

In [None]:
df.head()

Finally, we want to make sure that we're distinguishing between presidents with the same last names. There were two Harrisons, but William Henry Harrison didn't live long enough to give a State of the Union address. Neither did James A. Garfield. Since Grover Cleveland counts twice, that gives us 42 unique presidents who had given a State of the Union address in the time period covered by the corpus, which stops in 2018.

In [None]:
df.president = np.where(df.president.eq("Adams") & df["year"].gt(1800), "Adams2", df.president)
df.president = np.where(df.president.eq("Bush") & df["year"].gt(2000), "Bush2", df.president)
df.president = np.where(df.president.eq("Johnson") & df["year"].gt(1900), "Johnson2", df.president)
df.president = np.where(df.president.eq("Roosevelt") & df["year"].gt(1930), "Roosevelt2", df.president)

In [None]:
len(df.president.unique())

In [None]:
df.president.unique()

We can use a `dict` and the `.apply()` method to create a column for the party of the president who gave the speech.

In [None]:
party_dict = {
    'Washington': "Unaffiliated", 
    'Adams': "Federalist", 
    'Jefferson': "Democratic-Republican", 
    'Madison': "Democratic-Republican", 
    'Monroe': "Democratic-Republican", 
    'Adams2': "Democratic-Republican", 
    'Jackson': "Democrat", 
    'Buren': "Democrat", 
    'Tyler': "Whig", 
    'Polk': "Democrat", 
    'Taylor': "Whig", 
    'Fillmore': "Whig", 
    'Pierce': "Democrat", 
    'Buchanan': "Democrat", 
    'Lincoln': "Republican", 
    'Johnson': "Democrat", 
    'Grant': "Republican", 
    'Hayes': "Republican", 
    'Arthur': "Republican", 
    'Cleveland': "Democrat", 
    'Harrison': "Republican", 
    'McKinley': "Republican", 
    'Roosevelt': "Republican", 
    'Taft': "Republican", 
    'Wilson': "Democrat", 
    'Harding': "Republican", 
    'Coolidge': "Republican", 
    'Hoover': "Republican", 
    'Roosevelt2': "Democrat", 
    'Truman': "Democrat", 
    'Eisenhower': "Republican", 
    'Kennedy': "Democrat", 
    'Johnson2': "Democrat", 
    'Nixon': "Republican", 
    'Ford': "Republican", 
    'Carter': "Democrat", 
    'Reagan': "Republican", 
    'Bush': "Republican", 
    'Clinton': "Democrat", 
    'Bush2': "Republican", 
    'Obama': "Democrat", 
    'Trump': "Republican"
}

df["party"] = df.president.apply(lambda x: party_dict[x])

In [None]:
df[["party", "year"]].groupby("party").count() # number of speeches by party in the dataset

In [None]:
df[["president", "year"]].groupby("president").count() # number of speeches by each pres in the dataset

## Preprocessing the Text

In [None]:
def preprocess_post(post: str) -> str:
    """
    Tokenize, lemmatize, remove stop words, 
    remove non-alphabetic characters.
    """
    post = " ".join([word.lemma_ for word in nlp(post) if not word.is_stop])
    post = re.sub("[^a-z]", " ", post.lower())
    
    return re.sub("\s+", " ", post).strip()


nlp = spacy.load("en_core_web_sm", disable=["ner"])

The object we call <tt>nlp</tt> is a language model from [spaCy](https://spacy.io/). It does part-of-speech tagging, named entity recognition, and more. `disable=["ner"]` tells it not to perform named entity recognition. Turning things off might speed it up!


<div class="alert alert-info">
The function <tt>preprocess_post()</tt> is equivalent to the following:

```python
def preprocess_post(post: str) -> str:
    """
    Tokenizes and returns the lowercase lemmas of
    tokens that are not stop words, minus any 
    non-alphabetic characters
    """
    words = []
    for word in nlp(post): # each "word" in nlp(post) has been part-of-speech tagged, etc.
        if not word.is_stop: # ".is_stop" checks whether spacy has determined it's a stop word
            words.append(word.lemma_) # adding the lemma of the word, not the word itself, to the list
    post = " ".join(words) # converting the list of words to a string variable separated by spaces
    post = post.lower() # make everything lowercase
    post = re.sub("[^a-z]", " ", post) # now we replace non-alphabetic chars with spaces
    post = re.sub("\s+", " ", post) # now we replace long stretches of whitespace with a single space
    post = post.strip() # now we strip whitespace from the edges
    
    return post
```
    
</div>

In [None]:
%%time

df["preprocessed"] = df.text.apply(preprocess_post)

In [None]:
df.to_json("df_with_preprocessed_sotu.json")
df = pd.read_json("df_with_preprocessed_sotu.json")

In [None]:
df.head()

## Creating a Document-Term Matrix

Converting a corpus of documents to a `document-term matrix` is a core step for many NLP tasks. We saw how to do that in Notebook 3, but we'll review it now before introducing a faster way to complete this step.

We'll start by getting the number of times each *type* (unique word) occurs in the entire corpus. We'll save this as a `dict` called <tt>term_frequencies</tt>, which we'll create using the [`Counter()` method](https://www.geeksforgeeks.org/counters-in-python-set-1/).

In [None]:
term_frequencies = Counter(" ".join(df["preprocessed"]).split())
vocabulary = list(term_frequencies.keys())
print(f"There are {len(vocabulary):,} unique words in the corpus.")

<div class="alert alert-info">
Let's break this down a bit:

```python
" ".join(df["preprocessed"]).split()
```


joins each of the preprocessed speeches with a single space, creating one big document with all of the *tokens* in the entire corpus. This would be a single string. The `str.split()` method then splits that string on whitespace (like spaces), returning a `list` containing all the tokens.

We then use the `Counter()` method to count the number of times each type occurs in that list, saving it as <tt>term_frequencies</tt>.

```python
vocabulary = list(term_frequencies.keys())
```

accesses the keys (types) and casts the result as a `list`, giving us a list of the unique words.
</div>

Let's take a look at some of the most frequent words. Here, we create a `list` called <tt>tups</tt> (short for tuples) by accessing the `.items()` from <tt>term_frequencies</tt>. Each "item" is a tuple like <tt>(key, value)</tt>, where the key and value are the type and count. We sort this list using a lambda function that checks the value at index 1. The value at index 1 of each tuple is the number of times the word occurs in the corpus. This means that

```python
key = lambda x: x[1]
```

says we want to sort by the frequencies. We also set `reverse=True` to get a list in order from most frequent to least frequent. We then display the first ten using `tups[:10]`.

In [None]:
tups = sorted(list(term_frequencies.items()), key=lambda x: x[1], reverse=True)
tups[:10]

Now we filter the vocabulary to exclude words that occur only once.

In [None]:
vocabulary = list(filter(lambda x: term_frequencies[x] > 1, vocabulary))
print(f"{len(vocabulary):,} unique words occur more than once.")

Next we define a function that accepts a string as an argument and returns a version of that string without any duplicate words.

In [None]:
def set_of_types(document: str) -> str:
    """
    Returns a string with all duplicates of words removed
    """

    return " ".join(set(document.split()))

The function <tt>set_of_types()</tt> uses the `str.split()` method to split the speech (a string) on whitespace, casts it as a `set()` (removing any duplicates), then uses the `.join()` method to join the tokens into a single string again using whitespace.

Now we create a column called <tt>types</tt> by applying this function to the preprocessed text, row by row. This creates a copy of each speech without any duplicate words.

In [None]:
df["types"] = df.preprocessed.apply(set_of_types)

Why do we do that? It's just a convenient way to count the number of documents each word occurs in. Just like the code we used to create <tt>term_frequencies</tt>, the code to create <tt>document_frequencies</tt> first joins the <tt>types</tt> with a single space, creating one big document. It then uses `str.split()` to convert that giant document into a list of tokens, and then uses the `Counter()` method to count how many times each type occurs. 

We can get rid of the <tt>types</tt> column. It has served its purpose: helping avoid going word by word through each document to get the number of documents in which each word occurs.

In [None]:
document_frequencies = Counter(" ".join(df.types).split())
df.drop(columns=["types"], inplace=True)

Words that occur in only one document don't give us a lot of information, so we'll filter them out.

In [None]:
vocabulary = list(filter(lambda x: document_frequencies[x] > 1, vocabulary))
print(f"The vocabulary now has {len(vocabulary):,} words.")

Now we're almost ready to create the `document-term matrix`. We'll create a copy of our dataframe and call it <tt>dtm</tt> (for "document-term matrix"). We're only going to keep the column with the preprocessed text. We're also going to convert this from a string for each speech to a list of tokens for each speech using `str.split()` and then rename the column "<preprocessed\>" somewhat arbitrarily; we're about to create a column for each unique word, and while we might not expect "preprocessed" to be in the vocabulary, it's good practice.

In [None]:
dtm = copy.copy(df)
dtm.preprocessed = dtm.preprocessed.apply(str.split)
dtm = dtm[["preprocessed"]]
dtm.rename(columns={"preprocessed": "<preprocessed>"}, inplace = True)
dtm.head()

Now we're going to create a column for each word in the vocabulary. Each row still corresponds to a State of the Union, and the value for each cell will be the number of times that word occurs in that (preprocessed) speech. Since we've converted the preprocessed text to a list, we can use the `.count()` method to get the number of times each word in the vocabulary occurs in each document.

The function <tt>term_frequency()</tt> uses a list comprehension and returns a list of word counts for each speech. We can then create the columns with the counts for each word. Next, we drop the "<preprocessed\>" column because we no longer need the text anymore for the document-term matrix.

In [None]:
def term_frequency(doc: list, vocab: list) -> list:
    """
    Returns counts of each term in a list
    """
    
    return [doc.count(term) for term in vocab]

In [None]:
%%time

dtm_counts = dtm["<preprocessed>"].apply(lambda x: pd.Series(term_frequency(x, vocabulary)))
dtm_counts.rename(mapper={i:vocabulary[i] for i in range(len(vocabulary))}, inplace=True, axis=1)

In [None]:
dtm_counts.to_json("dtm_counts.json")
dtm_counts = pd.read_json("dtm_counts.json")

In [None]:
dtm_counts.head()

## A Faster Way: CountVectorizer

We can do this much more quickly with `scikit-learn's` [`CountVectorizer()` method](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

The argument `min_df=2` excludes words that occur in fewer than two documents.

Let's compare the results!

In [None]:
%%time

count_vectorizer = CountVectorizer(min_df=2)
counts = count_vectorizer.fit_transform(df["preprocessed"])
counts.shape

We'll use `pd.DataFrame.sparse.from_spmatrix()` to convert the result to a dataframe.

The vocabulary is in a different order, and it's 25 columns narrower. Let's see what's missing.

In [None]:
new_df = pd.DataFrame.sparse.from_spmatrix(counts, columns=count_vectorizer.get_feature_names_out())

In [None]:
new_df = pd.DataFrame(new_df.to_numpy(), columns=new_df.columns)

In [None]:
new_df.head()

In [None]:
diffs = list(set(dtm_counts.columns).difference(set(new_df.columns)))

In [None]:
len(diffs)

In [None]:
print(sorted(diffs))

`CountVectorizer()` has removed single characters from the vocabulary.

## Measuring Similarity

This brings us to a few of the key intuitions of text analysis. We now have **vectors** instead of strings as representations of the speeches, and these vectors can be compared quanitatively. More specifically, the documents can now be compared quantitatively as if they were points in a high-dimensional space.

Let's start with a simple case of a vocabulary of two words. You can assign different words to <tt>keyword1</tt> and <tt>keyword2</tt> below, and documents are selected at random. We're going to show where documents are in a *two*-dimensional space using counts of these two words. The function <tt>plot_distances()</tt> handles that for us.

In [None]:
def plot_distances(dtm, doc1_idx: int, doc2_idx: int, keyword1: str, keyword2: str, extend=False, cosine=False):
    """
    Plots an arrow illustrating the distance between two
    2D "word vectors" based on term frequencies
    """
    x1 = dtm.loc[doc1_idx, keyword1]
    y1 = dtm.loc[doc1_idx, keyword2]
    x2 = dtm.loc[doc2_idx, keyword1]
    y2 = dtm.loc[doc2_idx, keyword2]
    
    doc1 = min([[x1, y1], [x2, y2]], key=lambda x: np.sqrt(x[0]**2 + x[1]**2))
    doc2 = max([[x1, y1], [x2, y2]], key=lambda x: np.sqrt(x[0]**2 + x[1]**2))
    
    if extend==True:
        doc3 = [doc2[0]*2, doc2[1]*2]
        plt.xlim(0, max(x1, x2, doc3[0])*1.2)
        plt.ylim(0, max(y1, y2, doc3[1])*1.2)
        plt.text(x = doc3[0], y = doc3[1], s = "Doc 3")
        plt.arrow(doc1[0], doc1[1], doc3[0]-doc1[0], doc3[1]-doc1[1], width=0.5, length_includes_head=True)
    else:
        plt.xlim(0, max(x1, x2) * 1.2)
        plt.ylim(0, max(y1, y2) * 1.2)
    plt.arrow(doc1[0], doc1[1], doc2[0]-doc1[0], doc2[1]-doc1[1], width=0.5, length_includes_head=True)
    plt.text(x = doc1[0], y = doc1[1], s = "Doc 1")
    plt.text(x = doc2[0], y = doc2[1], s = "Doc 2")
    plt.xlabel(f'Frequecy of "{keyword1}"')
    plt.ylabel(f'Frequency of "{keyword2}"')
    
    print(f"Document 1 features '{keyword1}' {doc1[0]} times and '{keyword2}' {doc1[1]} times.")
    print(f"Document 2 features '{keyword1}' {doc2[0]} times and '{keyword2}' {doc2[1]} times.")
    print(f"Documents 1 and 2 are {euclidean_distances([doc1], [doc2])[0][0]:.1f} units apart in this 2D space.")

    if extend==True:
        print(f"\nDocument 3 features '{keyword1}' {doc3[0]} times and '{keyword2}' {doc3[1]} times.")
        print(f"Documents 1 and 3 are {euclidean_distances([doc1], [doc3])[0][0]:.1f} units apart in this 2D space.")
        if cosine==True:
            print(f"\nDocuments 1 and 2 have a cosine similarity of {cosine_similarity([doc1], [doc2])[0][0]:.2f}.")
            print(f"Documents 1 and 3 have a cosine similarity of {cosine_similarity([doc1], [doc3])[0][0]:.2f}.")
            print(f"Documents 2 and 3 have a cosine similarity of {cosine_similarity([doc2], [doc3])[0][0]:.2f}.")

In [None]:
doc1_idx = np.random.randint(0, dtm.shape[0])
doc2_idx = np.random.randint(0, dtm.shape[0])

keyword1 = "people"
keyword2 = "law"

plot_distances(dtm_counts, doc1_idx, doc2_idx, keyword1, keyword2)

plt.show()

This arrow represents the distance between the two documents in this two-dimensional space, and we can calculate the length of the arrow directly using the Pythagorean theorem. This gives us the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between the points.

If we make the assumption that the meaning of the documents is related to the distribution of words in them, then we can make the additional assumption that the distance between the points is related to how semantically similar the documents are. In other words, we might assume that two documents close together mean something similar, whereas documents farther apart are less likely to mean (or be about) the same thing.

What about document length, though?

Consider the (contrived) example below. Document 3 is identical to Document 2, but twice as long. It's farther from Document 1, but it's in the exact same direction as Document 2 from the origin.

In [None]:
doc1_idx = np.random.randint(0, dtm.shape[0])
doc2_idx = np.random.randint(0, dtm.shape[0])

keyword1 = "people"
keyword2 = "law"

plot_distances(dtm_counts, doc1_idx, doc2_idx, keyword1, keyword2, extend=True)

plt.show()

If we use Euclidean distance as our measure of similarity, this makes it look like Document 3 is much less similar to Document 1 than Document 2 is to Document 1, and like Documents 2 and 3 are not very similar. Euclidean distance is a problem in this case.

We can use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) instead. Whereas Euclidean distance measures the length of the arrow between the two points, cosine similarity takes into account the *angle*. Document 3 is longer than Document 2, but it has the same proportions of the keywords.

If we use cosine, two important points emerge:
1. The similarity of Document 2 and Document 3 will be 1.0
2. The similarity of Document 1 to Document 2 will be the same as the similarity of Document 1 to Document 3

These are both desirable because Document 2 and 3 only differ in length.

In [None]:
doc1_idx = np.random.randint(0, dtm.shape[0])
doc2_idx = np.random.randint(0, dtm.shape[0])

keyword1 = "people"
keyword2 = "law"

plot_distances(dtm_counts, doc1_idx, doc2_idx, keyword1, keyword2, extend=True, cosine=True)

plt.show()

We can also use these measures beyond this two-dimensional case.

## Tf-idf Revisited

> In Section 6.2...we developed the notion of a document vector that captures the relative importance of the terms in a document. The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval (IR) operations including scoring documents on a query, document classification, and document clustering.
<br><br>    [Manning, Raghavan, & Schutze (2008, p. 110)](https://nlp.stanford.edu/IR-book/information-retrieval-book.html)

In Notebook 3, we also encountered tf-idf weighting. The key idea is that having rare things in common is more informative than having common things in common. It is not terribly informative if two documents share really common words. Inverse document frequency (idf) weighting helps us assign less importance to words that appear in many documents. We also experimented a bit with weighting the frequencies of terms within documents by logging them. There are [many ways to calculate term frequency and inverse document frequency](https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System).

`sklearn` does offer the most flexibility in terms of weighting systems, but it does make tf-idf weighting fast. If we want to calculate a document-term matrix and apply tf-idf weighting, we can use [`sklearn's TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
tfidf_vectorizer = TfidfVectorizer(min_df=2, sublinear_tf=True) # sublinear_tf logs the term frequencies
tfidf = tfidf_vectorizer.fit_transform(df["preprocessed"])
tfidf.shape

In [None]:
tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_df.to_numpy(), columns=tfidf_df.columns)
tfidf_df.head()

## Finding Similar Documents

In [None]:
def find_most_similar(query: str, num_matches: int=1, cosine: bool=True) -> list:
    """
    Preprocess a query and find `num_matches` using tf-idf and either cosine or Euclidean distance
    """
    query = preprocess_post(query)
    query = tfidf_vectorizer.transform([query])
    
    if cosine==True:
        matches = [(idx, cosine_similarity(query, np.array(post).reshape(1,-1))[0][0]) for idx, post in tfidf_df.iterrows()]
    else:
        matches = [(idx, euclidean_distances(query, np.array(post).reshape(1,-1))[0][0]) for idx, post in tfidf_df.iterrows()]
    
    matches = sorted(matches, key=lambda x: x[1], reverse=True)
    
    return matches[:num_matches]

Let's make sure everything is working. The most similar document to any given speech should be the speech itself!

Here's a random selection:

In [None]:
query = df.sample(1)["text"].values[0]

In [None]:
print(query[:500])

Now let's get the tf-idf-weighted version of it. Notably, *these weights are based on the document frequencies in the original corpus*. While the resulting vector represents the new query, the relative importance assigned to each word means we have already incorporated information from the other documents.

In [None]:
query_tfidf = tfidf_vectorizer.transform([query])

In [None]:
query_tfidf

Now let's find the most similar speech.

In [None]:
find_most_similar(query, num_matches=1)

In [None]:
print(df.loc[55].text[:500]) # the speech is selected at random, so you will need to change the index

We can also compare the documents to new documents, for example to other figures who influenced social thought. This quote is an arbitrary example (but comes from Simmel's essay "Fashion" in *Georg Simmel on Individuality and Social Forms*).

In [None]:
query = """The charm of imitation in the first place is to be found in the fact that it makes possible an expedient 
test of power, which, however, requires no great personal or creative application, but is displayed easily and 
smoothly, because its content is a given quantity. We might define it as the child of thought and thoughtlessness. 
It affords the pregnant possibility of continually extending the greatest creations of the human spirit, without the 
aid of the forces which were originally the very condition of their birth. Imitation, furthermore, gives to the 
individual the satisfaction of not standing alone in his actions. Whenever we imitate, we transfer not only the 
demand for creative activity, but also the responsibility for the action from ourselves to another. Thus the 
individual is freed from the worry of choosing and appears simply as a creature of the group, as a vessel of the 
social contents..."""

In [None]:
find_most_similar(query, num_matches=1)

In [None]:
print(f"President: {df.loc[179].president}")
print(f"Year: {df.loc[179].year}")
print(f"Party: {df.loc[179].party}")
print(f"Snippet of text:\n\n{df.loc[179].text[:500]}")

## Document Similarity as an Outcome Variable

The ability to position documents in a shared vector space also means we can treat document similarity as an outcome. We can compare many things (like an entire corpus!) to a single document, calculating the similarity (or distance) for each comparison. We can look at relationships between document similarity and other variables.

### Example 1: Similarity to First Document
Let's pick an obvious starting point: the first document. We'll calculate the similarity between each subsequent document and the first document and then observe trends in that measure over time.

In [None]:
df.loc[0]

In [None]:
query = df.loc[0].text # the text of the first row in the dataframe

In [None]:
print(query[:500])

We already have the vector. Let's assign it to a variable called <tt>query_tfidf</tt> and check the `shape`.

In [None]:
query_tfidf = tfidf_df.loc[0]
query_tfidf.shape

In [None]:
tfidf_df.loc[0]

When we retrieve it using `.loc` and the index (row), it is giving us a data structure with a row for every word in the vocabulary. We want to make sure that stays as a single row with a column for every word. We can use `numpy's .reshape()` method. `.reshape(1, -1)` will do the trick.

In [None]:
query_tfidf = np.array(query_tfidf).reshape(1, -1)
query_tfidf.shape

In [None]:
query_tfidf

Now let's create a new variable and call it <tt>sim_to_first</tt>. We'll initially save this as a list, and we'll create it using a list comprehension that compares the tf-idf-weighted vector representation of the first speech to the vector for each speech in order.

In [None]:
sim_to_first = [cosine_similarity(query_tfidf, np.array(post).reshape(1,-1))[0][0] for idx, post in tfidf_df.iterrows()]

We've kept everything in the same order, but let's put our minds at ease by first confirming there are the same number of speeches.

In [None]:
len(sim_to_first) == df.shape[0] == dtm.shape[0] == tfidf_df.shape[0] == new_df.shape[0]

And to be extra safe, let's make sure the first result--which should be comparing the first speech to itself--has a similarity of 1.0. There will always be some noise, but this is effectively 1.0:

In [None]:
print(sim_to_first[0])

Looks good! Now we can add the new variable to our original dataframe, which has useful metadata, namely the president who gave the speech and the year it was given.

In [None]:
df["sim_to_first"] = sim_to_first

In [None]:
df.head()

<div class="alert alert-info">
Using seaborn, pandas, or pyplot itself with missing data may mean missing datapoints are interpolated. We can see this if we explicitly interperolate a value for the missing year, 1933.
</div>

In [None]:
[i for i in range(1791,2019) if i not in df.year.values]

In [None]:
tmp = copy.copy(df)
tmp.loc[len(tmp.index)] = ["Roosevelt2", 1933, np.nan, "Democrat", np.nan, np.nan]
tmp = tmp.sort_values("year")
tmp = tmp.reset_index()
tmp.sim_to_first = tmp.sim_to_first.interpolate()

tmp = tmp[tmp.year.isin(range(1930,1940))]
display(tmp.head())

sns.lineplot(x="year", y="sim_to_first", data=tmp[tmp.index != 0])
plt.title("Similarity to 1791 State of the Union")
plt.xlabel("Year")
plt.ylabel("Cosine")
plt.show()

In [None]:
sns.lineplot(x="year", y="sim_to_first", data=tmp[tmp.index != 0])
plt.title("Similarity to 1791 State of the Union")
plt.xlabel("Year")
plt.ylabel("Cosine")
plt.ylim(0.0, 1.0)
plt.show()

In [None]:
sns.lineplot(x="year", y="sim_to_first", data=df[df.index != 0])
plt.title("Similarity to 1791 State of the Union")
plt.xlabel("Year")
plt.ylabel("Cosine")
plt.show()

In [None]:
sns.lineplot(x="year", y="sim_to_first", data=df[df.index != 0])
plt.title("Similarity to 1791 State of the Union")
plt.xlabel("Year")
plt.ylabel("Cosine")
plt.ylim(0, 1.0)
plt.show()

### Example 2: Similarity to a President
Now let's try a less obvious starting point: the "average" speech for a particular president. I've put "average" in scare quotes because the idea of just averaging these representations may seem a little sketchy, but we are, in fact, going to average them. You can think of the average as the centroid (center) of the cluster of points belonging to a particular president's speeches in this vector space.

In [None]:
df[df.president=="Polk"]

James K. Polk's speeches have the indices 54, 55, 56, and 57.

We could have also used the following line:

In [None]:
df[df.president=="Polk"].index

In [None]:
polk_tfidf = tfidf_df[tfidf_df.index.isin([54, 55, 56, 57])]

In [None]:
polk_tfidf

Now we'll average them, and it's *really* important to keep checking the shape.

In [None]:
polk_average = polk_tfidf.mean()
polk_average.shape

In [None]:
polk_average = np.array(polk_average).reshape(1, -1)
polk_average.shape

Finally, we'll create a variable <tt>sim_to_polk</tt> in the same manner as the previous example.

In [None]:
df["sim_to_polk"] = [cosine_similarity(polk_average, np.array(post).reshape(1,-1))[0][0] for idx, post in tfidf_df.iterrows()]

In [None]:
sns.lineplot(x="year", y="sim_to_polk", data=df[df.president != "Polk"])
plt.title("Similarity to James K. Polk's State of the Union Addresses")
plt.xlabel("Year\n(Polk Administration in Orange)")
plt.ylabel("Cosine")
plt.axvspan(1845, 1848, alpha = 0.7, color = "orange")
plt.show()

In [None]:
sns.lineplot(x="year", y="sim_to_polk", data=df[df.president != "Polk"])
plt.title("Similarity to James K. Polk's State of the Union Addresses")
plt.xlabel("Year\n(Polk Administration in Orange)")
plt.ylabel("Cosine")
plt.axvspan(1845, 1848, alpha = 0.7, color = "orange")
plt.ylim(0.0, 1.0)
plt.show()

In [None]:
df.sort_values("sim_to_polk", ascending=False) # most similar

In [None]:
df[df.president != "Polk"].sort_values("sim_to_polk", ascending=False).head(10) # most similar, excluding Polk himself

### Example 3: Similarity to the Whigs

We also know the party of the president who gave the speech (although this variable may not be meaningful for the first several presidents). We can calculate the centroid for a party and then calculate the similarity of every subsequent speech to that.

In [None]:
df[df.party=="Whig"].index

In [None]:
whig_indices = df[df.party=="Whig"].index

In [None]:
whig_tfidf = tfidf_df[tfidf_df.index.isin(whig_indices)]
whig_tfidf

In [None]:
whig_average = whig_tfidf.mean()
whig_average.shape

In [None]:
whig_average = np.array(whig_average).reshape(1, -1)
whig_average.shape

In [None]:
df["sim_to_whig"] = [cosine_similarity(whig_average, np.array(post).reshape(1,-1))[0][0] for idx, post in tfidf_df.iterrows()]

In [None]:
sns.lineplot(x="year", y="sim_to_whig", data=df)
plt.title("Similarity to Average Whig Address")
plt.xlabel("Year")
plt.ylabel("Cosine")
plt.show()

## Exercises

<div class="alert alert-warning">
    <b>Exercise 1</b><br><br>
    For this exercise, pick a year when you think the State of the Union Address may have referred to consequential events. (Hint: You may want to pick the year after, depending on when the events happened and the month the State of the Union was given that year.)<br><br>
    1.1 What sociologically significant events, institutions, or processes might make this year distinctive?
</div>

_Your text here_

<div class="alert alert-warning">
    1.2 Identify the index
</div>

In [None]:
df[df.year == PICK_A_YEAR].index # replace "PICK_A_YEAR" with a year

<div class="alert alert-warning">
    1.3 Get the tf-idf-weighted vector from the dataframe <tt>tfidf_df</tt>
</div>

In [None]:
chosen_year_vec = tfidf_df.loc[YOUR_INDEX] # replace "YOUR_INDEX" with the index from the previous step
chosen_year_vec = np.array(chosen_year_vec).reshape(1,-1)
chosen_year_vec.shape

<div class="alert alert-warning">
    1.4 Run the cell below to calculate the cosine similarity of each speech to the speech from your chosen year
</div>

In [None]:
df["sim_to_chosen_year"] = [cosine_similarity(chosen_year_vec, np.array(post).reshape(1,-1))[0][0] for idx, post in tfidf_df.iterrows()]

<div class="alert alert-warning">
    1.5 What trend do you expect in how similar other speeches are? For example, will earlier or later speeches be more or less similar? Are there other years when the State of the Union may have been much more or much less similar?
</div>

_Your answer here_

<div class="alert alert-warning">
    1.6 Plot the trend as in Example 1 above
</div>

In [None]:
# YOUR CODE HERE

<div class="alert alert-warning">
    1.7 What do you notice about the trend? Does the plot reflect your expectations? What might explain any differences?
</div>

_Your answer here_

<div class="alert alert-warning">
    <b>Exercise 2</b><br><br>
    Now, as in Example 2, you will examine trends in the similarity of State of the Union addresses to the average for a particular president.<br><br>
    2.1 Pick a president other than James K. Polk. What sociologically significant events happened during that president's administration?
</div>

_Your answer here_

<div class="alert alert-warning">
    2.2 Get the index or indices of that president's speeches. Save them to the variable <tt>pres_indices</tt>.
</div>

In [None]:
pres_indices = df[df.president=="PRESIDENT"].index # replace "PRESIDENT" with the name as it appears in the data
df.loc[pres_indices]

<div class="alert alert-warning">
    2.3 Get the subset of tf-idf-weighted vectors corresponding to those indices and save the result as the dataframe <tt>pres_tfidf</tt>.
</div>

In [None]:
pres_tfidf = tfidf_df[tfidf_df.index.isin(pres_indices)]
pres_tfidf

<div class="alert alert-warning">
    2.4 Now calculate the centroid (average) of the vectors and display the shape using the <tt>.shape</tt> method.
</div>

In [None]:
pres_average = pres_tfidf.mean()
pres_average.shape

<div class="alert alert-warning">
    2.5 Reshape the vector (as in the examples above) using the <tt>.reshape</tt> method. The first number (the number of rows) should be 1, and the second number should be the number of words in the vocabulary.
</div>

In [None]:
pres_average = np.array(pres_average).reshape(1, -1)
pres_average.shape

<div class="alert alert-warning">
    2.5 Now calculate the similarity of each speech to that average, just as we have done in the examples above. Store this in a variable with an informative name like <tt>sim_to_pres</tt>.
</div>

In [None]:
# YOUR CODE HERE

<div class="alert alert-warning">
    2.6 What trends do you expect in the similarity of other documents to this average?
</div>

_Your answer here_

<div class="alert alert-warning">
    2.7 Plot the trend.
</div>

In [None]:
# YOUR CODE HERE

<div class="alert alert-warning">
    2.8 Does the plot match your expectations? What might explain any differences?
</div>

_Your answer here_

<div class="alert alert-warning">
    <b>Exercise 3</b><br><br>
    Now pick another group of speeches that were given in years during which sociologically significant events occurred. For example, you could pick the years during which a particular event occurred (e.g., the Civil War) or when a political party (other than the Whigs) held the office of the president. You will calculate the centroid of the vectors representing these speeches and plot the trend in similarity of all of the speeches to it, just as in the previous exercise. <br><br>
    3.1 What group of years are you choosing, and what makes that period of time interesting?
</div>

_Your answer here_

<div class="alert alert-warning">
    3.2 What do you expect the plotted trends in similarity to look like? Why?
</div>

_Your answer here_

<div class="alert alert-warning">
    3.3 Following the steps in the previous exercise, calculate the centroid of the appropriate vectors, calculate the similarity of each speech to the centroid, and save the results in a variable with an informative name like <tt>sim_to_civil_war</tt> (but with a name that matches the period or group you chose).
</div>

In [None]:
# YOUR CODE HERE

<div class="alert alert-warning">
    3.4 Plot the trend.
</div>

In [None]:
# YOUR CODE HERE

<div class="alert alert-warning">
    3.5 Does the plot match your expectations? What might explain any differences? If you've offered potential explanations for any differences you observe, how could they be tested?
</div>

_Your answer here_