#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science

# Notebook 3: Stylometry

In this notebook, we are going to take our first step toward topics such as [distributional semantics](https://en.wikipedia.org/wiki/Distributional_semantics), [vector space models](https://en.wikipedia.org/wiki/Vector_space_model), and [vector semantics](https://web.stanford.edu/~jurafsky/slp3/6.pdf). This involves quantifying information about linguistic units (whether individual words or entire documents) within the context of information about the entire corpus in order to [learn representations](https://en.wikipedia.org/wiki/Feature_learning) of the linguistic units as vectors of numbers. The vectors of numbers may have relatively few dimensions (e.g., 50) or may have many thousands. The vectors representing words, documents, parts of documents, or even latent dimensions can then be mathematically compared to measure the similarity of different things. Words used in similar ways (i.e., used with similar context words) will have similar vectors in a model of word embeddings. Documents that use similar language will also have similar vectors in a model of documents. We're going to start by exploring similarity among documents, which we'll return to in Notebook 5. We will turn to [word embeddings](https://en.wikipedia.org/wiki/Word_embedding) in Notebooks 9 and 10.

More broadly, taken together, these ideas are one of the main approaches we'll use in this class. They have exerted enormous influence on computational social science, including within cultural sociology. Specifically, in this notebook we will build on Notebook 2 by using word and document frequencies to visualize how similar or dissimilar documents are.

Please download the [State of the Union Corpus (1790-2018)](https://www.kaggle.com/rtatman/state-of-the-union-corpus-1989-2017), which was posted to Kaggle by Rachael Tatman and Liling Tan. 

In [None]:
import copy
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from collections import Counter
from scipy.stats import pearsonr, spearmanr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

sns.set_theme(style="darkgrid")

First things first: There are a lot of speeches in this *corpus* (collection of documents), and we have to put them somewhere. For Notebooks 1 and 2, I encouraged you to keep datasets in your working directory—where Jupyter looks automatically—to keep things simple. If we do that this time, though, we'll have hundreds of individual text files in the same directory as your notebooks, and that would be a mess to look through later. I'm simply creating a variable called `dir_` (short for "directory") that stores the location of the speeches as a string variable. (`dir_` has an underscore on the end because `dir` without an underscore is a reserved keyword in Python.)

Modify `dir_` to point to where you have saved the State of the Union addresses. I'm using a relative path: a path from my working directory to the data. I have a folder for the notebooks, and within that folder there is a folder called `data`. Within *that* folder, I have a folder called `sotu` containing the speeches. Since my working directory is the directory in which I have both the notebooks *and* the `data` folder (which contains the `sotu` folder), I only need to tell Jupyter to look in `data/sotu/` for the files.

In [None]:
dir_ = "data/sotu/"
os.listdir(dir_)

`dir_` points to the *directory* (or folder) in which the speeches are stored, but we also want to be able to tell Python the names of the files. [`os.path.join` (link)](https://www.geeksforgeeks.org/python-os-path-join-method/) is a clean way of merging file paths with file names. The line of code below is a list comprehension that iterates through every file in `dir_`, checks the file extension, keeps those that end with the proper extension, and joins the file path to the file name. The result is a list we can use to tell Python exactly where each file is by name.

In [None]:
sotu_paths = [os.path.join(dir_, f) for f in os.listdir(dir_) if f.endswith(".txt")]

In [None]:
sotu_paths[0]

In [None]:
print(open(sotu_paths[0], "r").read())

The function defined below takes a single parameter (`f`) that should be a string variable containing the file path and name of the text file for a given speech. This function will `open()` the file, use `os.path.split` to take the *last* element (just the file name, excluding the path), remove the file extension, and then split the file name on underscores. Because the files are named things like `Adams_1797.txt`, we can split on underscores after these other steps to get the name of the president who gave the speech (e.g., Adams) and the year in which it was given (e.g., 1797). The function returns the president, the year, and the text of the speech.

In [None]:
def return_sotu_name_year_text(f: str):
    """Returns the name, year, and text of a SOTU."""
    doc = open(f, "r").read().strip()
    f = os.path.split(f)[-1]
    f = f.rstrip(".txt")
    pres, year = f.split("_")
    return pres, year, doc

In [None]:
return_sotu_name_year_text(sotu_paths[0])

Next, we are going to store all of the speeches (along with metadata, such as the president and year) as a dataframe. We can do that in a number of ways. The cell below uses a [for loop](https://cs.stanford.edu/people/nick/py/python-for.html) to iterate through all of the file paths in our list of file paths (`sotu_paths`), calls the function we just defined to get the president, year, and text, and then appends each of these elements to a separate list. The code then uses `zip` to combine these lists and converts the result to a dataframe.

In [None]:
presidents = []
years = []
docs = []

for path in sotu_paths:
    pres, year, doc = return_sotu_name_year_text(path)
    presidents.append(pres)
    years.append(year)
    docs.append(doc)
    
data = list(zip(presidents, years, docs))

pd.DataFrame(data, columns = ["president", "year", "text"])

The cell below accomplishes the same task, but it does so a bit differently. It first creates a dataframe called `df` with only the file paths in our list of file paths. It then uses the `apply()` method to call the function we defined above to each file path, creating new columns in the process. It then drops the file_path column.

In [None]:
df = pd.DataFrame(sotu_paths, columns = ["file_path"])
df[["president", "year", "text"]] = df.file_path.apply(lambda x: pd.Series(return_sotu_name_year_text(x)))
df.drop(columns = ["file_path"], inplace = True)

df

The cell below will sort the speeches by the year in which they were given and then reset the index (i.e., the numbers on the lefthand side), which is based on ordering the rows alphabetically by the president's name.

In [None]:
df.sort_values(by="year", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

In [None]:
?df.reset_index

There's no text for the row for 1790, so we will use the `drop()` method to get rid of that row using its index, which is 0 since it is the first and Python begins from 0.

In [None]:
df.drop(index=0, inplace = True)
df.reset_index(inplace=True, drop=True)
df.head()

There are many presidents who have the same last names. This is a bit of a problem if we want to compare the speeches from different presidents. We need to disambiguate the names.

In [None]:
df[df.president=="Adams"]

The cell below demonstrates that we can use a question mark to look up information about all sorts of things, in this case the `np.where()` method. We can use question marks or doc strings to look at documentation without having to leave the notebook, although searching the internet for ways to do things is natural.

We are going to use `np.where()` to identify locations where presidents have a certain last name (e.g., Adams) in a certain time period (e.g., after 1800). Since time has the useful property of appearing to be linear and there are only a handful of US presidents who share names with earlier US presidents, we can manually disambiguate the names of the presidents who share names.

In [None]:
?np.where

In [None]:
print(np.where.__doc__)

The cell below looks complicated, but if you clean data in this way it'll become familiar. There are three parts to our call to `np.where()`: the *conditions*, the name (a string) we want to replace the value with if the conditions are `True`, and the value we want to use if the conditions are `False`. The cell below identifies rows with "Adams" in the president column and a year greater than 1800. The second US president, John Adams, gave State of the Union addresses in 1797, 1798, 1799, and 1800. His son, John Quincy Adams, was the sixth US president and gave all his state of the union addresses *after* 1800.

In [None]:
df.year = df.year.apply(int)
df.president = np.where(df.president.eq("Adams") & df["year"].gt(1800), "Adams2", df.president)

In [None]:
df[df.president=="Adams"]

In [None]:
df[df.president=="Adams2"]

We also need to distinguish George W. Bush from George H.W. Bush, Andrew Johnson (served 1865-1869) from Lyndon Johnson (served 1963-1969), and Theodore Roosevelt (served 1901-1909) from Franklin Roosevelt (served 1933-1945).

We don't need to distinguish William Henry Harrison from Benjamin Harrison because the former never gave a State of the Union address.

In [None]:
df.president = np.where(df.president.eq("Bush") & df["year"].gt(2000), "Bush2", df.president)
df.president = np.where(df.president.eq("Johnson") & df["year"].gt(1900), "Johnson2", df.president)
df.president = np.where(df.president.eq("Roosevelt") & df["year"].gt(1930), "Roosevelt2", df.president)

We can call the `unique()` method on a column to see the distinct values. There are only 42 presidents here because the dataset stops at 2018 (excluding Biden), two presidents did not give state of the union addresses (William Henry Harrison and James A. Garfield), and Grover Cleveland held non-consecutive terms as the 22nd and 24th president.

In [None]:
df.president.unique()

In [None]:
len(df.president.unique())

Now we will do some pretty crude *preprocessing* of the text itself. Specifically, we are going to lowercase all of the text, remove non-alphabetical characters (like punctuation and numbers), and then *tokenize* by splitting on whitespace.

In [None]:
df.text = df.text.apply(str.lower)

In [None]:
df.head()

The built-in `ord()` function provides the unicode code for a given character. We can use it to identify the range of the letters in the alphabet we are using (i.e., the number for "a" and the number for "z") as well as the number for spaces, which we also want to keep prior to splitting documents on them.

In [None]:
?ord

In [None]:
print(f'a = {ord("a")}, z = {ord("z")}, and space = {ord(" ")}')

The cell below demonstrates how this works by iterating through characters in the string variable `s`, checking whether each is a space or is in the range we identified, and either adding the character (if we want to keep it) or a space (if we don't) to a new string variable, `s2`.

In [None]:
s = "This is a test string, and it has some punctuation--not a lot, but some--that we're going to remove."

s2 = ""
for char in s.lower():
    if (char == " ") or (ord(char) in range(97,123)):
        s2 += char
    else:
        s2 += " "
        
print(s2)
print() # this gives us a blank line

print(s2.split())

Below, we define a function called `keep_alphabetical()` that preserves only spaces and lowercase letters a through z.  Later we will do this kind of thing using [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), but we'll keep it simple for now. We then use the `apply()` method to apply this function to the text column.

In [None]:
def keep_alphabetical(text: str) -> str:
    """Keep only lowercase a-z"""
    return "".join([char if (ord(char) in range(97,123) or char == " ") else " " for char in text])


df.text = df.text.apply(keep_alphabetical)

In [None]:
df

## Word Frequencies over the Entire Corpus

Let's take a look at how frequent different words are in the corpus as a whole. First, we're going to select the column in our dataframe that has the text of the speeches. Right now, each row just has a string for that column containing the text of a speech. We're going to connect them all together using whitespace and the `join()` method.

This will create one mega document that combines all of the documents in the corpus; we'll then split on whitespace and use `Counter()` like we did last week. Since `Counter()` saves the result as a special counter object by default, we'll cast it as a `dict` and save the result in a variable called `word_frequencies`. This will be a word-to-frequency mapping where the keys are words and the values are the total frequency of each word in the whole corpus.

After that, we'll create another variable to store a list of `tuples` containing a word and its frequency. We'll sort that in descending order based on the frequency using a common lambda function.

In [None]:
all_text = " ".join(df.text)

word_frequencies = dict(Counter(all_text.split()))

types_and_counts = sorted(list(word_frequencies.items()), reverse = True, key = lambda x: x[1])
print(types_and_counts[:20])

In [None]:
print(f"The corpus has {sum(word_frequencies.values()):,} total words and a vocabulary of size {len(word_frequencies.keys()):,}.") 

We saw last time that the `zip()` function combines iterables by putting the first item in each one together, putting the second item in each one together, and so on. If we add an asterisk (`*`) before the name of the variable we're calling zip() on, it "unzips" the contents instead. `types_and_counts` was defined as a list of tuples with two things in each (a word and its frequency), so we can unzip that into two lists: the words and the frequencies. We can do that by putting a name for each new variable on the lefthand side separated by a comma.

Technically, the results are of the type `tuple` by default, as you can see when we examine the first 10 elements of each: they are enclosed within parentheses, not brackets as we would expect for a list.

In [None]:
types_, token_counts = zip(*types_and_counts)

In [None]:
types_[:10]

In [None]:
token_counts[:10]

In [None]:
print(type(types_))
print(type(token_counts))

Now let's take a look at the *distribution* of word frequencies. We'll zoom in on the first 100 words first. The result is more or less in line with [Zipf's law](https://en.wikipedia.org/wiki/Zipf's_law), which posits that the rank of a word by its frequency and its actual frequency will be negatively correlated. The most common words are *incredibly* frequent, while words ranked lower by frequency are much, much less frequent.

In [None]:
plt.figure(figsize=(14, 8))
plt.bar(x = range(100), height = token_counts[:100])
plt.title("Frequencies of Top 100 Terms in Corpus")
plt.show()

In [None]:
plt.figure(figsize=(14, 8))
plt.bar(x = types_[:20], height = token_counts[:20])
plt.xticks(rotation = 90)
plt.title("Frequencies of Top 20 Terms in Corpus")
plt.show()

## Document Frequency

Next, we'll look at document frequency: the number of documents in the corpus in which a word occurs. We'll do this using a simple, two-stage approach: we'll create a version of each document that has all duplicate words removed, and then we'll combine everything and count how many times a word occurs. The function below accepts one parameter (a string variable, intended to be a document) and splits it on whitespace, converting it to a `list`. A `list` can contain duplicates. Next, it casts the resulting list as a `set`, which removes all duplicate elements. It then casts it as a `list` again so that we can use the `.join()` method to convert it back to a string. The result of this series of transformations is that we have removed all repititions of any word. Each word occuring in a document is now represented exactly once in the new version of the document.

In [None]:
def set_of_types(document: str) -> str:
    """Returns a string with only the unique types (words) in the supplied document"""
    return " ".join(list(set(document.split())))

In [None]:
s = "this is a string that repeats some words, like string and words and some"

print(Counter(s.split())) # three types occur twice

In [None]:
s2 = set_of_types(s)

print(s2)
print()

print(Counter(s2.split())) # each type occurs only once

Now, we'll use the `apply()` method to add a column to our dataframe. Specifically, we will apply the function we just defined—`set_of_types()`—to the text column, creating a version of the text with all duplicate words removed.

In [None]:
df["types"] = df.text.apply(set_of_types)
df.head()

Now we will use `Counter()` after selecting this new column and splitting on whitespace. We'll cast this as a `dict` and store it as the variable `document_frequencies`.

In [None]:
document_frequencies = dict(Counter(" ".join(df.types).split()))

As we did for overall word frequencies, we'll create a list of tuples (namely, word—frequency pairs) sorted by document frequency and then 'unzip' that into two iterables: the words and the document frequencies. (Again, document frequency refers to the number of documents the word appears in at least once.)

In [None]:
types_and_doc_freqs = sorted(list(document_frequencies.items()), reverse = True, key = lambda x: x[1])
types_, doc_freqs = zip(*types_and_doc_freqs)

If we only looked at the first 100 words again, we wouldn't see any variation because they're all in pretty much all of the documents. We'll zoom out and look at the 500 most frequent words by document frequency. We again see a negative association between document frequency and rank based on that frequency.

In [None]:
plt.figure(figsize=(14, 8))
plt.plot(range(500), doc_freqs[:500])
plt.title("")
plt.show()

In [None]:
plt.figure(figsize=(14, 8))
plt.bar(x = types_[:20], height = doc_freqs[:20])
plt.xticks(rotation = 90)
plt.title("Frequencies of Top 20 Terms in Corpus")
plt.show()

In [None]:
df.drop(columns=["types"], inplace=True)

The cell below stores the unique words as the variable `vocabulary`. It then defines a variable `x` using a list comprehension; this contains the word frequencies (or term frequencies) in the overall corpus. The list comprehension iterates through `vocabulary` and looks up the frequency of each word in the `dict` we stored as `word_frequencies`.

We then define the variable `y` similarly, but using a list comprehension with the `dict` we saved as `document_frequencies` to create a list of document frequencies.

Finally, we can see that the two measures of frequency are positively correlated using two different correlation coefficients.

In [None]:
vocabulary, _ = zip(*types_and_counts)
vocabulary = list(vocabulary)

x = [word_frequencies[word] for word in vocabulary]
y = [document_frequencies[word] for word in vocabulary]

print("Correlation between each word's frequency in the overall corpus and its document frequency:")
print(f"Pearson's correlation coefficient: {pearsonr(x, y)[0]:.2f}")
print(f"Spearman's rank-order correlation: {spearmanr(x, y)[0]:.2f}")

## Pruning the Vocabulary

In [None]:
print(len(vocabulary))

If we are interested in analyzing meaning from a corpus, in practice we will often remove words that appear only once or in only one document (which aren't the same thing!). We sometimes call these [hapaxes](https://en.wikipedia.org/wiki/Hapax_legomenon). We can't say that two documents have a word in common if only one document in the entire corpus has the word!

The cell below uses a list comprehension to create a list of words that appear in only one document, which we save as the variable `hapaxes`.

In [None]:
hapaxes = [word for word in vocabulary if document_frequencies[word] == 1]
print(len(hapaxes))

We may often exclude words that appear in *every* document for similar reasons.

Let's remove hapaxes. The cell below overwrites `word_frequencies` and `document_frequencies` using a `dict comprehension` to check whether the *key* for each key—value pair in the dictionary is in our list of hapaxes. We keep every entry in these dictionaries if they occur in at least two documents.

In [None]:
word_frequencies = {key:value for key, value in word_frequencies.items() if key not in hapaxes}
document_frequencies = {key:value for key, value in document_frequencies.items() if key not in hapaxes}

assert word_frequencies.keys() == document_frequencies.keys()

types_and_counts = sorted(list(word_frequencies.items()), reverse = True, key = lambda x: x[1])
vocabulary, _ = zip(*types_and_counts)

In [None]:
print(len(vocabulary))

## Assigning an Identifier to Each Speech Based on the President and Year

The function below is designed to accept a row from a dataframe, extracting the president and year. It lowercases the president's last name and then stores a function variable called `title` that unites the president and year with an underscore using an [f-string](https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/). The function call returns that title.

We then create a new column by using the `apply()` method to apply this function to our dataframe.

I've also included a line of code that does this with a lambda function so you can see alternative approaches. Both work. We'll make sure they give the same result using an `assert` statement (specifically, asserting that two lists are identical), and then we'll drop one of the columns.

We'll also use a lambda function and `apply()` to create a column for the wordcount of the speech, like we did last week.

In [None]:
def remake_speech_title(row):
    pres = row["president"].lower()
    year = row["year"]
    title = f"{pres}_{year}"
    return title

df["speech_title"] = df.apply(remake_speech_title, axis=1)

df["speech_title_lambda"] = df.apply(lambda row: f"{row['president'].lower()}_{row['year']}", axis=1)

df["wordcount"] = df.text.apply(lambda x: len(x.split()))

df.head()

In [None]:
assert df.speech_title.tolist()==df.speech_title_lambda.tolist()

In [None]:
df.drop(columns="speech_title_lambda", inplace=True)
df.head()

In [None]:
plt.figure(figsize=(14, 8))
sns.scatterplot(x = "year", y = "wordcount", data = df)
plt.title("Wordcount of State of the Union Address by Year")
plt.xlabel("Year")
plt.ylabel("Words")
plt.plot()

We can use the `max()` method to identify the wordcount of the longest speech, and then the next line to identify the speech with that exact wordcount.

In [None]:
df.wordcount.max()

In [None]:
df[df.wordcount.eq(df.wordcount.max())]

## Document-Term Matrix

The first step to comparing documents based on the frequencies of terms is to construct a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix). Simply put, we will have a row for each document and a column for each individual type (i.e., unique word). The cells will include the number of times that word occurs in that document.

The code below uses `copy.copy()` to create a copy of our original dataframe, `df`, that won't change if `df` changes. See Nick Parlante's guide on this point [here](https://cs.stanford.edu/people/nick/py/python-nocopy.html).

In [None]:
x = [1, 2, 3]
y = x

print(x)
print(y)

x.append(4)
print(y) # y changes when x does, even though we changed x after creating y

In [None]:
dtm = copy.copy(df)
dtm.text = dtm.text.apply(str.split)
dtm = dtm[["speech_title", "text"]]
dtm.head()

The function below takes a document (as a list of words) and our vocabulary (also as a list of words) and returns a list of the term frequencies—the frequency of each word *within the document*, rather than in the corpus overall.

In [None]:
def term_frequency(doc: list, vocab: list) -> list:
    """Returns a list of term frequencies, given a document and vocab list"""
    return [doc.count(term) for term in vocab]

In [None]:
s = ["the", "cat", "in", "the", "hat"]

term_frequency(s, vocabulary[:10])

The cell below prints the the first 10 words in our vocabulary list on one line and the term frequencies of those 10 words in the first speech in the corpus on the second line. 

In [None]:
print(vocabulary[:10])

for idx, row in dtm.iterrows():
    print(term_frequency(row.text, vocabulary[:10]))
    break

Because we have such a large vocabulary still, we are going to use a subset of 3,000 words (an arbitrary number) for the examples below. One justification for this is that it will reduce the burden to compute everything; another is that most words, as we have seen, are rare, and we are keeping the 3,000 most frequent words.

In [None]:
vocab_subset = vocabulary[:3000]

The cell below uses the `apply()` method with a lambda function, with a twist: rather than working on a single variable, it's going to work through all 3,000 words in `vocab_subset`—creating 3,000 columns.

There's a lot of overhead and this may not be the best way to do it—see  [this blog post](https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/), for instance, about the dangers of using `apply()` in this way—but I timed this approach and it took less than half the time of a loop-based method. We are saving it as a new object, `dtm_counts`, and will then merge it with the columns we saved in our copy of `df`. If we add the new columns directly to our copy of `df`, `pandas` will give us many warnings (potentially 3,000!).

In [None]:
%%time

dtm_counts = dtm.text.apply(lambda x: pd.Series(term_frequency(x, vocab_subset)))

dtm_counts.head()

Note that the column names are 0 through 2999. These are the indices of the columns. Below, we uses a `dict comprehension` to create a mapping of our vocabulary words to their indices (i.e., the first one is 0, the second is 1, and so on). We then rename the columns in `dtm_counts` using this dictionary. The "mapper" argument in the `rename()` method accepts the dictionary for this purpose. It simply finds columns that match keys in the dictionary (like 0) and replaces them with the value in the dictionary for that key (like "the").

In [None]:
index_to_vocab_word_dict = {i:vocab_subset[i] for i in range(len(vocab_subset))}

dtm_counts.rename(mapper=index_to_vocab_word_dict, axis=1, inplace=True)

In [None]:
dtm_counts.head()

In [None]:
dtm = pd.concat([dtm, dtm_counts], axis=1)
dtm.head()

In [None]:
dtm.drop(columns="text", inplace=True)
dtm.set_index("speech_title", inplace=True)

In [None]:
dtm.head()

In [None]:
dtm.shape

## Plotting Speeches in a 2D Space using Principal Component Analysis

We will talk more about dimensionality reduction later in this class. If you are not familiar with [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), that's okay!

The problem we are trying to solve is that we have represented the speeches as vectors with a length of 3,000 (i.e., term frequencies for 3,000 terms). We can really only visualize stuff in two or three dimensions. PCA is one method for *reducing dimensionality*. It isn't perfect, or even necessarily the best approach, but we can use it to plot the locations of the speeches in a two-dimensional space. First, we are going to create another copy of our document-term matrix, `dtm_std`, in which we will standardize all of the frequencies by subtracting the mean and dividing by the standard deviation. This is one form of [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics). We will standardize things later for ease of interpretation; here we are doing this only to facilitate PCA.

As we discussed before, `numpy` (here imported with the alias `np`) computes standard deviations by dividing by *n*, not *n*-1 as R, Stata, and even various other Python modules do. The `ddof` argument allows us to adjust for this.

In [None]:
dtm_std = copy.copy(dtm)
titles = dtm_std.index # we create an iterable with all the titles...
dtm_std = dtm_std.to_numpy() # ...and then convert dtm_std to a numpy array

sd = np.std(dtm.to_numpy(), ddof = 1, axis = None)

dtm_std = dtm_std - dtm_std.mean()
dtm_std = dtm_std/sd

In [None]:
dtm_std

In [None]:
dtm_std.mean() # this is basically zero, just never perfectly zero

Now we'll implement PCA and do some basic visualizations before turning to TF-IDF weighting to try to improve the results. We only need two dimensions for the visualization, but in practice you might run PCA with more components and try to select the best set of results and only *then* create a plot using the first two dimensions (or three, if you're creating a 3D visualization).

In [None]:
pca = PCA(n_components=2)
components = pca.fit_transform(dtm_std)

pca_df = pd.DataFrame(data = components, columns = ["component1", "component2"])

In [None]:
pca_df["title"] = titles
pca_df[["president", "year"]] = pca_df.title.apply(lambda x: pd.Series(x.split("_")))
pca_df.year = pca_df.year.apply(int)
pca_df

The variable `mask` we define below returns `True` or `False` for each row based on the condition that the year is greater than 2000. We're simply filtering out some of the data points so we can see what's going on more clearly with the more recent State of the Union addresses.

We can see Trump's addresses are close to one another in this space, but they are in the middle of George W. Bush's. Obama's speeches appear to be clustered on their own, but they aren't super distant from the others.

In [None]:
mask = pca_df["year"] > 2000

label_points = False

plt.figure(figsize=(14, 8))
sns_plot = sns.scatterplot(x = "component1", y = "component2", data = pca_df[mask], hue="president")
plt.title("Distribution of Speeches According to First Two Components")
if label_points:
    for idx, row in pca_df[mask].iterrows():
        sns_plot.text(x = row["component1"], y = row["component2"], s = row["title"])
plt.show()

The function below will help us create a column for the decade so we can group speeches by decade for visualization purposes.

In [None]:
def return_decade(year: int) -> str:
    """Given a year, returns the decade as a string"""
    return str(year)[:-1] + "0s"

In [None]:
return_decade(1984)

In [None]:
pca_df["decade"] = pca_df.year.apply(return_decade)

In [None]:
pca_df.head()

We plot the results below. With PCA, we might sometimes have some intuitions about what the components (latent variables) capture, but we never really know for sure. In this instance, we shouldn't expect much because, so far, we are only looking at raw word counts. We do see that the more recent speeches cluster together, but they are also near the earliest, with speeches from he mid-1800s to mid-1900s widely dispersed along the first component.

In [None]:
plt.figure(figsize=(14, 8))

sns.scatterplot(x = "component1", y = "component2", data = pca_df, hue="decade", palette="flare")
plt.title("Distribution of State of the Union Addresses\nAccording to First Two Components")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

## Using TF-IDF to Compare Documents

Let's see if things improve if we use [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) weighting. This take the term frequencies we've just employed and weights them by the inverse of document frequency of the given term. Since document frequency is based on the whole corpus, this approach links document-level information term frequencies in the document) with corpus-level information (document frequencies of the vocabulary). If two speeches use the same common word a lot, they'll seem more similar regardless. With tf-idf weighting, however, two speeches that use the same *rare* word will seem even more similar. Weighting with tf-idf makes rare words matter more.

We define three helper funtions below. `compute_idf` computes the inverse document frequency, given the number of documents (`N`) and the document frequency of a word. `compute_tfidf` uses the frequency of a term within a document and the inverse document frequency of the term overall to compute the tf-idf score for a single word for a single document. `compute_all_tfidf` computes the tf-idf scores for all words in a list of vocabulary words for a single document by applying `compute_tfidf` repeatedly.

We define the variable `N` using the `shape()` method with `[0]` to select the first element, which is the number of rows (documents) in the dataframe.

`idf_dict` is a `dict` we define with a `dict comprehension` using the compute_idf() function we just defined while iterating through all of the key—value pairs in `document_frequencies.items()`.

In [None]:
def compute_idf(N: int, doc_freq: int) -> float:
    """Given the number of documents, N, and the document frequency, return the
    inverse document frequency (IDF)"""
    return np.log10(N/(1 + doc_freq))


def compute_tfidf(doc: list, word: str) -> float:
    """Given a document and a vocabulary word, returns the tfidf score for that word for that document"""
    term_freq = np.log(1 + doc.count(word))
    idf = idf_dict[word]
    return term_freq * idf


def compute_all_tfidf(doc: list, vocab: list) -> list:
    """Given a document and a list of words, returns the tfidf scores for each word for that document"""
    return [compute_tfidf(doc, word) for word in vocab]
    
    
N = dtm.shape[0]
    
idf_dict = {word:compute_idf(N, frequency) for word, frequency in document_frequencies.items()}

Now we will make a copy of our original dataframe. We will call it `weighted_dtm` because it will be based on a document-term matrix, but we will weight the values using tf-idf instead of the raw counts.

In [None]:
weighted_dtm = copy.copy(df)
weighted_dtm.text = weighted_dtm.text.apply(str.split)
weighted_dtm = weighted_dtm[["speech_title", "text"]]
weighted_dtm.head()

The cell below again uses `apply()` to create 3,000 columns in a new object, `tfidf_scores`, that we will merge with the dataframe we just created. Everything we discussed the first time we did this applies. We will reuse `index_to_vocab_word_dict` because the vocabulary is in the same order.

In [None]:
%%time

tfidf_scores = weighted_dtm.text.apply(lambda x: pd.Series(compute_all_tfidf(x, vocab_subset)))

tfidf_scores.head()

In [None]:
tfidf_scores.rename(mapper=index_to_vocab_word_dict, axis=1, inplace=True)
weighted_dtm = pd.concat([weighted_dtm, tfidf_scores], axis=1)

In [None]:
weighted_dtm.drop(columns="text", inplace=True)
weighted_dtm.head()

Now we will try PCA with the tf-idf weights!

In [None]:
weighted_dtm.set_index("speech_title", inplace=True)
titles = weighted_dtm.index
weighted_dtm = weighted_dtm.to_numpy()

sd = np.std(weighted_dtm, ddof = 1, axis = None)

weighted_dtm = weighted_dtm - weighted_dtm.mean()
weighted_dtm = weighted_dtm/sd

In [None]:
tfidf_pca = PCA(n_components=2)
components = tfidf_pca.fit_transform(weighted_dtm)

tfidf_pca_df = pd.DataFrame(data = components, columns = ["component1", "component2"])
tfidf_pca_df["title"] = titles
tfidf_pca_df[["president", "year"]] = tfidf_pca_df.title.apply(lambda x: pd.Series(x.split("_")))
tfidf_pca_df.year = tfidf_pca_df.year.apply(int)
tfidf_pca_df

In contrast to the visualization based on term frequencies alone, we see that the Obama and Trump speeches are farther apart, with Bush's speeches in between. Obama's speeches appear to be much more separated from the others.

In [None]:
mask = tfidf_pca_df["year"] > 2000
tfidf_pca_df[mask]

label_points = False

plt.figure(figsize=(14, 8))
sns_plot = sns.scatterplot(x = "component1", y = "component2", data = tfidf_pca_df[mask], hue="president")
plt.title("Distribution of State of the Union Addresses\nAccording to First Two Components")
if label_points:
    for idx, row in tfidf_pca_df[mask].iterrows():
        sns_plot.text(x = row["component1"], y = row["component2"], s = row["title"])
plt.show()

In [None]:
tfidf_pca_df["decade"] = tfidf_pca_df.year.apply(return_decade)

In [None]:
tfidf_pca_df

In [None]:
plt.figure(figsize=(14, 8))
sns.scatterplot(x = "component1", y = "component2", data = tfidf_pca_df, hue="decade", palette="flare")
plt.title("Distribution of State of the Union Addresses\nAccording to First Two Components")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

In [None]:
mask = tfidf_pca_df.decade.isin(["1790s", "1890s", "1990s"])

label_points = False

plt.figure(figsize=(14, 8))
sns_plot = sns.scatterplot(x = "component1", y = "component2", data = tfidf_pca_df[mask], hue="decade")
plt.title("Distribution of State of the Union Addresses\nAccording to First Two Components")
plt.legend(bbox_to_anchor=(1.25, 1))
if label_points:
    for idx, row in tfidf_pca_df[mask].iterrows():
        sns_plot.text(x = row["component1"], y = row["component2"], s = row["title"])
plt.show()

## Sparse versus Dense Vectors

In [None]:
print(f"Number of non-zero values in the (truncated) document-term matrix: {np.count_nonzero(dtm):,}")
print(f"Number of entries in the (truncated) document-term matrix: {dtm.size:,}")
print(f"{np.count_nonzero(dtm)/dtm.size * 100:.0f}% of entries are zeros, and that's based on "
      "the 3,000 most frequent words!")

We will discuss tf-idf more when we talk about document similarity, but there are many other ways of representing words, documents, or parts of documents as vectors. What we have done in this notebook is create what are considered sparse vectors because words fail to co-occur at such a high rate: there are a lot of dimensions, there are many zeros, and in general cell counts in a document-term matrix will be low. Dimensionality reduction is a core part of the approaches we will take going forward, including the use of 'dense' vectors, which are based on powerful algorithms that produce better representations of words or documents using fewer dimensions.