In this lab, we'll expand on our wordclouds by using *groupby* to make a new wordcloud every year. And we'll also use TF-IDF weghting to give us a better sense of what's changing.

In [None]:
from text_analytics import text_analytics
from text_analytics import load
import os
import pandas as pd

ai = text_analytics()
print("Done!")

This dataset contains speeches in the US Congress from  1931 to 1969. It will take a bit to load!

In [None]:
file = "US.Congress.1931-1969.gz"
file = os.path.join(ai.data_dir, file)
df = pd.read_csv(file)
print(df)
print("Done!")

First, we'll iterate over the data by year, using our *groupby* function.

In [None]:
counts = []
for year, year_df in df.groupby("Year"):
    counts.append([year, len(year_df)])

counts = pd.DataFrame(counts, columns = ["Year", "N"])
counts = counts.set_index("Year", drop = True)
print(counts)
print("Done!")

And then we'll plot this to get an idea of the rate of speeches over time.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [20,12]
plt.rcParams['figure.dpi'] = 120

sns.barplot(x = counts.index, y = "N", data = counts)
print("Done!")

Now, because it takes awhile to fit the TF-IDf model, let's load one that is pre-trained. We first load the pre-trained version, and then we tell the *ai* object to use that version. There are two parts here: the vectorizer and a phrase model that uses PMI to find sequences like "New York."

In [None]:
ai_state = load("tf-idf.US.Congress.1931-2016")
ai.tfidf_vectorizer = ai_state.tfidf_vectorizer
ai.phrases = ai_state.phrases
print("Done!")

In [None]:
year_df = df.loc[df["Year"] == 1955]
print(year_df)
ai.wordclouds(year_df, stage = 4, features = "tfidf")
print("Done!")

This will take a moment to calculate. Then, we'll see a wordcloud for congress in 1955. If you want to see a different year, change the part of the line where it says *1955*!