# Hist 3368 
## Cleaning, Lemmatizing, and Visualization with Congress in Pandas

## Load Some Data

In [1]:
import pandas as pd

In [2]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


***Give this several minutes; we're reading in big data:***

In [3]:
congress = pd.read_csv("congress1967-2010.csv")
#congress = pd.read_csv("eighties_data.csv")

Let's do a couple of basic cleaning steps. Let's look at the actual text output of the Content column to get an idea of what we're dealing with.  

In [4]:
for contenttext in congress['speech'].head(3): # for the first three entries in the 'Content' column
    print(contenttext[-100:]) # print the last 100 characters

Those who do not enjoy the privilege of the floor will please retire from the Chamber.
cleared of all attaches. unless they have absolutely important business to attend to in the Chamber.
lly needed for the next few minutes of the deliberations of the Senate will tetire from the Chamber.


You'll notice that there are uppercase words, punctuation marks, and stopwords that will interfere with our analysis unless we do away with them.

**Let's package all of these commands into a function, defined with "def," and use .apply() to apply the function to each item in the column 'speech.'''**

We know that we can split the text of the 'Content' column into words, lowercase them, stopword them, and lemmatize them using some familiar commands. 

    .lower()
    .split()
    wn.morphy()
    if word in stopwords

We could also add some steps to screen out digits and initials:

    if not word.isdigit()
    if len(word) > 1
    
Note the use of "len()", which asks the "length" of a string in characters.  If the length of a word -- len(word) -- is greater than 1, we keep it:

In [5]:
# load stopwords and software
from nltk.corpus import stopwords # this calls all multilingual stopword lists from NLTK
from nltk.corpus import wordnet as wn
stop = stopwords.words('english') # this command calls only the English stopwords, labeling them "stop"
stop_set = set(stop) # use the Python native command "set" to streamline how the stopwords are stored, improve performance

In [6]:
# create a function that does all the cleanup

def cleaning_step(row):
    
    clean_row = row.replace('[^\w\s]','') # remove punctuation
    clean_row = clean_row.split() # split into words
    clean_row = [wn.morphy(word.lower()) for word in clean_row  # lemmatize
                  if word not in stop_set
                 if not word.isdigit() # if it isn't a number)
                 if len(word) > 1] # if it's longer than one character
    clean_row = filter(None, clean_row) # remove any 'None's that result from cases such as wn.morphy("the")
    clean_row = ' '.join(clean_row) # glue the words back together into one string per row
    
    return(clean_row)

***This may take some time.  Lemmatizing is computation intensive. Allot 30 minutes.***

In [None]:
congress['speech'] = congress['speech'].apply(cleaning_step) 
congress[:5]

Inspect the data to see what we've done. 

In [None]:
for contenttext in congress['speech'].head(3): # for the first three entries in the 'Content' column
    print(contenttext[-1000:]) # print the last 100 characters

#### Save the data for later.

In [None]:
cd ~/digital-history

In [None]:
congress.to_csv("lemmatized-congress1968.csv")

In [None]:
# for use if you need to re-load
# congress = pd.read_csv("lemmatized-congress1967-2010.csv")

# Overall Visualisation

Let's paste together all the words in the 'speech' column to get a list that we'll call 'allwords.'

In [None]:
allwords = " ".join(congress['speech'])

Let's get a rough sense of what's in the 'Content' Column by creating a wordcloud.

The wordcloud package has its own built-in function to split a block of text.  It just needs one big block of text assembled from all the rows in the 'Content' column.  We'll use the join() command to paste together all the entries in df['Content'], calling the result 'allwords.'  Then we'l use the WordCloud().generate() command to make a wordcloud from the variable 'allwords'.

In [None]:
#import software 
!pip install wordcloud --user
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
%matplotlib inline
stop_words = set(STOPWORDS)

# make a wordcloud
wordcloud = WordCloud(stopwords=stop, background_color="white").generate(allwords)
plt.figure(figsize=(12, 12))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Next, let's visualize the most frequent words, breaking the variable 'allwords' down into individual words using split().  

In [None]:
wordlist = allwords.split()
wordlist[:10] # look at the first ten elements of the list only

Next, count the individual words using the pandas commands "Series()" and "value_counts()"

In [None]:
wordcounts = pd.Series(wordlist).value_counts()[:20]
wordcounts[:10]

Now, plot those values as a well-labeled barchart.  Notice that the axes are well-labeled and that the chart has a title that describes the data.

In [None]:
wordcounts.plot(kind='bar', 
                title='Most frequent words in the CONTENT column of EDGAR for 8 key companies',
                 figsize=(6, 6)
               )