# Document-Term Matrices

So far we've been making our own pandas dataframes of custom data. This is great, but there's also a very specific type of dataframe that helps us do a lot of different things. This is called a document-term matrix.

## What is a document-term matrix?

A document-term matrix (DTM) is simply a dataframe with terms (words) as columns, documents (texts) as rows, and the cells are the counts of those terms in those documents. For instance, if we had three documents:

* D1 = "I like this class"
* D2 = "I love this class"
* D3 = "I tolerate this class"

Then the document-term matrix would be:

| # |I|like|love|tolerate|this|class|
|--|-|----|----|--------|----|-----|
|D1|1|1   |0   |0       |1   |1    |
|D2|1|0   |1   |0       |1   |1    |
|D3|1|0   |0   |1       |1   |1    |

Why are these useful? With this type of dataframe, we can more easily:

* find the words which distinguish one group of texts from another group of texts.
* find which words often appear together in documents.
* calculate the "distance" between texts in terms of their word usage.
* cluster texts based on their word usage.
* ...and more!

## Review

In [None]:
# import some things
import os
import pandas as pd
from textblob import TextBlob
pd.set_option("display.max_rows", 20)

In [None]:
# just for an exercise below
letters = ['a','b','c','d','e']

### Reviewing how to make a dataframe (from lists of dictionaries)

We've been using a routine to make dataframes which we should maybe spell out.

In [None]:
# 1) Make a new list (for all result dictionaries)

# (2) Loop over something
    
    # (3) For each thing in the loop, make a dictionary
    
    # (4) Add some things to the dictionary
    
    # (5) add the individual result dictionary to the list of result dictionaries
    
# (6) make a dataframe from the list of dictionaries


#### @TODO: Make a dataframe that looks like this:

| name | status |
|------|--------|
|Rosencrantz|dead|
|Guildenstern|dead|

In [None]:
# To do so, loop over this list
names = ['Rosencrantz', 'Guildenstern']

# 1) Make a new list (for all result dictionaries)

# (2) Loop over something
    
    # (3) For each thing in the loop, make a dictionary
    
    # (4) Add some things to the dictionary
    
    # (5) add the individual result dictionary to the list of result dictionaries
    
# (6) make a dataframe from the list of dictionaries


### Reviewing how to loop over files

#### Method 1: Loop over metadata column

In [None]:
# For example:
df_meta = pd.read_excel('../corpora/harry_potter/metadata.xls')
df_meta

In [None]:
# This is the filename column:
df_meta.fn
#
# (or)
#
df_meta['fn']

In [None]:
# @TODO: Finish this:

# 1) Set a folder for this corpus
text_folder = '../corpora/harry_potter/texts/'

# 2) Loop over the filename column

    
    # 3) Get and print the full path
    
    

#### Method 2: Loop over the text files

We can also loop over files in a text folder directly.

In [None]:
# Get the filenames in a folder
filenames = os.listdir(text_folder)
filenames

In [None]:
## @TODO: Finish this:

# 1) Set a folder for this corpus
text_folder = '../corpora/harry_potter/texts/'

# 2) Loop over the filename list

    # 3) Get and print the full path
    
    

**Note**: Sometimes we have to check if the file is a text file:

In [None]:
# Check if filename endswith .txt
example_filename = 'The Bible.txt'
example_filename.endswith('.txt')

In [None]:
# Check if filename ends with .txt?
example_filename = 'The Bible.jesus'
example_filename.endswith('.txt')

## Quick detour: Stopwords!

What's a stopword? A word we don't want to count! Function words, pronouns, common verbs and adverbs. Stop word lists are highly variable. NLTK gives us one. We need to download it first:

In [None]:
# Download stopwords list
import nltk
nltk.download('stopwords')

In [None]:
# NLTK's stopwords
from nltk.corpus import stopwords
stopword_list=stopwords.words('english')
print(stopword_list)

In [None]:
# Sets let us check whether a word is in the stopword list faster
stopword_set = set(stopword_list)

# is 'us' in the stopwords?
'us' in stopword_set

In [None]:
# is 'you' in the stopwords?
'you' in stopword_set

## How to make a document-term matrix

For this notebook, we'll be working with the 118 State of the Union speeches given by U.S. Presidents from 1900 to 2018. You can [download this corpus here](https://www.dropbox.com/sh/xd854hgyvbysqlm/AAAhbS6r7MFe4SVg1BFuuMTCa?dl=1). Please unzip it to your "corpora" folder.

In [None]:
# Set text folder and metadata path
# (If you don't have this corpus, please download it here): https://www.dropbox.com/sh/xd854hgyvbysqlm/AAAhbS6r7MFe4SVg1BFuuMTCa?dl=1

text_folder = '../corpora/sotu_1900-2018/texts'
path_to_metadata='../corpora/sotu_1900-2018/metadata.xls'

### Major step 1: Make a list of dictionaries (of counts per text)

In [None]:
# 0) make a counter for a corpus-wide word count
from collections import Counter
all_counts = Counter()

# 1) make an empty results list
all_results = []

# 2) Loop over the filenames
filenames=sorted(os.listdir(text_folder))
for i,fn in enumerate(filenames):
    
    # make sure filename is a text file
    if not fn.endswith('.txt'): continue
    
    # just for a progress report:
    if not i%10:   #if i not divisible by 10
        # print some progress
        print('>> looping through #',i,'of',len(filenames),'files:',fn)
    
    # 3) get full path
    full_path = os.path.join(text_folder,fn)

    # 4) open the file
    with open(full_path) as file:
        txt=file.read()

    # 5) make a blob
    blob = TextBlob(txt.lower())

    # 6) make a result dictionary
    text_result = {}

    # 7) set the filename
    text_result['fn']=fn

    # 8) get the number of words
    num_words = len(blob.words)

    # 9) for each word,count pair in the blob.word_counts dictionary...
    for word,count in blob.word_counts.items():
        
        # is the word in the stopwords? if so, keep going
        if word in stopword_set: continue  

        # is the word a punctuation? if so, keep going
        if not word[0].isalpha(): continue

        
        # 10) set the normalized count for this word to the text_result dictionary
        text_result[word] = count / num_words
            
        # 11) add the count to the dictionary of counts for all words
        all_counts[word]+=count

    # 12) add result dictionary to all_results
    all_results.append(text_result)

In [None]:
# So here's what the first dictionary looks like in our list of dictionaries
#all_results[0]

### Major step #2: Convert list of dictionaries to a dataframe

There are many many kinds of words in these texts. Here's the number of unique words in all our all_counts dictionary:

In [None]:
len(all_counts)

That's too many columns for our document-term matrix! So let's get the most common words:

In [None]:
# because all_counts is a Counter (see above), we can find the most common 10 words
all_counts.most_common(10)

So let's convert the results to a dataframe *while also limiting the number of columns*.

In [None]:
###
# Convert all_results to dataframe while limiting number of columns
#

# set number of words we want
n_top_words = 1000

# 13) Get the most frequent words
most_common_words_plus_counts = all_counts.most_common(n_top_words)

# 14) Get only the words
words_we_want = []
for word,count in most_common_words_plus_counts:
    words_we_want.append(word)

# 15) set a list for the columns, which is the words we want plus the 'fn' column
columns = words_we_want
columns.append('fn')

# 16) Make dataframe
dtm = pd.DataFrame(all_results, columns=columns)

# 17) Set the filename as the index and fill empty values with 0
dtm=dtm.set_index('fn').fillna(0)

# show!
dtm

In [None]:
# Can sort by particular words
dtm.sort_values('poverty',ascending=False)

In [None]:
# Can sort by particular words
dtm.sort_values('jobs',ascending=False)

### Combining DTMs with metadata

In [None]:
# Get the metadata for this corpus
df_meta = pd.read_excel(path_to_metadata).set_index('fn')
df_meta

In [None]:
# merge
dtm_meta=df_meta.merge(dtm,on='fn')
dtm_meta

#### Plotting meta+DTM

In [None]:
# Plot poverty over time
dtm_meta.plot(x='Year',y='poverty',figsize=(10,6))

In [None]:
# Multi-line graphs
from matplotlib import pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))

# The Long History of America First?
dtm_meta.plot(x='Year',y='america',ax=ax)
dtm_meta.plot(x='Year',y='world',ax=ax)

In [None]:
# Boxplots by party

dtm_meta.boxplot('world',by='Party',figsize=(8,5))

In [None]:
# Poverty?
dtm_meta.boxplot('poverty',by='Party',figsize=(8,5))

In [None]:
# Immigration?
dtm_meta.boxplot('immigration',by='Party',figsize=(8,5))