<span style="color:red;font-weight:bold;font-size: 150%">These notes are a substantial revision of the notes from April 11, 2018.</span>

## textacy, with the Inaugural Addresses . . . 

[textacy Quick Start](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html)

[textacy API documentation](https://chartbeat-labs.github.io/textacy/api_reference.html#)

[textacy github repo](https://github.com/chartbeat-labs/textacy)


## . . . and topic modeling . . . 

**[Mallet](http://mallet.cs.umass.edu/), the java topic modeling package we use most often.**  Note: we have a super-easy web interface for Mallet.

[David Mimno explains Topic Modeling](https://vimeo.com/53080123).  Mimno is the maintainer of Mallet.

[Ben Schmidt applies topic modeling to ship logs](http://sappingattention.blogspot.com/2012/11/when-you-have-mallet-everything-looks.html)

[Scott Weingart, "Topic Modeling for Humanists: A Guided Tour"](http://www.scottbot.net/HIAL/index.html@p=19113.html)

[Mining the Dispatch](http://dsl.richmond.edu/dispatch/), an exemplary application of topic modeling to a set of historical newspaper data.

[My toy topic modeller](https://talus.artsci.wustl.edu/malletTalk/toyTopicModeller.py), written in python.

[My github repo for "understanding_mallet"](https://github.com/spenteco/understanding_mallet)

## Load textact and spacy . . . 

. . . check their versions.

In [2]:
import textacy, spacy

print textacy.__version__
print spacy.__version__

nlp = spacy.load('en')

0.6.0
2.0.11


## The Inaugural Address corpus

The "metadata" about the addresses are "buried" in the file name . . . 

In [3]:
!ls corpora/inaugural_addresses

ls: cannot access corpora/inaugural_addresses: No such file or directory


## Dig out the metadata

In [4]:
import glob

metadata = []

for path_to_file in glob.glob('corpora/inaugural_addresses/*.txt'):
    
    address_n = int(path_to_file.split('/')[-1].split('_')[0])
    address_year = int(path_to_file.split('/')[-1].split('_')[-1].replace('.txt', ''))
    president = '_'.join(path_to_file.split('/')[-1].split('_')[1:-1])
    
    metadata.append({'address_n': address_n, 'address_year': address_year, 'president': president, 'path_to_file': path_to_file})
    
metadata.sort(key=lambda address: address['address_n'])

for m in metadata[:5]:
    print m
    

## Load a spacy corpus

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#working-with-many-texts

https://chartbeat-labs.github.io/textacy/api_reference.html#module-textacy.corpus


In [5]:
import codecs, re

corpus = textacy.Corpus(
            u'en', 
            texts = [unicode(re.sub('\s+', ' ', codecs.open(m['path_to_file'], 'r', encoding='utf-8').read())) for m in metadata],
            metadatas = metadata)

print 'len(corpus)', len(corpus)

print
print type(corpus)

print
print type(corpus[0])

len(corpus) 0

<class 'textacy.corpus.Corpus'>



IndexError: list index out of range

### Inspecting the resulting textacy corpus . . . 

In [None]:
for a in range(0, 5):
    print corpus.docs[a].metadata, corpus.docs[a]
    
print
print 'docs in corpus', corpus.n_docs
print 'sentences in corpus', corpus.n_sents
print 'tokens in corpus', corpus.n_tokens

In [None]:
!wc -w corpora/inaugural_addresses/1_*
!wc -w corpora/inaugural_addresses/2_*
!wc -w corpora/inaugural_addresses/3_*
!wc -w corpora/inaugural_addresses/4_*
!wc -w corpora/inaugural_addresses/5_*

!wc -w corpora/inaugural_addresses/* | grep total

### Keywords, too easy.

Note the strange import.  Simply importing "textacy" doesn't work for this.

In [None]:
import textwrap
import textacy.keyterms

for doc in corpus[:5]:
    
    print
    print doc.metadata['address_year'], doc.metadata['president']
    print
    
    top_words = []
    for w in textacy.keyterms.textrank(doc, n_keyterms=20):
        top_words.append(w[0])
    
    print '\n'.join(textwrap.wrap(', '.join(top_words), 80))

## Readability

In [None]:
from textacy.text_stats import *

for doc in corpus[-5:]:

    print
    print doc.metadata['address_year'], doc.metadata['president']
    print
    
    ts = TextStats(doc)
    print 'flesch_kincaid_grade_level', ts.readability_stats['flesch_kincaid_grade_level']

### Corpus to document-term matrix . . . 

Lots of settings.  The doc is pretty good:

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus

https://chartbeat-labs.github.io/textacy/api_reference.html#vectorizers

(Note that in the API, see [**textacy.vsm.vectorizers.Vectorizer**](https://chartbeat-labs.github.io/textacy/api_reference.html#textacy.vsm.vectorizers.Vectorizer).)

Also, **please see [the to_terms_list doc](https://chartbeat-labs.github.io/textacy/api_reference.html#textacy.doc.Doc.to_terms_list).**


In [None]:

# "apply_dl" MEANS "apply document length"

# FOR TD-IDF WEIGHTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, apply_dl=True)

# FOR RAW WORD COUNTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False)

# FOR RELATIVE FREQUENCY . . . 
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False, min_df=2, max_df=0.90)

doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, 
                                                                named_entities=True,
                                                                normalize='lemma',
                                                                filter_stops=True,
                                                                filter_punct=True,
                                                                filter_nums=True,
                                                                as_strings=True) 
                                               for doc in corpus))

print
print repr(doc_term_matrix)
print
print doc_term_matrix.shape

### Inspect the document-term matrix

Convert the sparse matrix to a dense matrix (i.e., one with all the zeros).

Inspect in various ways.  Does it look reasonable?

In [None]:
dense_doc_term_matrix = doc_term_matrix.todense()

print
print repr(dense_doc_term_matrix)
print
print dense_doc_term_matrix.shape

list_doc_term_matrix = dense_doc_term_matrix.tolist()

print
print list_doc_term_matrix[0][:100]
print
print len(list_doc_term_matrix), len(list_doc_term_matrix[0])
print
for a in range(len(list_doc_term_matrix[0][:750])):
    if list_doc_term_matrix[0][a] > 0:
        #print vectorizer.id_to_term[a], list_doc_term_matrix[0][a], ';',
        print vectorizer.id_to_term[a],
print

###  Let's do some actual text analysis

We do two things here:

1.  Create a vectorizer, then use it to create a document-term matrix.
2.  Topic model using the document-term matrix.
3.  List the words associated with each resulting topic.

Lots of experimentation with Vectorizer parameters.  Raw word counts seem to work best.

And lots of experimentation with "n_topics" in creating the topic model.  20 seemed<br/>
reasonable for this demonstration.

### The docs:

https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus

https://chartbeat-labs.github.io/textacy/api_reference.html#topic-models

Tuning the topic model requires [a visit to the sklearn site](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

In [None]:
# FOR TD-IDF WEIGHTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, apply_dl=True,  min_df=2, max_df=25)

# FOR RAW WORD COUNTS
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False,  min_df=2, max_df=40)
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False)

# FOR RELATIVE FREQUENCY
#vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=True,  min_df=2, max_df=50)

# FOR RAW WORD COUNT.   LOW max_df VALUE REMOVES THINGS LIKE 'government' AND 'america' ANd SEEMS TO GIVE THE BEST RESULT
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=False, apply_dl=False,  min_df=5, max_df=30)

doc_term_matrix = vectorizer.fit_transform([doc.to_terms_list(ngrams=1,
                                                                named_entities=True,
                                                                normalize='lemma',
                                                                filter_stops=True,
                                                                filter_punct=True,
                                                                filter_nums=True,
                                                                as_strings=True) 
                                               for doc in corpus])

model = textacy.TopicModel('lda', n_topics=15)
model.fit(doc_term_matrix)

doc_topic_matrix = model.transform(doc_term_matrix)

print 
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=20):
    print 'topic', topic_idx, ':', ' '.join(top_terms)

In [None]:
!grep -li law corpora/inaugural_addresses/* | wc -l

### What are the document-topic percentages?

I.e., which topics make up what percentage of each document?

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import tabletext

def make_printable(topic_pcts):
    printable_pcts = []
    for pct in topic_pcts:
        formatted_pct = '%.2f' % pct
        if formatted_pct == '0.00':
            formatted_pct = '    '
        printable_pcts.append(formatted_pct)
    return printable_pcts

topic_headings = []
for a in range(len(doc_topic_matrix[0])):
    topic_headings.append(str(a).rjust(5))
    
results =[['', '', ''] + topic_headings]
    
for a in range(len(doc_topic_matrix)):
    results.append([a, corpus[a].metadata['address_year'], corpus[a].metadata['president'][:15]] + make_printable(doc_topic_matrix[a]))
    
print
print tabletext.to_text(results)

### List the words associated with each topic

We did this once.  I do it again, because I want to see more words, and I'd<br/>
like something that doesn't result in a wide display.

In [None]:
import textwrap

print

print 
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=100):
    print
    print 'topic', topic_idx, ':', '\n'.join(textwrap.wrap(' '.join(top_terms), 80))
    