# Introduction to Text Analysis

Today's workshop will address concepts in text analysis. A fundmental understanding of Python is necessary. We will cover:

1. term-document model
2. regex
3. POS tagging
3. sentiment analysis
4. topic modeling
5. word2vec

Python packages you will need:

* NLTK ( `$ pip install nltk` )
* TextBlob ( `$ pip install textblob` )
* gensim ( `$ pip install gensim` )

## Introduction

We've spent a lot of time in Python dealing with text data, and that's because text data is everywhere. It is the primary form of communication between persons and persons, persons and computers, and computers and computers. The kind of inferential methods that we apply to text data, however, are different from those applied to tabular data. 

This is partly because documents are typically specified in a way that expresses both structure and content using text (i.e. the document object model).

Largely, however, it's because text is difficult to turn into numbers in a way that preserves the information in the document. Today, we'll talk about dominant language models in NLP and the basics of how to implement it in Python.

# Part 1: The term-document model

This is also sometimes referred to as "bag-of-words" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored.

This is the primary method still used for most text analysis, although models utilizing word embeddings are beginning to take hold. We will discuss word embeddings briefly at the end.

To start with, let's import NLTK and load a document from their toy corpus.

In [None]:
import nltk
nltk.download('webtext')
document = nltk.corpus.webtext.open('grail.txt').read()

Let's see what's in this document

In [None]:
print(document[:1000])

It looks like we've gotten ourselves a bit of the script from *Monty Python and the Holy Grail*. Note that when we are looking at the text, part of the structure of the document is written in tokens. For example, stage directions have been placed in brackets, and the names of the person speaking are in all caps.

## Regular expressions

If we wanted to read out all of the stage directions for analysis, or just King Arthur's lines, doing so in base Python string processing will be very difficult. Instead, we are going to use regular expressions. Regular expressions are a method for string manipulation that match patterns instead of bytes.

In [None]:
import re
snippet = document.split("\n")[8]
print(snippet)

In [None]:
re.search(r'coconuts', snippet)

Just like with `str.find`, we can search for plain text. But `re` also gives us the option for searching for patterns of bytes - like only alphabetic characters.

In [None]:
re.search(r'[a-z]', snippet)

In this case, we've told re to search for the first sequence of bytes that is only composed of lowercase letters between `a` and `z`. We could get the letters at the end of each sentence by including a bang at the end of the pattern.

In [None]:
re.search(r'[a-z]!', snippet)

There are two things happening here:

1. `[` and `]` do not mean 'bracket'; they are special characters which mean 'anything of this class'
2. we've only matched one letter each

Re is flexible about how you specify numbers - you can match none, some, a range, or all repetitions of a sequence or character class.

character | meaning
----------|--------
`{x}`     | exactly x repetitions
`{x,y}`   | between x and y repetitions
`?`       | 0 or 1 repetition
`*`       | 0 or many repetitions
`+`       | 1 or many repetitions

Part of the power of regular expressions are their special characters. Common ones that you'll see are:

character | meaning
----------|--------
`.`       | match anything except a newline
`^`       | match the start of a line
`$`       | match the end of a line
`\s`      | matches any whitespace or newline

What if we wanted to grab all of Arthur's speech without grabbing the name `ARTHUR` itself?

If we wanted to do this using base string manipulation, we would need to do something like:

```
split the document into lines
create a new list of just lines that start with ARTHUR
create a newer list with ARTHUR removed from the front of each element
```

Regex gives us a way of doing this in one line, by using something called groups. Groups are pieces of a pattern that can be ignored, negated, or given names for later retrieval.

character | meaning
----------|--------
`(x)`     | match x
`(?:x)`   | match x but don't capture it
`(?P<x>)` | match something and give it name x
`(?=x)`   | match only if string is followed by x
`(?!x)`   | match only if string is not followed by x

In [None]:
re.findall(r'(?:ARTHUR: )(.+)', document)[0:10]

Because we are using `findall`, the regex engine is capturing and returning the normal groups, but not the non-capturing group. For complicated, multi-piece regular expressions, you may need to pull groups out separately. You can do this with names.

In [None]:
p = re.compile(r'(?P<name>[A-Z ]+)(?::)(?P<line>.+)')
match = re.search(p, document)
print(match)

In [None]:
print(match.group('name'))
print(match.group('line'))

## Challenge 1: Regex parsing

Using the regex pattern `p` above to print the `set` of unique character names in *Monty Python*:

In [None]:
matches = re.findall(p, document)
chars = set([x[0] for x in matches])
print(chars)
print(len(chars))

You should have 84 different characters.

Now use the `set` you made above to gather all dialogue into a character `dictionary`, with the keys being the character name and the value being a list of dialogues:

In [None]:
chars_dict = {}
for c in chars:
    chars_dict[c] = re.findall(r'(?:' + c + ': )(.+)',document)
chars_dict_2 = {}
for c in chars:
    chars_dict_2[c] = [x[1] for x in matches if x[0]==c]

#Not actually the same--second way doesn't match long and short version of names

In [None]:
char_dict["ARTHUR"]

## Tokenizing

Let's grab Arthur's speech from above, and see what we can learn about Arthur from it.

In [None]:
arthur = ' '.join(char_dict["ARTHUR"])
arthur[0:100]

In our model for natural language, we're interested in words. The document is currently a continuous string of bytes, which isn't ideal.

The practice of pulling apart a continuous string into units is called "tokenizing", and it creates "tokens". NLTK, the canonical library for NLP in Python, has a couple of implementations for tokenizing a string into sentences, and sentences into words.

In [None]:
nltk.download('punkt')
from nltk import word_tokenize, sent_tokenize
word_tokenize(snippet)

Look at what happened to "You're". It's been separated into "You" and "'re", which keeps with the way contractions work in English. While we know we could just use `snippet.split()` to split on white space, or write a complicated regex, word tokenizers allow for a more accurate representation of words based on additional rules.

We notice word tokenizers also separate punctuation, so unlike if we had split on whitespace, word tokenizers won't end up with `there!` and `there` as being different words.

At this point, we can start asking questions like what are the most common words, how many are unqiue words, and what words tend to occur together.

In [None]:
tokens = word_tokenize(arthur)
len(tokens), len(set(tokens))

So we can see right away that Arthur is using the same words a whole bunch - on average, each unique word is used four times. This is typical of natural language. 

> Not necessarily the value, but that the number of unique words in any corpus increases much more slowly than the total number of words.

> A corpus with 100M tokens, for example, probably only has 100,000 unique tokens in it.

For more complicated metrics, it's easier to use NLTK's classes and methods.

In [None]:
from nltk import collocations
fd = collocations.FreqDist(tokens)
fd.most_common()[:10]

Not so interesting, a common step in text analysis is to remove noise. *However*, what you deem "noise" is not only very important but also very subjective. For the purposes of today, we will discuss two common categories of strings often considered "noise". 

- Punctuation: While important for sentence analysis, punctuation will get in the way of word frequency and n-gram analyses. They will also affect any clustering on topic modeling.

- Stopwords: Stopwords are the most frequent words in any given language. Words like "the", "a", "that", etc. are considered not semantically important, and would also skew any frequency or n-gram analysis.

## Challenge 2: Removing noise

Write a function below that takes a string as an argument and returns a list of words without punctuation or stopwords:

In [None]:
nltk.download("stopwords")
def rem_punc_stop(text_string):
    from string import punctuation
    from nltk.corpus import stopwords
    # YOUR CODE HERE
    

Now we can rerun our frequency analysis without the noise:

In [None]:
tokens_reduced = rem_punc_stop(arthur)
fd2 = collocations.FreqDist(tokens_reduced)
fd2.most_common()[:10]

We can also look at collocations:

In [None]:
measures = collocations.BigramAssocMeasures()
c = collocations.BigramCollocationFinder.from_words(tokens_reduced)
c.nbest(measures.pmi, 10)

In [None]:
c.nbest(measures.likelihood_ratio, 10)

We see here that the collocation finder is pulling out some things that have face validity. When Arthur is talking about peasants, he calls them "bloody" more often than not. However, collocations like "Brother Maynard" and "BLACK KNIGHT" are less informative to us, because we know that they are proper names.

## Part of Speech Tagging

Many applications require text to be in the form of a list of sentences. NLTK's `sent_tokenize` should do the trick:

In [None]:
sents = sent_tokenize(arthur)
sents[0:10]

A common step in the NLP pipeline is tagging for part of speech, which can help begin to rectify our "bag of words" approach by retaining some idea of syntax. While training a POS tagger is a workshop in itself, NLTK also provides a trained tagger for us:

In [None]:
nltk.download("averaged_perceptron_tagger")
from nltk import pos_tag

toks_and_sents = [word_tokenize(s) for s in sent_tokenize(arthur)]
tagged_sents = [pos_tag(s) for s in toks_and_sents]

print()
print(tagged_sents[4])

## Challenge 3: POS Frequency

Create a frequency distribution for Arthur's parts of speech:

In [None]:
def freq_pos(test_string):
    toks = [word_tokenize(s) for s in sent_tokenize(test_string)]
    tagged_sents = [pos_tag(s) for s in toks]
    freqs_pos = {}
    for s in tagged_sents:
        for word in s:
            if word[1] in freqs_pos:
                freqs_pos[word[1]] = freqs_pos[word[1]] + 1
            else:
                freqs_pos[word[1]] = 1

    return freqs_pos

print(freq_pos(arthur))

## Stemming and Lemmatizing

In NLP it is often the case that the specific form of a word is not as important as the idea to which it refers. For example, if you are trying to identify the topic of a document, counting 'running', 'runs', 'ran', and 'run' as four separate words is not useful. Reducing words to their stems is a process called stemming.

A popular stemming implementation is the Snowball Stemmer, which is based on the Porter Stemmer. Its algorithm looks at word forms and does things like drop final 's's, 'ed's, and 'ing's.

Just like the tokenizers, we first have to create a stemmer object with the language we are using.

In [None]:
snowball = nltk.SnowballStemmer('english')

Now, we can try stemming some words

In [None]:
snowball.stem('running')

In [None]:
snowball.stem('eats')

In [None]:
snowball.stem('embarassed')

Snowball is a very fast algorithm, but it has a lot of edge cases. In some cases, words with the same stem are reduced to two different stems.

In [None]:
snowball.stem('cylinder'), snowball.stem('cylindrical')

In other cases, two different words are reduced to the same stem.

> This is sometimes referred to as a 'collision'

In [None]:
snowball.stem('vacation'), snowball.stem('vacate')

A more accurate approach is to use an English word bank like WordNet to call dictionary lookups on word forms, in a process called lemmatization.

In [None]:
nltk.download('wordnet')
wordnet = nltk.WordNetLemmatizer()

In [None]:
wordnet.lemmatize('vacation'), wordnet.lemmatize('vacate')

In [None]:
tok_red_lem = [snowball.stem(w) for w in tokens_reduced]
fd3 = collocations.FreqDist(tok_red_lem)
fd3.most_common()[:15]

# Part 2: High-level analysis

The rest of this class will focus on high level analyses, which do most of what we just covered for you, or in one quick step. It is important to remember that it is performing the above first. To know how to correctly interpret your analysis, remember that at some point the computer decided certain things weren't important!

## Sentiment

Frequently, we are interested in text to learn something about the person who is speaking. One of these things we've talked about already - linguistic diversity. A similar metric was used a couple of years ago to settle the question of who has the [largest vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).

> Unsurprisingly, top spots go to Canibus, Aesop Rock, and the Wu Tang Clan. E-40 is also in the top 20, but mostly because he makes up a lot of words; as are OutKast, who print their lyrics with words slurred in the actual typography

Another thing we can learn is about how the speaker is feeling, with a process called sentiment analysis. Before we start, be forewarned that this is not a robust method by any stretch of the imagination. Sentiment classifiers are often trained on product reviews, which limits their ecological validity.

We're going to use TextBlob's built-in sentiment classifier, because it is super easy.

In [None]:
from textblob import TextBlob

In [None]:
blob = TextBlob(arthur)

To check the polarity of a string, we can just iterate through Arthur's sentences:

In [None]:
net_pol = 0
for sentence in blob.sentences:
    pol = sentence.sentiment.polarity
    print(pol, sentence)
    net_pol += pol
print()
print("Net polarity of Arthur: ", net_pol)

What's happening behind the scenes? While there are new algorithms for sentiment anaysis emerging (cf. `VADER`), most algorithms currently rely only on a `dictionary` of words and a corresponding `positive`, `negative`, or `neutral`. Based on all the words in a sentence, a value is calculated for the sentence as a whole. Not super fancy, I know. Of course, you can change the `dictionary` used in the library itself, or opt for more advanced algorithms that aim to capture context.

## Challenge 4: Sentiment

How about we look at all characters? Create an empty list `collected_stats` and iterate through `char_dict`, calculate the net polarity of each character, and append a tuple of e.g. `(ARTHUR, 11.45)` back to `collected_stats`:

In [None]:
collected_stats = []
# YOUR CODE HERE


Now `sort` this list of tuples by polarity, and print the list of characters in *Monty Python* according to their sentiment:

In [None]:
# YOUR CODE HERE


## Topic Modeling

Another common NLP task is topic modeling. The math behind this is beyond the scope of this course, but the basic strategy is to represent each document as a one-dimensional array, where the indices correspond to integer ids of tokens in the document. Then, some measure of semantic similarity, like the cosine of the angle between unitized versions of the document vectors, is calculated. Finally, distinct topics are identified as leading certain groups of documents. The result is a list of `n` topics with the driving words for that topic, and a list of documents with their relation to each topic (how strongly a document fits that topic.

Let's run a topic model on the characters of *Monty Python*.

Luckily for us there is another Python library that takes care of the heavy lifting for us.

In [None]:
from gensim import corpora, models, similarities

First we need to separate the speeches and people, but keep it ordered so we index correctly when done. For the speeches, we'll need all speech as one string, then tokenized. We also need to remove punctuation and stop words so that Python can identify important words to documents. It seems we've gotten lucky again, we already wrote *rem_punc_stop* !

In [None]:
people = []
speeches = []
for k,v in char_dict.items():
    people.append(k)
    new_string = ' '.join(v)  # join all dialogue pices
    toks = rem_punc_stop(new_string)  # remove puntuation and stop words, and tokenize
    stems = [snowball.stem(tok) for tok in toks]  # change words to stems
    speeches.append(stems)

Now we create the dictionary of words used to create the matrices, and set thresholds for word frequencies within the corpus:

In [None]:
#create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(speeches)

#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
#no_below is absolute # of docs, no_above is fraction of corpus
dictionary.filter_extremes(no_below=2, no_above=.70)

#convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(i) for i in speeches]

Finally we set the parameters for the LDA topic modelling (other algorithms such as LSI do exist, but we won't get into the differences today):

In [None]:
#we run chunks of 15 books, and update after every 2 chunks, and make 10 passes
lda = models.LdaModel(corpus, num_topics=6, 
                            update_every=2,
                            id2word=dictionary, 
                            chunksize=15, 
                            passes=10)

lda.show_topics()

To match characters to their topics we just index the corpus:

In [None]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

corpus_lda = lda[corpus_tfidf]
for i, doc in enumerate(corpus_lda): # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(people[i],doc)
    print ()

## Word embeddings and word2vec

Word embeddings are the first successful attempt to move away from the "bag of words" model of language. Instead of looking at word frequencies, and vocabulary usage, word embeddings aim to retain syntactic information. Generally, a word2vec model *will not* remove stopwords or punctuation, because they are vital to the model itself.

word2vec simply changes a tokenized sentence into a vector of numbers, with each unique token being its own number.

e.g.:

~~~
[["I", "like", "coffee", "."], ["I", "like", "my", "coffee", "without", "sugar", "."]]
~~~

is tranformed to:

~~~
[[43, 75, 435, 98], [43, 75, 10, 435, 31, 217, 98]]
~~~

Notice, the "I"s, the "likes", the "coffees", and the "."s, all have the same assignment.

The model is created by taking these numbers, and creating a high dimensional vector by mapping every word to its surrounding, creating a sort of "cloud" of words, where words used in a similar syntactic, and often semantic, fashion, will cluster closer together.

One of the drawbacks of word2vec is the volume of data necessary for a decent analysis. So we will read in a copy of the King James Bible and hope it will provide enough data, it then needs to be broken into sentences and tokenized:

In [None]:
with open("King_James_Bible.txt", "r") as f:
    bible = f.read()

from nltk.tokenize import sent_tokenize

bible = sent_tokenize(bible)
bible = [word_tokenize(s) for s in bible]

In [None]:
bible[10]

Now we can actually train the model on the language of the Bible:

In [None]:
import gensim
model = gensim.models.word2vec.Word2Vec(bible, size=300, window=5, min_count=5, workers=4)
model.train(bible)

Once the model is trained, we can look at how words are situated in this cloud:

In [None]:
model.most_similar('man')

In [None]:
model.most_similar('woman')

We can even create little equations, so what would be a:

KING + WOMAN - MAN = ?

In [None]:
model.most_similar(positive=['king', 'woman'], negative=['man'])

## Challenge 5: word2vec

Play around with the word2vec model above and try to put into words exactly what the model does, and how one should interpret the results. How would you contrast this with the "bag of words" model?