# Intro to Natural Language Processing with Python

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*
- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*
- Green Library 121A, 2pm-4pm

## What are we covering today?
- What is NLP?
- Options for NLP in Python
- Tokenization
- Part of Speech Tagging
- Word transformations (lemmatization, pluralization)
- Sentiment Analysis
- Readability indices

## What is NLP

NLP stands for Natual Language Processing and it involves a huge variety of tasks such as:
- Automatic summarization.
- Coreference resolution.
- Discourse analysis.
- Machine translation.
- Morphological segmentation.
- Named entity recognition.
- Natural language understanding.
- Part-of-speech tagging.
- Parsing.
- Question answering.
- Relationship extraction.
- Sentiment analysis.
- Speech recognition.
- Topic segmentation.
- Word segmentation.
- Word sense disambiguation.
- Information retrieval.
- Information extraction.
- Speech processing.

One of the key ideas is to be able to process text without reading it.

## NLP in Python

Python is builtin with a very mature regular expression library, which is the building block of natural language processing. However, more advanced tasks need different libraries. Traditionally, in the Python ecosystem the Natural Language Processing Toolkit, abbreviated as `NLTK`, has been until recently the only working choice. Unfortunately, the library has not aged well, and even though it's updated to work with the newer versions of Python, it does not provide us the speed we might need to process large corpora.

Another solution that appeared recently is called `spaCy`, and it is much faster since is written in a pseudo-C Python language optimized for speed called Cython.

Both these libraries are complex and therefore there exist wrappers around them to simplify their APIs. The two more popular are `Textblob` for NLTK and CLiPS Parser, and `textacy` for spaCy. In this workshop we will be using Textblob since it is more well established and mature and provides with all we need to start learning some NLP basic tasks.

In [1]:
from textblob import TextBlob

In [2]:
# Helper functions
import requests
from urllib.request import urlopen

def get_page(url):
    try:
        return requests.get(url).text
    except:
        return urlopen(url).read().decode("utf8")
        
def get_speech(url):
    page = get_page(url)
    full_text = page.split('\n')
    return " ".join(full_text[2:])

In [3]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/clinton2000.txt"
clinton_speech = get_speech(clinton_url)
clinton_speech

'Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  We are fortunate to be alive at this moment in history. Never before has our nation enjoyed, at once, so much prosperity and social progress with so little internal crisis and so few external threats. Never before have we had such a blessed opportunity and, therefore, such a profound obligation to build the more perfect Union of our Founders’ dreams.  We begin the new century with over 20 million new jobs; the fastest economic growth in more than 30 years; the lowest unemployment rates in 30 years; the lowest poverty rates in 20 years; the lowest African-American and Hispanic unemployment rates on record; the first back-to-back surpluses in 42 years; and next month, America will achieve the longest period of economic growth in our entire history. We have built a new economy.  And our economic revolution has been matched by a revival of the American spirit: crime down by 20 percent, to its lowes

In [4]:
clinton_blob = TextBlob(clinton_speech[:446])
clinton_blob.string == clinton_speech[:446]

True

## Tokenization

In NLP, the act of splitting text is called tokenization, and each of the individual chunks is called a token. Therefore, we can talk about word tokenization or sentence tokenization depending on what it is that we need to divide the text into.

In [5]:
clinton_blob.words

WordList(['Mr', 'Speaker', 'Mr', 'Vice', 'President', 'members', 'of', 'Congress', 'honored', 'guests', 'my', 'fellow', 'Americans', 'We', 'are', 'fortunate', 'to', 'be', 'alive', 'at', 'this', 'moment', 'in', 'history', 'Never', 'before', 'has', 'our', 'nation', 'enjoyed', 'at', 'once', 'so', 'much', 'prosperity', 'and', 'social', 'progress', 'with', 'so', 'little', 'internal', 'crisis', 'and', 'so', 'few', 'external', 'threats', 'Never', 'before', 'have', 'we', 'had', 'such', 'a', 'blessed', 'opportunity', 'and', 'therefore', 'such', 'a', 'profound', 'obligation', 'to', 'build', 'the', 'more', 'perfect', 'Union', 'of', 'our', 'Founders’', 'dreams'])

In [6]:
clinton_blob.sentences

[Sentence("Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  We are fortunate to be alive at this moment in history."),
 Sentence("Never before has our nation enjoyed, at once, so much prosperity and social progress with so little internal crisis and so few external threats."),
 Sentence("Never before have we had such a blessed opportunity and, therefore, such a profound obligation to build the more perfect Union of our Founders’ dreams.")]

In [7]:
clinton_blob.noun_phrases

WordList(['mr.', 'mr.', 'vice president', 'congress', 'never', 'social progress', 'internal crisis', 'external threats', 'never', 'profound obligation', 'perfect union', 'founders’'])

A special way of dividing text in tuples of sequential words or letters is usualy referred to as N-Grams.

In [8]:
clinton_blob.ngrams(n=3)

[WordList(['Mr', 'Speaker', 'Mr']),
 WordList(['Speaker', 'Mr', 'Vice']),
 WordList(['Mr', 'Vice', 'President']),
 WordList(['Vice', 'President', 'members']),
 WordList(['President', 'members', 'of']),
 WordList(['members', 'of', 'Congress']),
 WordList(['of', 'Congress', 'honored']),
 WordList(['Congress', 'honored', 'guests']),
 WordList(['honored', 'guests', 'my']),
 WordList(['guests', 'my', 'fellow']),
 WordList(['my', 'fellow', 'Americans']),
 WordList(['fellow', 'Americans', 'We']),
 WordList(['Americans', 'We', 'are']),
 WordList(['We', 'are', 'fortunate']),
 WordList(['are', 'fortunate', 'to']),
 WordList(['fortunate', 'to', 'be']),
 WordList(['to', 'be', 'alive']),
 WordList(['be', 'alive', 'at']),
 WordList(['alive', 'at', 'this']),
 WordList(['at', 'this', 'moment']),
 WordList(['this', 'moment', 'in']),
 WordList(['moment', 'in', 'history']),
 WordList(['in', 'history', 'Never']),
 WordList(['history', 'Never', 'before']),
 WordList(['Never', 'before', 'has']),
 WordList([

In [9]:
clinton_blob.ngrams(n=5)

[WordList(['Mr', 'Speaker', 'Mr', 'Vice', 'President']),
 WordList(['Speaker', 'Mr', 'Vice', 'President', 'members']),
 WordList(['Mr', 'Vice', 'President', 'members', 'of']),
 WordList(['Vice', 'President', 'members', 'of', 'Congress']),
 WordList(['President', 'members', 'of', 'Congress', 'honored']),
 WordList(['members', 'of', 'Congress', 'honored', 'guests']),
 WordList(['of', 'Congress', 'honored', 'guests', 'my']),
 WordList(['Congress', 'honored', 'guests', 'my', 'fellow']),
 WordList(['honored', 'guests', 'my', 'fellow', 'Americans']),
 WordList(['guests', 'my', 'fellow', 'Americans', 'We']),
 WordList(['my', 'fellow', 'Americans', 'We', 'are']),
 WordList(['fellow', 'Americans', 'We', 'are', 'fortunate']),
 WordList(['Americans', 'We', 'are', 'fortunate', 'to']),
 WordList(['We', 'are', 'fortunate', 'to', 'be']),
 WordList(['are', 'fortunate', 'to', 'be', 'alive']),
 WordList(['fortunate', 'to', 'be', 'alive', 'at']),
 WordList(['to', 'be', 'alive', 'at', 'this']),
 WordList(

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `count_chars(text)` that receives `text` and returns the total number of characters ignoring spaces and punctuation marks. For example, `count_chars("Well, I am not 30 years old.")` should return `20`.
<br/>
* **Hint**: You could count the characters in the words.*
</p>
</div>

In [10]:
def count_chars(text):
    return sum(len(w) for w in TextBlob(text).words)

count_chars("Well, I am not 30 years old.")

20

## Part of Speech Tagging

Textblob also allows you to perform Part-Of-Speech tagging, a kind of grammatical chunking, out of the box. By default it uses the Penn U Treebank, but other taggers can be plugged in using NLTK classes.

In [11]:
clinton_blob.tags

[('Mr.', 'NNP'),
 ('Speaker', 'NNP'),
 ('Mr.', 'NNP'),
 ('Vice', 'NNP'),
 ('President', 'NNP'),
 ('members', 'NNS'),
 ('of', 'IN'),
 ('Congress', 'NNP'),
 ('honored', 'VBD'),
 ('guests', 'NNS'),
 ('my', 'PRP$'),
 ('fellow', 'JJ'),
 ('Americans', 'NNPS'),
 ('We', 'PRP'),
 ('are', 'VBP'),
 ('fortunate', 'JJ'),
 ('to', 'TO'),
 ('be', 'VB'),
 ('alive', 'JJ'),
 ('at', 'IN'),
 ('this', 'DT'),
 ('moment', 'NN'),
 ('in', 'IN'),
 ('history', 'NN'),
 ('Never', 'NN'),
 ('before', 'IN'),
 ('has', 'VBZ'),
 ('our', 'PRP$'),
 ('nation', 'NN'),
 ('enjoyed', 'VBN'),
 ('at', 'IN'),
 ('once', 'RB'),
 ('so', 'RB'),
 ('much', 'JJ'),
 ('prosperity', 'NN'),
 ('and', 'CC'),
 ('social', 'JJ'),
 ('progress', 'NN'),
 ('with', 'IN'),
 ('so', 'RB'),
 ('little', 'JJ'),
 ('internal', 'JJ'),
 ('crisis', 'NN'),
 ('and', 'CC'),
 ('so', 'RB'),
 ('few', 'JJ'),
 ('external', 'JJ'),
 ('threats', 'NNS'),
 ('Never', 'RB'),
 ('before', 'RB'),
 ('have', 'VBP'),
 ('we', 'PRP'),
 ('had', 'VBD'),
 ('such', 'JJ'),
 ('a', 'DT'),
 (

In [12]:
for word, pos in clinton_blob.tags:
    print(word, pos)

Mr. NNP
Speaker NNP
Mr. NNP
Vice NNP
President NNP
members NNS
of IN
Congress NNP
honored VBD
guests NNS
my PRP$
fellow JJ
Americans NNPS
We PRP
are VBP
fortunate JJ
to TO
be VB
alive JJ
at IN
this DT
moment NN
in IN
history NN
Never NN
before IN
has VBZ
our PRP$
nation NN
enjoyed VBN
at IN
once RB
so RB
much JJ
prosperity NN
and CC
social JJ
progress NN
with IN
so RB
little JJ
internal JJ
crisis NN
and CC
so RB
few JJ
external JJ
threats NNS
Never RB
before RB
have VBP
we PRP
had VBD
such JJ
a DT
blessed JJ
opportunity NN
and CC
therefore RB
such PDT
a DT
profound JJ
obligation NN
to TO
build VB
the DT
more RBR
perfect JJ
Union NNP
of IN
our PRP$
Founders’ NNP
dreams NN


For what these tags mean, you might check out http://www.clips.ua.ac.be/pages/mbsp-tags

In [13]:
clinton_blob.parse()

'Mr./NNP/B-NP/O Speaker/NNP/I-NP/O ,/,/O/O Mr./NNP/B-NP/O Vice/NNP/I-NP/O President/NNP/I-NP/O ,/,/O/O members/NNS/B-NP/O of/IN/B-PP/B-PNP Congress/NNP/B-NP/I-PNP ,/,/O/O honored/VBN/B-VP/O guests/NNS/B-NP/O ,/,/O/O my/PRP$/B-NP/O fellow/NN/I-NP/O Americans/NNPS/I-NP/O :/:/O/O We/PRP/B-NP/O are/VBP/B-VP/O fortunate/JJ/B-ADJP/O to/TO/B-PP/O be/VB/B-VP/O alive/JJ/B-ADJP/O at/IN/B-PP/B-PNP this/DT/B-NP/I-PNP moment/NN/I-NP/I-PNP in/IN/B-PP/B-PNP history/NN/B-NP/I-PNP ././O/O\nNever/RB/B-ADVP/O before/IN/B-PP/O has/VBZ/B-VP/O our/PRP$/B-NP/O nation/NN/I-NP/O enjoyed/VBD/B-VP/O ,/,/O/O at/IN/B-PP/O once/RB/B-ADVP/O ,/,/O/O so/RB/B-NP/O much/JJ/I-NP/O prosperity/NN/I-NP/O and/CC/O/O social/JJ/B-NP/O progress/NN/I-NP/O with/IN/B-PP/B-PNP so/RB/B-NP/I-PNP little/JJ/I-NP/I-PNP internal/JJ/I-NP/I-PNP crisis/NN/I-NP/I-PNP and/CC/O/O so/RB/B-NP/O few/JJ/I-NP/O external/JJ/I-NP/O threats/NNS/I-NP/O ././O/O\nNever/RB/B-ADVP/O before/IN/B-PP/O have/VBP/B-VP/O we/PRP/B-NP/O had/VBD/B-VP/O such/JJ/B-AD

## Word transformations

In [14]:
from textblob import Word
w = Word("octopi")
w.lemmatize()

'octopus'

In [15]:
w.lemma

'octopus'

In [16]:
v = Word("is")
v.lemmatize("v")

'be'

In [17]:
for word in clinton_blob.words:
    print(word, word.lemmatize())

Mr Mr
Speaker Speaker
Mr Mr
Vice Vice
President President
members member
of of
Congress Congress
honored honored
guests guest
my my
fellow fellow
Americans Americans
We We
are are
fortunate fortunate
to to
be be
alive alive
at at
this this
moment moment
in in
history history
Never Never
before before
has ha
our our
nation nation
enjoyed enjoyed
at at
once once
so so
much much
prosperity prosperity
and and
social social
progress progress
with with
so so
little little
internal internal
crisis crisis
and and
so so
few few
external external
threats threat
Never Never
before before
have have
we we
had had
such such
a a
blessed blessed
opportunity opportunity
and and
therefore therefore
such such
a a
profound profound
obligation obligation
to to
build build
the the
more more
perfect perfect
Union Union
of of
our our
Founders’ Founders’
dreams dream


In [18]:
for word in clinton_blob.words:
    print(word, word.lemmatize("v"))

Mr Mr
Speaker Speaker
Mr Mr
Vice Vice
President President
members members
of of
Congress Congress
honored honor
guests guests
my my
fellow fellow
Americans Americans
We We
are be
fortunate fortunate
to to
be be
alive alive
at at
this this
moment moment
in in
history history
Never Never
before before
has have
our our
nation nation
enjoyed enjoy
at at
once once
so so
much much
prosperity prosperity
and and
social social
progress progress
with with
so so
little little
internal internal
crisis crisis
and and
so so
few few
external external
threats threats
Never Never
before before
have have
we we
had have
such such
a a
blessed bless
opportunity opportunity
and and
therefore therefore
such such
a a
profound profound
obligation obligation
to to
build build
the the
more more
perfect perfect
Union Union
of of
our our
Founders’ Founders’
dreams dream


In [19]:
for word, pos in clinton_blob.tags:
    if pos == "VBP":
        print(word, word.lemmatize("v"))

are be
have have


In [20]:
for word in clinton_blob.words:
    print(word, word.pluralize())

Mr Mrs
Speaker Speakers
Mr Mrs
Vice Vices
President Presidents
members memberss
of ofs
Congress Congresses
honored honoreds
guests guestss
my our
fellow fellows
Americans Americanss
We Wes
are ares
fortunate fortunates
to toes
be bes
alive alives
at ats
this these
moment moments
in ins
history histories
Never Nevers
before befores
has hass
our ours
nation nations
enjoyed enjoyeds
at ats
once onces
so soes
much muches
prosperity prosperities
and ands
social socials
progress progress
with withs
so soes
little littles
internal internals
crisis crises
and ands
so soes
few fews
external externals
threats threatss
Never Nevers
before befores
have haves
we wes
had hads
such suches
a some
blessed blesseds
opportunity opportunities
and ands
therefore therefores
such suches
a some
profound profounds
obligation obligations
to toes
build builds
the thes
more mores
perfect perfects
Union Unions
of ofs
our ours
Founders’ Founders’s
dreams dreamss


## Counting

In [21]:
clinton_blob.word_counts

defaultdict(int,
            {'a': 2,
             'alive': 1,
             'americans': 1,
             'and': 3,
             'are': 1,
             'at': 2,
             'be': 1,
             'before': 2,
             'blessed': 1,
             'build': 1,
             'congress': 1,
             'crisis': 1,
             'dreams': 1,
             'enjoyed': 1,
             'external': 1,
             'fellow': 1,
             'few': 1,
             'fortunate': 1,
             'founders’': 1,
             'guests': 1,
             'had': 1,
             'has': 1,
             'have': 1,
             'history': 1,
             'honored': 1,
             'in': 1,
             'internal': 1,
             'little': 1,
             'members': 1,
             'moment': 1,
             'more': 1,
             'mr': 2,
             'much': 1,
             'my': 1,
             'nation': 1,
             'never': 2,
             'obligation': 1,
             'of': 2,
             'once': 1,


In [22]:
clinton_blob.word_counts['congress']

1

In [23]:
clinton_blob.words.count('Mr', case_sensitive=True)

2

In [24]:
clinton_blob.noun_phrases.count('internal crisis')

1

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Let's define the lexicon of a person as the number of different words she uses to speak. Write a function `get_lexicon(text, n)` that receives `text` and `n` and returns the lemmas of nouns, verbs, and adjectives that are use at least `n` times. For example, `get_lexicon(clinton_speech, 10)` should return

```
{'A',
 'America',
 'New',
 'So',
 'Thank',
 'Tonight',
 'ask',
 'be',
 'child',
 'do',
 'have',
 'help',
 'make',
 'more',
 'new',
 'people',
 'thank',
 'tonight',
 'want',
 'work',
 'year'}
```.
<br/>
* **Hint**: In Textblob, when a tag refers to nouns, verbs, or adjectives, the first letter of the tag starts with `n`, `v`, or `j`.*
</p>
</div>

In [25]:
def get_lexicon(text, n):
    blob = TextBlob(text)
    return {word.lemma for word, tag in blob.tags
            if tag[0].lower() in ["n", "j", "v"] and blob.words.count(word) >= n}
    
get_lexicon(clinton_speech, 25)

{'A',
 'America',
 'New',
 'So',
 'Thank',
 'Tonight',
 'ask',
 'be',
 'child',
 'do',
 'have',
 'help',
 'make',
 'more',
 'new',
 'people',
 'thank',
 'tonight',
 'want',
 'work',
 'year'}

## Sentiment analysis

Sentiment analysis is a basic form of classification of sentences, commonly into 2 categories.

In [26]:
clinton_blob.sentiment

Sentiment(polarity=0.17351190476190476, subjectivity=0.4476190476190477)

In [27]:
for sentence in clinton_blob.sentences:
    print(sentence, sentence.sentiment.polarity)

Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  We are fortunate to be alive at this moment in history. 0.25
Never before has our nation enjoyed, at once, so much prosperity and social progress with so little internal crisis and so few external threats. 0.049404761904761896
Never before have we had such a blessed opportunity and, therefore, such a profound obligation to build the more perfect Union of our Founders’ dreams. 0.3166666666666667


In [28]:
sad_sent = "Life is sad."
sad_blob = TextBlob(sad_sent)
sad_blob.sentiment.polarity

-0.5

Textblob includes an alternate sentiment analyzer that you can use out of the box. 

In [29]:
from textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob(clinton_speech[:446], analyzer=NaiveBayesAnalyzer())
for sentence in blob.sentences:
    print(sentence, sentence.sentiment)

Mr. Speaker, Mr. Vice President, members of Congress, honored guests, my fellow Americans:  We are fortunate to be alive at this moment in history. Sentiment(classification='pos', p_pos=0.9869582403531282, p_neg=0.013041759646874915)
Never before has our nation enjoyed, at once, so much prosperity and social progress with so little internal crisis and so few external threats. Sentiment(classification='pos', p_pos=0.9933821787261897, p_neg=0.006617821273806525)
Never before have we had such a blessed opportunity and, therefore, such a profound obligation to build the more perfect Union of our Founders’ dreams. Sentiment(classification='pos', p_pos=0.914381720440424, p_neg=0.08561827955957835)


In [30]:
para = "Life is good. Life sucks. John hates soda. John hates nasty soda. John likes good soda. John loves soda. John loves sweet soda."
sent_blob = TextBlob(para)
for sent in sent_blob.sentences:
    print(sent, sent.sentiment.polarity)

Life is good. 0.7
Life sucks. -0.3
John hates soda. 0.0
John hates nasty soda. -1.0
John likes good soda. 0.7
John loves soda. 0.0
John loves sweet soda. 0.35


In [31]:
sent_blob_nb = TextBlob(para, analyzer=NaiveBayesAnalyzer())
for sent in sent_blob_nb.sentences:
    print(sent, sent.sentiment)

Life is good. Sentiment(classification='pos', p_pos=0.5995917017800413, p_neg=0.4004082982199584)
Life sucks. Sentiment(classification='neg', p_pos=0.12196027933237585, p_neg=0.8780397206676244)
John hates soda. Sentiment(classification='neg', p_pos=0.22370657856753343, p_neg=0.7762934214324665)
John hates nasty soda. Sentiment(classification='neg', p_pos=0.24953488372093016, p_neg=0.7504651162790696)
John likes good soda. Sentiment(classification='neg', p_pos=0.248901977799218, p_neg=0.7510980222007825)
John loves soda. Sentiment(classification='neg', p_pos=0.3654245396557949, p_neg=0.6345754603442052)
John loves sweet soda. Sentiment(classification='neg', p_pos=0.48973374677919485, p_neg=0.5102662532208047)


These examples used the built-in analyzers, but a Textblob analyzer can be built with a classifier object with its own methods. Some of them are very useful for model selection if you were building your own. The Textblob docs do give an example of how to build a basic sentiment classifier if you're interested.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Rather than just get the sentiment of individual sentences, we could try to calculate the average sentiment of a text by averaging the sentiment of its sentences. Write a function `avg_sentiment(text)` that receives `text` and returns the average positive sentiment as the sum of all probability of positive sentences divided by the number of sentences. For example, `avg_sentiment(para)` should return ~`0.3284`.
<br/>
* **Hint**: Rememer to use the `NaiveBayesAnalyzer` analyzer.*
</p>
</div>

In [32]:
def avg_sentiment(text):
    sentences = TextBlob(text, analyzer=NaiveBayesAnalyzer()).sentences
    total = len(sentences)
    sent = sum(s.sentiment.p_pos for s in sentences)
    return sent / total

para = "Life is good. Life sucks. John hates soda. John hates nasty soda. John likes good soda. John loves soda. John loves sweet soda."
avg_sentiment(para)

0.32840767251929837

Textblob also lets you simply get the sentiment of a whole text, but you'll notice that this and the average calculated from sentence sentiment are not the same.

In [33]:
sent_blob_nb.sentiment

Sentiment(classification='neg', p_pos=0.1082876450842774, p_neg=0.8917123549157216)

## Readability indices

Readability indices are ways of assessing how easy or complex it is to read a particular text based on the words and sentences it has. They usually output scores that correlate with grade levels.

A couple of indices that are presumably easy to calculate are the Auto Readability Index (ARI) and the Coleman-Liau Index:

$$
ARI = 4.71\frac{chars}{words}+0.5\frac{words}{sentences}-21.43
$$
$$ CL = 0.0588\frac{letters}{100 words} - 0.296\frac{sentences}{100words} - 15.8 $$

In [34]:
def coleman_liau_index(blob):
    chars = count_chars(blob.words)
    words = len(blob.words)
    sentences = len(blob.sentences)
    return (0.0588 * letters_per_100(chars, words)) - (0.296 * sentences_per_100(sentences, words)) - 15.8

def letters_per_100(chars, words):
    return (chars / words) * 100
    
def sentences_per_100(sentences, words):
    return (sentences / words) * 100

def count_chars(words):
    return sum(len(w) for w in words)

In [35]:
coleman_liau_index(sent_blob)

0.2452173913043474

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `auto_readability_index(blob)` that receives a Textblob `blob` and returns the Auto Readability Index (ARI) score as defined above. For example, `auto_readability_index(sent_blob)` should return ~`0.2815`.
</p>
</div>

In [36]:
def auto_readability_index(blob):
    chars = count_chars(blob.words)
    words = len(blob.words)
    sentences = len(blob.sentences)
    return (4.71 * (chars / words)) + (0.5 * (words / sentences)) - 21.43

auto_readability_index(sent_blob)

0.28155279503105746

## Corpus
  
We will work with State of the Union speeches, each from their last year, for Barack Obama, George H.W. Bush, and Bill Clinton, and the recent address to Congress by Donald Trump.

In [37]:
clinton_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/clinton2000.txt"
bush_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/bush2008.txt"
obama_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/obama2016.txt"
trump_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/trump.txt"
clinton_speech = get_speech(clinton_url)
bush_speech = get_speech(bush_url)
obama_speech = get_speech(obama_url)
trump_speech = get_speech(trump_url)

In [38]:
speeches = {
    "clinton": TextBlob(clinton_speech, analyzer=NaiveBayesAnalyzer()),
    "bush": TextBlob(bush_speech, analyzer=NaiveBayesAnalyzer()),
    "obama": TextBlob(obama_speech, analyzer=NaiveBayesAnalyzer()),
    "trump": TextBlob(trump_speech, analyzer=NaiveBayesAnalyzer()),
}

Let's get some basic data about the speeches.

In [39]:
print("Name", "Chars", "Words", "Unique", "Sentences", sep="\t")
for speaker, speech in speeches.items():
    print(speaker, count_chars(speech.words), len(speech.words), len(set(speech.words)), len(speech.sentences), sep="\t")

Name	Chars	Words	Unique	Sentences
bush	27317	5754	1673	311
obama	27880	6124	1698	433
clinton	42021	9067	2109	497
trump	22394	4796	1528	259


We can calculate the average number of words per sentence for each speech.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `avg_sentence_length(blob)` that receives a Textblob `blob` and returns the average sentence length as the sum of all word lengths divided by the total number of sentences. For example, `avg_sentence_length(sent_blob)` should return ~`3.2857`.
</p>
</div>

In [40]:
def avg_sentence_length(blob):
    return sum(len(s.words) for s in blob.sentences) / len(blob.sentences)

avg_sentence_length(sent_blob)

3.2857142857142856

In [41]:
for speaker, speech in speeches.items():
    print(speaker, avg_sentence_length(speech))

bush 18.5016077170418
obama 14.143187066974596
clinton 18.243460764587525
trump 18.517374517374517


We can also get the most used words. We are going to filter out some common stopwords first.

In [42]:
stopwords_url = "https://raw.githubusercontent.com/sul-cidr/python_workshops/nlp/data/english_stopwords.txt"
stopwords = get_page(stopwords_url).split("\n")
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [43]:
def most_used_words(blob, n):
    word_counts = sorted(blob.word_counts.items(), key=lambda p: p[1], reverse=True)
    return list(filter(lambda p: p[0].lower() not in stopwords, word_counts))[:n]

for speaker, speech in speeches.items():
    print(speaker, most_used_words(speech, 10), "\n")

bush [('america', 34), ('people', 31), ('must', 29), ('congress', 27), ('new', 25), ('year', 25), ('us', 23), ('ve', 21), ('iraq', 21), ('nation', 19)] 

obama [('applause', 89), ('us', 34), ('america', 28), ('people', 26), ('that’s', 25), ('world', 24), ('work', 22), ('american', 22), ('it’s', 20), ('new', 19)] 

clinton [('new', 47), ('ask', 43), ('people', 40), ('make', 38), ('us', 35), ('help', 35), ('years', 32), ('children', 31), ('must', 31), ('america', 29)] 

trump [('america', 31), ('american', 30), ('must', 20), ('new', 19), ('country', 18), ('us', 18), ('world', 17), ('people', 15), ('one', 15), ('americans', 15)] 



This sort of exploratory work is often the first step in figuring out how to clean a text for text analysis. 

Let's assess the lexical richness, defined as the ratio of number of unique words by the number of total words.

In [44]:
def lexical_richness(words):
    return len(words) / len(set(words))

In [45]:
for speaker, speech in speeches.items():
    print(speaker, lexical_richness(speech.words))

bush 3.4393305439330546
obama 3.6065959952885747
clinton 4.299193930772878
trump 3.1387434554973823


What about sentiment?

In [46]:
for speaker, speech in speeches.items():
    print(speaker, avg_sentiment(speech.string))

bush 0.8006456103438158
obama 0.7063063471504729
clinton 0.7185620550464893
trump 0.7594267280342685


Readbility scores

For the Automated Readability Index, you can get the appropriate grade level here: https://en.wikipedia.org/wiki/Automated_readability_index

In [47]:
for speaker, speech in speeches.items():
    print(speaker, "ARI:", auto_readability_index(speech), "CL:", coleman_liau_index(speech))

bush ARI: 10.18143472400578 CL: 10.515321515467498
obama ARI: 7.084245395015714 CL: 8.87629000653168
clinton ARI: 9.52021940843251 CL: 9.82835336936142
trump ARI: 9.821126791631379 CL: 10.05703085904921


<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Write a function `stats(url)` that receives a `url` from a plain text version of a book in Project Gutenberg and returns the a dictionary with statistics (Auto Readability Index, Coleman-Lieu Index, lexical richness, average sentence length in words, average sentiment, number of characters, number of words, number of unique words, number of sentences, and 10 most used words) of the text contained in the URL. For example, `stats("http://www.gutenberg.org/cache/epub/345/pg345.txt")` should return `{'ari': 7.051237118685233,
 'average_sentiment': 0.6216963558545169,
 'characters': 883114,
 'cl': 6.151579188686984,
 'lexical_richness': 15.130625285257873,
 'sentence_length': 19.343680709534368,
 'sentences': 8569,
 'top_words': ['said',
  'could',
  'one',
  'us',
  'must',
  'would',
  'may',
  'shall',
  'see',
  'know'],
 'unique_words': 10955,
 'words': 165756}`.
<br/>
* **Hint**: Rememer to use the `get_page()` function. Be careful with what parameters to pass to each function.*
</p>
</div>

In [48]:
def stats(url):
    text = get_page(url)
    blob = TextBlob(text)
    return {
        "ari": auto_readability_index(blob),
        "cl": coleman_liau_index(blob),
        "lexical_richness": lexical_richness(blob.words),
        "sentence_length": avg_sentence_length(blob),
        "average_sentiment": avg_sentiment(blob.string),
        "characters": count_chars(blob.string),
        "words": len(blob.words),
        "unique_words": len(set(blob.words)),
        "sentences": len(blob.sentences),
        "top_words": [w for (w, f) in most_used_words(blob, 10)],
    }

stats("http://www.gutenberg.org/cache/epub/345/pg345.txt")  # Dracula

{'ari': 7.051237118685233,
 'average_sentiment': 0.6216963558545169,
 'characters': 883114,
 'cl': 6.151579188686984,
 'lexical_richness': 15.130625285257873,
 'sentence_length': 19.343680709534368,
 'sentences': 8569,
 'top_words': ['said',
  'could',
  'one',
  'us',
  'must',
  'would',
  'may',
  'shall',
  'see',
  'know'],
 'unique_words': 10955,
 'words': 165756}