## Natural language processing

Our word counts would be more interesting if we could reason better about the *language* in the text, not just the individual characters. For example, if we knew the parts of speech of individual words, we could exclude words that are determiners, conjunctions, etc. from the count. If we knew what kinds of things the words were referring to, we could count how many times particular characters or settings are referenced.

To do this, we need to do a bit of Natural Language Processing. [More notes and opinions on this.](https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3)

Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to import it:

In [203]:
import spacy
nlp = spacy.load('en_core_web_sm')

Load in your text using the following line of code! (Remember to replace `pg84.txt` with the filename of your own text file.)

In [204]:
# replace with the name of your own text file, then run this cell with CTRL+Enter.
text = open("sotu2012_edited.txt").read() 

Now, use spaCy to parse it. (This might take a while, depending on the size of your text.)

In [205]:
doc = nlp(text)

Right off the bat, the spaCy library gives us access to a number of interesting units of text:

* All of the sentences (`doc.sents`)
* All of the words (`doc`)
* All of the "named entitites," like names of places, people, #brands, etc. (`doc.ents`)
* All of the "noun chunks," i.e., nouns in the text plus surrounding matter like adjectives and articles

The cell below, we extract these into variables so we can play around with them a little bit.

In [206]:
sentences = list(doc.sents)
words = [w for w in list(doc) if w.is_alpha]
noun_chunks = list(doc.noun_chunks)
entities = list(doc.ents)

With this information in hand, we can answer interesting questions like: how many sentences are in the text?

In [207]:
len(sentences)

26

Now we can go thru every sentence in sentences and print each one.

In [208]:
for item in sentences:
    print(item)


Mr. Speaker, Mr. Vice President, members of Congress, distinguished
guests, and fellow Americans.


cats like cheese.
dogs like potatos.


Last month, I went to Andrews Air Force Base and welcomed home some of
our last troops to serve in Iraq.
Together, we offered a final, proud
salute to the colors under which more than a million of our fellow
citizens fought - and several thousand gave their lives.


chair is smelly.


We gather tonight knowing that this generation of heroes has made the
United States safer and more respected around the world.
For the first
time in nine years, there are no Americans fighting in Iraq.
For the
first time in two decades, Osama bin Laden is not a threat to this
country.
Most of al Qaeda's top lieutenants have been defeated.
The
Taliban's momentum has been broken, and some troops in Afghanistan have
begun to come home.


These achievements are a testament to the courage, selflessness, and
teamwork of America's Armed Forces.
At a time when too many of our

Wat if we only printed sentences that started with the word "We"?

In [209]:
for item in sentences:
    if item[0].text == "We":
         print(item.text)

    

We gather tonight knowing that this generation of heroes has made the
United States safer and more respected around the world.
We can do this.


In [210]:
nouns = [w for w in words if w.pos_ == "NOUN"]

#what data-type is a word in nouns? oh no! it's a weird spacy token thing.
print(type(nouns[0]))

#print a break
print()

#print all nouns
print(nouns)

<class 'spacy.tokens.token.Token'>

[members, guests, cats, cheese, dogs, potatos, month, troops, salute, colors, fellow, citizens, lives, chair, tonight, generation, heroes, world, time, years, time, decades, threat, country, lieutenants, momentum, troops, achievements, testament, courage, selflessness, teamwork, time, institutions, expectations, ambition, differences, mission, hand, what, example, reach, country, world, people, generation, tech, manufacturing, jobs, future, control, energy, security, prosperity, parts, world, economy, work, responsibility, end, generation, heroes, combat, economy, class, world, grandfather, veteran, chance, college, grandmother, who, bomber, assembly, line, part, workforce, products]


In [213]:
#here we add a .text after w in order to turn the words into strings.
nouns_str = [w.text for w in words if w.pos_ == "NOUN"]

#what data-type is a word in nouns? yayy! we want str. the words have quotes now!
print(type(nouns[0]))

#print a break
print()

#print all nouns
print(nouns_str)

<class 'spacy.tokens.token.Token'>

['members', 'guests', 'cats', 'cheese', 'dogs', 'potatos', 'month', 'troops', 'salute', 'colors', 'fellow', 'citizens', 'lives', 'chair', 'tonight', 'generation', 'heroes', 'world', 'time', 'years', 'time', 'decades', 'threat', 'country', 'lieutenants', 'momentum', 'troops', 'achievements', 'testament', 'courage', 'selflessness', 'teamwork', 'time', 'institutions', 'expectations', 'ambition', 'differences', 'mission', 'hand', 'what', 'example', 'reach', 'country', 'world', 'people', 'generation', 'tech', 'manufacturing', 'jobs', 'future', 'control', 'energy', 'security', 'prosperity', 'parts', 'world', 'economy', 'work', 'responsibility', 'end', 'generation', 'heroes', 'combat', 'economy', 'class', 'world', 'grandfather', 'veteran', 'chance', 'college', 'grandmother', 'who', 'bomber', 'assembly', 'line', 'part', 'workforce', 'products']


In [212]:
#for every sentence in sentences, if the first word of the sentence is in the nouns_str list, print that sentence

for item in sentences:
    if item[0].text in nouns_str:
         print(item.text)


cats like cheese.
dogs like potatos.


chair is smelly.


