# (5A) Natural Language Processing (NLP) Cookbook, part 1

So far, our three primary steps of text analysis have consisted of:

1. **Reading** the file:
    * Filename --> String of entire text
2. **Tokenizing** the string:
    * String --> List of individual words
3. **Counting** the tokens:
    * List --> Dictionary of word counts (word as key, count as value)
    
    
Cool. But how can we get more granular information from texts? What about all the other things Natural Language Processing can do, like:

* Identify parts of speech? (nouns, verbs, adjectives, ...)
* Detect proper nouns of places and people?
* Identify sentence boundaries?
* Describe sentence syntax? (clauses, subject-object and other "dependencies", etc)
* Identify noun phrases? ("natural language processing", "education department", etc)
* Sentiment analysis? (estimating positive/negative sentiment of text)

We don't need to reinvent the wheel. (Although, if you *did* reinvent the wheel, wouldn't that be kind of cool?) There are a number of great software packages out there for natural language processing in Python.

This notebook is organized a bit differently. On the top level are basic NLP tasks, and then within each, I'll identify a few different ways to accomplish that NLP task. For instance, how can we tokenize words again? (Perhaps the simplest NLP task.)

-----

## How can I tokenize a string into a list of words?

Let's compare a number of ways to tokenize a string.

In [None]:
# English
string = "We don't need to re-invent the wheel!"

# Spanish
string_es = '¡No necesitamos reinventar la rueda!'

# German
string_de = 'Wir brauchen nicht das Rad neu erfinden!'

# French
string_fr = "Nous n'avons pas besoin de réinventer la roue!"

# Chinese (simplified)
string_zh = '我们不需要重新发明轮子！'

### (0) Just using string's .split()

In [None]:
# "tokenize"
words = string.split()

# print words
print(words)

# print number of words
print(len(words))

### (1) NLTK

This is the one we've worked with so far. NLTK is used most often in academic settings, and by people prototyping algorithms and processing pipelines. Depending on the task, it can support many languages, if you download the appropriate language models.

For tokenization, NLTK uses a punctuation-based tokenizer. This should work decently well, but not perfectly, for any language for which spaces and punctuation marks delimit words and sentences.

In [None]:
# import nltk once to use its functions
import nltk

# tokenize
words = nltk.word_tokenize(string)

# print words
print(words)

# print number of words
print(len(words))

### (2) TextBlob (recommended)

I recommend that we start to move from NLTK to TextBlob, which is built on top of NLTK and [pattern](https://github.com/clips/pattern), and which makes a number of basic NLP tasks extremely easy. TextBlob works for English, French, and German. To install, type in the Terminal:

```
pip install textblob
```

For French or German, add:
```
pip install textblob-fr    # French
pip install textblob-de    # German
```

In [None]:
# from textblob, import the main 'class': TextBlob
from textblob import TextBlob

# create a textblob object
blob = TextBlob(string)

# print words
print(blob.words)

# print number of words
print(len(blob.words))

In [None]:
# print tokens (words with punctuation)
print(blob.tokens)

# print number of tokens (words with punctuation)
print(len(blob.tokens))

In [None]:
blob.word_counts

### (3) Polyglot (for non-English, non-French, non-German text)

[Polyglot](https://polyglot.readthedocs.io/) is a really cool package, built on top of TextBlob, which supports up to 140 different languages depending on the NLP task. This increase in linguistic range comes at a cost of accuracy, however. The tools was trained using Wikipedia as a Rosetta Stone, calibrating languages' models against each other by using the "same" articles in those languages. The other costs of using Polyglot: the documentation isn't that great, and it doesn't seem to be actively updated.

*Installation is also kind of a pain in the neck.* I recommend installing this only if you are planning to work with non-English, non-French, non-German text. To do so, paste the following into Terminal:

    conda install -c conda-forge pyicu
    pip install pycld2
    pip install morfessor
    pip install polyglot
    polyglot download LANG:en   # for english
    polyglot download LANG:es   # for spanish (optional)
    polyglot download LANG:xx   # where xx is the two-letter language code
   
See [the website](https://polyglot.readthedocs.io/) for more details.

In [None]:
# let's try this...
try:
    # to use polyglot, import its "Text" object:
    from polyglot.text import Text

    # then wrap that Text object around any string
    pg_text = Text(string_es)

    # print words
    print(pg_text.words)

    # print number of words
    print(len(pg_text.words))
except ImportError:
    print('Polyglot not installed! To do so, follow the instructions above.')

### (4) Spacy

[Spacy](http://spacy.io) is industrial-strength NLP. It's the fastest, most powerful, and most accurate. It can also work on [several languages besides English](https://spacy.io/models). But it's also kinda ugly and confusing to use. I recommend using this only if you are working on hundreds of texts and feel extremely comfortable with all the things we've been doing so far.

To install:

    pip install spacy
    python -m spacy download en_core_web_sm

Here's a toy example of spacy:

In [None]:
try:
    # import spacy
    import spacy

    # load its default English model
    nlp = spacy.load("en_core_web_sm")

    # create a spacy text object
    doc = nlp(string)

    # print words
    print(list(doc))

    # print number of words
    print(len(doc))
except ImportError:
    print("spacy not installed. Please follow directions above.")

### Practice

In [None]:
## @TODO: 
# - Import textblob
# - Make a 'blob' from the string (string_arendt) below
# - Print the number of words
# - Print the number of UNIQUE words
# - Calculate the TTR

string_arendt = """For the sciences today
have been forced to adopt a "language" of mathematical symbols
which, though it was originally meant only as an abbreviation for
spoken statements, now contains statements that in no way can be
translated back into speech. The reason why it may be wise to
distrust the political judgment of scientists qua scientists is not
primarily their lack of "character" -- that they did not refuse to
develop atomic weapons -- or their naivete -- that they did not
understand that once these weapons were developed they would
be the last to be consulted about their use -- but precisely the fact
that they move in a world where speech has lost its power."""



## How can I count words?

In [None]:
# sample list of words
list_stein = ['a','rose','is','a','rose','is','a','rose','.']

### (0) Python

This was the first way we were doing things.

In [None]:
# We can write our own counter function
def count(words):
    # make an empty dictionary
    dict_of_counts = {}
    
    # loop over words
    for word in words:
        # if the word is in the dictionary
        if word in dict_of_counts:
            # add 1 to its entry
            dict_of_counts[word]+=1
        else:
            # initialize its entry with 1
            dict_of_counts[word]=1
            
    # return dict of counts
    return dict_of_counts

In [None]:
# We can create a dictionary of counts using our function
count_dict = count(list_stein)
count_dict

In [None]:
# But this will break, because 'daffodil' is not in the count dictionary
count_dict['daffodil']

In [None]:
# This will make sure we get a default value (0, in this case) if 'daffodil' is not in the count dictionary
count_dict.get('daffodil',0)

### (1) Counter

Python includes a very helpful variant on the dictionary called 'Counter'.

In [None]:
# Counter must be imported
from collections import Counter

# This is the first way you can make a Counter: just pass it a list of tokens
count_dict = Counter(list_stein)

# show
count_dict

In [None]:
# Another way is to start from an empty one
count_dict = Counter()

# ...and then loop through and add 1 each time
for word in list_stein:
    count_dict[word]+=1    # no need to check if it's there!
    
# show
count_dict

In [None]:
# A counter will automatically give us a default value of 0 if the word is not there
count_dict['daffodil']

### (2) textblob

The simplest way to count all the words in a text is to just grab the `.word_counts` attribute of a textblob.

In [None]:
# first, make a blob
blob = TextBlob("a rose is a rose is a rose.")

# then grab the counter
count_dict = blob.word_counts

# show (note the lack of period: this is without punctuation)
count_dict

In [None]:
# this will also give us a default value of 0
count_dict['daffodil']

## How can I tokenize sentences?

If word tokenization is the process of splitting an undifferentiated string into a list of words, then sentence tokenization is the process of splitting an undifferentiated string into a list of sentences.

Here's a few different ways.

In [None]:
para_labor = """Ever since he's been here, never stopped working. Always working.
Washing dishes. Chopping vegetables. Cleaning floors. Cooking hamburgers. Painting walls.
Laying brick. Cutting hedges. Mowing lawn. Digging ditches. Sweeping trash."""

### (1) NLTK

In [None]:
# tokenize
sentences = nltk.sent_tokenize(para_labor)

# print number of sentences
print(len(sentences))

# show sentences
sentences

### (2) textblob (recommended)

In [None]:
# first make a blob
blob = TextBlob(para_labor)

# Then the sentences are just magically at .sentences
sentences = blob.sentences

# Print number of sentences
print(len(sentences))

# Show sentences
sentences

In [None]:
# Get first sentence
first_sent = blob.sentences[0]

In [None]:
# Get that sentence's words
first_sent.words

#### How can I calculate the average length of sentences?

In [None]:
# Get average words per sentence
num_words = len(blob.words)
num_sents = len(blob.sentences)
wps = num_words / num_sents

print(wps)

#### How can I calculate the median length of sentences?

In [None]:
### Calculate the MEDIAN length of sentences

# numpy has a lot useful stats functions
import numpy as np

# set an empty list which we'll use to store the lengths of the sentences
sent_lens = []

# for every sentence...
for sent in blob.sentences:
    
    # get the number of words
    num_sent_words = len(sent.words)
        
    # add this length to the list of sentence lengths
    sent_lens.append(num_sent_words)
    
# once, we're done looping, print sent_lens
print(sent_lens)

# print the median
median_len = np.median(sent_lens)
print(median_len)

## How can I get the parts of speech?

### (1) NLTK

English only (by default). NLTK's tagset (as well as textblob's) is drawn from the [Penn Treebank Tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In [None]:
# first you need to tokenize
words = nltk.word_tokenize(string)

# then you can part of speech tag
tags = nltk.pos_tag(words)

# show tags
tags

In [None]:
# Each of these is a 'tuple': a baby list with just two things in it
babylist = tags[0]
babylist

In [None]:
# This means we can get the first thing in this baby list the same way
babylist[0]

In [None]:
# And the second thing
babylist[1]

In [None]:
# A tuple is a 'frozen' list. Nothing can be added to it
babylist.append('this will not work')

In [None]:
# We can loop over the words and tags like this:
for babylist in tags:
    word = babylist[0]
    tag = babylist[1]
    if tag == 'NN':
        print(word)

In [None]:
# We can also loop over it like this
for word,tag in tags:
    # Print if noun
    if tag=='NN':
        print(word)

### (2) textblob (recommended)

In [None]:
# first make a blob
blob = TextBlob(string)

# then the tags are just magically at .tags
tags = blob.tags

# show tags
tags

In [None]:
# Loop over words and tags in the same way as above
for word,tag in tags:
    # Print if noun
    if tag=='NN':
        print(word)

### (3) polyglot

Polyglot uses a [simplified tag set](https://polyglot.readthedocs.io/en/latest/POS.html).

In [None]:
try:
    # First make a polyglot text
    from polyglot.text import Text
    text = Text(string_es)

    # Then get the tags at .pos_tags
    tags = text.pos_tags

    # print tags
    from pprint import pprint   # this is a prettier print function
    pprint(tags)
except ImportError:
    print('Polyglot not installed. Please see installation instructions above.')

In [None]:
# You can loop over these in the same way, too
for word,tag in tags:
    if tag=='NOUN':
        print(word)

### (4) spacy

See [here](https://spacy.io/usage/spacy-101).

### Practice

In [None]:
## @TODO: Count only the nouns in this latest Manzanar chapter (ch35)
#

# open the file
with open('../corpora/tropic_of_orange/texts/ch35.txt') as file:
    string_manzanar = file.read()
    
# make a text blob


# make an empty dictionary or counter
count_pos = Counter()     # you don't have to check if entry is here


# loop over the words and tags
# and, if it is a noun,
# add 1 to a word's entry in count_nouns 



In [None]:
## @TODO: Count not words, but parts of speech, in this same chapter
# hint: follow the pattern above for the most part



## Overall practice

In [None]:
## @TODO: 
# - Make a pandas dataframe for the Tropic of Orange metadata
# - For each file in that dataframe,
#    - Open that file
#    - Read it to a string
#    - Make a textblob of it
#    - Make a text_results dictionary (remember to include 'fn' as key)
#    - Add in the results dictionary:
#      - % of words which are nouns (tag starts with "N")
#      - % of words which are verbs (tag starts with "V")
#      - % of words which are adjectives (tag starts with "J")
#    - Store the results dictionaries in a list
#    - Make a results dataframe
#    - Merge that dataframe to the original metadata frame
#    - Make 3 boxplots of nouns/verbs/adjectives by narrator




## Appendix

### textblob

#### textblob auf Deutsch

In [None]:
# let's try the following lines out, but if they don't work, we'll take plan B
try:
    # import German textblob
    from textblob_de import TextBlobDE

    # make a German blob
    blob_de = TextBlobDE(string_de)

    # print words
    print(blob_de.words)

    # print number of words
    print(len(blob_de.words))

    # print part of speech tags
    print(blob_de.tags)
except ImportError:
    # here's our Plan B
    print('textblob-de not installed. run in terminal: pip install textblob-de')

#### textblob en français

In [None]:
# let's try the following lines out, but if they don't work, we'll take plan B
try:
    # import French textblob
    from textblob_fr import PatternTagger, PatternAnalyzer
    
    # make a French blob
    blob_fr = TextBlob(string_fr, pos_tagger=PatternTagger(), analyzer=PatternAnalyzer())
    
    # print words
    print(blob_fr.words)
    
    # print number of words
    print(len(blob_fr.words))
    
    # print part of speech tags
    print(blob_fr.tags)
except ImportError:
    # here's our Plan B
    print('textblob-fr not installed. run in terminal: pip install textblob-fr')