## Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These are  very useful categories for many language processing tasks. Our goals chapter is to answer the following questions:

1. What are lexical categories and how are they used in natural language processing?
2. What is a good Python data structure for storing words and their categories?
3. How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. 


### Using a POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:

In [None]:
import nltk

text = nltk.word_tokenize("And now for something completely different")

nltk.pos_tag(text)

In [None]:
text = nltk.word_tokenize("Manager worked well")
nltk.pos_tag(text)

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

NLTK provides documentation for each tag, which can be queried using the tag, e.g. `nltk.help.upenn_tagset('RB')`, or a regular expression, e.g. `nltk.help.upenn_tagset('NN.*')`.

In [None]:
nltk.help.upenn_tagset('NN')

In [None]:
nltk.help.upenn_tagset('.*')

Let's look at another example, this time including some **homonyms**:

In [None]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)

See now how this information can be useful when trying to figure out the sense of a word in WordNet:

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('refuse')

In [None]:
senses = [(s.lemma_names(), s.definition(), s.examples()) for s in wn.synsets('refuse')]
for s in senses:
    print("Lemma name:", s[0])
    print("Definition:", s[1])
    print("Examples  :", s[2])
    print("=======================")

There is just one interpretation of _refuse_ that is a noun (garbage.n.01) and the most common interpretation of refuse as a verb means "show unwillingness towards" which is the correct interpretation in our context. 

### Exercise

Many words, like ski and race, can be used as nouns or verbs with no difference in pronunciation. Can you think of others? Now make up a sentence with both uses of this word, and run the POS-tagger on this sentence.

In [None]:
# Your code is here
text = nltk.word_tokenize("I love that I have love for data.")
tagged = nltk.pos_tag(text)
tagged

### Representing Tagged Tokens

By convention in NLTK, a tagged token is represented using a **tuple** consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

In [None]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
tagged = nltk.pos_tag(text)
tagged_token = tagged[0]

print(tagged_token)
print(tagged_token[0])
print(tagged_token[1])

In [None]:
print("Text = ", text)

tokens = [a for (a, b) in tagged]
print("Tokens = ",tokens)

tags = [b for (a, b) in tagged]
print("POS Tags = ", tags)

fd = nltk.FreqDist(tags)
print(fd)
fd.tabulate()

In [None]:
text = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
tagged = nltk.pos_tag(text)

In [None]:
tokens = [a for (a, b) in tagged]
tags = [b for (a, b) in tagged]
fd = nltk.FreqDist(tags)
fd.tabulate()

In [None]:
from nltk.book import *

print(type(text7))
tagged_wsj = nltk.pos_tag(text7)

In [None]:
tokens_wsj = [a for (a, b) in tagged_wsj]
tags_wsj = [b for (a, b) in tagged_wsj]
fd_wsj = nltk.FreqDist(tags_wsj)
fd_wsj.tabulate()

### Exercise 

Load a text of your choice, tokenize it, and perform part of speech tagging on it. Then extract the nouns from the text, and perform a frequency anaysis, to identify the most common nouns in the text. (Warning: POS tagging takes a good amount of time when processing long texts).

Repeat the exercise for adjectives.

In [None]:
# Your code is here
text = nltk.Text([word for (word,pos) in nltk.pos_tag(text1) if pos.startswith('NN')])
fdist = text.vocab()
print(fdist.most_common(10))

## Primitive sentiment analysis

Adjectives are known to be the primary carriers of sentiment. So now let's pick a piece of text and identify the adjectives that appear in the text and their sentiment score. For that, we will use the  SentiWordNet, a lexical resource for opinion mining.

In [None]:
# See http://www.nltk.org/_modules/nltk/corpus/reader/sentiwordnet.html for the documentation

from nltk.corpus import sentiwordnet as swn
print(swn.senti_synset('fast.a.02'))

Now let's analyze a review text 

In [None]:
# Amazon review for Samsung Galaxy S5, White 16GB
# http://www.amazon.com/review/R3UULR1IWEUS4I/ref=cm_cr_dp_title?ie=UTF8&ASIN=B00IZ1X21K&nodeID=2335752011&store=wireless


content = u'''
First off, I am not a professional reviewer, nor am I employed or compensated by Samsung or any other company. Instead of boring you with facts - which you can find anywhere on the Net - I will just give you some real-world impressions on how it looks, feels, and runs. With that out of the way, let's get to the point and the nitty gritty, shall we?

* THE SCREEN - that is the very first thing you will notice when you look at the S5. Samsung has found its niche with AMOLED screens, which are BRIGHT & SATURATED. Everything almost literally jumps out at you, and sometimes even too much so. I had to switch to the "natural" setting, as the "vivid" and even "standard" profiles are too saturated(and FAKE) for me. It's better as a demo unit to draw you in, but for everyday use, I recommend switching to the natural profile.
FACTS: The Galaxy S5 has a 5.1-inch Super AMOLED capacitive touchscreen with Full HD resolution - 1080 x 1920 pixels or ~432 ppi pixel density, plus Gorilla Glass 3 to protect the screen from scratches.

* The Look - the S5 has a more squared-off edges look than the S4, which is more squared off than the S3, but all three are not as angular as the S2. In terms of roundness-to square-ness, it goes from the S3 - S4 - S5 - S2 (the original S just looks like an iPhone 3GS). Check out my images for an easier comparison. The S5 is the tallest and widest, but not the thickest of the Galaxy S's. The best thing I can say about this is it's an evolution. Beauty is subjective, so judge for yourself. The front side is almost the same as any other Galaxy phone: You have the physical Home button, flanked by the "back" and "menu" capacitive buttons. Probably the most improved aspect of the design is in its functionality - it is now dust-proof, and water-proof up to 3 feet!
FACTS: The dimensions are 5.59" x 2.85" x 0.32"(142cm x 72.5cm x 8.1cm), and weighs 5.11oz(145g).

* The Feel - Samsung has taken a lot of flack for making the Galaxy S line so cheap looking and feeling with its plastic bodies, for being the top Android phone maker. HTC has been known to have the best craftsmanship with their all-metal One phones. Perhaps Samsung feel they are so dominant that they don't have to spend more to mass-produce metal phones, but since they don't want to come off as too arrogant, so their compromise is a dimpled, faux-rubber backside like the Nexus 7(2012) and its very own Galaxy Note 3. It definitely gives a better feel - it doesn't slip and slide in your hands or pockets anymore - but it cannot compare to the feel and craftsmanship of the HTC One(both the m7 and m8). It is on the right track though, so let's hope that rumored luxury "F" line or next year's S6 will continue to get better.

* How it Runs - This phone is fast, fast, FAST! With a 2.5gHz Snapdragon 801, it has the fastest processor out there right now. It terms of real speed, I cannot say if it is faster than the HTC One m8 or the Sony Xperia Z2, but it is definitely up there. When you touch an app icon to launch it, it launches nearly instantly. To really see how this phone flies, just open the gallery app and scroll through all your photos and you'll see what I mean. Usually the gallery is where most phones stutter as it tries to load all your photos and albums - but NOT the S5!

* The Camera - FINALLY! Samsung has decided to make a decent camera, and not just as an afterthought. This 16mp camera is really awesome, so much better than the S4. I would always get washed out images with my S3/S4/Note 2, but with the S5, it actually looks like it's from a decent point-and-shoot dedicated camera with crisp, bright, and saturated images. Low-light shooting is also vastly improved, although not as good as the new HTC One m8. 16mp means 5312 x 2988 -resolution images, so you can actually blow them up or crop them down without fearing the dreaded pixelation monster. There are a myriad of other cool and useful camera features that I will save for you to find out(like macro and "Google Street View" modes :]). And lastly, the focus is quick, quick, QUICK! Nearly instantaneous focus allows you to capture those hard-to-capture moments easier. A definitely thumbs up to Samsung for paying attention to the camera and its functions.

* Software - I'm still trying to figure out everything, as there is A LOT of stuff under the hood. Samsung's TouchWiz user interface this time around is A LOT less intrusive though, as much as can be without being totally stock Android, I guess. The layout and iconography are flatter and simpler, and for the better in my view. There is also a new sensor on the back, just beneath the camera lens. It is a heart-rate monitor/pedometer, and it comes with its own health app called S Health. There is a new battery-saving mode which can save you precious minutes when you're caught in a bind. All in all, I think this version is a lot nicer-looking, more responsive, and better than the precious S phones.

The ultimate question is whether this phone is a worthy upgrade over the S4. As my review title suggests, it is an evolution, an incremental upgrade over the S4. So with that said I cannot whole-heartedly recommend it if you already have a good phone, or even over the S4. But I do feel this upgrade is more vast and much better than from the S3 to the S4, so in that sense Samsung has done a much better job this year. If you are switching from an older phone that was made at least 2 years ago, then I would tell you jump right in and try the S5 - it will not disappoint you. But for those with already a good phone, and/or say you just finished year one of your 2-year contract, then I would say think hard before you make the leap. For my money, I think the Note 4 and S6 will be the bigger upgrades more worth waiting for.
'''

In [None]:
tokens = nltk.word_tokenize(content)
text = nltk.Text(tokens)
tagged = nltk.pos_tag(tokens)


In [None]:
# Let's keep the adjectives only
adjectives = [word for (word , pos_tag) in tagged if pos_tag=='JJ']
print(adjectives) 

In [None]:
# Now we want to use WordNet and eliminate the words that do not appear in our lexicon
# Since we do not have much of information for further disambiguation, we will keep only the 
# most popular interpretation (list element 0) if there are multiple ones
resolved_adjectives = [(w, list(swn.senti_synsets(w, 'a'))[0])
                       for w in adjectives
                        if len(list(swn.senti_synsets(w, 'a'))) > 0]
print(resolved_adjectives)

In [None]:
# SentiWordNet assigns to each synset of WordNet three
# sentiment scores: positivity, negativity, and objectivity.

for (w,a) in resolved_adjectives:
    print("Word:", w)
    print("Synset:", a)
    print("Pos score:",  a.pos_score())
    print("Neg score:",  a.neg_score())
    print("Objectivity score:",  a.obj_score())
    print("======================================")

In [None]:
# But let's take a look at what we rejected
rejected_adjectives = [w for w in adjectives if len(list(swn.senti_synsets(w, 'a')))==0]
print(rejected_adjectives)


Perhaps we would also like to figure out what the adjectives in the text refer to. 

In [None]:
for i in range(0, len(tagged)):
    current_word = tagged[i][0]
    current_pos = tagged[i][1]
    if current_pos == 'NN':
        previous_word = tagged[i-1][0]
        previous_pos = tagged[i-1][1]
        if previous_pos == 'JJ':
            print(previous_word + " " + current_word)

### Exercise 1

Instead of adjectives-nouns, we can instead use adverbs and verbs (e.g., "works nicely"). Let's modify the code above to extract patterns involving verbs and adverbs

In [None]:
# Your code is here


### Exercise 2

How can you modify the code to find more patterns, instead of just JJ-NN (adjective followed by noun)?

In [None]:
# Your code is here
