## What is natural language?

Language used for everyday communication by humans
    - English
    - Korean
    - Chinese ..
    
compared to the artificial computer language

## What is natural language processing?

Any computation, manipulation of natural language

Natural languages evolve
- new words get added ex_ selfie
- old words lose popularity ex_ thou
- meaning of words change ex_ learn
- language rules themselves may change ex_ position of verbs in sentences
    
## NLP Tasks : A broad spectrum
- Counting words, counting frequency of words
- Finding sentence boundaries
- Part of speech tagging
- Parsing the sentence structure
- Identifying semantic roles
- Identifying entities in a sentences (entity recognition)
- Finding which pronoun refers to which entity (co-reference resolution)


## 1. Basic NLP Tasks with NLTK

**NLTK : Natural Language Toolkit**

1) Open source library in Python

2) Has support for most NLP tasks

3) Also provides access to numerous text corpora

### 1) Download Corpus

In [1]:
import nltk
# get some text corpora
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
from nltk.book import *   

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [5]:
# 1 sentence from 9 corpora
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [6]:
sent1

['Call', 'me', 'Ishmael', '.']

### 2) Simple NLP Tasks

##### - Counting vocabulary of words

In [14]:
text7

<Text: Wall Street Journal>

In [15]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [16]:
#number of tokens
len(sent7)

18

In [17]:
len(text7)

100676

In [18]:
# number of unique tokens
len(set(text7))

12408

In [25]:
list(set(text7))[:10]

['export-oriented',
 'cease-fire',
 'trimmed',
 'Rico',
 'implied',
 'psychiatric',
 'lenders',
 'PAPERS',
 'AN',
 'pennies']

##### - Frequency of words

In [21]:
dist = FreqDist(text7)
len(dist) #same as len(set(text7))

12408

In [28]:
vocab1 = dist.keys()
#vocab1[:10] 
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [32]:
# 'four' appears 20 times
dist['four']

20

In [33]:
# word, which is at least length of 5, and appears more than 100 times
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

### 3) Normalization and stemming

Different forms of the same 'word'

In [36]:
# Normalization
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [35]:
# Stemming = root word, root form
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

** Important question here is, "Do you really want to distinguish between the two?" **

It's a matter of choice.

### 4) Lemmatization

A slight variant of stemming.

You want to have the words that come out to be actually meaningful.

In [37]:
#Universal Declaration of Human Rights
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [38]:
# Use porter stemming
[porter.stem(t) for t in udhr[:20]]

# Some of them are not really valid words

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [40]:
# So we use Lemmatizer

WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

# All the words are valid
# rights -> rights 
# rights -> right   
# There are rules to changes.

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

### 5) Tokenization

Recall splitting a sentence into words / tokens

In [42]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')
# But it's not doing a really good job

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [29]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

** How would you split sentences from a long text string? **

In [43]:
# nltk has built-in sentence tokenizer!
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [31]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

## *Syntatic Ambiguity

** Visiting aunts can be a nuisance. **

Besides the tongue-in-cheek responses, this sentence is another example of a syntactic ambiguity. 

Depending on how the sentence is parsed, both the first and the second interpretations are possible. 

In this case, the ambiguity arises because the words "Visiting" could either be an adjective or a gerund, leading to two different ways to parse the sentence to derive two different meanings.

## 2. Advanced NLP Tasks with NLTK

- Part of speech tagging.
- Parsing the sentence structure.
- Identifying semantic role labeling.
- Named Entity Recognition
- Co-reference and pronoun resolution

### 1) POS tagging (Part-of-speech)

(Tag, Word Class)
(CC, Conjuction)
(CD, Cardinal)
(NN, Noun)
(PRP, Pronoun)
...

So many tags or word classes

In [44]:
# more info about word class
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [46]:
# POS tagging
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

In [45]:
# Ambiguity in POS tagging

text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

### 2) Parsing sentence structure

Making sense of sentences is easy if they follow a well-defined grammatical structure.



In [60]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")

# NP : Noun Phrase, VP : Verb Phrase
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


In [66]:
# ambiguity in parsing
text16 = nltk.word_tokenize("I saw the man with a telescope")

mygrammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | VP PP
PP -> P NP
NP -> DT N | DT N PP | 'I'
DT -> 'a' | 'the'
N -> 'man' | 'telescope'
V -> 'saw'
P -> 'with'
""")
parser = nltk.ChartParser(mygrammar1)

trees = parser.parse_all(text16)
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (DT the) (N man)))
    (PP (P with) (NP (DT a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (DT the) (N man) (PP (P with) (NP (DT a) (N telescope))))))


In [67]:
# Big Collection of Trees

from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


### 3) POS tagging and parsing ambiguity

In [68]:
# Uncommon usage of words ex_ The old man the boat

text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [70]:
# Well-formed sentences may still be meaningless

text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

# Wrong classification!

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]