# **Module 2: Basic Natural Language Processing**

## Section I Basic Natural Language Processing (NLP) with NLTK

#### 1. **Natural language and its processing**

##### (1) **Natural language**
- Language used for everyday communication by *humans*
- Compared to **computer language** (artificial)

##### (2) **Natural language processing (NLP)**
- Computation/Manipulation of natural language to:
    - Get insight of how words mean and how sentences are constructed
- Natural languages evolve
    - New words are added
    - Old words lose popularity
    - Meaning of words changes
    - Language rules may change

##### (3) **NLP tasks**
- Counting words (frequency/unique)
- Finding sentence boundaries
- Part of speech tagging
- Parsing the sentence structure
- Identifying semantic roles (e.g., *finding **S,V,O** in "Mary loves John"*)
- Identifying entities in a sentence (e.g., *finding **Mary** in "Mary loves John"*)
- Finding which pronoun refers to which entity (**co-ef resolution**)

#### 2. **An introduction to Natural Language Toolkit (NLTK)**

##### (1) **NLTK**

- An open source library in Python
- Supports most NLP tasks
- Provides access to numerous text corpora

##### (2) **Install and import NLTK**
- Install NLTK:
```bash
    pip3 install nltk
    conda install nltk
```

- import NLTK and process essential preparation (download corpora):

In [2]:
import nltk

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [4]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [5]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [6]:
sent1

['Call', 'me', 'Ishmael', '.']

#### 3. **Simple NLP tasks**

- Counting vocabulary of words

In [8]:
text7

<Text: Wall Street Journal>

In [9]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [11]:
len(sent7) # number of tokens in sentence 7

18

In [12]:
len(text7) # number of tokens in the whole text 7 "Wall Street Journal"

100676

In [14]:
len(set(text7)) # number of unique words in the whole text 7

12408

In [26]:
list(set(text7))[:10] # first 10 unique words in text 7

['Torrington',
 'Krenz',
 '27-year',
 'Sebastian',
 'airlines',
 'Needham',
 'Senior',
 'supplies',
 'LTV',
 'popularly']

- Counting frequency of words

In [16]:
dist = FreqDist(text7)

In [18]:
len(dist) # set of unique words in text 7

12408

In [27]:
vocabl = dist.keys()
list(vocabl)[:10] # the first 10 unique words in text 7

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [30]:
dist[u'four'] # the distribution (times that show) of the word 'four' in UTF-8 encoding

20

In [35]:
freqwords = [w for w in vocabl if len(w) > 5 and dist[w] > 100] # The words with more than 5 characters and more than 100 times occurrence
# Reason for using at least 5 characters: short words or non-letter characters may contain no meaning and become very frequent
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

- **Normalization**
    - Normalize all words to the same casing (e.g., *lower-case*)

In [37]:
input1 = 'List listed lists listing listings'

In [39]:
words1 = input1.lower().split(' ') # normalize all words to lower-case
words1

['list', 'listed', 'lists', 'listing', 'listings']

- **Stemming**
    - Removing all common suffixs of a word and bring it to its root form (e.g., *removing "ed" and "ing" from verbs*)

In [42]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

- **Lemmmatization**
    - Transform a word to its basic form (e.g., *converting "running" and "ran" to "run"*)
    - Difference with stemming: resulting words are all valid words

In [45]:
udhr = nltk.corpus.udhr.words('English-Latin1')

In [47]:
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [48]:
[porter.stem(t) for t in udhr[:20]]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [53]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]] # Note: the capitalized word "rights" (the 5th word) is not lemmatized

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

In [55]:
[WNlemma.lemmatize(t.lower()) for t in udhr[:20]]

['universal',
 'declaration',
 'of',
 'human',
 'right',
 'preamble',
 'whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

- **Tokenization**
    - Split a sentence while taking out non-alphabetic characters (like dots).

In [6]:
text11 = "Children shouldn't drink a sugary drink before bed."

In [60]:
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [63]:
nltk.word_tokenize(text11) # NLTK inbuilt tokenizer

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

- **Sentence splitting**

In [62]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"

In [65]:
sentences = nltk.sent_tokenize(text12) # NLTK inbuilt sentence splitter
print(sentences,len(sentences))

['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?', 'Yes, it is!'] 4


#### 4. **Advanced NLP tasks**

- **Part-of-speech (POS) tagging**
    - POS tags: nouns, verbs, adjectives...

| Tag | Word Class | Tag | Word Class | Tag | Word Class |
| :-----:| :----: | :----: | :-----:| :----: | :----: |
| CC | Conjunction | JJ | Adjective | PRP | Pronoun |
| CD | Cardinal | MD | Modal | RB | Adverb |
| DT | Determiner | NN | Noun | SYM | Symbol |
| IN | Preposition | POS | Possessive | VB | Verb |

In [5]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [7]:
text13 = nltk.word_tokenize(text11)

In [8]:
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

- **Ambiguity in POS tagging**
    - The same sentence may contain different meanings with the <u>same word</u> in <u>different POS</u>  

In [9]:
text14 = nltk.word_tokenize('Visiting aunts can be a nuisance.')

In [11]:
nltk.pos_tag(text14)
# Note: 'visiting' can be an adjective, meaning 'visiting aunts' are aunts who are coming to visit us.

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN'),
 ('.', '.')]

- **Parsing sentence structure**
    - For well defined structure: can be easy

In [23]:
text15 = nltk.word_tokenize('Alice loves Bob')

In [24]:
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

In [17]:
parser = nltk.ChartParser(grammar)

In [25]:
trees = parser.parse_all(text15)

In [26]:
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


- **Ambiguity in Parsing**
    - Ambiguity may rise even if the sentence is grammatically correct
        - e.g., *I saw the man with telescope* (I am with the telescope| the man is with the telescope)

In [28]:
text16 = nltk.word_tokenize('I saw the man with a telescope')

In [29]:
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | VP PP
PP -> P NP
NP -> DT N | DT N PP | 'I'
DT -> 'a' | 'the'
N -> 'man' | 'telescope'
V -> 'saw'
P -> 'with'
""")

In [30]:
parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text16)

In [31]:
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (DT the) (N man)))
    (PP (P with) (NP (DT a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (DT the) (N man) (PP (P with) (NP (DT a) (N telescope))))))


- **NLTK and parse tree collection**

In [32]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]

In [33]:
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


- **POS tagging & parsing complexity**
    - Uncommon usages of words
        - e.g., *The old man the boat* (man is a verb)
    - Well formed sentences can be meaningless (not semantic)
        - e.g., *colorless green ideas sleep furiously* (???)

#### ***\* Take Home Concepts 1***

$\qquad$ - <u>POS tagging</u> provides insights into the word classes/types in a sentence  
$\qquad$ - <u>Parsing</u> the grammatical structures helps derive meaning  
$\qquad$ - Both <u>POS tagging</u> and <u>parsing</u> have ambiguous components that increase the difficulty  
$\qquad$ - Better models can be learned with **supervised learning**  
$\qquad$ - NLTK provides access to tools and data for training

#### ***\*Special Application: Spell Correction***

- Find valid words sharing similar spelling
    - **Requirement**: 
        - a <u>dictionary</u> of valid words - **From NLTK package**: nltk.corpus.words
        - someway to <u>measure spelling similarity</u> - Find "***edit distance***" between the *misspelled word* and the *potentially correct word*
    - **Action of edits**:
        - Insertions
        - Deletions
        - Substitutions
    - **Edit distance**:
        - Number of changes that need to be made to string A to get to string B (A -> B)
        - Algorithm: 
          - *Levenshtein distance* - Align strings A and B with common parts and substitute/insert/delete remaining parts
          - *N-grams*: characterize sequences in a word of size *n* 
            - Can be used for word sequences: two words similar in spelling share many *n-grams*
          - *Jaccard similarity*: used to measure similarity of sets
            - $J_i = \frac{A\cap B}{A\cup B}$
            


In [1]:
import nltk
from nltk.corpus import words

#### ***\*Take Home Concepts 2***

$\qquad$ - <u>NLTK</u> and <u>simple text processing</u> can be used to build a simple spell checker  
$\qquad$ - <u>N-grams</u> can be used to capture word phrases to find "near duplicate" passages   
$\qquad$ - <u>N-grams</u> are also fundamental to many NLP pre-processing steps like *suffix/prefix matching*, *character-level embedding* etc.