# Primitive constructs in Text

* Sentences / input strings 
* Words or Tokens
* Characters
* Documents, larger files

In [3]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "
text1

'Ethics are built right into the ideals and objectives of the United Nations '

In [4]:
len(text1)

76

What if you wan to know the words?

In [5]:
text2 = text1.split(' ')

In [6]:
len(text2)

14

In [7]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

# Finding specific words

* **Long words: Words that are most than 3 letters long**

In [8]:
[w for w in text2 if len(w) > 3 ]

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

* **Capitalized words**

In [9]:
# .istitle() is a function that checks whether the first 
# character is capitalized
[w for w in text2 if w.istitle()]

['Ethics', 'United', 'Nations']

* **Words that end with s** 

In [10]:
[w for w in text2 if w.endswith('s')]

['Ethics', 'ideals', 'objectives', 'Nations']

The above tells you how to find the individual word, then let's talk about how to find unique words.

## Finding unique words: using set()

In [11]:
text3 = 'To be or not to be'

In [13]:
text4 = text3.split(' ')
text4

['To', 'be', 'or', 'not', 'to', 'be']

In [14]:
len(text4)

6

In [15]:
len(set(text4))

5

In [16]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [21]:
# function .lower() lowerize tokens
[w.lower() for w in text4]

['to', 'be', 'or', 'not', 'to', 'be']

In [22]:
len(set([w.lower() for w in text4]))

4

In [23]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

## Some word comparison functions

* **s.startswith()**
* **s.endswith()**
* **t in s**
* **s.isupper(); s.islower(); s.istitle()**
* **s.isalpha(); s.isdigit(); s.isalnum()**

s.isalpha() indicates that the string is only made of alphabets.

s.isdigit() indicates that if the string is just made of digits.

s.isalnum() if the string is made of alphabets or digits.

In [34]:
# Let's see some examples
s = '12123'
print(s)
print(s.isdigit())
print('\n')
s = 'abc'
print(s)
print(s.isalpha())
print('\n')
s = 'abc123'
print(s)
print(s.isalpha())
print('\n')
s = '121acv'
print(s)
print(s.isdigit())
print('\n')
s = '12avb'
print(s)
print(s.isalnum())
print('\n')
s = '12avb&&'
print(s)
print(s.isalnum())

12123
True


abc
True


abc123
False


121acv
False


12avb
True


12avb&&
False


# String Operations

* **s.lower(); s.upper(); s.titlecase()**
* **s.split(t)**
* **s.splitlines()**

This splits a sentence in terms of newline characters (endline characters -> '\n').
* **s.join(t)**
* **s.strip(); s.rstrip()**

In [35]:
# take off whitespace both in the begining and the end
s = ' I am a robot. '
s.strip()

'I am a robot.'

In [36]:
# only the end
s = ' I am a robot. '
s.rstrip()

' I am a robot.'

* **s.find(t); s.rfind(t)**
* **s.replace(u, v)**

## From words to characters

In [37]:
text5 = 'ouagadougou'

In [39]:
text6 = text5.split('ou')
text6

['', 'agad', 'g', '']

In [40]:
'ou'.join(text6)

'ouagadougou'

In [43]:
# wrong case: text5.split('')
list(text5)

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

In [44]:
[c for c in text5]

['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u']

## Cleaning text

In [45]:
text8 = '    A quick brown fox jumped over the lazy dog. '
text8

'    A quick brown fox jumped over the lazy dog. '

In [46]:
text8.split(' ')

['',
 '',
 '',
 '',
 'A',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog.',
 '']

In [48]:
text9 = text8.strip()
text9

'A quick brown fox jumped over the lazy dog.'

In [49]:
text9.split(' ')

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

* **Find and replace**

In [50]:
text9

'A quick brown fox jumped over the lazy dog.'

In [51]:
# the whitespace would be counted as well.
text9.find('o')

10

In [53]:
# start from the end -- in a reverse order
text9.rfind('o')

40

In [54]:
text9.replace('o', 'O')

'A quick brOwn fOx jumped Over the lazy dOg.'

## Handling larger texts

* **Reading files line by line**

In [64]:
f = open('Toefl essay.txt', 'r')
f.readline()

'YOUNG PEOPLE ABILITY TO MANAGE AND PLAN\n'

* **Reading the full file**

In [65]:
f.seek(0)
text12 = f.read()
len(text12)

2093

In [66]:
text13 = text12.splitlines()
len(text13)

13

In [68]:
text13[0]

'YOUNG PEOPLE ABILITY TO MANAGE AND PLAN'

# File operations

* **f = open(filename, mode)**
* r - read; w - write, etc.
* **f.readline(); f.read(); f.read(n)**
* **for line in f: doSomething(line)**
* **f.seek(n)**
* **f.write(message)** -- under writing mode
* **f.close()**
* **f.closed** -- check wether it is closed or not

## Issues with reading text files


In [69]:
f = open('Toefl essay.txt', 'r')
text14 = f.readline()
text14

'YOUNG PEOPLE ABILITY TO MANAGE AND PLAN\n'

* **How do you remove the last newline character?**

In [73]:
# blackslash is one kind of whitespace!
text14.rstrip()

'YOUNG PEOPLE ABILITY TO MANAGE AND PLAN'

## Take home concepts

* **Handling text sentences**
* **Splitting sentences into words, words into characters**
* **Finding unique words**
* **Handling text from documents**

# What is Natural Language?

* **Language used for everyday communication by humans**

*e.g. English, 中文，にほんご...*

* **compared to the artificial computer language, like python, R and so on.**

# What is Natural Language Processing?

* **Any computation, manipulation of natural language**

* **Natural languages evolve**
 - **new words get added**
 - **old words lose popularity**
 - **meanings of words change**
 - **language rules themselves may change**

## NLP Tasks: A Broad Spectrum

* **Counting words, counting frequency of words**
* **Finding sentence boundaries**
* **Part of speech tagging**
* **Parsing the sentence structure**
* **Identifying semantic roles**
* **Identifying entities in a sentences**
* **Finding which pronoun refers to which entity**

# An Introduction to NLTK

* **NLTK: Natural Language Toolkit**
* **Open source library in Python**

### Advantages of NLTK:
* **Has support for most NLP tasks**
* **Also provides access to numerous text corpora**

In [74]:
import nltk

In [75]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [77]:
# show the corpora that have been downloaded
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [78]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [79]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [81]:
sent1

['Call', 'me', 'Ishmael', '.']

* **Counting vocabulary of works**

In [82]:
text7

<Text: Wall Street Journal>

In [84]:
# show one sentence from text7
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [86]:
len(sent7)

18

In [87]:
len(text7)

100676

In [88]:
len(set(text7))

12408

In [91]:
list(set(text7))[:10]

['Otero',
 '1.5805',
 'regarded',
 'settlements',
 'LONDON',
 'Komatsu',
 'Nelson',
 'amazingly',
 '645,000',
 'fresh']

* **Frequency of words**

In [93]:
# dist: stands for distribution
dist = FreqDist(text7)
len(dist)

12408

In [100]:
vocabl = dist.keys()
list(vocabl)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [104]:
dist['four']

20

In [108]:
# the word length filters the really needed words.
freqwords = [w for w in vocabl if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

## Normalization and Stemming

* **Different forms of the same "word"**

In [109]:
input1 = 'List listed lists listing listings'

In [111]:
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [124]:
# Stemming is to find the root word or the root form of any given word.
porter = nltk.PorterStemmer()

In [114]:
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

# Lemmatization

In [123]:
# Lemmatization: Stemming, but resulting stems are all valid words.
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [125]:
[porter.stem(t) for t in udhr[:20]]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

* **Lemmatization: Stemming, but resulting stems are all valid words**

In [126]:
WNLemma = nltk.WordNetLemmatizer()

In [127]:
[WNLemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

## Tokenization

* **Recall splitting a sentence into words / tokens**

In [130]:
text1 = 'Children shouldn\'t drink a sugary drink before bed.'
text1.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

* **NLTK has in-built tokenizer**

In [131]:
nltk.word_tokenize(text1)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [132]:
# tokenize based on punctuation
nltk.wordpunct_tokenize(text1)

['Children',
 'shouldn',
 "'",
 't',
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

## Sentence Splitting

### How do you split sentences from a long text string?

In [134]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
text12

'This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!'

In [135]:
nltk.sent_tokenize(text12)

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

# NLP Tasks

* **Part of speech tagging**
* **Parsing the sentence structure**

## Part-of-speech (POS) Tagging
* **Recall high school grammer: nouns, verbs, adjectives,..**

* **Reference**

[Alphabetical list of part-of-speech tags used in the Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [1]:
import nltk

In [5]:
# check all tagset
# e.g. nltk.help.upenn_tagset()

# check particular tagset
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [14]:
text11 = 'Children shouldn\'t drink a sugary drink before bed.'
text12 = nltk.word_tokenize(text11)
print(nltk.pos_tag(text12))
print(nltk.help.upenn_tagset('IN'))

[('Children', 'NNP'), ('should', 'MD'), ("n't", 'RB'), ('drink', 'VB'), ('a', 'DT'), ('sugary', 'JJ'), ('drink', 'NN'), ('before', 'IN'), ('bed', 'NN'), ('.', '.')]
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
None


### Ambiguity in POS Tagging

* **Ambiguity is common in English**

In [17]:
text14 = 'Visiting aunts can be a nuisance'
nltk.pos_tag(nltk.word_tokenize(text14))
nltk.help.upenn_tagset('VBG')

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


## Parsing Sentence Structure
* **Making sense of sentences is easy if they follow a well-defined grammatical structure**

In [31]:
text15 = nltk.word_tokenize('Alice loves Bob')
# create a text free grammer
grammer = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob' | 'Jane'
V -> 'loves'
""")

In [32]:
parser = nltk.ChartParser(grammer)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


In [33]:
text16 = nltk.word_tokenize('Alice loves Jane')
trees = parser.parse_all(text16)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Jane)))


### Ambiguity in Parsing
* **Ambiguity may exist even if sentences are grammatically correct!**

In [59]:
#nltk.help.upenn_tagset()

In [58]:
text17 = nltk.word_tokenize('I saw a man with a telescope.')
garmmer = nltk.CFG.fromstring("""
S -> N VP
VP -> V NP | VP PP
PP -> P NP SYM
NP -> DT N | NP PP
N -> 'I' | 'man' | 'telescope'
DT -> 'the' | 'a'
V -> 'saw'
P -> 'with'
SYM -> '.'
""")
parser = nltk.ChartParser(garmmer)
trees = parser.parse_all(text17)
for tree in trees:
    print(tree)

(S
  (N I)
  (VP
    (VP (V saw) (NP (DT a) (N man)))
    (PP (P with) (NP (DT a) (N telescope)) (SYM .))))
(S
  (N I)
  (VP
    (V saw)
    (NP
      (NP (DT a) (N man))
      (PP (P with) (NP (DT a) (N telescope)) (SYM .)))))


In [64]:
# write a cfg file in advance - CFG stands for context-free grammer
grammer = nltk.data.load('mygrammer.cfg')
grammer
parser = nltk.ChartParser(garmmer)
trees = parser.parse_all(text17)
for tree in trees:
    print(tree)

(S
  (N I)
  (VP
    (VP (V saw) (NP (DT a) (N man)))
    (PP (P with) (NP (DT a) (N telescope)) (SYM .))))
(S
  (N I)
  (VP
    (V saw)
    (NP
      (NP (DT a) (N man))
      (PP (P with) (NP (DT a) (N telescope)) (SYM .)))))


* **check here** [What is CFG?](https://www.cs.rochester.edu/~nelson/courses/csc_173/grammars/cfg.html)

## NLTK and Parse Tree Collection

In [66]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


### POS Tagging & Parsing Complexity

* **Uncommon usages of words**

In [67]:
text18 = nltk.word_tokenize('The old man the boat.')
nltk.pos_tag(text18)

[('The', 'DT'),
 ('old', 'JJ'),
 ('man', 'NN'),
 ('the', 'DT'),
 ('boat', 'NN'),
 ('.', '.')]

## Take Home Concepts

* **POS tagging provides insights into the word classes / types in a sentence**
* **Parsing the grammatical structures helps derive meaning**
* **Both tasks are difficult, linguistic ambiguity increases the difficulty even more**
* **Better models could be learned with supervised learning**
* **NLTK provides access to tools and data for training**