# SCS 3546 Week 6 - Natural Language Processing (NLP)
(NLP Part 1)

In [None]:
from IPython.display import Image

# Introduction

- Develop some familiarity with key concepts in NLP
- Understanding Regular Expressions, n-grams, annotators, word and document representation, language and topic modeling
- Have a look at some of the features of the Natural Language Toolkit (NLTK), spaCy, gensim, coreNLP and keras

## What is Natural Language Processing?

- Natural Language Processing (NLP) is the study of computational treatment of natural (human) language

- NLP applications include:
  - Search: Web, documents, autocomplete
  - Editing: Spelling, grammar, style
  - Dialog: Chatbots, assistants
  - Writing: Index, concordance, table of contents
  - Email: Spam filter, classification, prioritization
  - Text mining: Summarization, knowledge extraction, medical diagnosis
  - Law: Legal inference, precedent search, subpoena classification
  - News: Event detection, fact checking, headline composition
  - Attribution: Plagiarism detection, literary forensics, style coaching
  - Sentiment analysis: Community morale monitoring, product review triage, customer care
  - Behavior prediction: Finance, election forecasting, marketing
  - Creative writing: Movie scripts, poetry, song lyrics
  
Source: Natural Language Processing in Action

## Ambiguity in Natural Language

Ambiguity in language is one of the main issues in interpreting the concepts they represent. Consider the following sentences:

- "Teachers Strikes Idle Kids."
- “Eats shoots and leaves”
- "Stolen Painting Found by Tree."
- "Local High School Dropouts Cut in Half"


Grammar is a description, not a prescription

## Other Issues
- Synonyms
       A synonym is a word or phrase that means exactly or nearly the same as another lexeme in the same language.
- Homonyms
        In linguistics, homonyms, broadly defined, are words which sound alike or are spelled alike, but have different meanings.
- Misspellings
        an incorrect spelling
- Sarcasm
        the use of irony to mock or convey contempt.
- Allegory
        a story, poem, or picture that can be interpreted to reveal a hidden meaning, typically a moral or political one.
- Dialects
        a particular form of a language which is peculiar to a specific region or social group.

## Corpus

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed).

<table border="1" class="docutils" id="tab-corpora">
<colgroup>
<col width="31%">
<col width="18%">
<col width="51%">
</colgroup>
<thead valign="bottom">
<tr><th class="head">Corpus</th>
<th class="head">Compiler</th>
<th class="head">Contents</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>Brown Corpus</td>
<td>Francis, Kucera</td>
<td>15 genres, 1.15M words, tagged, categorized</td>
</tr>
<tr><td>CESS Treebanks</td>
<td>CLiC-UB</td>
<td>1M words, tagged and parsed (Catalan, Spanish)</td>
</tr>
<tr><td>Chat-80 Data Files</td>
<td>Pereira &amp; Warren</td>
<td>World Geographic Database</td>
</tr>
<tr><td>CMU Pronouncing Dictionary</td>
<td>CMU</td>
<td>127k entries</td>
</tr>
<tr><td>CoNLL 2000 Chunking Data</td>
<td>CoNLL</td>
<td>270k words, tagged and chunked</td>
</tr>
<tr><td>CoNLL 2002 Named Entity</td>
<td>CoNLL</td>
<td>700k words, pos- and named-entity-tagged (Dutch, Spanish)</td>
</tr>
<tr><td>CoNLL 2007 Dependency Treebanks (sel)</td>
<td>CoNLL</td>
<td>150k words, dependency parsed (Basque, Catalan)</td>
</tr>
<tr><td>Dependency Treebank</td>
<td>Narad</td>
<td>Dependency parsed version of Penn Treebank sample</td>
</tr>
<tr><td>Floresta Treebank</td>
<td>Diana Santos et al</td>
<td>9k sentences, tagged and parsed (Portuguese)</td>
</tr>
<tr><td>Gazetteer Lists</td>
<td>Various</td>
<td>Lists of cities and countries</td>
</tr>
<tr><td>Genesis Corpus</td>
<td>Misc web sources</td>
<td>6 texts, 200k words, 6 languages</td>
</tr>
<tr><td>Gutenberg (selections)</td>
<td>Hart, Newby, et al</td>
<td>18 texts, 2M words</td>
</tr>
<tr><td>Inaugural Address Corpus</td>
<td>CSpan</td>
<td>US Presidential Inaugural Addresses (1789-present)</td>
</tr>
<tr><td>Indian POS-Tagged Corpus</td>
<td>Kumaran et al</td>
<td>60k words, tagged (Bangla, Hindi, Marathi, Telugu)</td>
</tr>
<tr><td>MacMorpho Corpus</td>
<td>NILC, USP, Brazil</td>
<td>1M words, tagged (Brazilian Portuguese)</td>
</tr>
<tr><td>Movie Reviews</td>
<td>Pang, Lee</td>
<td>2k movie reviews with sentiment polarity classification</td>
</tr>
<tr><td>Names Corpus</td>
<td>Kantrowitz, Ross</td>
<td>8k male and female names</td>
</tr>
<tr><td>NIST 1999 Info Extr (selections)</td>
<td>Garofolo</td>
<td>63k words, newswire and named-entity SGML markup</td>
</tr>
<tr><td>NPS Chat Corpus</td>
<td>Forsyth, Martell</td>
<td>10k IM chat posts, POS-tagged and dialogue-act tagged</td>
</tr>
<tr><td>PP Attachment Corpus</td>
<td>Ratnaparkhi</td>
<td>28k prepositional phrases, tagged as noun or verb modifiers</td>
</tr>
<tr><td>Proposition Bank</td>
<td>Palmer</td>
<td>113k propositions, 3300 verb frames</td>
</tr>
<tr><td>Question Classification</td>
<td>Li, Roth</td>
<td>6k questions, categorized</td>
</tr>
<tr><td>Reuters Corpus</td>
<td>Reuters</td>
<td>1.3M words, 10k news documents, categorized</td>
</tr>
<tr><td>Roget's Thesaurus</td>
<td>Project Gutenberg</td>
<td>200k words, formatted text</td>
</tr>
<tr><td>RTE Textual Entailment</td>
<td>Dagan et al</td>
<td>8k sentence pairs, categorized</td>
</tr>
<tr><td>SEMCOR</td>
<td>Rus, Mihalcea</td>
<td>880k words, part-of-speech and sense tagged</td>
</tr>
<tr><td>Senseval 2 Corpus</td>
<td>Pedersen</td>
<td>600k words, part-of-speech and sense tagged</td>
</tr>
<tr><td>Shakespeare texts (selections)</td>
<td>Bosak</td>
<td>8 books in XML format</td>
</tr>
<tr><td>State of the Union Corpus</td>
<td>CSPAN</td>
<td>485k words, formatted text</td>
</tr>
<tr><td>Stopwords Corpus</td>
<td>Porter et al</td>
<td>2,400 stopwords for 11 languages</td>
</tr>
<tr><td>Swadesh Corpus</td>
<td>Wiktionary</td>
<td>comparative wordlists in 24 languages</td>
</tr>
<tr><td>Switchboard Corpus (selections)</td>
<td>LDC</td>
<td>36 phonecalls, transcribed, parsed</td>
</tr>
<tr><td>Univ Decl of Human Rights</td>
<td>United Nations</td>
<td>480k words, 300+ languages</td>
</tr>
<tr><td>Penn Treebank (selections)</td>
<td>LDC</td>
<td>40k words, tagged and parsed</td>
</tr>
<tr><td>TIMIT Corpus (selections)</td>
<td>NIST/LDC</td>
<td>audio files and transcripts for 16 speakers</td>
</tr>
<tr><td>VerbNet 2.1</td>
<td>Palmer et al</td>
<td>5k verbs, hierarchically organized, linked to WordNet</td>
</tr>
<tr><td>Wordlist Corpus</td>
<td>OpenOffice.org et al</td>
<td>960k words and 20k affixes for 8 languages</td>
</tr>
<tr><td>WordNet 3.0 (English)</td>
<td>Miller, Fellbaum</td>
<td>145k synonym sets</td>
</tr>
</tbody>


</table>

[Source: Natural Language Processing with Python,Steven Bird, Ewan Klein, and Edward Loper](https://www.nltk.org/book/ch02.html)

### Stop Words

Stop words are generally the most common words in a language; there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. 

In [None]:
import warnings
warnings.warn("ignore")

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords.words('english')[:10]

  


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

### WordNet
WordNet is **a large lexical database of English**. Nouns, verbs, adjectives and adverbs are grouped into **sets of cognitive synonyms (synsets)**, each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. 

<img src="https://www.nltk.org/images/wordnet-hierarchy.png">

[Source Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper](https://www.nltk.org/book/ch02.html)


In [None]:
nltk.download('wordnet')
from nltk.corpus import wordnet
motorcar = wordnet.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
print(types_of_motorcar[0])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Synset('ambulance.n.01')


In [None]:
print(sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas()))

['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon']


### Gutenberg

In [None]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg

gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
gutenberg.raw('blake-poems.txt')[0:100]

'[Poems by William Blake 1789]\n\n \nSONGS OF INNOCENCE AND OF EXPERIENCE\nand THE BOOK of THEL\n\n\n SONGS '

### Web Text

In [None]:
from nltk.corpus import webtext
nltk.download('webtext')
webtext.fileids()

[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.


['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

In [None]:
webtext.sents('firefox.txt')

[['Cookie', 'Manager', ':', '"', 'Don', "'", 't', 'allow', 'sites', 'that', 'set', 'removed', 'cookies', 'to', 'set', 'future', 'cookies', '"', 'should', 'stay', 'checked', 'When', 'in', 'full', 'screen', 'mode', 'Pressing', 'Ctrl', '-', 'N', 'should', 'open', 'a', 'new', 'browser', 'when', 'only', 'download', 'dialog', 'is', 'left', 'open', 'add', 'icons', 'to', 'context', 'menu', 'So', 'called', '"', 'tab', 'bar', '"', 'should', 'be', 'made', 'a', 'proper', 'toolbar', 'or', 'given', 'the', 'ability', 'collapse', '/', 'expand', '.'], ['[', 'XUL', ']', 'Implement', 'Cocoa', '-', 'style', 'toolbar', 'customization', '.'], ...]

### Brown Text

15 genres, 1.15M words, tagged, categorized

In [None]:
from nltk.corpus import brown
nltk.download('brown')
brown.categories()

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [None]:
brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [None]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

In [None]:
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


### Pronouncing Dictionary

In [None]:
nltk.download('cmudict')
entries = nltk.corpus.cmudict.entries()
entries[300:310]

[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


[('abreu', ['AH0', 'B', 'R', 'UW1']),
 ('abridge', ['AH0', 'B', 'R', 'IH1', 'JH']),
 ('abridged', ['AH0', 'B', 'R', 'IH1', 'JH', 'D']),
 ('abridgement', ['AH0', 'B', 'R', 'IH1', 'JH', 'M', 'AH0', 'N', 'T']),
 ('abridges', ['AH0', 'B', 'R', 'IH1', 'JH', 'AH0', 'Z']),
 ('abridging', ['AH0', 'B', 'R', 'IH1', 'JH', 'IH0', 'NG']),
 ('abril', ['AH0', 'B', 'R', 'IH1', 'L']),
 ('abroad', ['AH0', 'B', 'R', 'AO1', 'D']),
 ('abrogate', ['AE1', 'B', 'R', 'AH0', 'G', 'EY2', 'T']),
 ('abrogated', ['AE1', 'B', 'R', 'AH0', 'G', 'EY2', 'T', 'IH0', 'D'])]

### Comparative Wordlists

In [None]:
from nltk.corpus import swadesh
nltk.download('swadesh')
swadesh.fileids()[:10]

[nltk_data] Downloading package swadesh to /root/nltk_data...
[nltk_data]   Unzipping corpora/swadesh.zip.


['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr']

In [None]:
fr2en = swadesh.entries(['fr', 'en'])
len(fr2en) , fr2en[3]

(207, ('nous', 'we'))

## Tokenization

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. 

### Concept
- Split on white space (in most languages)
- **Regular expressions** and **finite state automata** are often used


### Issues in Tokenization

Consider: “There’s a moon in the sky.  It’s called The Moon." (B52's)

- Special cases: Names (particularly multi-word), initials, hyphenated words, abbreviations, special forms (dates, phone numbers, URLs, etc.)
- Punctuation
- Contractions: is "isn’t" one word or two (many tokenizers treat as two)
- Named Entities
- Rare words
- Stop words

- Other languages
  - German: Compounds words such as schadenfreude
  - Chinese: average of about 2 symbol/word; greedy readings work quite well but there are better algos
  - Japanese: kanji, hirigana, katakana, romanji

### Popular Python tokenizers:
  - NLTK
  - SpaCy
  - Keras
  - CoreNLP

### Example

In [None]:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
nltk.download('punkt')
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = sent_tokenize(raw)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['DENNIS: Listen, strange women lying in ponds distributing swords\nis no basis for a system of government.', 'Supreme executive power derives from\na mandate from the masses, not from some farcical aquatic ceremony.']


In [None]:
tokens = word_tokenize(raw)
print(tokens)

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


## Normalization

The concept is to replace similar words with a single token and reduce word vector size. This task is a kind of dimensionality reduction. It reduces the processing cost and the likelihood of overfitting


### Normalization Techniques
- Case folding: Force all lower case 
  - Can make named entity resolution more difficult
  - Becoming less common as a result
- Stemming: Crude chopping of affixes
  - e.g. remove -s or -ing at end
  - Can cut vocabulary size in half (or more if aggressive)
  - Many algoritms: Porter’s is most common English stemmer and has a lot of knowledge of English hardcoded in it
  - Useful for search where we are looking for similar, not exact matches (will improve recall but reduce precision)
- Lemmatization: Extraction of the base form
  - e.g. “are”, “am”, “is” replaced with “be”
  - Better for most applications than stemmers which might take “better” and convert it to “bet”
- Hashtag expansions

### When to Use Stemming vs. Lemmatization
- If you want the recall benefit of stemming try putting a lemmatizer before the stemmer
- NLTK lemmatizer uses the Princeton WordNet graph of word meanings
- Newer packages like SpaCy don’t provide a stemmer, only. a lemmatizer
- Stemmers and lemmatizers (like stop words) are being less used all the time as computers become more powerful

### Example

In [None]:
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [None]:
lancaster = nltk.LancasterStemmer()
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Getting the part of speech of a word
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
        print(tag)

tagged_sent = nltk.pos_tag(tokens) # Gets part of speech
lmtzr = WordNetLemmatizer()
[(x[0],lmtzr.lemmatize(x[0], get_wordnet_pos(x[1])))
 for x in tagged_sent if get_wordnet_pos(x[1]) is not None]

[('DENNIS', 'DENNIS'),
 ('Listen', 'Listen'),
 ('strange', 'strange'),
 ('women', 'woman'),
 ('lying', 'lie'),
 ('ponds', 'pond'),
 ('distributing', 'distribute'),
 ('swords', 'sword'),
 ('is', 'be'),
 ('basis', 'basis'),
 ('system', 'system'),
 ('government', 'government'),
 ('Supreme', 'Supreme'),
 ('executive', 'executive'),
 ('power', 'power'),
 ('derives', 'derive'),
 ('mandate', 'mandate'),
 ('masses', 'mass'),
 ('not', 'not'),
 ('farcical', 'farcical'),
 ('aquatic', 'aquatic'),
 ('ceremony', 'ceremony')]

## Annotators 

<img src="https://stanfordnlp.github.io/CoreNLP/assets/images/pipeline.png">

[source CoreNLP Website](https://stanfordnlp.github.io/CoreNLP/)


you can try different annotators [here](http://corenlp.run/)

For spaCy you should run following commands: 

* `pip install spacy`
* `python -m spacy download en_core_web_sm`

### POS (Part Of Speech) Tagger 

Part-of-speech tagging (POS tagging or PoS tagging or POST), also called **grammatical tagging or word-category disambiguation**, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.[source](https://en.wikipedia.org/wiki/Part-of-speech_tagging)


<img src="https://drive.google.com/uc?id=19I5kJPRfQ6OdnWVHr9BqYWYB11B9Mrcl">



 

In [None]:
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(text)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset('DT')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those


### Name Entity Recognition 

**Named-entity recognition (NER)** is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text. [source](https://en.wikipedia.org/wiki/Named-entity_recognition)

In information extraction, a **named entity** is a **real-world object**, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, or anything else that can be named. [source](https://en.wikipedia.org/wiki/Named_entity)

<img src="https://stanfordnlp.github.io/CoreNLP/assets/images/ner.png">

### Constituent Parsing

Constituent parsing is the task of **recognizing a sentence and assigning a syntactic structure** to it.[source](https://stanfordnlp.github.io/CoreNLP/parse.html) 


[Source Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper](https://www.nltk.org/book/ch07.html)

One of the common problems is **Ubiquitous Ambiguity**. Please consider example below



[Source Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper](https://www.nltk.org/book/ch07.html)

### Dependency Parsing 

Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and **defines the relationships between “head” words and words**, which modify those heads.[source](http://nlpprogress.com/english/dependency_parsing.html)

<img src="https://stanfordnlp.github.io/CoreNLP/assets/images/depparse.png">

### Coreference Resolution

Coreference resolution is the task of finding **all expressions that refer to the same entity** in a text. 

<img src="https://nlp.stanford.edu/projects/corefexample.png">

[source:nlpCore website](https://nlp.stanford.edu/projects/coref.shtml)

### Open information extraction (open IE) 

**Open information extraction (open IE)** refers to the **extraction of relation tuples**, typically binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook).[source](https://nlp.stanford.edu/software/openie.html)



##  Regular Expressions

A **regular expression** or **regex** is a **sequence of characters that define a search pattern**. 

- Helps you find and match the patterns in text
    - Finding all the email addresses in webpage
    - Email addresses are strings that has exactly one @ sign, and at least one . in the part after the @
    - Based on the above description (which is too general but mostly enough for basic check) the regular expression for an email address is as follows
    - Email pattern: **[^@]+@[^@]+\.[^@]+**
    - The above pattern means some at least one character which is not @ ([^@]+), followed by an @ sign, followed by at least one character which is not @ ([^@]+), followed by a single . , and again followed by at least one character which is not @ ([^@]+)
    - A more comprehensive pattern that does not allow spaces inside email adresses is as follows:
    - Email pattern: **[^@|\s]+@[^@]+\.[^@|\s]+**
    
- Some Regular expression syntax examples 
    - **.**	Matches any character
    - **^abc**	Matches some pattern abc at start of a string
    - **abc$**	Matches some pattern abc at the end of a string
    - **[abc]**	Matches one of a set of characters
    - **[A-Z0-9]**	Matches one of a range of character
    - **ed|ing|s**	Matches one of the specified strings
    - **\***	Matches zero or more of the previous item
    - **\+**	Matches one or more of the previous item

###  Regular Expressions in Python

- Take a crash course to learn the most commonly used syntax for writing the patterns
    - [Regex tutorial — A quick cheatsheet by examples](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)
    - [Python Regex Cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
    - In Python **re** library can be used. It has many methods and the following methos are more common
        - re.match()
        - re.search()
    - The match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match. It’s important to keep this distinction in mind. Remember, match() will only report a successful match which will start at 0; if the match wouldn’t start at zero, match() will not report it.

In [None]:
# Let's see how we can make use of re library to extract part of text
import re
print(re.match('super', 'superstition').span())
print(re.match('super', 'insuperable'))
print(re.search('super', 'superstition').span())
print(re.search('super', 'insuperable').span())

(0, 5)
None
(0, 5)
(2, 7)


### Finding email addresses in a text document using Regular Expressions

The following code shows how we can use **re** library to extract all the **valid** email addresses from a source docuements.

In [None]:
#Finding email addresses in a text file
import re
string="Hello friend, You can send me an email either to example@me.com or to me@example.com or me[at]gmail[dot]com"
# findall returns all non-overlapping matches of pattern in string, as a list of strings. 
res = re.findall("[^@|\s]+@[^@]+\.[^@|\s]+",string) 
res

['example@me.com', 'me@example.com']

### Relation Extraction

In [None]:
import re
nltk.download('ieer')
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
     for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

[nltk_data] Downloading package ieer to /root/nltk_data...
[nltk_data]   Unzipping corpora/ieer.zip.
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']


## Language Models

A statistical language model is a **probability distribution over sequences of words**. Given such a sequence, say of length m, it assigns a probability ${\displaystyle P(w_{1},\ldots ,w_{m})}$ to the whole sequence.

The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases "recognize speech" and "wreck a nice beach" sound similar, but mean different things.

[source](https://en.wikipedia.org/wiki/Language_model)
- What is the probability of the next word given what we’ve seen so far?
- Can’t just count instances: too many sentences; never have enough data
- So we use a simplifying assumption: The Markov property (only the previous few words matter)


## n-Grams

In the fields of computational linguistics and probability, an n-gram is a **continuous sequence of n items from a given sample of text or speech**.[source](https://en.wikipedia.org/wiki/N-gram)

- The Distributional Hypothesis (1950’s): There is a link between how words are distributed and what they mean
- Sapir-Whorf Hypothesis: The structure of a language determines a native speaker's perception and categorization of experience
- Collocations: Words that go together to express a concept
- Concordances: An alpabetical index that shows words in their context
- n-grams are sequences of words found in natural language text
- Useful for translation, spelling correction, speech recognition, question answering
- Consider the meaning of “not old” vs. not and old individually: perhaps we should add “not old” to our vocabulary if it occurs
- We can add 2, 3, etc.-grams
- Can have letter n-grams as well (e.g. used internally in DBMS’s for wildcard lookups) but for our purposes we are only concerned with word n-grams
- No point in trying to generate all possible n-grams in advance (combinatorial explosion) so we are only interested in the ones that actually occur in our corpus
- Consider very rare collocations: Can’t determine their frequency of occurrence so not useful for classification problems
- Consider very common collocations (e.g. “is a”): Carry almost no information (but useful for language detection)
- A vocabulary of about 20,000 words is sufficient to track 95% of words in a corpus of tweets, blog posts and news article

In [None]:
from nltk import ngrams

sentence = 'DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government'
n = 2
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
    print(grams)

('DENNIS:', 'Listen,')
('Listen,', 'strange')
('strange', 'women')
('women', 'lying')
('lying', 'in')
('in', 'ponds')
('ponds', 'distributing')
('distributing', 'swords')
('swords', 'is')
('is', 'no')
('no', 'basis')
('basis', 'for')
('for', 'a')
('a', 'system')
('system', 'of')
('of', 'government')


## Representing Words

- Primary 
    - One Hot Encoding 
    
- Advanced Word Embeddings
    - WordToVec
    - Glove
    - TagLM
    - ELMO
    - GPT
    - BERT


#### One Hot Encoding  Example

Following you can find an example for hot-encoding with Keras

In [None]:
import tensorflow as tf
sentence = 'DENNIS: Listen, strange women lying in ponds distributing swords is no basis for a system of government'

tf.keras.preprocessing.text.one_hot(
    sentence,
    400,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=' '
)

[60,
 311,
 302,
 268,
 303,
 277,
 352,
 57,
 250,
 134,
 381,
 277,
 124,
 233,
 217,
 60,
 79]

## Representing Documents
### Pre-Deep Learning Approaches
- Bag of Words
- Bags of n-Grams
- TF-IDF vectors
- topic modeling

### Bag of Words or n-Grams

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words,

- We would like to model words, n-grams and documents in a way that is amenable to computer processing
- First attempt
  - One Hot encoding for words and n-grams
  - Bag of Words (BOW) encoding for documents: sum or logical-OR of word vectors
- Advantages:
  - Numeric representation
  - Simple to compute
  - Easy to interpret and use
  - Can compare two documents by comparing their BOW encodings
- Disadvantages:
  - Loses context and therefore meaning
  - Hugely wasteful of memory space


In [None]:
import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

dataset = api.load("text8")
dct = Dictionary(dataset)  # fit dictionary
corpus = [dct.doc2bow(line) for line in dataset]  # convert corpus to BoW format




In [None]:
for word_id,count in corpus[0][:10]:
    print(dct[word_id], count)

a 184
abacus 1
abilities 1
ability 3
able 7
abnormal 1
abolished 1
abolition 1
about 12
above 2


### TF-IDF
In information retrieval, tf–idf or TFIDF, short for **term frequency–inverse document frequency**, is a **numerical statistic** that is intended to **reflect how important a word is to a document in a collection or corpus**.
- Let's say we want to divide up a corpus (collection) of documents into similar clusters
- How should we decide how similar two documents are?
  - How many words they have in common
  - How specialized those words are
- To capture these two aspects we need two measures for each word in our vocabulary:
  - Term Frequency: How frequently the word occurs in each document
  - Document Frequency: How often the word occurs in our corpus
- We can combine these measures as Term Frequency / Document Frequency (called TF-IDF for Term Frequency, Inverse Document Frequency)
- the Document Frequency is usually measured on a log scale


$$
{\text { tfidf }(t, d, D)=t f(t, d) \cdot \text { idf }(t, D)}$$

$$\mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|}$$

In [None]:
model = TfidfModel(corpus)  # fit model

In [None]:
vector = model[corpus[0]]  # apply model to the first corpus document

In [None]:
for word_id,weight in vector[:10]:
    print(dct[word_id], weight)

abacus 0.006704047545684609
abilities 0.0030255603220721273
ability 0.003156168449586299
able 0.0036673470201144674
abnormal 0.004575122435127926
abolished 0.0028052608258295926
abolition 0.004064820137019515
about 0.00014963587508918375
above 0.0007492665180478759
absence 0.004142807322609117


## Similarity Measure 

### Cosine Similarity
- We can measure the similarity of vectors (such as TF-IDF vectors) using Cosine Similarity
- The more similar the vectors are, the smaller the angle there should be between them
- The cosine similarity of two vectors, x and y, is easily calculated using dot product 

\begin{align}
cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}
\end{align}
- Cosine similarity is a number that runs between 0 (nothing in common) to 1 (identical) for TF-IDF vectors (Note: Here identical means identical TF-IDF, not necessarily identical documents)

## NLP Pipelines

Natural Language processing is usually organized into a pipeline of operations that parse and analyze the text.
The figure below shows an example of a NLP pipeline:
<img src="https://miro.medium.com/max/1838/1*CbzCcP3XFtYVJmWowZLugQ.png">
[source: medium.com](https://medium.com/mlearning-ai/basic-steps-in-natural-language-processing-pipeline-763cd299dd99)

Additional simple examples:

- [NLP Pipeline: Building an NLP Pipeline, Step-by-Step](https://suneelpatel-in.medium.com/nlp-pipeline-building-an-nlp-pipeline-step-by-step-7f0576e11d08)

- [NLP Text Preprocessing and Cleaning Pipeline in Python](https://towardsdatascience.com/nlp-text-preprocessing-and-cleaning-pipeline-in-python-3bafaf54ac35)


In [None]:
'''Normalization '''
text = '''The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 
The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person [1]. 
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger�s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)'''
text = text.lower()
print(text)

In [None]:
''' And remove punctuation '''
import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

In [None]:
'''Tokenize words'''
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize


words = word_tokenize(text)
print(words)

In [None]:
'''Tokenize sentences'''
sentences = sent_tokenize(text)
print(sentences)

In [None]:
'''Stop word removal '''
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
# Tokenize text
words = word_tokenize(text)
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

In [None]:
'''POS and NER '''
import nltk
from nltk.tokenize import word_tokenize
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# tokenize text
sentence = word_tokenize(text)
# tag each word with part of speech
pos_tag(sentence)

In [None]:
'''use ne_chunk to find named entities '''
# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree[1:10])

In [None]:
'''Stemming and Lemmatization '''
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
words = word_tokenize(text)
stemmed = [PorterStemmer().stem(w) for w in words]

from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their stems
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]


print("Stemmed:", stemmed)
print("Lemmas", lemmed)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Natural Language Processing or NLP is a field of Artificial Intelligence.",
        "NLP gives the machines the ability to read, understand and derive meaning from human languages",
        "Data Scientists work with tons of data, and many times that data includes natural language text.",
        "Modern organizations work with huge amounts of data.",
        "Is AI a bad thing ?"]
# initialize count vectorizer object
# use your own tokenize function

results = []
for sentence in corpus:
    sentence_results = []
    for s in sentence:
        sentence_results.append(nltk.word_tokenize(sentence))
    results.append(sentence_results)
vect = CountVectorizer(tokenizer=results)


# get counts of each token (word) in text data
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# convert sparse matrix to numpy array to view
X.toarray()
# view token vocabulary and counts
document_term_matrix = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)

In [None]:
''' TF-IDF'''
from sklearn.feature_extraction.text import TfidfTransformer
# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X)
# convert sparse matrix to numpy array to view
tfidf.toarray()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()

# The Natural Language Toolkit: NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

- Originally developed in 2001 at the University of Pennsylvania
- Version 3.3 released in 2018
- Natural Language Processing with Python book: https://www.nltk.org/book/  


## References
- http://nlpprogress.com/english/dependency_parsing.html
- https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
- https://wordnet.princeton.edu/
- https://www.nltk.org/
- https://github.com/mikhailklassen/Mining-the-Social-Web-3rd-Edition/tree/master/notebooks
- https://en.wikipedia.org/wiki/Regular_expression
- Lane, Howard & Hapke. Natural Language Processing in Action. Manning. 2019.
- Jurafsky & Martin. Speech and Language Processing, 3rd Ed. https://web.stanford.edu/~jurafsky/slp3/
- SpaCy: https://spacy.io/
- gensim: https://radimrehurek.com/gensim/
- [Source Natural Language Processing with Python Steven Bird, Ewan Klein, and Edward Loper](https://www.nltk.org/book/ch07.html)