# NLP(Natural language processing)

  Natural languages(English, Japanese etc.) is different than computer language. It evolves and changes.
  In NLP we are concerned with how to program computer to process and analyze natural language data(text).
    NLP is defined as a  field of computer science  and artificial intelligence with roots in linguistics.

# Why?
- We have large volume of unstructured text data
    + How to apply statistical analysis/machine learning to extract useful insight from this?
- Most of the data analysis is numeric
    + We need specialized technique like NLP

# Applications
- Machine translation
- Speech Recognition Systems
- Question Answering Systems
- Text summarization, categorization/classification/clustering
    - Sentiment analysis
    - Chatbots
    - Spam detection



Building any of above application is a bit involved process as text is free flowing, unstructured data.
It requires 
- cleaning (misspelled text, duplicates, removing stopwords),
- tokenization: list of words.
- tagging(POS), stemming, lemmatization and 
- conversion to word vector before using any machine learning or statistical technique

- POS tagging: Each word has pos tag indicating  part of the speech.
  + Here is some list list [penn pos](https://www.clips.uantwerpen.be/pages/mbsp-tags)
  + Here is a demo [Parts-of-speech.Info](https://parts-of-speech.info/)



# Stemming and lemmatization
In NLP, Stemming and lemmatization are text normalization technique to prepare text(word, sentence etc.) for further processing. In web search and information retrieval it is a common activity to increase the recall. These are common step after tokenization of text.

- Stem: Part of the words to which affixes can be added. 
A stem is a part of a word to which [inflectional](https://en.wikipedia.org/wiki/Inflection) affixes**(ed, ing, ize, s, de)** can be attached.   Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root(may not be a word). Like apple and apples down to appl if you [Porter stemmer](https://tartarus.org/martin/PorterStemmer/)
- Lemma: A lemma is the  base form(part of the language) for a set of words like  geese to goose.
    + Note: stem of these words woud be gees and goos as per Porter stemmer. Lemmatization((careful approach to removing inflections)) is the process of creating  base form typically based on lexical knowledge base like [WordNet](https://wordnet.princeton.edu/).

Stemming is not perfect. Porter stemming stems both meanness and meaning to mean, creating a false equivalence.


Let's see some available text data(Text corpora).
# Text corpora

Large amount of written or spoken textual data. It has usually associated with some meta data.


# Some popular corpora
- Brown Corpus: This was the first million-word corpus for the English language, published by Kucera and Francis in 1961.
- WordNet: This corpus is a semantic-oriented lexical database for the English language. It was created at Princeton University
- Penn Treebank: This corpus consists of tagged and parsed English sentences including annotations like POS tags and grammar-based parse trees.
- Google N-gram Corpus: The Google N-gram Corpus consists of over a trillion words from various sources including books, web pages etc.
- Web, chat, email, tweets: We can gather this kind of textual data from social media.



# Some popular framework for text analysis
- nltk Natural Language Toolkit
- gensim: The gensim library has a rich set of capabilities for semantic analysis, including topic modeling and similarity analysis.
- textblob: text processing, phrase extraction, classification, POS tagging, and sentiment analysis
- spacy: claims to provide industrial-strength NLP capabilities by providing the best implementation of each technique and algorithm

# Let's use nltk to access some corpora using nltk

In [1]:
!pip install nltk

[33mYou are using pip version 18.1, however version 19.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import nltk


In [3]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/pooran/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |   

[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Package state_union is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /home/pooran/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_

True

# Accessing the Brown Corpus

In [235]:
from nltk.corpus import brown
brown.readme()

'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'

In [5]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

We can access the tokenized sentences like

In [8]:
sentences = brown.sents(categories='adventure')

In [None]:
# remove, Get the original sentence
# write code here


# POS tagged sentences

In [21]:
tagged_sents= brown.tagged_sents(categories='humor')
tagged_sents

[[('It', 'PPS'), ('was', 'BEDZ'), ('among', 'IN'), ('these', 'DTS'), ('that', 'CS'), ('Hinkle', 'NP'), ('identified', 'VBD'), ('a', 'AT'), ('photograph', 'NN'), ('of', 'IN'), ('Barco', 'NP'), ('!', '.'), ('!', '.')], [('For', 'CS'), ('it', 'PPS'), ('seems', 'VBZ'), ('that', 'CS'), ('Barco', 'NP'), (',', ','), ('fancying', 'VBG'), ('himself', 'PPL'), ('a', 'AT'), ("ladies'", 'NNS$'), ('man', 'NN'), ('(', '('), ('and', 'CC'), ('why', 'WRB'), ('not', '*'), (',', ','), ('after', 'IN'), ('seven', 'CD'), ('marriages', 'NNS'), ('?', '.'), ('?', '.')], ...]

# Let's get top noun in humor and see its distribution

In [19]:
# write code here

[('time', 43), ('Mr.', 36), ('way', 28)]

# Can use nltk freqdist

In [23]:
nouns_freq = nltk.FreqDist(nouns_tags)

In [26]:
nouns_freq.most_common(3)

[('time', 43), ('Mr.', 36), ('way', 28)]

In [31]:
# orignal doc id
len(brown.fileids())

500

# Wordnet

In [32]:
from nltk.corpus import wordnet

word synsets

In [55]:
word_synsets= wordnet.synsets('camping')
word_synsets

[Synset('camping.n.01'),
 Synset('camp.v.01'),
 Synset('camp.v.02'),
 Synset('camp.v.03')]

In [56]:
for synset in word_synsets:
    print('name:',synset.name())
    print('definition:',synset.definition())
    print('examples:',synset.examples())

name: camping.n.01
definition: the act of encamping and living in tents in a camp
examples: []
name: camp.v.01
definition: live in or as if in a tent
examples: ['Can we go camping again this summer?', 'The circus tented near the town', 'The houseguests had to camp in the living room']
name: camp.v.02
definition: establish or set up a camp
examples: []
name: camp.v.03
definition: give an artificially banal or sexual quality to
examples: []


# Text Preprocessing

# Tokenization

Breaking down or splitting textual data into smaller meaningful components.
- Sentence tokenization
- words tokenization

# tokenizer in nltk
- sent_tokenize
- RegexpTokenizer

Check the documentation for other tokenizer

In [57]:
import nltk
from nltk.corpus import gutenberg

In [83]:
edgeworth = gutenberg.raw(fileids='edgeworth-parents.txt')
edgeworth[0:1000]

'[The Parent\'s Assistant, by Maria Edgeworth]\r\n\r\n\r\nTHE ORPHANS.\r\n\r\nNear the ruins of the castle of Rossmore, in Ireland, is a small cabin,\r\nin which there once lived a widow and her four children.  As long as she\r\nwas able to work, she was very industrious, and was accounted the best\r\nspinner in the parish; but she overworked herself at last, and fell ill,\r\nso that she could not sit to her wheel as she used to do, and was obliged\r\nto give it up to her eldest daughter, Mary.\r\n\r\nMary was at this time about twelve years old.  One evening she was\r\nsitting at the foot of her mother\'s bed spinning, and her little brothers\r\nand sisters were gathered round the fire eating their potatoes and milk\r\nfor supper.  "Bless them, the poor young creatures!" said the widow, who,\r\nas she lay on her bed, which she knew must be her deathbed, was thinking\r\nof what would become of her children after she was gone.  Mary stopped\r\nher wheel, for she was afraid that the nois

In [84]:
edgeworth_sent_tkn = nltk.sent_tokenize(edgeworth)

In [86]:
print('Total sentences  {}'.format(len(edgeworth_sent_tkn)))
for sidx, s in enumerate(edgeworth_sent_tkn[0:3]):
    print(sidx,"::", s, '\n')

Total sentences  10096
0 :: [The Parent's Assistant, by Maria Edgeworth]


THE ORPHANS. 

1 :: Near the ruins of the castle of Rossmore, in Ireland, is a small cabin,
in which there once lived a widow and her four children. 

2 :: As long as she
was able to work, she was very industrious, and was accounted the best
spinner in the parish; but she overworked herself at last, and fell ill,
so that she could not sit to her wheel as she used to do, and was obliged
to give it up to her eldest daughter, Mary. 



# RegexpTokenizer

In [240]:
s ="Price of a gallon milk is $3.50.  I'll buy 2. Thanks."


In [241]:
sent_regex = nltk.tokenize.RegexpTokenizer('\w+|\$\d+\.\d+|\S+')

In [242]:
sent_regex.tokenize(s)

['Price',
 'of',
 'a',
 'gallon',
 'milk',
 'is',
 '$3.50',
 '.',
 'I',
 "'ll",
 'buy',
 '2',
 '.',
 'Thanks',
 '.']

# Word Tokenization

Converts sentence into word token. Typical process before stemming or lemmentizing.

- word_tokenize
- TreebankWordTokenizer. Based on the Penn Treebank and uses various regular expressions to tokenize the text.
- RegexpTokenizer

In [243]:
s = 'I just absolutely adore Denver and the Boulder area.'

In [244]:
nltk.word_tokenize(s)

['I',
 'just',
 'absolutely',
 'adore',
 'Denver',
 'and',
 'the',
 'Boulder',
 'area',
 '.']

In [108]:
word_regex= nltk.RegexpTokenizer(pattern=r'\w+', gaps=False)
word_regex.tokenize(s)

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

we can get start and end indices

In [112]:
list(word_regex.span_tokenize(s))

[(0, 1),
 (2, 6),
 (7, 17),
 (18, 23),
 (24, 30),
 (31, 34),
 (35, 38),
 (39, 46),
 (47, 51)]

In [114]:
[s[st:en] for st, en in word_regex.span_tokenize(s)]

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

There are other word tokenizer classes. Check the documentation but to give you a flavour here is WordPunctTokenizer. It tokenize sentences into independent alphabetic and non-alphabetic tokens.

In [116]:
wordpunkt_tkn = nltk.WordPunctTokenizer()
wordpunkt_tkn.tokenize("He couldn't swim" )

['He', 'couldn', "'", 't', 'swim']

# Text normalization or wrangling
Apart from tokenization
- Cleaning
- Case conversion
- Spell correction
- Removing stop words
- Stemming,
- Lemmatization

# Cleaning text
 Remove any unnecessary tokens.
 
 Like from html we don't care about tags
- use regex, Beautiful soup
- clean_html from ntlk 

In [142]:
# example to get text from html


# Working on a corpus
# Tokenization

In [163]:
corpus = ['Meet Google Fi, a different kind of phone plan @@plan',  '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']
corpus

['Meet Google Fi, a different kind of phone plan @@plan',
 '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']

In [164]:
sent_tokens = []
for doc in corpus:
    sent_tokens.append(nltk.sent_tokenize(doc))
sent_tokens    

[['Meet Google Fi, a different kind of phone plan @@plan'],
 ['*Simpler* pricing and smarter coverage.',
  'It has unlimited call and text at $20!']]

In [167]:
words_tokens= []
for doc in corpus:
    sent_tokens= nltk.sent_tokenize(doc)
    words_tokens.append([nltk.word_tokenize(sent) for sent in sent_tokens])
        

In [169]:
words_tokens

[[['Meet',
   'Google',
   'Fi',
   ',',
   'a',
   'different',
   'kind',
   'of',
   'phone',
   'plan',
   '@',
   '@',
   'plan']],
 [['*Simpler*', 'pricing', 'and', 'smarter', 'coverage', '.'],
  ['It', 'has', 'unlimited', 'call', 'and', 'text', 'at', '$', '20', '!']]]

In [174]:
import re
import string
pattern = '[{}]'.format(re.escape(string.punctuation))

In [175]:
## ? build a regex to remove  punctuations 
## work with words_tokens[0][0] sentence




# Removing stop words

Words that end up occurring the most like a, the, am.

In [185]:
stopwords = nltk.corpus.stopwords.words('english')
stop_clean_sent = [w for w in clean_sent if w not in stopwords]
print(stopwords)
stop_clean_sent

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

['Meet', 'Google', 'Fi', 'different', 'kind', 'phone', 'plan', 'plan']

# Stemming


In [188]:
from nltk.stem import PorterStemmer
pstemmer = PorterStemmer()

In [191]:
pstemmer.stem('helped'), pstemmer.stem('helping')

('help', 'help')

In [193]:
pstemmer.stem('strange')

'strang'

In [195]:
from nltk.stem import LancasterStemmer
ls_stemmer = LancasterStemmer()
ls_stemmer.stem('strange')

'strange'

# Regex based stemmer

In [220]:
# write regex for words ending with ed or ing
from nltk.stem import RegexpStemmer
regex_stemmer = RegexpStemmer( ???, min=3)

In [221]:
regex_stemmer.stem('played'), regex_stemmer.stem('apples')

('play', 'apples')

# Lemmatization
Get the root word in the dictionary.

In [222]:
from nltk.stem import WordNetLemmatizer
wnetl = WordNetLemmatizer()

In [223]:
# noun
wnetl.lemmatize('buses', 'n')

'bus'

In [226]:
# verb
wnetl.lemmatize('running', 'v'), wnetl.lemmatize('ate', 'v')

('run', 'eat')

In [228]:
# adjective
wnetl.lemmatize('easier', 'a')

'easy'

Use right part of speech

In [234]:
wnetl.lemmatize('ate','n')

'ate'

# Side: getting bash output to python

In [60]:
%%bash --out out
curl -s http://www.gutenberg.org/cache/epub/18674/pg18674.txt

In [61]:
out

