# Introduction to Natural Language Processing (NLP)

NLP is a discipline that has been developped over many decades now, and has become an important piece of the puzzle to solve challenging text-based AI problems, such classifying text documents, generating text, language understanding, chatbots, speech and more...


## Terminology
We'll go over the typical terminology used in NLP, and the tools used to decompose NLP problems.

### Methods
- Tokenizing
- Part Of Speech (POS) tagging
- Stemming & Lemmatization
- Stop Words
- Name Entity Recognition (NER)
- Dependency Parsing
- Vocabulary
- Embeddings

also:
- Bag Of Words (BOW)
- bigrams, trigrams, ngrams
- skip-ngram


### Tokenizing & Tokens
Text problems begin with text. A text document is composed of phrases, themselves composed of words and punctuation.

In order to start analyzing text, one usually needs to decompose the text into it's basic components: sentences, and then words.

Splitting sentences is not always trivial: simply splitting on `.` or capitalized first word just won't cut it in the real world.

Then, words come in many flavors, or are not even words, like abbreviations (U.K., U.S.A), contractions ('s in "it's good"), or hyphenated words (like "on-time", "mother-in-law"), so basic components of text are generally called 'tokens'.

#### Tokenizing
Tokenizing is the action of splitting text into tokens.

There are many frameworks for NLP out there. One of the most popular is `nltk` (Natural Language Tool Kit), 
which is very versatile and geared towards research.

`spaCy` is another one which is much more opinionated and targetting engineers for 'production ready NLP'

`scikit-learn` is a more generic data science framework that includes NLP functionalities.

Then with Deep Learning came a flurry of other libraries focusing on deep learning methods, yet typically the text decomposition 
is done using the libraries above.

Neural Network based libraries include `gensim` which implements the popular Word2Vec model (more on this later)


In [None]:
text = '''Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. 
And sometimes sentences can start with non-capitalized words.  i is a good variable name.
'''

In [None]:
# to load the punkt package, if not already loaded, use:
# import nltk
# nltk.download('punkt')

# Tokenizing with nltk
from nltk.data import load

# Tokenize sentences (i.e. split on proper punctuation)

sent_detector = load('nltk:tokenizers/punkt/english.pickle')
tokens = sent_detector.tokenize(text)

for i, token in enumerate(tokens):
    print(f'Sentence {i}: {token}')

In [None]:
# Tokenize words:

from nltk.tokenize import word_tokenize

word_tokenize(text)

In [None]:
# Tokenize sentences with spaCy

import spacy
nlp = spacy.load('en')

doc = nlp(text)

for i, token in enumerate(doc.sents):
    print(f'Sentence {i}: {token}')

In [None]:
# Tokenize words with spaCy

import spacy
nlp = spacy.load('en')

doc = nlp(text)

for i, token in enumerate(doc):
    print(token.text)

Note that the output may be slightly different. SpaCy didn't parse `non-capitlized` into a single token, and kept the newline characters.

## Tagging: Part Of Speech Tagging (POS tagging)

Part Of Speech (POS) tagging is a somewhat standard method to recognize token types (nouns, verbs, adjectives, punctuation etc..)


In [None]:
# POS tagging with nltk
import nltk

# if data not loaded, load with:
# nltk.download('averaged_perceptron_tagger')

tokens = word_tokenize(text)
nltk.pos_tag(tokens)

#### The full list and their meaning is:
    
- CC	coordinating conjunction
- CD	cardinal digit
- DT	determiner
- EX	existential there (like: "there is" ... think of it like "there exists")
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- LS	list marker	1)
- MD	modal	could, will
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'
- PDT	predeterminer	'all the kids'
- POS	possessive ending	parent's
- PRP	personal pronoun	I, he, she
- PRP`$`	possessive pronoun	my, his, hers
- RB	adverb	very, silently,
- RBR	adverb, comparative	better
- RBS	adverb, superlative	best
- RP	particle	give up
- TO	to	go 'to' the store.
- UH	interjection	errrrrrrrm
- VB	verb, base form	take
- VBD	verb, past tense	took
- VBG	verb, gerund/present participle	taking
- VBN	verb, past participle	taken
- VBP	verb, sing. present, non-3d	take
- VBZ	verb, 3rd person sing. present	takes
- WDT	wh-determiner	which
- WP	wh-pronoun	who, what
- WP`$`	possessive wh-pronoun	whose
- WRB	wh-abverb	where, when

from: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
    

In [None]:
# POS tagging with spaCy

import spacy
nlp = spacy.load('en')

doc = nlp(text)

for i, token in enumerate(doc):
    print(f'{token.text} -> {token.pos_}')

### What do we do with POS tags?

In most ML models, only important words are used, otherwise the model vocabulary is so large that it is difficult to train a general model

Many will start by using just nouns/proper nouns and verbs, getting rid of all other words and punctuation.


In [None]:
# with spaCy

import spacy
nlp = spacy.load('en')

doc = nlp(text)
useful_tokens = [t.text for t in doc if t.pos_ in ['NOUN', 'VERB', 'PROPN']]
print(useful_tokens)

In some other cases, we might know that the text of interest is only nouns, or verbs.

In most cases, POS tagging is used to remove white space and punctuation.

### Stop words

A very common concept in NLP is removing stop words.

Stop words are words that are so common that they do not bring any useful information when included in
typical classification methods because they are just everywhere.

A very common step is therefore to remove stop words, and NLP packages include a curated list of stop words.

In [None]:
# removing stop words in nltk

from nltk.corpus import stopwords
print(set(stopwords.words('english')))

In [None]:
tokens = word_tokenize(text)
without_stopwords = [t for t in tokens if t.lower() not in stopwords.words('english')]
print(without_stopwords)

In [None]:
# with spaCy

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS)


In [None]:
# spaCy tags the tokens with a flag `is_stop` that can be used more readily for filtering

nlp = spacy.load('en')
doc = nlp(text)
print([t.text for t in doc if not t.is_stop])

### Stemming and Lemmatization

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Stemming is somewhat of a crude method that tries to chop the end of the word down to a root, while Lemmatization usually refers to using a vocabulary and morphological analysis of words

In [None]:
text = """This was a challenging computational challenge for the computer challenger: 
optimizing the computed optimized optimization"""

In [None]:
# wth nltk
from nltk.stem import PorterStemmer

ps = PorterStemmer()
tokens = word_tokenize(text)
for token in tokens:
    print(ps.stem(token))

In [None]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
for token in tokens:
    print(lemmatizer.lemmatize(token))

In [None]:
# with spaCy

# spaCy DOES NOT DO STEMMING

nlp = spacy.load('en')
doc = nlp(text)
print([t.lemma_ for t in doc])

### Name Entity Recognition (NER)

Name Entity recognition is a functionality that aims at detecting entities in text.


In [None]:
text = """Hackers are exploiting vulnerable Jira and Exim servers with the end goal of infecting 
them with a new Watchbog Linux Trojan variant and using the resulting botnet 
as part of a Monero cryptomining operation.

European authorities fined Google a record $5.1 billion on Wednesday 
for abusing its power in the mobile phone market 
and ordered the company to alter its practices
"""

In [None]:
nlp = spacy.load('en')
doc = nlp(text)
print([(t.text, t.label_) for t in doc.ents])

### Dependency Parsing


In [None]:
text = "Autonomous cars shift insurance liability toward manufacturers"

In [None]:
nlp = spacy.load("en")
doc = nlp(text)
for chunk in doc.noun_chunks:
    print(f'{chunk.text} -> {chunk.root.text} -> {chunk.root.dep_} -> {chunk.root.head.text}')

### Vocabulary

The Vocabulary in a dataset, is the list of unique tokens in the dataset.

If the dataset is not filtered after tokenization, it amounts to all words and punctuation.

The larger the vocabulary the more likely some tokens are just adding to the noise, hence the need to filter the dataset tokens, 
removing punctuation that usually doesn't bring any information, stop words that are too common to be meaningful, and filtering only for words that may be meaningful,
stemming or lemmatizing the remaining tokens to further reduce the vocabulary.


#### Token Frequency

Token frequency can be a useful piece of information. It can also be used to further reduce the vocabulary, by picking only
tokens that appear enough times in the dataset to be meaningful.

In order to do this, we need to count occurences. This is easily done with the Python `Counter` class, part of the `collections` package

In [None]:
from collections import Counter
import spacy

nlp = spacy.load('en')

all_text = ''
line_count = 0
tokens = []
with open('big.txt', 'r') as f:
    text = ''
    for line in f.readlines():
        text += line
        line_count += 1
        if line_count % 1000 == 0:
            doc = nlp(text)
            tokens += [t.text for t in doc if not t.is_stop and not t.is_punct and not t.is_space]
            all_text += text
            text = ''
        if line_count > 5000:
            break

token_freq = Counter(tokens)

# Note: this is a quick hack to load a large file, but this may split sentences mid-way.

In [None]:
token_freq.most_common(20)

In [None]:
# to further reduce the vocabulary, and capitalization doesn't bring additional information
# it is often useful to lowercase all the words

token_freq = Counter([t.lower() for t in tokens])

#### Using the vocabulary in a model

The vocabulary is composed of tokens. Those tokens are words, so they are difficult to work with for a regression model
or any neural network that expects numbers only.

A very basic way to encode the vocabulary is to use an index.

In [None]:
# create a vocabulary index

vocab_index = {}
for token in token_freq.keys():
    vocab_index[token] = len(vocab_index)


In [None]:
vocab_index

The input to a model would then consist in a series of numbers, indices of the words of the sentence in the index

For example:

In [None]:
text = 'the quick brown fox jumps over the lazy dog'

doc = nlp(text)
tokens = [t.text for t in doc if not t.is_punct and not t.is_space and not t.is_stop]
print(tokens)
token_idx = [vocab_index.get(t, -1) for t in tokens]
print(token_idx)

We see that if the words are not in the dictionary, they don't return a meaningful index.
Ideally the vocabulary is built from the dataset, to train the model.

In [None]:
text = all_text[:301]
print(text)

doc = nlp(text)
tokens = [t.text.lower() for t in doc if not t.is_punct and not t.is_space and not t.is_stop]
print(tokens)
token_idx = [vocab_index.get(t, -1) for t in tokens]
print(token_idx)

### Embeddings

#### 1-hot encoder

The index method is not very useful directly as a model might interpret this numerical index as an ordering of the words.

here `ebook` as index 2 and `country` has index 18. A computer model would interpret this as `country` having 9x more weight
than ebook, while `project` with index 0, would not even count.

Since tokens are values of a category, they usually need to be transformed into a vector, where all values are 0, excpet for the 
column of the word index.

In [None]:
# 1 hot encoding of vocabulary
import numpy as np
np.set_printoptions(threshold=np.inf) # this is so the full array can be displayed

def one_hot_encoder(token):
    vector = np.zeros(len(vocab_index))
    index = vocab_index.get(token, -1)
    if index != -1:
        vector[index] = 1.0
    return vector


In [None]:
text = all_text[:301]
print(text)

doc = nlp(text)
tokens = [t.text.lower() for t in doc if not t.is_punct and not t.is_space and not t.is_stop]
print(tokens)
token_vectors = np.array([one_hot_encoder(t) for t in tokens])
print(token_vectors)

Obviously, this is not very efficient, as the array for each word is the size of the vocabulary.

To circumvent this issue, one can use embeddings that have reduced dimentionality, like Word2Vec or GLoVe

spaCy also provide this functionality by default, using the vector attribute

In [None]:
import spacy
nlp = spacy.load('en')

text = all_text[:301]
print(text)

doc = nlp(text)
token_vectors = [t.vector for t in doc if not t.is_punct and not t.is_space and not t.is_stop]
print(token_vectors)

In [None]:
token_vectors[0].shape

You'll notice immediately that the vectors are not 0's and 1's anymore.

This is because the million words corpuse used to train the model, has been reduced to (in this case) 96 values by using dimentionality reduction techniques like PCA.


Embeddings now represent the words in a 96 dimensions space.

Similar words should be found in a similar area in space.

### Embeddings with Word2Vec

Word2Vec is an embeddings model that can be used to train on your own corpus of data.

In [10]:
# Some logging definition so as to be able to trace what's going on under the hood
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [11]:
from gensim.models.word2vec import Word2Vec

In [31]:
import nltk
from nltk.corpus import brown

nltk.download('brown')
# model = Word2Vec()

[nltk_data] Downloading package brown to /Users/emmanuel/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [55]:
model = Word2Vec(
#     brown.sents(),
#     min_count=10,   # minimum frequency of token
    window=5,   # how many words at a time when scanning a sentence (more on this later)
#     size=300,   # size of the output vector
#     sample=6e-5,  # sampleing noise factor
#     alpha=0.03,  # learning rate
#     min_alpha=0.0007, 
#     negative=20,  #
    workers=4)

In [56]:
model.build_vocab(brown.sents(), progress_per=10000)

2019-07-23 12:30:23,973 : INFO : collecting all words and their counts
2019-07-23 12:30:23,975 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-07-23 12:30:24,822 : INFO : PROGRESS: at sentence #10000, processed 219770 words, keeping 23488 word types
2019-07-23 12:30:25,743 : INFO : PROGRESS: at sentence #20000, processed 430477 words, keeping 34367 word types
2019-07-23 12:30:26,632 : INFO : PROGRESS: at sentence #30000, processed 669056 words, keeping 42365 word types
2019-07-23 12:30:27,336 : INFO : PROGRESS: at sentence #40000, processed 888291 words, keeping 49136 word types
2019-07-23 12:30:27,936 : INFO : PROGRESS: at sentence #50000, processed 1039920 words, keeping 53024 word types
2019-07-23 12:30:28,398 : INFO : collected 56057 word types from a corpus of 1161192 raw words and 57340 sentences
2019-07-23 12:30:28,399 : INFO : Loading a fresh vocabulary
2019-07-23 12:30:28,500 : INFO : effective_min_count=5 retains 15173 unique words (27% of orig

In [57]:
model.train(brown.sents(), total_examples=model.corpus_count, epochs=5, report_delay=1)

2019-07-23 12:30:37,331 : INFO : training model with 4 workers on 15173 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-07-23 12:30:38,343 : INFO : EPOCH 1 - PROGRESS: at 17.44% examples, 145236 words/s, in_qsize 0, out_qsize 0
2019-07-23 12:30:39,382 : INFO : EPOCH 1 - PROGRESS: at 36.33% examples, 146911 words/s, in_qsize 0, out_qsize 0
2019-07-23 12:30:40,389 : INFO : EPOCH 1 - PROGRESS: at 49.43% examples, 137679 words/s, in_qsize 0, out_qsize 0
2019-07-23 12:30:41,437 : INFO : EPOCH 1 - PROGRESS: at 67.44% examples, 142224 words/s, in_qsize 0, out_qsize 0
2019-07-23 12:30:42,482 : INFO : EPOCH 1 - PROGRESS: at 88.13% examples, 137078 words/s, in_qsize 0, out_qsize 0
2019-07-23 12:30:43,005 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-23 12:30:43,007 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-23 12:30:43,008 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-23 1

(3908888, 5805960)

In [58]:
model.wv.most_similar('man')

2019-07-23 12:31:13,211 : INFO : precomputing L2-norms of word weight vectors


[('woman', 0.8763573169708252),
 ('girl', 0.870948851108551),
 ('boy', 0.8382799625396729),
 ('child', 0.7810909748077393),
 ('young', 0.7795072197914124),
 ('paradise', 0.7738704681396484),
 ('himself', 0.7477854490280151),
 ('person', 0.7475799918174744),
 ('good', 0.7463479042053223),
 ('old', 0.7426416873931885)]

In [59]:
model.wv.most_similar(positive=['woman','king'], negative=['man'], topn = 3)

[('sold', 0.942639172077179),
 ('boyhood', 0.9409275054931641),
 ('Class', 0.9401830434799194)]

### Using pre-trained Word2Vec

In [61]:
from gensim.models import KeyedVectors
from nltk.data import find
nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
model = KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/emmanuel/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.
2019-07-23 12:32:28,595 : INFO : loading projection weights from /Users/emmanuel/nltk_data/models/word2vec_sample/pruned.word2vec.txt
2019-07-23 12:32:49,944 : INFO : loaded (43981, 300) matrix from /Users/emmanuel/nltk_data/models/word2vec_sample/pruned.word2vec.txt


In [62]:
model.wv.most_similar('man')

  """Entry point for launching an IPython kernel.
2019-07-23 12:32:59,317 : INFO : precomputing L2-norms of word weight vectors


[('woman', 0.7664012908935547),
 ('boy', 0.6824869513511658),
 ('teenager', 0.6586930155754089),
 ('girl', 0.5921713709831238),
 ('robber', 0.5585117936134338),
 ('men', 0.5489763617515564),
 ('guy', 0.5420036315917969),
 ('person', 0.5342026948928833),
 ('gentleman', 0.5337991714477539),
 ('Man', 0.5316052436828613)]

In [63]:
model.wv.most_similar(positive=['woman','king'], negative=['man'], topn = 3)

  """Entry point for launching an IPython kernel.


[('queen', 0.7118192911148071),
 ('monarch', 0.6189673542976379),
 ('princess', 0.5902431011199951)]

In [67]:
model['man']

array([ 0.141162  ,  0.0566339 ,  0.0150038 , -0.0359245 ,  0.038883  ,
       -0.0178566 , -0.0857962 ,  0.0029849 ,  0.0621283 ,  0.00084198,
        0.0124679 , -0.108196  , -0.0363472 , -0.0655094 , -0.044166  ,
        0.0176453 , -0.0422641 ,  0.0256755 ,  0.0128906 , -0.0435321 ,
       -0.0566339 ,  0.00056132,  0.0113057 , -0.117494  ,  0.027683  ,
       -0.0828377 , -0.0338113 ,  0.112423  ,  0.162294  , -0.0196528 ,
        0.0701585 ,  0.0591698 , -0.027683  , -0.0089283 , -0.0418415 ,
        0.109887  ,  0.107351  , -0.0549434 ,  0.0310641 ,  0.138626  ,
        0.0136302 , -0.0166943 ,  0.0917132 , -0.00351321,  0.0963622 ,
       -0.0583245 , -0.032966  ,  0.0045434 , -0.0224    ,  0.016483  ,
       -0.0579019 ,  0.0540981 ,  0.0241962 , -0.0790339 ,  0.0352906 ,
       -0.0365585 , -0.0336    , -0.0188075 ,  0.0350792 , -0.0047283 ,
        0.0756528 ,  0.132709  , -0.0187019 , -0.0061283 ,  0.0393056 ,
       -0.00401509, -0.0148981 , -0.0498717 ,  0.0538868 , -0.01

### Bag Of Words (BOW)

Bag Of Words  is a method to create a defined-size vocabulary that ignores the word order. 

It creates a set of unique words, and words are vectorized by index in the vocabulary (as seen before)

When looking for keywords or common terms, various scoring methods can be used.

Term Frequency - Inverse Document Frequency (TF-IDF) is a common way to score terms in documents, 
by their frequency in the document, normalozed to the inverse frequency of the term across all documents in the corpus.



### unigrams, bigrams, trigrams, n-grams

A single word token is a unigram

A two word token is called a bigram

A three word token is called a trigram

So a n-word token is called a n-gram

Using single words is often stripping the meaning of some combined words, so ngrams are useful, although they tend to have much lower frequencies.


#### Note about Word2Vec

Word2Vec uses 2 methods, one of which is called CBOW, for Continuous Bag Of Words, which uses a bag of words method
scanning a window across the sentences used as input.

This allows for preserving some of the relationship between words, and allows for clustering them into space by similarity.
