# NLP Objectives and Basic Steps
- Objectives
   * Split documents into tokens or segments
   * Let machine understand and have the ability to read by feature extraction
   
- Common Terminology
    * Corpus
    * Document
    * Token
    * Bag of Words
    * Vocab
    * Stop Words
    
##  1. Basic processing steps
   * Read in documents (corpus).
   * Tokenization & Lower case: split documents into individual words or segments.
   * Strip out spacing and punctuation.
   * Remove stop words.
   * Stemming & Lemmatization.
   * Bag of words: document-to-term matrix.
   * Term Frequency and Inverse Document Frequency (TF-IDF).

## 2. Document searching - by similarity measure

## Example

In [1]:
doc1 = 'Wise people think they are foolish.'
doc2 = 'Foolish foolish people think they are wise wise.'
doc3 = 'I am definitely wise so this irritates me.'
doc4 = 'ABC is for sure like definitely foolish.'

# 1. Basic processing steps

## 1.1 Corpus
- Corpus means a list of documents.

In [2]:
corpus = [doc1, doc2, doc3, doc4]

## 1.2 Tokenization and Lower case

In [3]:
# Tokenization: without any package.
for doc in corpus:
    lower_doc = doc.lower()
    tokens = lower_doc.split(' ')
    print("Tokens: {}".format(tokens))

Tokens: ['wise', 'people', 'think', 'they', 'are', 'foolish.']
Tokens: ['foolish', 'foolish', 'people', 'think', 'they', 'are', 'wise', 'wise.']
Tokens: ['i', 'am', 'definitely', 'wise', 'so', 'this', 'irritates', 'me.']
Tokens: ['abc', 'is', 'for', 'sure', 'like', 'definitely', 'foolish.']


In [4]:
# Tokenization: with nltk package
from nltk.tokenize import word_tokenize

[word_tokenize(doc.lower()) for doc in corpus]

[['wise', 'people', 'think', 'they', 'are', 'foolish', '.'],
 ['foolish', 'foolish', 'people', 'think', 'they', 'are', 'wise', 'wise', '.'],
 ['i', 'am', 'definitely', 'wise', 'so', 'this', 'irritates', 'me', '.'],
 ['abc', 'is', 'for', 'sure', 'like', 'definitely', 'foolish', '.']]

## 1.3 Strip out spacing and punctuation

In [5]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus]
tokenized_corpus

[['wise', 'people', 'think', 'they', 'are', 'foolish', '.'],
 ['foolish', 'foolish', 'people', 'think', 'they', 'are', 'wise', 'wise', '.'],
 ['i', 'am', 'definitely', 'wise', 'so', 'this', 'irritates', 'me', '.'],
 ['abc', 'is', 'for', 'sure', 'like', 'definitely', 'foolish', '.']]

In [7]:
# 
token_ls = []
for document in tokenized_corpus:
    tokens = []
    for token in document:
        if token.strip(string.punctuation) != '':
            tokens.append(token.strip(string.punctuation))
    token_ls.append(tokens)

In [8]:
token_ls

[['wise', 'people', 'think', 'they', 'are', 'foolish'],
 ['foolish', 'foolish', 'people', 'think', 'they', 'are', 'wise', 'wise'],
 ['i', 'am', 'definitely', 'wise', 'so', 'this', 'irritates', 'me'],
 ['abc', 'is', 'for', 'sure', 'like', 'definitely', 'foolish']]

## 1.4 Remove stop words

In [9]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)
print()
print('{} stop words.'.format(len(stop_words)))

{'don', 'did', 'does', 'doesn', 'herself', 'yours', 'through', 'ours', 'shouldn', 've', 'by', 'above', 'was', 'haven', "haven't", 'just', 'during', 'the', 'hadn', 'we', "you'd", 'theirs', 'wouldn', 'again', 'our', 'has', 'before', 'at', 'mightn', 'those', 't', 'now', 'them', 'couldn', 'under', "mustn't", 'who', 'in', 'only', 'o', 'where', 'own', 'same', 'too', 'while', 'be', 'yourselves', 'of', 'about', 'on', 'any', 'hasn', 'can', 'shan', "it's", "shouldn't", 'd', "aren't", 'some', 'wasn', 'nor', 'themselves', "wasn't", 'an', 'this', 'am', 'over', 're', 'so', "shan't", 'had', 'but', 'when', 'that', 'she', "won't", 'to', 'isn', 'off', 'been', 'mustn', 'him', 'than', 'against', 'not', 'further', "weren't", 'which', 'each', 'y', 'such', 's', 'my', 'being', 'up', 'he', 'needn', 'himself', 'below', "hadn't", "doesn't", 'hers', "isn't", "mightn't", 'for', 'out', 'very', 'his', 'these', 'do', 'll', "couldn't", 'all', 'because', 'you', 'have', 'there', 'weren', 'ourselves', 'after', 'ma', "has

In [10]:
# Remove stop words.
token_ls = [[token for token in document if token not in stop_words] 
        for document in token_ls]

In [11]:
token_ls

[['wise', 'people', 'think', 'foolish'],
 ['foolish', 'foolish', 'people', 'think', 'wise', 'wise'],
 ['definitely', 'wise', 'irritates'],
 ['abc', 'sure', 'like', 'definitely', 'foolish']]

## 1.5 Stemming & Lemmatization
- Stemming: reducing inflected (or sometimes derived) words to their stem, base or root form.
- Lemmatization: determining the lemma for a given word, 
   * A lemma is a word which stands at the head of a definition in a dictionary, e.g. run (lemma),  runs, ran and running (inflections) .
   * Lemmatization is a complex task involving understanding context and determining the part of speech of a word in a sentence.
      * e.g. "organized" (verb or adjective?)
   * The widely used Lemmatization method is based on WordNet, a large lexical database of English.
- Difference between Stemming & Lemmatization
   * A stemmer operates on a single word **without knowledge of the context**, and therefore cannot discriminate between words which have different meanings depending on part of speech. While, lemmatization **requires context and POS tags**. 
   * Stemming may not generate a real word, but lemmization always generates real words.
   *  However, stemmers are typically easier to implement and run faster with reduced accuracy.

### Stemming

In [12]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

stem_docs = [[porter_stemmer.stem(token) for token in document]
               for document in token_ls
            ]
stem_docs

[['wise', 'peopl', 'think', 'foolish'],
 ['foolish', 'foolish', 'peopl', 'think', 'wise', 'wise'],
 ['definit', 'wise', 'irrit'],
 ['abc', 'sure', 'like', 'definit', 'foolish']]

### Lemmatization

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

lemma_docs = [[wordnet_lemmatizer.lemmatize(token) for token in document]
               for document in token_ls
            ]
lemma_docs

[['wise', 'people', 'think', 'foolish'],
 ['foolish', 'foolish', 'people', 'think', 'wise', 'wise'],
 ['definitely', 'wise', 'irritates'],
 ['abc', 'sure', 'like', 'definitely', 'foolish']]

## 1.6 Bag of words

### Method 1 - Construct Vocab & use Counter

In [14]:
# Before getting the bag of words, we need to attain the 'Vocab'.

#### Vocab
- Set of words.

In [15]:
# Use lemmatized documents as an example.
# Collect all tokens in a list.
vocabulary = [token for document in lemma_docs for token in document]
vocabulary

['wise',
 'people',
 'think',
 'foolish',
 'foolish',
 'foolish',
 'people',
 'think',
 'wise',
 'wise',
 'definitely',
 'wise',
 'irritates',
 'abc',
 'sure',
 'like',
 'definitely',
 'foolish']

In [16]:
# Extract unique tokens.
vocabulary = sorted(list(set(vocabulary)))

In [17]:
# vocabulary is the feature space.
print('Vocabulary (features):', vocabulary)

Vocabulary (features): ['abc', 'definitely', 'foolish', 'irritates', 'like', 'people', 'sure', 'think', 'wise']


#### Bag of words
- Document-to-term matrix.

In [18]:
from collections import Counter
import numpy as np

In [19]:
# Helper function.
def bow_vectorize(doc, vocabulary):
    # Counter output is a dictionary object (count on each word).
    bag_of_words = Counter(doc)
    
    # Setup an empty list containing zeros given the length of vocabulary.
    doc_vector = np.zeros(len(vocabulary))
    
    # Loop over by document, if the word exists in a document, increase count value for that word
    for word_index, word in enumerate(vocabulary):
        if word in bag_of_words:
            doc_vector[word_index] += bag_of_words[word]
    return doc_vector

In [20]:
bag_of_words_matrix = list()
for document in lemma_docs:
    bag_of_words_matrix.append(bow_vectorize(document, vocabulary))

In [21]:
# Check final results
print('Features:', vocabulary)

for i in range(len(bag_of_words_matrix)):
    print('"%s":'% lemma_docs[i], '\n', bag_of_words_matrix[i], '\n')
          
print('Feature matrix:')
print(bag_of_words_matrix)

Features: ['abc', 'definitely', 'foolish', 'irritates', 'like', 'people', 'sure', 'think', 'wise']
"['wise', 'people', 'think', 'foolish']": 
 [0. 0. 1. 0. 0. 1. 0. 1. 1.] 

"['foolish', 'foolish', 'people', 'think', 'wise', 'wise']": 
 [0. 0. 2. 0. 0. 1. 0. 1. 2.] 

"['definitely', 'wise', 'irritates']": 
 [0. 1. 0. 1. 0. 0. 0. 0. 1.] 

"['abc', 'sure', 'like', 'definitely', 'foolish']": 
 [1. 1. 1. 0. 1. 0. 1. 0. 0.] 

Feature matrix:
[array([0., 0., 1., 0., 0., 1., 0., 1., 1.]), array([0., 0., 2., 0., 0., 1., 0., 1., 2.]), array([0., 1., 0., 1., 0., 0., 0., 0., 1.]), array([1., 1., 1., 0., 1., 0., 1., 0., 0.])]


### Method 2 - CountVectorizer

In [22]:
# Do NLP at one time: tokenize, lowercase, and lemmatize.
def lemmatize(doc):
    return [wordnet_lemmatizer.lemmatize(word) for word in word_tokenize(doc.lower())]

In [23]:
# Do NLP at one time: tokenize, lowercase, remove stop words, lemmatize, and bag of words.
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(stop_words=stopwords.words('english'),
                                   vocabulary=vocabulary,
                                   tokenizer=lemmatize)

# Can perform on multiple documents.
print('Feature matrix - Method 2')
print(count_vectorizer.fit_transform(corpus).todense())

Feature matrix - Method 2
[[0 0 1 0 0 1 0 1 1]
 [0 0 2 0 0 1 0 1 2]
 [0 1 0 1 0 0 0 0 1]
 [1 1 1 0 1 0 1 0 0]]


  sorted(inconsistent))


In [24]:
# Compared with results using Method 1
print('Feature matrix - Method 1')
bag_of_words_matrix

Feature matrix - Method 1


[array([0., 0., 1., 0., 0., 1., 0., 1., 1.]),
 array([0., 0., 2., 0., 0., 1., 0., 1., 2.]),
 array([0., 1., 0., 1., 0., 0., 0., 0., 1.]),
 array([1., 1., 1., 0., 1., 0., 1., 0., 0.])]

## 1.7 TF-IDF

### Term Frequency (TF)
- How many times a word appears in a document / total words in a document. (Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones.)

In [25]:
def tf_vectorize(doc, vocabulary):
    bow_vector = bow_vectorize(doc, vocabulary)
    tf_vector = np.zeros(len(vocabulary))
    for idx, vec in enumerate(bow_vector):
        tf_vector[idx] = vec / len(doc)
    return tf_vector

In [26]:
tf_matrix = list()
for doc in lemma_docs:
    tf_matrix.append(tf_vectorize(doc, vocabulary))

In [27]:
tf_matrix

[array([0.  , 0.  , 0.25, 0.  , 0.  , 0.25, 0.  , 0.25, 0.25]),
 array([0.        , 0.        , 0.33333333, 0.        , 0.        ,
        0.16666667, 0.        , 0.16666667, 0.33333333]),
 array([0.        , 0.33333333, 0.        , 0.33333333, 0.        ,
        0.        , 0.        , 0.        , 0.33333333]),
 array([0.2, 0.2, 0.2, 0. , 0.2, 0. , 0.2, 0. , 0. ])]

### Inverse Document Frequency (IDF)
- Measures how important a term is within the corpus.
- Frequent terms assigned lower weights while rare terms assigned higher weights.
- Let $|D|$ denote the number of documents, $df(w,D)$ denotes the number of documents with term $w$ in them. Then, $$idf(w) = ln(\frac{|D|}{df(w,D)})+1$$ Or a smoothed version of IDF: $$idf(w) = ln(\frac{|D|+1}{df(w,D)+1})+1$$
- Why plus 1? To avoid a zero value occured. And, smoothed version is smaller than the normal version of IDF.

In [28]:
# Get document frequency
df = np.where(np.array(tf_matrix)>0,1,0)
df

array([[0, 0, 1, 0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 0, 1, 0, 1, 1],
       [0, 1, 0, 1, 0, 0, 0, 0, 1],
       [1, 1, 1, 0, 1, 0, 1, 0, 0]])

In [29]:
# Get document frequency.
df = np.where(np.array(tf_matrix)>0,1,0)
df

# Get IDF matrix.
idf = np.log(np.divide(len(corpus), np.sum(df, axis=0))) + 1
print("\nIDF Matrix:")
print(idf)

# Smoothed version of IDF matrix.
smoothed_idf=np.log(np.divide(len(corpus)+1, np.sum(df, axis=0)+1)) + 1
print("\nSmoothed IDF Matrix:")
print(smoothed_idf)


IDF Matrix:
[2.38629436 1.69314718 1.28768207 2.38629436 2.38629436 1.69314718
 2.38629436 1.69314718 1.28768207]

Smoothed IDF Matrix:
[1.91629073 1.51082562 1.22314355 1.91629073 1.91629073 1.51082562
 1.91629073 1.51082562 1.22314355]


### TF-IDF
- Let $s(w,d)=tf(w,d) * idf(w)$, normalize the TF-IDF score of each word in a document normalized by the Euclidean norm, then 
   $$tfidf(w,d)=\frac{s(w,d)}{\sqrt{\sum_{w \in d}{s(w,d)^2}}}$$

In [30]:
# Use sklearn package - IDF.
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'),
                                   vocabulary=vocabulary,
                                   smooth_idf=False)
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus).todense()
print(tfidf_matrix.shape)
tfidf_matrix

(4, 9)


matrix([[0.        , 0.        , 0.42804604, 0.        , 0.        ,
         0.5628291 , 0.        , 0.5628291 , 0.42804604],
        [0.        , 0.        , 0.59085245, 0.        , 0.        ,
         0.38844998, 0.        , 0.38844998, 0.59085245],
        [0.        , 0.52964479, 0.        , 0.74647284, 0.        ,
         0.        , 0.        , 0.        , 0.40280852],
        [0.51335285, 0.36423919, 0.27701329, 0.        , 0.51335285,
         0.        , 0.51335285, 0.        , 0.        ]])

In [31]:
# Simply multiply tf with idf and normalize the result.
from sklearn.preprocessing import normalize
normalize(np.array(tf_matrix)*idf)

array([[0.        , 0.        , 0.42804604, 0.        , 0.        ,
        0.5628291 , 0.        , 0.5628291 , 0.42804604],
       [0.        , 0.        , 0.59085245, 0.        , 0.        ,
        0.38844998, 0.        , 0.38844998, 0.59085245],
       [0.        , 0.52964479, 0.        , 0.74647284, 0.        ,
        0.        , 0.        , 0.        , 0.40280852],
       [0.51335285, 0.36423919, 0.27701329, 0.        , 0.51335285,
        0.        , 0.51335285, 0.        , 0.        ]])

In [32]:
# Use sklearn package - smoothed IDF.
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'),
                                   vocabulary=vocabulary,
                                   smooth_idf=True)
smooth_tfidf_matrix = tfidf_vectorizer.fit_transform(corpus).todense()
print(smooth_tfidf_matrix.shape)
smooth_tfidf_matrix

(4, 9)


matrix([[0.        , 0.        , 0.44493104, 0.        , 0.        ,
         0.54957835, 0.        , 0.54957835, 0.44493104],
        [0.        , 0.        , 0.60161783, 0.        , 0.        ,
         0.37155886, 0.        , 0.37155886, 0.60161783],
        [0.        , 0.55349232, 0.        , 0.70203482, 0.        ,
         0.        , 0.        , 0.        , 0.44809973],
        [0.49819711, 0.39278432, 0.31799276, 0.        , 0.49819711,
         0.        , 0.49819711, 0.        , 0.        ]])

In [33]:
# Simply multiply tf with smoothed idf and normalize the result.
normalize(np.array(tf_matrix)*smoothed_idf)

array([[0.        , 0.        , 0.44493104, 0.        , 0.        ,
        0.54957835, 0.        , 0.54957835, 0.44493104],
       [0.        , 0.        , 0.60161783, 0.        , 0.        ,
        0.37155886, 0.        , 0.37155886, 0.60161783],
       [0.        , 0.55349232, 0.        , 0.70203482, 0.        ,
        0.        , 0.        , 0.        , 0.44809973],
       [0.49819711, 0.39278432, 0.31799276, 0.        , 0.49819711,
        0.        , 0.49819711, 0.        , 0.        ]])

# 2. Document searching - find similar documents by similarity measure

## 2.1 Cosine similarity
- Given two vectors A and B:
<img src='NLP/cosine_formula.svg' width='50%' />
- Example 1 : A=[0,2,1], B=[0,2,2], then
$$cosine(A,B)=\frac{0*0+2*2+1*2}{\sqrt{0+4+1}*\sqrt{0+4+4}} = 1.6641$$

- Example 2 : A=[0,2,1], B=[0,2,1], then
$$cosine(A,B)=\frac{0*0+2*2+1*1}{\sqrt{0+4+1}*\sqrt{0+4+1}} = 1$$

In [34]:
# Package to measure 'distance'.
from scipy.spatial import distance

# Calculate cosine distance of every pair of documents.
# similarity is equal to '1 - distance'.
similarity = 1 - distance.squareform(distance.pdist(tfidf_matrix, 'cosine'))
similarity

array([[1.        , 0.943086  , 0.17242059, 0.11857444],
       [0.943086  , 1.        , 0.2380004 , 0.16367398],
       [0.17242059, 0.2380004 , 1.        , 0.19291739],
       [0.11857444, 0.16367398, 0.19291739, 1.        ]])

In [35]:
# Cosine similarity: A dot product between two vectors. 
# Take first two vectors (documents) as an example.
np.dot(tfidf_matrix[0, :], tfidf_matrix[1, :].transpose())

matrix([[0.943086]])

In [36]:
# Document searching - find similar documents.
for idx, doc in enumerate(corpus):
    print('Query document: {}'.format(doc))
    print('Similar document: {}'.format(corpus[np.argsort(similarity)[:,::-1][idx,1]]))
    print('Cosine similarity score: {}'.format(round(similarity[idx,np.argsort(similarity)[:,::-1][idx,1]],4)))
    print()

Query document: Wise people think they are foolish.
Similar document: Foolish foolish people think they are wise wise.
Cosine similarity score: 0.9431

Query document: Foolish foolish people think they are wise wise.
Similar document: Wise people think they are foolish.
Cosine similarity score: 0.9431

Query document: I am definitely wise so this irritates me.
Similar document: Foolish foolish people think they are wise wise.
Cosine similarity score: 0.238

Query document: ABC is for sure like definitely foolish.
Similar document: I am definitely wise so this irritates me.
Cosine similarity score: 0.1929



## 2.2 Euclidean distance

- Given two vectors q and p:
<img src='Euclidean_distance.png' width=50% />

In [37]:
e_similarity = 1 - distance.squareform(distance.pdist(tfidf_matrix, 'euclidean'))
e_similarity

array([[ 1.        ,  0.66261594, -0.28652976, -0.32772404],
       [ 0.66261594,  1.        , -0.23450363, -0.29331049],
       [-0.28652976, -0.23450363,  1.        , -0.27049802],
       [-0.32772404, -0.29331049, -0.27049802,  1.        ]])

In [38]:
# Euclidean distance: Root sum of square difference between two vectors.
# similarity = 1 - Euclidean distance
# Take first two vectors (documents) as an example.
1 - (sum((np.array(tfidf_matrix[0, :].flatten() - tfidf_matrix[1, :]).flatten())**2)**(1/2))

0.6626159359466615

In [39]:
# Document searching - find similar documents.
for idx, doc in enumerate(corpus):
    print('Query document: {}'.format(doc))
    print('Similar document: {}'.format(corpus[np.argsort(e_similarity)[:,::-1][idx,1]]))
    print('Euclidean similarity score: {}'.format(round(e_similarity[idx,np.argsort(e_similarity)[:,::-1][idx,1]],4)))
    print()

Query document: Wise people think they are foolish.
Similar document: Foolish foolish people think they are wise wise.
Euclidean similarity score: 0.6626

Query document: Foolish foolish people think they are wise wise.
Similar document: Wise people think they are foolish.
Euclidean similarity score: 0.6626

Query document: I am definitely wise so this irritates me.
Similar document: Foolish foolish people think they are wise wise.
Euclidean similarity score: -0.2345

Query document: ABC is for sure like definitely foolish.
Similar document: I am definitely wise so this irritates me.
Euclidean similarity score: -0.2705

