This tutorial helps you get familiar with general steps of Natural Language Processing. We will use NLTK package in Python. To install NLTK package, use command **"pip install nltk"**. 

If you are running NLTK for the first time, you might have trouble importing some modules. Try download NLTK modules using **nltk.download("module_name")**. For example, you can download stopwords by **nltk.download("stopwords")**. 

Optionally, you can also download all data from NLTK, this step takes time
As you as you download NLTK once, the data will reside in your machine: **nltk.download("all")**  # optional

In [33]:
from collections import Counter
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Data and Corpus

There are five documents related to nlp.

In [11]:
# documents
doc1 = 'By resolving the reference resolution in the design document, the shallow natural language processing technique increases the amount of design information content extracted'
doc2 = 'It is also stimulating cross-fertilisation of ideas between researchers in natural language processing, information retrieval and artificial intelligence'
doc3 = 'We discuss how these findings can be integrated into a natural language processing system'
doc4 = 'In this paper, we discuss how a natural language processing system can take advantage of this information to understand pronominal references to quantified expressions'
doc5 = 'Natural language processing has also provided some tools for computerised scoring of essays, particularly relevant in large-scale language testing programs'

# corpus, each element is a string (document)
documents = [doc1, doc2, doc3, doc4, doc5]

# documents

# Preprocessing

The preprocessing steps consist of tokenization, lowercase, removing stopwords, stemming and Lemmatization.

## Tokenize and Lower case

In [10]:
from nltk.tokenize import word_tokenize
import numpy as np

# lower and tokenize
tokenized = [word_tokenize(doc.lower()) for doc in documents]

# document 1
# tokenized[0]

## Remove Stop Words and Punctuation

In [12]:
from nltk.corpus import stopwords

# get stop words of English
stop = stopwords.words('english')

# you can add anything
punctuations = list('''!()-[]{};:'"\,<>./?@#$%^&*_~''')

stop = stop + punctuations

# remove stop words and punctuation
docs = [[word for word in words if word not in stop] 
        for words in tokenized]

# docs[0]

## Stemming and Lemmatization

In [13]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

# initialize stemmer and lemmatizer
porter = PorterStemmer()
wordnet = WordNetLemmatizer()

# stemming 
docs_stem = [[porter.stem(word) for word in words]
               for words in docs]

# lemmatization
docs_lemma = [[wordnet.lemmatize(word) for word in doc]
                for doc in docs]

In [14]:
# print(porter.stem('mice'))
# print(wordnet.lemmatize('mice'))

mice
mouse


## Vocabulary

Now let's create a Vocabulary for our Corpus

In [21]:
# first lets flatten the docs_lemma
fatten_docs = [word for doc in docs_lemma for word in doc]

# use set to remove duplicates
vocabulary = sorted(list(set(fatten_docs)))

# print (vocabulary)

# Bag of Words

Write your own Bag of Words function. Then compare with the function in sklearn.

In [147]:
from collections import Counter
import numpy as np

## Hands-On

In [46]:
def bow_vectorize(doc, vocabulary):
    """
    doc: one document
    vocabulary: vocabulary
    return: BOW vector
    """
    # initialize a list
    doc_vector = np.zeros(len(vocabulary))
    
    # word count 
    word_count = dict(Counter(doc))
    
    # update vector
    for i, w in enumerate(vocabulary):
        if w in word_count:
            doc_vector[i] = word_count[w]
    
    return doc_vector

bow_matrix = [bow_vectorize(doc, vocabulary) for doc in docs_lemma]

print ("Bag of word for document 1: \n")
print (vocabulary)
print ()
print (bow_matrix[0])

Bag of word for document 1: 

['advantage', 'also', 'amount', 'artificial', 'computerised', 'content', 'cross-fertilisation', 'design', 'discus', 'document', 'essay', 'expression', 'extracted', 'finding', 'idea', 'increase', 'information', 'integrated', 'intelligence', 'language', 'large-scale', 'natural', 'paper', 'particularly', 'processing', 'program', 'pronominal', 'provided', 'quantified', 'reference', 'relevant', 'researcher', 'resolution', 'resolving', 'retrieval', 'scoring', 'shallow', 'stimulating', 'system', 'take', 'technique', 'testing', 'tool', 'understand']

[0. 0. 1. 0. 0. 1. 0. 2. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0.]


## CountVectorizer in Scikit-Learn

In sklearn, feature extraction.text can implement BOW and TF-IDF.

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

# define tokenizer function, which will be used in CountVectorizer()
def lemmatize(doc):
    return [wordnet.lemmatize(word) for word in word_tokenize(doc.lower())]

# initialize CountVectorizer
count_vectorizer = CountVectorizer(stop_words=stopwords.words('english'),
                                   vocabulary=vocabulary,
                                   tokenizer=lemmatize)

# transform documents
feature_matrix = count_vectorizer.fit_transform(documents)

print ("Bag of word for document 1: \n")
print (vocabulary)
print ()
print (feature_matrix.toarray()[0])

Bag of word for document 1: 

['advantage', 'also', 'amount', 'artificial', 'computerised', 'content', 'cross-fertilisation', 'design', 'discus', 'document', 'essay', 'expression', 'extracted', 'finding', 'idea', 'increase', 'information', 'integrated', 'intelligence', 'language', 'large-scale', 'natural', 'paper', 'particularly', 'processing', 'program', 'pronominal', 'provided', 'quantified', 'reference', 'relevant', 'researcher', 'resolution', 'resolving', 'retrieval', 'scoring', 'shallow', 'stimulating', 'system', 'take', 'technique', 'testing', 'tool', 'understand']

[0 0 1 0 0 1 0 2 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1
 0 0 0 1 0 0 0]


# TF-IDF

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize a tfidf
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'),
                                   vocabulary=vocabulary)
# fit documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# sparse matrix to dense
tfidf_matrix = tfidf_matrix.todense()

In [57]:
print ("TFIDF for document 1: \n")
print (vocabulary)
print ()
print (tfidf_matrix[0])

TFIDF for document 1: 

['advantage', 'also', 'amount', 'artificial', 'computerised', 'content', 'cross-fertilisation', 'design', 'discus', 'document', 'essay', 'expression', 'extracted', 'finding', 'idea', 'increase', 'information', 'integrated', 'intelligence', 'language', 'large-scale', 'natural', 'paper', 'particularly', 'processing', 'program', 'pronominal', 'provided', 'quantified', 'reference', 'relevant', 'researcher', 'resolution', 'resolving', 'retrieval', 'scoring', 'shallow', 'stimulating', 'system', 'take', 'technique', 'testing', 'tool', 'understand']

[[0.         0.         0.26603192 0.         0.         0.26603192
  0.         0.53206384 0.         0.26603192 0.         0.
  0.26603192 0.         0.         0.         0.17816468 0.
  0.         0.12676564 0.         0.12676564 0.         0.
  0.12676564 0.         0.         0.         0.         0.26603192
  0.         0.         0.26603192 0.26603192 0.         0.
  0.26603192 0.         0.         0.         0.266

# Similarity Comparison

## Euclidean Distance

In [75]:
from sklearn.metrics.pairwise import euclidean_distances

In [78]:
# BOW
print(euclidean_distances(bow_matrix, bow_matrix))

[[0.         4.69041576 4.35889894 4.69041576 5.19615242]
 [4.69041576 0.         3.60555128 4.24264069 4.35889894]
 [4.35889894 3.60555128 0.         3.31662479 4.        ]
 [4.69041576 4.24264069 3.31662479 0.         4.79583152]
 [5.19615242 4.35889894 4.         4.79583152 0.        ]]


In [77]:
# TFIDF
print(euclidean_distances(tfidf_matrix, tfidf_matrix))

[[0.         1.32287677 1.32765676 1.33585782 1.3525131 ]
 [1.32287677 0.         1.2763048  1.28961429 1.24170497]
 [1.32765676 1.2763048  0.         1.1724174  1.25690387]
 [1.33585782 1.28961429 1.1724174  0.         1.33039625]
 [1.3525131  1.24170497 1.25690387 1.33039625 0.        ]]


## Cosine Similarity

In [81]:
from sklearn.metrics.pairwise import cosine_similarity

In [82]:
# BOW
print(cosine_similarity(bow_matrix, bow_matrix))

[[1.         0.27216553 0.26726124 0.31497039 0.22866478]
 [0.27216553 1.         0.32732684 0.3086067  0.35007002]
 [0.26726124 0.32732684 1.         0.50507627 0.3666794 ]
 [0.31497039 0.3086067  0.50507627 1.         0.25928149]
 [0.22866478 0.35007002 0.3666794  0.25928149 1.        ]]


In [83]:
# TFIDF
print(cosine_similarity(tfidf_matrix, tfidf_matrix))

[[1.         0.12499852 0.11866376 0.10774194 0.08535416]
 [0.12499852 1.         0.18552303 0.16844749 0.22908439]
 [0.11866376 0.18552303 1.         0.31271873 0.21009633]
 [0.10774194 0.16844749 0.31271873 1.         0.11502291]
 [0.08535416 0.22908439 0.21009633 0.11502291 1.        ]]
