# CS-6570 Lecture 25 - Introduction to Natural Language Processing (NLP)
**Dylan Zwick**

*Weber State University*

***Introduction to Natural Language Processing (NLP)***

Natural language processing (NLP) is a branch of data science that consists of analyzing, understanding, and deriving information from text data. With NLP one can organize massive chunks of text data and solve a wide range of problems, including automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. Just to name a few.

But, before we dive into any of these, we'll need to explain some important terms, and (of course) import our favorite libraries. The terms are:

* *Tokenization* – process of converting a text into tokens
* *Tokens* – words or entities present in the text
* *Text object* – a sentence or a phrase or a word or an article

The libraries are:

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In addition, we'll grab the natural language toolkit (nltk)

In [109]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/e930b992-a9bf-43e9-98f5-
[nltk_data]     95aaaf79bdea/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/e930b992-a9bf-43e9-98f5-
[nltk_data]     95aaaf79bdea/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/e930b992-a9bf-43e9-98f5-
[nltk_data]     95aaaf79bdea/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

***Text Preprocessing***

Text data is some of the most unstructured data available. Many types of noise are typically present in it and the data is usually not readily analyzable. The entire process of cleaning and standardizing text - making it noise-free and ready for analysis - is known as text preprocessing.

It is predominantly comprised of three steps:

1. Noise Removal
2. Lexicon Normalization
3. Object Standardization

**Noise Removal**

Any piece of text which is not relevant to the context of the data and the desired output can be viewed as noise.

For example, language stopwords (commonly used words like "is", "am", "the", "of", "in", etc), URLs or links, social media entities (mentions, hashtags), punctuations, and industry specific words.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

The following is an example of how we can do this with Python:

In [48]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 

def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

print(_remove_noise("this is a sample text"))

sample text


Another way would be to use regular expressions, which we haven't covered but maybe we should have, and will probably cover next semester.

**Lexicon Normalization**

Another type of textual noise concerns multiple representations of a single word.

For example – “play”, “player”, “played”, “plays” and “playing” are different variations of the word “play”. They mean different things but contextually are similar. This step converts all the versions of a word into a normalized form (also known as a lemma). Normalization is a pivotal step for feature engineering with text as it converts the high dimensional features (N different features) to a lower dimensional representation (1 feature).

The most common lexicon normalization practices are :

* *Stemming*:  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.
* *Lemmatization*: Lemmatization, on the other hand, is an organized & step by step procedure for obtaining the root form of the word. It makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Below is sample code that performs lemmatization and stemming using NLTK.

In [71]:
from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 

print('Stemming:\n')
print(stem.stem(word))

from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

print('Lemmatization:\n')
print(lem.lemmatize(word))

Stemming:

multipli
Lemmatization:

multiplying


Note that "[Porter stemmer](https://www.nltk.org/_modules/nltk/stem/porter.html)" is a popular stemming algorithm. Information about lemmatization can be  found [here](https://www.nltk.org/_modules/nltk/stem/wordnet.html).

**Object Standardization**

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some examples are acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed. 

For example, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [81]:
lookup_dict = {"rt" : "Retweet", "dm" : "direct message", "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")

'Retweet this is a retweeted tweet by Shivam Bansal'

Apart from three steps discussed so far, other types of text preprocessing includes encoding-decoding noise, grammar checker, and spelling correction etc.

***Text to Features (Feature Engineering on text data)***

To analyse preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques including syntactical parsing, entities / n-grams / word-based features, statistical features, and word embeddings.

**Syntactic Parsing**

Syntactical parsing involves the analysis of words in a sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactics.

* *Dependency Trees* – Sentences are composed of words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship among the words can be observed in the form of a tree representation as shown:

<center>
    <img src = "https://drive.google.com/uc?export=view&id=1iO_P7xe1rTarVSc920Ep2qx9Vl-846Yk">
</center>

The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation” relation).

This type of tree, when parsed recursively in top-down manner, gives grammar relation triplets as output which can be used as features for many NLP problems like entity wise sentiment analysis, actor & entity identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford's NLP Group) and NLTK dependency grammars can be used to generate dependency trees.

The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation” relation).

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity identification, and text classification. The python wrapper [StanfordCoreNLP](http://stanfordnlp.github.io/CoreNLP/) (by Stanford NLP Group, only commercial license) and NLTK dependency grammars can be used to generate dependency trees.

* *Part of speech tagging* – Apart from the grammar relations, every word in a sentence is also associated with a part of speech (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. The following code uses Pythan's NLTK to perform pos tagging annotation on input text:

In [111]:
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print(pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]


Part of Speech tagging is used for many important purposes in NLP:

* *Word sense disambiguation*: Some language words have multiple meanings according to their usage. For example, in the two sentences below:

    I. “Please book my flight for Delhi”

    II. “I am going to read this book in the flight”

    “Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un. (Lesk Algorithm is also us ed for similar purposes)

* *Improving word-based features*: A learning model could learn different contexts of a word when used word as the features, however if the part of speech tag is linked with them, the context is preserved, thus making strong features. For example:

    Sentence -“book my flight, I will read this book”

    Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

    Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

* *Normalization and Lemmatization*: POS tags are the basis of lemmatization process for converting a word to its base form (lemma).

* *Efficient stopword removal*: POS tags are also useful in efficient removal of stopwords.

    For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”, “mu st” etc)

**Entity Extraction (Entities as features)**

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity detection algorithms are generally ensemble models of rule based parsing, dictionary lookups, POS tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights.

* *Named Entity Recognition (NER)*:
    The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER. For example :

    Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

    Named Entities –  ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)

    A typical NER model consists of three blocks:

    *Noun phrase identification*: This step deals with extracting all the noun phrases from a text using dependency parsing and part of speech tagging.

    *Phrase classification*: This is the classification step in which all the extracted noun phrases are classified into respective categories (locations, names etc). Google Maps API provides a good path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify person names or company names. Apart from this, one can curate the lookup tables and dictionaries by combining information from different sources.

    *Entity disambiguation*: Sometimes it is possible that entities are misclassified, hence creating a validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and Wikipedia.

* *Topic modeling*:  
    The process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

    Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is the code to implement topic modeling using LDA in Python:

In [127]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim 
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results 
print(ldamodel.print_topics())

[(0, '0.083*"to" + 0.058*"My" + 0.058*"sister" + 0.058*"my" + 0.033*"likes" + 0.033*"sugar," + 0.033*"not" + 0.033*"Sugar" + 0.033*"but" + 0.033*"have"'), (1, '0.029*"driving" + 0.029*"time" + 0.029*"of" + 0.029*"father" + 0.029*"around" + 0.029*"dance" + 0.029*"practice." + 0.029*"spends" + 0.029*"a" + 0.029*"lot"'), (2, '0.060*"driving" + 0.060*"cause" + 0.060*"Doctors" + 0.060*"and" + 0.060*"that" + 0.060*"blood" + 0.060*"increased" + 0.060*"may" + 0.060*"pressure." + 0.060*"stress"')]


**Statistical Features**

Text data can also be quantified directly into numbers using several techniques described in this section:

* *Term Frequency – Inverse Document Frequency (TF – IDF)*
    TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as –

    * Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”

    * Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.

    * TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula below. Following is the code using python’s scikit learn package to convert a text into tf idf vectors:
    
<center>
    <img src = "https://drive.google.com/uc?export=view&id=1rj9FHRPkKYinsxN6x3SPBiZZ85ZZC-S6">
</center>

In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print(X)

  (0, 1)	0.34520501686496574
  (0, 4)	0.444514311537431
  (0, 2)	0.5844829010200651
  (0, 7)	0.5844829010200651
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (1, 1)	0.3853716274664007
  (2, 5)	0.5844829010200651
  (2, 6)	0.5844829010200651
  (2, 1)	0.34520501686496574
  (2, 4)	0.444514311537431


The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

**Word Embedding (text vectors)**

Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks.

[Word2Vec](https://code.google.com/archive/p/word2vec/) and [GloVe](http://nlp.stanford.edu/projects/glove/) are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output.

The Word2Vec model is composed of a preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip-gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. The following code using gensim package prepares the word embedding as the vectors:

In [139]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)

print(model.similarity('data', 'science'))

AttributeError: 'Word2Vec' object has no attribute 'similarity'

Important tasks of NLP
This section talks about different use cases and problems in the field of natural language processing.

4.1 Text Classification
Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam Identification, topic classification of news, sentiment classification and organization of web pages by search engines.

Text classification, in common words is defined as a technique to systematically classify a text object (document or sentence) in one of the fixed category. It is really helpful when the amount of data is too large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in image below. Firstly the text input is processes and features are created. The machine learning models then learn these features and is used for predicting against the new text.

The text classification model are heavily dependent upon the quality and quantity of features, while applying any machine learning model it is always a good practice to include more and more training data. H ere are some tips that I wrote about improving the text classification accuracy in one of my previous article.

Text Matching / Similarity
One of the important areas of NLP is the matching of text objects to find similarities. Important applications of text matching includes automatic spelling correction, data de-duplication and genome analysis etc.

A number of text matching techniques are available depending upon the requirement. This section describes the important techniques in detail.

A. Levenshtein Distance – The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. Following is the implementation for efficient memory computations.

Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s name, location name etc) and produces a character string that identifies a set of words that are (roughly) phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and matching relevant names. Soundex and Metaphone are two main phonetic algorithms used for this purpose. Python’s module Fuzzy is used to compute soundex strings for different words, for example –

C. Flexible String Matching – A complete text matching system includes different algorithms pipelined together to compute variety of text variations. Regular expressions are really helpful for this purposes as well. Another common techniques include – exact string matching, lemmatized matching, and compact matching (takes care of spaces, punctuation’s, slangs etc).

D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine similarity can also be applied in order to measure vectorized similarity. Following code converts a text to vectors (using term frequency) and applies cosine similarity to provide closeness among two text.

4.3 Coreference Resolution
Coreference Resolution is a process of finding relational links among the words (or phrases) within the sentences. Consider an example sentence: ” Donald went to John’s office to see the new table. He looked at it for an hour.“

Humans can quickly figure out that “he” denotes Donald (and not John), and that “it” denotes the table (and not John’s office). Coreference Resolution is the component of NLP that does this job automatically. It is used in document summarization, question answering, and information extraction. Stanford CoreNLP provides a python wrapper for commercial purposes.

4.4 Other NLP problems / tasks
Text Summarization – Given a text article or paragraph, summarize it automatically to produce most important and relevant sentences in order.
Machine Translation – Automatically translate text from one human language to another by taking care of grammar, semantics and information about the real world, etc.
Natural Language Generation and Understanding – Convert information from computer databases or semantic intents into readable human language is called language generation. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.
Optical Character Recognition – Given an image representing printed text, determine the corresponding text.
Document to Information – This involves parsing of textual data present in documents (websites, files, pdfs and images) to analyzable and clean format.