# Natural Language Processing - Examples

There are many python packages for NLP out there, but we can cover the important bases once we master a handful of them.

   * NLTK Book: https://www.nltk.org/    
   * TextBlob : https://textblob.readthedocs.io/en/dev/index.html    
   * Spacy : https://spacy.io/

In additional to these there are few other libraries such as Gensim and Stanford’s CoreNLP that can be explored as well.

<a id='1'></a>
# 1. Load libraries and Packages

In [1]:
!pip install nltk==3.4
!pip install textblob==0.15.3
!pip install gensim==3.8.2
!pip install -U SpaCy==2.2.0
!python -m spacy download en_core_web_lg

Collecting nltk==3.4
  Downloading nltk-3.4.zip (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting singledispatch (from nltk==3.4)
  Downloading singledispatch-4.1.0-py2.py3-none-any.whl.metadata (3.8 kB)
Downloading singledispatch-4.1.0-py2.py3-none-any.whl (6.7 kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4-py3-none-any.whl size=1436383 sha256=8c0a6270b32c54d2d539d268fcd7ca93cc6893addc3b14fd1cd6df40323a1b16
  Stored in directory: /root/.cache/pip/wheels/30/cb/3e/ae17c28eb286abfcd886fdab69f9533bb060dc0a74f3b41d0d
Successfully built nltk
Installing collected packages: singledispatch, nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.8.1
    Uninstalling nltk-3.8.1:
      Successfully uninstalled nltk-3.8.1
Successfu

In [2]:
import nltk
import nltk.data
nltk.download('punkt')
from textblob import TextBlob
import spacy
#Run the command python -m spacy download en_core_web_sm to download this
import en_core_web_lg
nlp = en_core_web_lg.load()

#Other helper packages
import pandas as pd
import numpy as np

#Download nltk data lobraries. All can be downloaded by using nltk.download('all')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

<a id='2'></a>
# 2. Preprocessing

<a id='2.1'></a>
## 2.1. Tokenization
Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

In [4]:
#Text to tokenize
text = "This is a tokenize test"

### NLTK

The NLTK data package includes a pre-trained Punkt tokenizer for English, which has alreayd been loaded before

In [5]:
from nltk.tokenize import word_tokenize
word_tokenize(text)

['This', 'is', 'a', 'tokenize', 'test']

### TextBlob

In [None]:
TextBlob(text).words

WordList(['This', 'is', 'a', 'tokenize', 'test'])

<a id='2.2'></a>
## 2.2. Stop Words Removal

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The code for removing stop words using SpaCy library is shown below:

### NLTK

We first load the language model and store it in the stop_words variable. The stopwords.words('english') is a set of default stop words for English language model in NLTK. Next, we simply iterate through each word in the input text and if the word exists in the stop word set of the NLTK language model, the word is removed.

In [6]:
text = "S&P and NASDAQ are the two most popular indices in US"

In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in stop_words]

print(tokens_without_sw)

['S', '&', 'P', 'NASDAQ', 'two', 'popular', 'indices', 'US']


As we can see some of the stop words such as "are", "of", "most" etc are removed from the sentence.

<a id='2.3'></a>
## 2.3. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.

In [8]:
text = "It's a Stemming testing"

### NLTK

In [9]:
parsed_text = word_tokenize(text)

In [10]:
# Initialize stemmer.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# Stem each word.
[(word, stemmer.stem(word)) for i, word in enumerate(parsed_text)
 if word.lower() != stemmer.stem(parsed_text[i])]

[('Stemming', 'stem'), ('testing', 'test')]

<a id='2.4'></a>
## 2.4. Lemmetization

### TextBlob

In [11]:
text = "This world has a lot of faces "

In [12]:
from textblob import Word
parsed_data= TextBlob(text).words
parsed_data

WordList(['This', 'world', 'has', 'a', 'lot', 'of', 'faces'])

In [13]:
[(word, word.lemmatize()) for i, word in enumerate(parsed_data)
 if word != parsed_data[i].lemmatize()]

[('has', 'ha'), ('faces', 'face')]

<a id='2.5'></a>
## 2.5. POS Tagging

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words

In [14]:
text = 'Google is looking at buying U.K. startup for $1 billion'

### TextBlob

In [15]:
TextBlob(text).tags

[('Google', 'NNP'),
 ('is', 'VBZ'),
 ('looking', 'VBG'),
 ('at', 'IN'),
 ('buying', 'VBG'),
 ('U.K.', 'NNP'),
 ('startup', 'NN'),
 ('for', 'IN'),
 ('1', 'CD'),
 ('billion', 'CD')]

This is a list of tokens with their corresponding Part-of-Speech (POS) tags.

('Google', 'NNP'): Google is a proper noun.

('is', 'VBZ'): is is a verb, third person singular present.

('looking', 'VBG'): looking is a verb, gerund or present participle.

('at', 'IN'): at is a preposition.

## Spacy- doing all at ones

In [16]:
text = 'Google is looking at buying U.K. startup for $1 billion'
doc = nlp(text)

In [17]:
pd.DataFrame([[t.text, t.is_stop, t.lemma_, t.pos_]
              for t in doc],
             columns=['Token', 'is_stop_word','lemma', 'POS'])

Unnamed: 0,Token,is_stop_word,lemma,POS
0,Google,False,Google,PROPN
1,is,True,be,AUX
2,looking,False,look,VERB
3,at,True,at,ADP
4,buying,False,buy,VERB
5,U.K.,False,U.K.,PROPN
6,startup,False,startup,NOUN
7,for,True,for,ADP
8,$,False,$,SYM
9,1,False,1,NUM


<a id='2.6'></a>
## 2.6. Name Entity Recognition

In [19]:
text = 'Google is looking at buying U.K. startup for $1 billion'

### SpaCy

In [18]:
for entity in nlp(text).ents:
    print("Entity: ", entity.text)
    print("Entity Type: %s | %s" % (entity.label_, spacy.explain(entity.label_)))
    print("--")

Entity:  Google
Entity Type: ORG | Companies, agencies, institutions, etc.
--
Entity:  U.K.
Entity Type: GPE | Countries, cities, states
--
Entity:  $1 billion
Entity Type: MONEY | Monetary values, including unit
--


In [20]:
from spacy import displacy
displacy.render(nlp(text), style="ent", jupyter = True)

<a id='3'></a>
# 3. Feature Representation


<a id='3.1'></a>
## 3.1. Bag of Words - Word Count

In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.The CountVectorizer from sklearn provides a simple way to both tokenize a collection of text documents and encode new documents using that vocabulary.The fit_transform
function learns the vocabulary from one or more documents and encodes each document in the word as a vector.

In [21]:
sentences = [
'The stock price of google jumps on the earning data today',
'We really love FINC612', 'This semester is over'
]

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(sentences).todense() )
print( vectorizer.vocabulary_ )

[[1 1 0 1 0 1 0 1 1 0 1 0 0 1 2 0 1 0]
 [0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0]]
{'the': 14, 'stock': 13, 'price': 10, 'of': 7, 'google': 3, 'jumps': 5, 'on': 8, 'earning': 1, 'data': 0, 'today': 16, 'we': 17, 'really': 11, 'love': 6, 'finc612': 2, 'this': 15, 'semester': 12, 'is': 4, 'over': 9}


<a id='3.2'></a>
## 3.2. TF-IDF

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

* Term Frequency: This summarizes how often a given word appears within a document.
* Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
TFIDF = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names_out()[-10:]) # Use get_feature_names_out() instead of get_feature_names()
print(TFIDF.shape)
print(TFIDF.toarray())

['earning' 'finc612' 'google' 'jumps' 'love' 'price' 'really' 'semester'
 'stock' 'today']
(3, 11)
[[0.37796447 0.37796447 0.         0.37796447 0.37796447 0.
  0.37796447 0.         0.         0.37796447 0.37796447]
 [0.         0.         0.57735027 0.         0.         0.57735027
  0.         0.57735027 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.         0.         0.        ]]


<a id='5'></a>
# 5 NLP Recipies

<a id='5.1'></a>
## 5.1. Sentiment Analysis

Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material, and helping us understand the sentiments behind a text.

With the help of Sentiment Analysis using Textblob the sentiment analysis can be performed in few lines of code. TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. The polarity defines the phase of emotions expressed in the analyzed sentence. Polarity alone is not enough to deal with complex text sentences. Subjectivity helps in determining personal states of the speaker including Emotions, Beliefs and opinions. It has values from 0 to 1 and a value closer to 0 shows the sentence is objective and vice versa.

The texblob sentiment function is pretrained and map adjectives frequently found in movie reviews(source code: https://textblob.readthedocs.io/en/dev/_modules/textblob/en/sentiments.html) to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).

The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token.

In [24]:
text1 = "Bayer (OTCPK:BAYRY) started the week up 3.5% to €74/share in Frankfurt, touching their highest level in 14 months, after the U.S. government said a $25M glyphosate decision against the company should be reversed."
text2 = "Apple declares poor in revenues"

In [25]:
TextBlob(text1).sentiment.polarity

0.5

In [26]:
TextBlob(text1).sentiment_assessments

Sentiment(polarity=0.5, subjectivity=0.5, assessments=[(['touching'], 0.5, 0.5, None)])

In [27]:
TextBlob(text2).sentiment.polarity

-0.4

In [28]:
TextBlob(text2).sentiment_assessments

Sentiment(polarity=-0.4, subjectivity=0.6, assessments=[(['poor'], -0.4, 0.6, None)])

<a id='5.2'></a>
## 5.2. Text Similarity
Finding similarity between text is at the heart of almost all text mining methods, for example, text classification, clustering, recommendation, and many more. In order to calculate similarity between two text snippets, the usual way is to convert the text into its corresponding vector representation, for which there are many methods like word embedding of text, and then calculate similarity or difference using different distance metrics such as cosine-similarity and euclidean distance applicable to vectors. The underlying vector representations come from a word embedding model which generally produces a dense multi-dimensional semantic representation of words (as shown in the example). Using this vector representation, we can calculate similarities and dissimilarities between tokens, named entities, noun phrases, sentences and documents. The example below shows how to calculate similarities between two documents and tokens.

In [29]:
text1 = "Barack Obama was the 44th president of the United States of America."
text2 = "Donald Trump is the 45th president of the United States of America."
text3 = "SpaCy and NLTK are two popular NLP libraries in Python community."
doc1 = nlp(text1); doc2 = nlp(text2); doc3 = nlp(text3);

In [30]:
def text_similarity(inp_obj1, inp_obj2):
    return inp_obj1.similarity(inp_obj2)

In [31]:
print("Similarity between doc1 and doc2: ", text_similarity(doc1, doc2))
print("Similarity between doc1 and doc3: ", text_similarity(doc1, doc3))

Similarity between doc1 and doc2:  0.9745025257196641
Similarity between doc1 and doc3:  0.6599942528535854


In [32]:
def token_similarity(doc):
    for token1 in doc:
        for token2 in doc:
            print("Token 1: %s, Token 2: %s - Similarity: %f" % (token1.text, token2.text, token1.similarity(token2)))

doc4 = nlp("Apple orange cats")
token_similarity(doc4)

Token 1: Apple, Token 2: Apple - Similarity: 1.000000
Token 1: Apple, Token 2: orange - Similarity: 0.184876
Token 1: Apple, Token 2: cats - Similarity: -0.047864
Token 1: orange, Token 2: Apple - Similarity: 0.184876
Token 1: orange, Token 2: orange - Similarity: 1.000000
Token 1: orange, Token 2: cats - Similarity: 0.139896
Token 1: cats, Token 2: Apple - Similarity: -0.047864
Token 1: cats, Token 2: orange - Similarity: 0.139896
Token 1: cats, Token 2: cats - Similarity: 1.000000


#END

---