# NLP-Master Template

How do you work through various NLP related tasks using different Python packages for NLP ? 

A typical NLP problem statement consists of the following steps and substeps: 
![image.png](attachment:image.png)

In this jupyter notebook you will work through the above mentioned steps of Data-preprocessing and feature extraction using different libraries for NLP.

There are many python packages for NLP out there, but we can cover the important bases once we master a handful of them. In this jupyter notebook we will describe 3 Python NLP libraries we’ve found to be the most useful and will be using in the case studies. 


#### NLTK : 
NLTK is recommended only as an education and research tool. Its modularized structure makes it excellent for learning and exploring NLP concepts, but it’s not very suitable for production.


##### TextBlob :  
It is built on top of NLTK, and it’s more easily-accessible. This is one of the best libraries for fast-prototyping or building applications that don’t require highly optimized performance.


##### SpaCy :  
SpaCy is a NLP library that’s designed to be fast, streamlined, and production-ready. SpaCy is minimal and opinionated, and it doesn’t flood with many options like NLTK does. Its philosophy is to only present one algorithm (the best one) for each purpose. We don’t have to make choices, and you can focus on being productive.

In additional to these there are few other libraries such as Gensim and Stanford’s CoreNLP that can be explored as well, which aren't used much in the case studies. Gensim is used for few specialised tasks, one example of Gensim used for word embedding is shown in the subsequent section. 



## Content

* [1. Loading Libraries and Packages](#1)
* [2. Data Preprocessing](#2)
    * [2.1. Tokenization](#2.1)    
    * [2.2. Removing Stop Words](#2.2)
    * [2.3. Stemming](#2.3)
    * [2.4. Lemmetization](#2.4)
    * [2.5. PoS tagging](#2.5)
    * [2.6. Name Entity Recognition](#2.6)  
* [3. Feature Representation](#3)
    * [3.1. Bag-of Words](#3.1)    
    * [3.2. TF-IDF](#3.2)
    * [3.3. Word Embedding](#3.3)
* [4. Inference](#4)
    * [4.1. Supervised (Example Naive Bayes)](#4.1)    
    * [4.2. Unsupervised (Example LDA)](#4.2)
* [5. NLP Recipies](#5)
    * [4.1 Sentiment Analysis](#5.1)
    * [4.1 Words and Sentences similarity](#5.1)

<a id='1'></a>
# 1. Load libraries and Packages 

* NLTK
Import NLTK and run nltk.download().This will open the NLTK downloader from where you can choose the corpora and models to download. You can also download all packages at once. Details in the links below:
   *  NLTK Book: http://www.nltk.org/book/
   *  Dive into NLTK: https://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
    
* TextBlob
   * TextBlob Documentation: https://textblob.readthedocs.io/en/dev/index.html
    
* Spacy
   * spaCy Documentation: https://spacy.io/
   * Intro to NLP with SpaCy: https://nicschrading.com/project/Intro-to-NLP-with-spaCy/


In [136]:
import nltk
import nltk.data
nltk.download('punkt')
from textblob import TextBlob
import spacy
#Run the command python -m spacy download en_core_web_sm to download this
nlp = spacy.load("en_core_web_lg")

#Other helper packages
import pandas as pd
import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tatsa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [245]:
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')

<a id='2'></a>
# 2. Preprocessing

<a id='2.1'></a>
## 2.1. Tokenization
Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

In [194]:
#Text to tokenize
text = "This is a tokenize test"

### NLTK

The NLTK data package includes a pre-trained Punkt tokenizer for English, which has alreayd been loaded before 

In [195]:
from nltk.tokenize import word_tokenize
word_tokenize(text)

['This', 'is', 'a', 'tokenize', 'test']

### TextBlob

In [196]:
TextBlob(text).words

WordList(['This', 'is', 'a', 'tokenize', 'test'])

<a id='2.2'></a>
## 2.2. Stop Words Removal

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. The code for removing stop words using SpaCy library is shown below:

### NLTK

We first load the language model and store it in the stop_words variable. The stopwords.words('english') is a set of default stop words for English language model in NLTK. Next, we simply iterate through each word in the input text and if the word exists in the stop word set of the NLTK language model, the word is removed.

In [197]:
text = "S&P and NASDAQ are the two most popular indices in US"

In [198]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
stop_words = set(stopwords.words('english'))
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in stop_words]

print(tokens_without_sw)

['S', '&', 'P', 'NASDAQ', 'two', 'popular', 'indices', 'US']


As we can see some of the stop words such as "are", "of", "most" etc are removed from the sentence. 

<a id='2.3'></a>
## 2.3. Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”.

In [206]:
text = "It's a Stemming testing"

### NLTK

In [207]:
parsed_text = word_tokenize(text)

In [208]:
# Initialize stemmer.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# Stem each word.
[(word, stemmer.stem(word)) for i, word in enumerate(parsed_text) 
 if word.lower() != stemmer.stem(parsed_text[i])]

[('Stemming', 'stem'), ('testing', 'test')]

<a id='2.4'></a>
## 2.4. Lemmetization

A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.

### TextBlob

In [209]:
text = "This world has a lot of faces "

In [210]:
from textblob import Word
parsed_data= TextBlob(text).words
parsed_data

WordList(['This', 'world', 'has', 'a', 'lot', 'of', 'faces'])

In [211]:
[(word, word.lemmatize()) for i, word in enumerate(parsed_data) 
 if word != parsed_data[i].lemmatize()]

[('has', 'ha'), ('faces', 'face')]

<a id='2.5'></a>
## 2.5. POS Tagging

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words

In [212]:
text = 'Google is looking at buying U.K. startup for $1 billion'

### TextBlob

In [213]:
TextBlob(text).tags

[('Google', 'NNP'),
 ('is', 'VBZ'),
 ('looking', 'VBG'),
 ('at', 'IN'),
 ('buying', 'VBG'),
 ('U.K.', 'NNP'),
 ('startup', 'NN'),
 ('for', 'IN'),
 ('1', 'CD'),
 ('billion', 'CD')]

## Spacy- doing all at ones 

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

![image.png](attachment:image.png)

All the preprocessing items including tokenization, stop words removal, lemmatization, getting POS and NER etc. can be performed in one go using spaCy. An example is demonstrated below. We will go through the example of NER in the next section.

In [174]:
text = 'Google is looking at buying U.K. startup for $1 billion'
doc = nlp(text)

In [171]:
pd.DataFrame([[t.text, t.is_stop, t.lemma_, t.pos_]
              for t in doc],
             columns=['Token', 'is_stop_word','lemma', 'POS'])

Unnamed: 0,Token,is_stop_word,lemma,POS
0,Google,False,Google,PROPN
1,is,True,be,VERB
2,looking,False,look,VERB
3,at,True,at,ADP
4,buying,False,buy,VERB
5,U.K.,False,U.K.,PROPN
6,startup,False,startup,NOUN
7,for,True,for,ADP
8,$,False,$,SYM
9,1,False,1,NUM


spaCy also performs NER that we will discuss in the next section, along with the word embedding which we will also cover in the next section. Given NER performs a wide range of NLP related tasks in one go, it is highly recommended. We will be using spaCy extensively in our case studies. The list of all the task that can be performed using spaCy is mentioned in the list below. 

In [175]:
attributes = [a for a in dir(doc) if not a.startswith('_')]
print(attributes)

['cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']


<a id='2.6'></a>
## 2.6. Name Entity Recognition

Named Entity Recognition, popularly referred to as N.E.R is a process that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions. The NER performed using spaCy is shown below. 

In [215]:
text = 'Google is looking at buying U.K. startup for $1 billion'

### SpaCy

In [216]:
for entity in nlp(text).ents:
    print("Entity: ", entity.text)
    print("Entity Type: %s | %s" % (entity.label_, spacy.explain(entity.label_)))
    print("--")

Entity:  Google
Entity Type: ORG | Companies, agencies, institutions, etc.
--
Entity:  U.K.
Entity Type: GPE | Countries, cities, states
--
Entity:  $1 billion
Entity Type: MONEY | Monetary values, including unit
--


In [220]:
from spacy import displacy
displacy.render(nlp(text), style="ent", jupyter = True)

<a id='3'></a>
# 3. Feature Representation

The vast majority of NLP related data is created for human consumption and as such is stored
in an unstructured format, such as news feed articles, PDF reports, social media posts
and audio files, which cannot be readily processed by computers. Following the preprocessing steps discussed in the previous section, in order for the information content to be conveyed to the statistical inference algorithm, the preprocessed tokens need to be translated into predictive features. A model is used to embed raw text into a vector space where we can use the data science tool.

Feature representation involves two things:
* A vocabulary of known words.
* A measure of the presence of known words.

The intuition behind the Feature Representation is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.
For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).

Some of the feature representation methods are as follows: 
* Bag of Words- word count
* Tf-Idf
* Word Embedding 
    * Pretrained word embedding models ( Word2vec, GloVe)
    * Customized deep Learning based

There are Feature representation(or vector representation) such as one-hot encoding of text, n-grams etc which are similar to the types mentioned above. 

<a id='3.1'></a>
## 3.1. Bag of Words - Word Count

In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.The CountVectorizer from sklearn provides a simple way to both tokenize a collection of text documents and encode new documents using that vocabulary.The fit_transform
function learns the vocabulary from one or more documents and encodes each document in the word as a vector.

In [223]:
sentences = [
'The stock price of google jumps on the earning data today',
'Google plunge on China Data!'
]

In [224]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(sentences).todense() )
print( vectorizer.vocabulary_ )

[[0 1 1 1 1 1 1 0 1 1 2 1]
 [1 1 0 1 0 0 1 1 0 0 0 0]]
{'the': 10, 'stock': 9, 'price': 8, 'of': 5, 'google': 3, 'jumps': 4, 'on': 6, 'earning': 2, 'data': 1, 'today': 11, 'plunge': 7, 'china': 0}


We can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 10) that has an occurrence of 2. Word counts are a good starting point, but are very basic.One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

<a id='3.2'></a>
## 3.2. TF-IDF

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

* Term Frequency: This summarizes how often a given word appears within a document.
* Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [226]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
TFIDF = vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names()[-10:])
print(TFIDF.shape)
print(TFIDF.toarray())

['china', 'data', 'earning', 'google', 'jumps', 'plunge', 'price', 'stock', 'today']
(2, 9)
[[0.         0.29017021 0.4078241  0.29017021 0.4078241  0.
  0.4078241  0.4078241  0.4078241 ]
 [0.57615236 0.40993715 0.         0.40993715 0.         0.57615236
  0.         0.         0.        ]]


A vocabulary of 9 words is learned from the documents and each word is assigned a unique integer index in the output vector. The sentences are encoded as an 9-element sparse array and we can review the final scorings of each word with different values from the other words in the vocabulary.

<a id='3.3'></a>
## 3.3. Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:
* Pretained models( i.e. Word2Vec, glove etc.)
* Developing custom models

In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

### 3.3.1 Pretrained word embedding models

### 3.3.1.1  Pretrained model- SpaCy

SpaCy comes with inbuilt representation of text as vectors at different levels of word, sentence and document. The underlying vector representations come from a word embedding model which generally produces a dense multi-dimensional semantic representation of words (as shown in the example below). The word embedding model includes 20k unique vectors with 300 dimensions. Using this vector representation, we can calculate similarities and dissimilarities between tokens, named entities, noun phrases, sentences and documents. 

The word embedding in Spacy is performed first by first loading the model, and then processing text. The vectors can be accessed directly using the .vector attribute of each processed token (word). The mean vector for the entire sentence is also calculated simply using .vector, providing a very convenient input for machine learning models based on sentences.

In [231]:
doc = nlp("Apple orange cats dogs")

In [291]:
print("Vector representation of the sentence for first 10 features: \n", doc.vector[1:10])

Vector representation of the sentence for first 10 features: 
 [ 0.22351399 -0.110111   -0.367025   -0.13430001  0.13790375 -0.24379876
 -0.10736975  0.2715925   1.3117325 ]


### 3.3.1.2. Word2Vec

In [251]:
from gensim.models import Word2Vec

In [378]:
sentences = [
['The','stock','price', 'of', 'Google', 'increases'],
['Google','plunge',' on','China',' Data!']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
words = list(model.wv.vocab)
print(words)
print(model['Google'][1:5])

Word2Vec(vocab=10, size=100, alpha=0.025)
['The', 'stock', 'price', 'of', 'Google', 'increases', 'plunge', ' on', 'China', ' Data!']
[-1.7868265e-03 -7.6242397e-04  6.0105987e-05  3.5568199e-03]


<a id='4'></a>
# 4. Interpretation
Like all other artificial intelligence tasks, the inference generated by an NLP application
usually needs to be translated into a decision in order to be actionable.Inference in ML falls under three broad categories, namely supervised, unsupervised and reinforcement learning. While the type of inference required depends on the business problem and the type of training data, in NLP the most commonly used algorithms are
supervised or unsupervised. Inference in ML falls under three broad categories, namely supervised, unsupervised and
reinforcement learning. 

In the past years, neural network architectures, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), have dominated NLP-based inference. 

A popular unsupervised technique applied in NLP is LSA(Latent Semantic Analysis). LSA looks at relationships between a set of documents and the words they contain by producing a set of latent concepts related to the documents and terms. LSA has paved the way for a more sophisticated approach called LDA under which documents are modelled as a finite mixture of topics and topics in turn are modelled as a finite mixture over words in the vocabulary. LDA has been extensively used for Topic Modeling which is a growing area of research where NLP practitioners build probabilistic generative models to reveal likely topic attributions for words.

We have already discussed many supervised and unsupervised learning models in the previous chapters. We will just provide details about Naive Bayes and LDA models which are extensively used in NLP and were not covered in the previous chapters. 

<a id='4.1'></a>
## 4.1. Supervised Learning Example-Naive Bayes

One of the most commonly used supervised methodologies in NLP is the Naïve
Bayes model, which assumes that all word features are independent of each other given
the class labels. Due to this simplifying assumptions, Naïve Bayes is very compatible with a bag-of-words word representation. We do have other alternatives when coping with NLP problems, such as Support Vector Machine (SVM) and neural networks. However, the simple design of Naive Bayes classifiers make them very attractive for such classifiers. Moreover, they have been demonstrated to be fast, reliable and accurate in a number of applications of NLP.Naïve Bayes is commonly described as ‘the punching bag’ of more complex algorithms in ML. However, despite its simplifying assumptions, it often comes head to head and at times even outperforms more complicated classifiers. 

Naive Bayes is a family of algorithms based on applying Bayes theorem with a strong(naive) assumption, that every feature is independent of the others, in order to predict the category of a given sample. They are probabilistic classifiers, therefore will calculate the probability of each category using Bayes theorem, and the category with the highest probability will be output. 

In [371]:
senteces = [
'The stock price of google jumps on the earning data today',
'Google plunge on China Data!']
sentiment = (1, 0)
data = pd.DataFrame({'Sentence':senteces,
        'sentiment':sentiment})

In [372]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(data['Sentence'])
X_train_vectorized = vect.transform(data['Sentence'])

In [377]:
from sklearn.naive_bayes import MultinomialNB
clfrNB = MultinomialNB(alpha = 0.1)
clfrNB.fit(X_train_vectorized, data['sentiment'])

preds = clfrNB.predict(vect.transform(['Apple price plunge', 'Amazon Price jumps']))
preds

array([0, 1], dtype=int64)

In [None]:
As we can see that Naive 

<a id='4.2'></a>
## 4.2. Unsupervised Learning Example-LDA 
It is the most popular topic model because it tends to produce meaningful topics that
humans can relate to, can assign topics to new documents, and is extensible. Variants of
LDA models can include metadata such as authors, or image data, or learn hierarchical
topics
Given a set of documents, assume that there are some latent topics of documents that are not observed. Each document has a distribution over these topics. For instance, suppose the latent topics are 'S&P500', 'Gold', 'Oil', 'Cryptocurrency'. Then a document may have the following distribution over the topics : 50% S&P500, 40% Gold, 8% Oil, 2% Cryptocurrency. Another document might have a different distribution over the topics.
Also, for each topic, you have a distribution over the words in the vocabulary. For example, for Cryptocurrency topic, the probability of word 'bitcoin' would be higher than that of 'barrels'. For oil, 'barrels' will have higher probability than 'bitcoin', and so on.

Now, a document is assumed to be generated as follows: first, you select a distribution over the topics, say 50% S&P500, 40% Gold, 8% Oil, 2% Cryptocurrency, as above. You draw a topic from this distribution, say that comes to be Gold. Then, you draw a word from the distribution over words corresponding to Gold topic. This is the first word of the document. You repeat this process for all words.
(Clearly, this is not how a document would actually be generated, but this is a reasonable approximation.)

You estimate the topic distributions and the distribution of words for each topic during training.
Now given a new document, you can generate the most likely distribution over the topics that generated the document.

The algorithm will not label the topics for you. It will only return something like
topic 1 corresponds to words : 'bitcoin', 'litcoin', 'blockchain', etc
topic 2 corresponds to words: 'GDP', 'bank', 'dollar', 'stock', etc.
You have to figure out what these topics refer to.

The algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus). The aim of each algorithm is then to produce 2 smaller matrices; a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.

In [303]:
sentences = [
'The stock price of google jumps on the earning data today',
'Google plunge on China Data!'
]

In [302]:
#Getting the bag-of words
from sklearn.decomposition import LatentDirichletAllocation
vect=CountVectorizer(ngram_range=(1,1),stop_words='english')
sentences_vec=vect.fit_transform(sentences)

#Running LDA on the bag of words. 
from sklearn.feature_extraction.text import CountVectorizer
lda=LatentDirichletAllocation(n_components=3)
lda.fit_transform(sentences_vec)

array([[0.04283242, 0.91209846, 0.04506912],
       [0.06793339, 0.07059533, 0.86147128]])

The model produces two smaller matrices. We will be discussing the interpretation further in the third case study. 

<a id='5'></a>
# 5 NLP Recipies

<a id='5.1'></a>
## 5.1. Sentiment Analysis

Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material, and helping us understand the sentiments behind a text. 

With the help of Sentiment Analysis using Textblob the sentiment analysis can be performed in few lines of code. TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. The polarity defines the phase of emotions expressed in the analyzed sentence. Polarity alone is not enough to deal with complex text sentences. Subjectivity helps in determining personal states of the speaker including Emotions, Beliefs and opinions. It has values from 0 to 1 and a value closer to 0 shows the sentence is objective and vice versa.

The texblob sentiment function is pretrained and map adjectives frequently found in movie reviews(source code: https://textblob.readthedocs.io/en/dev/_modules/textblob/en/sentiments.html) to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).

The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token.

In [285]:
text1 = "Bayer (OTCPK:BAYRY) started the week up 3.5% to €74/share in Frankfurt, touching their highest level in 14 months, after the U.S. government said a $25M glyphosate decision against the company should be reversed."
text2 = "Apple declares poor in revenues"

In [283]:
TextBlob(text1).sentiment.polarity

0.5

In [284]:
TextBlob(text1).sentiment_assessments

Sentiment(polarity=0.5, subjectivity=0.5, assessments=[(['touching'], 0.5, 0.5, None)])

In [280]:
TextBlob(text2).sentiment.polarity

-0.4

In [281]:
TextBlob(text2).sentiment_assessments

Sentiment(polarity=-0.4, subjectivity=0.6, assessments=[(['poor'], -0.4, 0.6, None)])

We see that the first sentiment has positive sentiment and the second text has negative sentiments. Looking at the subjectivity, the second sentence has more subjectivity as compared to the first one. However, looking at the words that give rise to the sentiments, the word "touching" and not "high" causes positive sentiment in sentence one. So, probably a sentiment analysis algorithm pretrained on movie/product reviews might not perform well with news sentiment analysis. Hence probably, additional training for the stock sentiments might be needed. 

<a id='5.2'></a>
## 5.2. Text Similarity
Finding similarity between text is at the heart of almost all text mining methods, for example, text classification, clustering, recommendation, and many more. In order to calculate similarity between two text snippets, the usual way is to convert the text into its corresponding vector representation, for which there are many methods like word embedding of text, and then calculate similarity or difference using different distance metrics such as cosine-similarity and euclidean distance applicable to vectors. The underlying vector representations come from a word embedding model which generally produces a dense multi-dimensional semantic representation of words (as shown in the example). Using this vector representation, we can calculate similarities and dissimilarities between tokens, named entities, noun phrases, sentences and documents. The example below shows how to calculate similarities between two documents and tokens. If all of this does not makes sense, then don’t worry. We will cover the concepts behind word embeddings and text similarity in details in a subsequent blog post.

In [109]:
text1 = "Barack Obama was the 44th president of the United States of America."
text2 = "Donald Trump is the 45th president of the United States of America."
text3 = "SpaCy and NLTK are two popular NLP libraries in Python community."
doc1 = nlp(text1); doc2 = nlp(text2); doc3 = nlp(text3); 

In [111]:
print("Similarity between doc1 and doc2: ", text_similarity(doc1, doc2))
print("Similarity between doc1 and doc3: ", text_similarity(doc1, doc3))

Similarity between doc1 and doc2:  0.9525886414220489
Similarity between doc1 and doc3:  0.5184867892507579


In [113]:
def token_similarity(doc):
    for token1 in doc:
        for token2 in doc:
            print("Token 1: %s, Token 2: %s - Similarity: %f" % (token1.text, token2.text, token1.similarity(token2)))

doc4 = nlp("Apple orange cats")
token_similarity(doc4)

Token 1: Apple, Token 2: Apple - Similarity: 1.000000
Token 1: Apple, Token 2: orange - Similarity: 0.561892
Token 1: Apple, Token 2: cats - Similarity: 0.218511
Token 1: orange, Token 2: Apple - Similarity: 0.561892
Token 1: orange, Token 2: orange - Similarity: 1.000000
Token 1: orange, Token 2: cats - Similarity: 0.267099
Token 1: cats, Token 2: Apple - Similarity: 0.218511
Token 1: cats, Token 2: orange - Similarity: 0.267099
Token 1: cats, Token 2: cats - Similarity: 1.000000
