<a href="https://colab.research.google.com/github/rahiakela/python-for-programmers-practice/blob/natural-language-processing/text_blob_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TextBlob

TextBlob is an objectoriented NLP textprocessing library that is built on the NLTK and pattern NLP libraries and simplifies many of their capabilities. Some of the NLP tasks TextBlob can perform include:

* **Tokenization**—splitting text into pieces called tokens, which are meaningful units, such as words and numbers.
* **Partsofspeech (POS) tagging**—identifying each word’s part of speech, such as noun, verb, adjective, etc.
* **Noun phrase extraction**—locating groups of words that represent nouns, such as “red brick factory.”
* **Sentiment analysis**—determining whether text has positive, neutral or negative sentiment.
* **Inter-language translation and language detection** powered by Google Translate.
* **Inflection** —pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob.
* **Spell checking and spelling correction**.
* **Stemming**—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”
* **Lemmatization**—like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”
* **Word frequencies**—determining how often each word appears in a corpus.
* **WordNet integration** for finding word definitions, synonyms and antonyms.
* **Stop word elimination**—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.
* **ngrams**— producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.

## Setup

Now download the NLTK corpora used by TextBlob:

In [1]:
import textblob
from textblob import TextBlob

!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


## Create a TextBlob

TextBlob is the fundamental class for NLP with the textblob module. Let’s create a TextBlob containing two sentences:

In [2]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

blob = TextBlob(text)
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

## Tokenizing Text into Sentences and Words

Natural language processing often requires tokenizing text before performing other NLP tasks. TextBlob provides convenient properties for accessing the sentences and words in TextBlobs. 

Let’s use the sentence property to get a list of Sentence objects:

In [3]:
blob.sentences

[Sentence("Today is a beautiful day."),
 Sentence("Tomorrow looks like bad weather.")]

The words property returns a WordList object containing a list of Word objects,
representing each word in the TextBlob with the punctuation removed:

In [4]:
blob.words

WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather'])

In [5]:
# get every word count
blob.word_counts

defaultdict(int,
            {'a': 1,
             'bad': 1,
             'beautiful': 1,
             'day': 1,
             'is': 1,
             'like': 1,
             'looks': 1,
             'today': 1,
             'tomorrow': 1,
             'weather': 1})

## Parts-of-Speech Tagging

Partsofspeech (POS) tagging is the process of evaluating words based on their context to determine each word’s part of speech. There are eight primary English parts of speech— nouns, pronouns, verbs, adjectives, adverbs,prepositions, conjunctions and interjections (words that express emotion and that are typically followed by punctuation, like “Yes!” or “Ha!”). Within each category there are many subcategories.

The tags property returns a list of tuples, each containing a word and a string representing its part-of-speech tag:

In [6]:
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

Explanation:
* **Today**, day and weather are tagged as NN—a singular noun or mass noun.
* **is** and looks are tagged as VBZ—a third person singular present verb.
* **a** is tagged as DT—a determiner.
* **beautiful** and bad are tagged as JJ—an adjective.
* **Tomorrow** is tagged as NNP—a proper singular noun.
* **like** is tagged as IN—a subordinating conjunction or preposition.

By default, TextBlob uses a PatternTagger to determine parts-of-speech.
This class uses the partsofspeech tagging capabilities of the pattern library:

## Extracting Noun Phrases

A TextBlob’s noun_phrases property returns a WordList object containing a list of Word objects—one for each noun phrase in the text:

In [7]:
blob.noun_phrases

WordList(['beautiful day', 'tomorrow', 'bad weather'])

Note that a Word representing a noun phrase can contain multiple words. A WordList is an extension of Python’s builtin list type. WordLists provide additional methods for stemming, lemmatizing, singularizing and pluralizing.

In [8]:
blob.tokens

WordList(['Today', 'is', 'a', 'beautiful', 'day', '.', 'Tomorrow', 'looks', 'like', 'bad', 'weather', '.'])

## Sentiment Analysis with TextBlob’s Default Sentiment Analyzer

One of the most common and valuable NLP tasks is sentiment analysis, which determines whether text is positive, neutral or negative. For instance, companies might use this to determine whether people are speaking positively or negatively online about their products. 

Consider the positive word “good” and the negative word “bad.” Just because a sentence contains “good” or “bad” does not mean the sentence’s sentiment necessarily is positive or negative. 

For example, the sentence--
The food is not good.
clearly has negative sentiment. 

Similarly, the sentence--The movie was not bad.

clearly has positive sentiment, though perhaps not as positive as something like

The movie was excellent!

Sentiment analysis is a complex machinelearning problem. However, libraries like TextBlob have pretrained machine learning models for performing sentiment analysis.

### Getting the Sentiment of a TextBlob

A TextBlob’s sentiment property returns a Sentiment object indicating whether the text is positive or negative and whether it’s objective or subjective:

In [9]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [10]:
blob.sentiment

Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

the polarity indicates sentiment with a value from 1.0 (negative) to 1.0 (positive) with 0.0 being neutral. The subjectivity is a value from 0.0(objective) to 1.0 (subjective). Based on the values for our TextBlob, the overall sentiment is close to neutral, and the text is mostly subjective.

### Getting the polarity and subjectivity from the Sentiment Object

The values displayed above probably provide more precision that you need in most cases.

This can detract from numeric output’s readability. The IPython magic %precision allows you to specify the default precision for standalone float objects and float objects in builtin types like lists, dictionaries and tuples.

In [11]:
%precision 3

blob.sentiment.polarity

0.075

In [12]:
blob.sentiment.subjectivity

0.833

### Getting the Sentiment of a Sentence

You also can get the sentiment at the individual sentence level. 

Let’s use the sentence property to get a list of Sentence objects, then iterate through them and display each Sentence’s sentiment property:

In [13]:
for sentence in blob.sentences:
  print(sentence.sentiment)

Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)


This might explain why the entire TextBlob’s sentiment is close to 0.0 (neutral)—one sentence is positive (0.85) and the other negative (0.6999999999999998).

## Sentiment Analysis with the NaiveBayesAnalyzer

By default, a TextBlob and the Sentences and Words you get from it determine sentiment using a PatternAnalyzer, which uses the same sentiment analysis techniques as in the Pattern library. 

The TextBlob library also comes with a NaiveBayesAnalyzer which was trained on a database of movie reviews. Naive Bayes is a commonly used machine learning text classification algorithm.

In [14]:
from textblob.sentiments import NaiveBayesAnalyzer

blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [15]:
blob.sentiment

Sentiment(classification='neg', p_pos=0.47662917962091056, p_neg=0.5233708203790892)

The overall sentiment is classified as negative (classification='neg'). The
Sentiment object’s p_pos indicates that the TextBlob is 47.66% positive, and its p_neg indicates that the TextBlob is 52.34% negative. Since the overall sentiment is just slightly more negative we’d probably view this TextBlob’s sentiment as neutral overall.

Now, let’s get the sentiment of each Sentence:

In [16]:
for sentence in blob.sentences:
  print(sentence.sentiment)

Sentiment(classification='pos', p_pos=0.8117563121751951, p_neg=0.18824368782480477)
Sentiment(classification='neg', p_pos=0.174363226578349, p_neg=0.8256367734216521)


Rather than polarity and subjectivity, the Sentiment objects we get from
the NaiveBayesAnalyzer contain a classification—'pos' (positive) or 'neg' (negative)—and p_pos (percentage positive) and p_neg (percentage negative) values from 0.0 to 1.0.

## Language Detection and Translation

Inter-language translation also is great for people traveling to foreign countries. They can use translation apps to translate menus, road signs and more. There are even efforts at live speech translation so that you’ll be able to converse in real time with people who do not know your natural language.Some smartphones, can now work together with in ear headphones to provide nearlive
translation of many languages.

The TextBlob library uses Google Translate to detect a text’s language and translate TextBlobs, Sentences and Words into other languages. Let’s use detect_language method to detect the language of the text we’re manipulating ('en' is English):

In [17]:
blob.detect_language()

'en'

Next, let’s use the translate method to translate the text to Spanish ('es') then detect the language on the result. The to keyword argument specifies the target language.

In [18]:
spanish = blob.translate(to='es')
spanish

TextBlob("Hoy es un hermoso dia. Mañana parece mal tiempo.")

In [19]:
spanish.detect_language()

'es'

Next, let’s translate our TextBlob to simplified Chinese (specified as 'zh' or 'zhCN') then detect the language on the result:

In [20]:
chinese = blob.translate(to='zh')
chinese

TextBlob("今天是美好的一天。明天看起来天气不好。")

In [21]:
chinese.detect_language()

'zh-CN'

In [22]:
hindi = blob.translate(to='hi')
hindi

TextBlob("आज का दिन बहुत ही अच्छा है। कल खराब मौसम की तरह लग रहा है।")

In [23]:
hindi.detect_language()

'hi'

In each of the preceding cases, Google Translate automatically detects the source language.
You can specify a source language explicitly by passing the from_lang keyword argument to
the translate method.

In [24]:
chinese = blob.translate(from_lang='en', to='zh')
chinese

TextBlob("今天是美好的一天。明天看起来天气不好。")

In [25]:
hindi = blob.translate(from_lang='en', to='hi')
hindi

TextBlob("आज का दिन बहुत ही अच्छा है। कल खराब मौसम की तरह लग रहा है।")

Calling translate without arguments translates from the detected source language to English:

In [26]:
chinese.translate()
chinese

TextBlob("今天是美好的一天。明天看起来天气不好。")

In [27]:
hindi.translate()
hindi

TextBlob("आज का दिन बहुत ही अच्छा है। कल खराब मौसम की तरह लग रहा है।")

## Inflection: Pluralization and Singularization

Inflections are different forms of the same words, such as singular and plural (like “person”
and “people”) and different verb tenses (like “run” and “ran”). When you’re calculating word
frequencies, you might first want to convert all inflected words to the same form for more
accurate word frequencies. Words and WordLists each support converting words to their
singular or plural forms.

Let’s pluralize and singularize a couple of Word objects.

In [28]:
from textblob import Word

index = Word('index')
index

'index'

In [29]:
index.pluralize()

'indices'

In [0]:
cacti = Word('cacti')

In [31]:
cacti.singularize()

'cactus'

Pluralizing and singularizing are sophisticated tasks which, as you can see above, are not as
simple as adding or removing an “s” or “es” at the end of a word.

In [0]:
animals = TextBlob('dog cat fish bird').words

In [33]:
animals.pluralize()

WordList(['dogs', 'cats', 'fish', 'birds'])

## Spell Checking and Correction

You can check a Word’s spelling with its spellcheck method, which returns a list of tuples
containing possible correct spellings and a confidence value. Let’s assume we meant to type
the word “they” but we misspelled it as “theyr.” The spell checking results show two possible
corrections with the word 'they' having the highest confidence value:

In [34]:
%precision 2
'%.2f'

'%.2f'

In [36]:
word = Word('theyr')
word.spellcheck()

[('they', 0.57), ('their', 0.43)]

TextBlobs, Sentences and Words all have a correct method that you can call to correct
spelling. Calling correct on a Word returns the correctly spelled word that has the highest
confidence value (as returned by spellcheck):

In [37]:
# chooses word with the highest confidence value
word.correct()

'they'

Calling correct on a TextBlob or Sentence checks the spelling of each word. For each
incorrect word, correct replaces it with the correctly spelled one that has the highest
confidence value:

In [38]:
sentence = TextBlob('Ths sentense has missplled wrds.')
sentence.correct()

TextBlob("The sentence has misspelled words.")

## Normalization: Stemming and Lemmatization

Stemming removes a prefix or suffix from a word leaving only a stem, which may or may
not be a real word. Lemmatization is similar, but factors in the word’s part of speech and
meaning and results in a real word.

Stemming and lemmatization are normalization operations, in which you prepare words
for analysis. 

For example, before calculating statistics on words in a body of text, you might
convert all words to lowercase so that capitalized and lowercase words are not treated
differently. Sometimes, you might want to use a word’s root to represent the word’s many
forms. 

For example, in a given application, you might want to treat all of the following words
as “program”: program, programs, programmer, programming and programmed.

Words and WordLists each support stemming and lemmatization via the methods stem and
lemmatize. Let’s use both on a Word:

In [39]:
word = Word('varieties')
word.stem()

'varieti'

In [40]:
word.lemmatize()

'variety'

## Word Frequencies

Various techniques for detecting similarity between documents rely on word frequencies. As
you’ll see here, TextBlob automatically counts word frequencies. 

First, let’s load the ebook
for Shakespeare’s Romeo and Juliet into a TextBlob. To do so, we’ll use the Path class
from the Python Standard Library’s pathlib module:

In [51]:
import sys  
from importlib import reload
reload(sys)  
sys.getdefaultencoding() # use this for Python3

url ='romeo-and-juliet.txt'
file=open(url)
t=file.read()
print(type(t))

blob = TextBlob(t)

<class 'str'>


You can access the word frequencies through the TextBlob’s word_counts dictionary.
Let’s get the counts of several words in the play:

In [52]:
blob.word_counts['juliet']

68

In [53]:
blob.word_counts['romeo']

158

In [54]:
blob.word_counts['thou']

278

If you already have tokenized a TextBlob into a WordList, you can count specific words in
the list via the count method:

In [55]:
blob.words.count('joy')

14

In [58]:
blob.noun_phrases.count('lady capulet')

0

## Getting Definitions, Synonyms and Antonyms from WordNet

WordNet is a word database created by Princeton University. The TextBlob library uses
the NLTK library’s WordNet interface, enabling you to look up word definitions, and get
synonyms and antonyms.

In [0]:
happy = Word('happy')

The Word class’s definitions property returns a list of all the word’s definitions in the
WordNet database:

In [60]:
# Getting Definitions
happy.definitions

['enjoying or showing or marked by joy or pleasure',
 'marked by good fortune',
 'eagerly disposed to act or to be of service',
 'well expressed and to the point']

You can get a Word’s synsets—that is, its sets of synonyms—via the synsets property.

In [61]:
# Getting Synonyms
happy.synsets

[Synset('happy.a.01'),
 Synset('felicitous.s.02'),
 Synset('glad.s.02'),
 Synset('happy.s.04')]

You can iterate through the synsets list to find the original word’s synonyms. Each Synset
has a lemmas method that returns a list of Lemma objects representing the synonyms. A
Lemma’s name method returns the synonymous word as a string. 

In the following code, for
each Synset in the synsets list, the nested for loop iterates through that Synset’s
Lemmas (if any). Then we add the synonym to the set named synonyms. We used a set
collection because it automatically eliminates any duplicates we add to it:

In [63]:
synonyms = set()

for synset in happy.synsets:
  for lemma in synset.lemmas():
    synonyms.add(lemma.name())

synonyms    

{'felicitous', 'glad', 'happy', 'well-chosen'}

If the word represented by a Lemma has antonyms in the WordNet database, invoking the
Lemma’s antonyms method returns a list of Lemmas representing the antonyms (or an empty
list if there are no antonyms in the database).

In [64]:
# Getting Antonyms
lemmas = happy.synsets[0].lemmas()
lemmas

[Lemma('happy.a.01.happy')]

In this case, lemmas returned a list of one Lemma element. We can now check whether the
database has any corresponding antonyms for that Lemma:

In [65]:
lemmas[0].antonyms()

[Lemma('unhappy.a.01.unhappy')]

The result is list of Lemmas representing the antonym(s). Here, we see that the one antonym
for 'happy' in the database is 'unhappy'.

## Deleting Stop Words

Stop words are common words in text that are often removed from text before analyzing it
because they typically do not provide useful information.

In [0]:
import nltk

In [67]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

First import stopwords from
the nltk.corpus module, then use stopwords method words to load the 'english'
stop words list:

In [0]:
from nltk.corpus import stopwords

stops = stopwords.words('english')

In [0]:
blob = TextBlob('Today is a beautiful day.')

In [70]:
[word for word in blob.words if word not in stops]

['Today', 'beautiful', 'day']

## n-grams

An ngram is a sequence of n text items, such as letters in words or words in a sentence. In natural language processing, ngrams
can be used to identify letters or words that frequently
appear adjacent to one another.

For text based user input, this can help predict the next
letter or word a user will type—such as when completing items in IPython with tabcompletion
or when entering a message to a friend in your favorite smartphone messaging
app. 

For speech-to-text, ngrams
might be used to improve the quality of the transcription.
Ngrams are a form of cooccurrence
in which words or letters appear near each other in a
body of text.

In [71]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'
blob = TextBlob(text)
blob.ngrams()

[WordList(['Today', 'is', 'a']),
 WordList(['is', 'a', 'beautiful']),
 WordList(['a', 'beautiful', 'day']),
 WordList(['beautiful', 'day', 'Tomorrow']),
 WordList(['day', 'Tomorrow', 'looks']),
 WordList(['Tomorrow', 'looks', 'like']),
 WordList(['looks', 'like', 'bad']),
 WordList(['like', 'bad', 'weather'])]

The following produces ngrams
consisting of five words:

In [72]:
blob.ngrams(n=5)

[WordList(['Today', 'is', 'a', 'beautiful', 'day']),
 WordList(['is', 'a', 'beautiful', 'day', 'Tomorrow']),
 WordList(['a', 'beautiful', 'day', 'Tomorrow', 'looks']),
 WordList(['beautiful', 'day', 'Tomorrow', 'looks', 'like']),
 WordList(['day', 'Tomorrow', 'looks', 'like', 'bad']),
 WordList(['Tomorrow', 'looks', 'like', 'bad', 'weather'])]