<a href="https://colab.research.google.com/github/rahiakela/python-for-programmers-practice/blob/natural-language-processing/text_blob_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TextBlob

TextBlob is an objectoriented NLP textprocessing library that is built on the NLTK and pattern NLP libraries and simplifies many of their capabilities. Some of the NLP tasks TextBlob can perform include:

* **Tokenization**—splitting text into pieces called tokens, which are meaningful units, such as words and numbers.
* **Partsofspeech (POS) tagging**—identifying each word’s part of speech, such as noun, verb, adjective, etc.
* **Noun phrase extraction**—locating groups of words that represent nouns, such as “red brick factory.”
* **Sentiment analysis**—determining whether text has positive, neutral or negative sentiment.
* **Inter-language translation and language detection** powered by Google Translate.
* **Inflection** —pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob.
* **Spell checking and spelling correction**.
* **Stemming**—reducing words to their stems by removing prefixes or suffixes. For example, the stem of “varieties” is “varieti.”
* **Lemmatization**—like stemming, but produces real words based on the original words’ context. For example, the lemmatized form of “varieties” is “variety.”
* **Word frequencies**—determining how often each word appears in a corpus.
* **WordNet integration** for finding word definitions, synonyms and antonyms.
* **Stop word elimination**—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.
* **ngrams**— producing sets of consecutive words in a corpus for use in identifying words that frequently appear adjacent to one another.

## Setup

Now download the NLTK corpora used by TextBlob:

In [1]:
import textblob
from textblob import TextBlob

!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


## Create a TextBlob

TextBlob is the fundamental class for NLP with the textblob module. Let’s create a TextBlob containing two sentences:

In [2]:
text = 'Today is a beautiful day. Tomorrow looks like bad weather.'

blob = TextBlob(text)
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

## Tokenizing Text into Sentences and Words

Natural language processing often requires tokenizing text before performing other NLP tasks. TextBlob provides convenient properties for accessing the sentences and words in TextBlobs. 

Let’s use the sentence property to get a list of Sentence objects:

In [3]:
blob.sentences

[Sentence("Today is a beautiful day."),
 Sentence("Tomorrow looks like bad weather.")]

The words property returns a WordList object containing a list of Word objects,
representing each word in the TextBlob with the punctuation removed:

In [4]:
blob.words

WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather'])

In [5]:
# get every word count
blob.word_counts

defaultdict(int,
            {'a': 1,
             'bad': 1,
             'beautiful': 1,
             'day': 1,
             'is': 1,
             'like': 1,
             'looks': 1,
             'today': 1,
             'tomorrow': 1,
             'weather': 1})

## Parts-of-Speech Tagging

Partsofspeech (POS) tagging is the process of evaluating words based on their context to determine each word’s part of speech. There are eight primary English parts of speech— nouns, pronouns, verbs, adjectives, adverbs,prepositions, conjunctions and interjections (words that express emotion and that are typically followed by punctuation, like “Yes!” or “Ha!”). Within each category there are many subcategories.

The tags property returns a list of tuples, each containing a word and a string representing its part-of-speech tag:

In [6]:
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

Explanation:
* **Today**, day and weather are tagged as NN—a singular noun or mass noun.
* **is** and looks are tagged as VBZ—a third person singular present verb.
* **a** is tagged as DT—a determiner.
* **beautiful** and bad are tagged as JJ—an adjective.
* **Tomorrow** is tagged as NNP—a proper singular noun.
* **like** is tagged as IN—a subordinating conjunction or preposition.

By default, TextBlob uses a PatternTagger to determine parts-of-speech.
This class uses the partsofspeech tagging capabilities of the pattern library:

## Extracting Noun Phrases

A TextBlob’s noun_phrases property returns a WordList object containing a list of Word objects—one for each noun phrase in the text:

In [7]:
blob.noun_phrases

WordList(['beautiful day', 'tomorrow', 'bad weather'])

Note that a Word representing a noun phrase can contain multiple words. A WordList is an extension of Python’s builtin list type. WordLists provide additional methods for stemming, lemmatizing, singularizing and pluralizing.

In [8]:
blob.tokens

WordList(['Today', 'is', 'a', 'beautiful', 'day', '.', 'Tomorrow', 'looks', 'like', 'bad', 'weather', '.'])

## Sentiment Analysis with TextBlob’s Default Sentiment Analyzer

One of the most common and valuable NLP tasks is sentiment analysis, which determines whether text is positive, neutral or negative. For instance, companies might use this to determine whether people are speaking positively or negatively online about their products. 

Consider the positive word “good” and the negative word “bad.” Just because a sentence contains “good” or “bad” does not mean the sentence’s sentiment necessarily is positive or negative. 

For example, the sentence--
The food is not good.
clearly has negative sentiment. 

Similarly, the sentence--The movie was not bad.

clearly has positive sentiment, though perhaps not as positive as something like

The movie was excellent!

Sentiment analysis is a complex machinelearning problem. However, libraries like TextBlob have pretrained machine learning models for performing sentiment analysis.

### Getting the Sentiment of a TextBlob

A TextBlob’s sentiment property returns a Sentiment object indicating whether the text is positive or negative and whether it’s objective or subjective:

In [9]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [10]:
blob.sentiment

Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

the polarity indicates sentiment with a value from 1.0 (negative) to 1.0 (positive) with 0.0 being neutral. The subjectivity is a value from 0.0(objective) to 1.0 (subjective). Based on the values for our TextBlob, the overall sentiment is close to neutral, and the text is mostly subjective.

### Getting the polarity and subjectivity from the Sentiment Object

The values displayed above probably provide more precision that you need in most cases.

This can detract from numeric output’s readability. The IPython magic %precision allows you to specify the default precision for standalone float objects and float objects in builtin types like lists, dictionaries and tuples.

In [13]:
%precision 3

blob.sentiment.polarity

0.075

In [14]:
blob.sentiment.subjectivity

0.833

### Getting the Sentiment of a Sentence

You also can get the sentiment at the individual sentence level. 

Let’s use the sentence property to get a list of Sentence objects, then iterate through them and display each Sentence’s sentiment property:

In [15]:
for sentence in blob.sentences:
  print(sentence.sentiment)

Sentiment(polarity=0.85, subjectivity=1.0)
Sentiment(polarity=-0.6999999999999998, subjectivity=0.6666666666666666)


This might explain why the entire TextBlob’s sentiment is close to 0.0 (neutral)—one sentence is positive (0.85) and the other negative (0.6999999999999998).

## Sentiment Analysis with the NaiveBayesAnalyzer

By default, a TextBlob and the Sentences and Words you get from it determine sentiment using a PatternAnalyzer, which uses the same sentiment analysis techniques as in the Pattern library. 

The TextBlob library also comes with a NaiveBayesAnalyzer which was trained on a database of movie reviews. Naive Bayes is a commonly used machine learning text classification algorithm.

In [16]:
from textblob.sentiments import NaiveBayesAnalyzer

blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [17]:
blob.sentiment

Sentiment(classification='neg', p_pos=0.47662917962091056, p_neg=0.5233708203790892)

The overall sentiment is classified as negative (classification='neg'). The
Sentiment object’s p_pos indicates that the TextBlob is 47.66% positive, and its p_neg indicates that the TextBlob is 52.34% negative. Since the overall sentiment is just slightly more negative we’d probably view this TextBlob’s sentiment as neutral overall.

Now, let’s get the sentiment of each Sentence:

In [18]:
for sentence in blob.sentences:
  print(sentence.sentiment)

Sentiment(classification='pos', p_pos=0.8117563121751951, p_neg=0.18824368782480477)
Sentiment(classification='neg', p_pos=0.174363226578349, p_neg=0.8256367734216521)


Rather than polarity and subjectivity, the Sentiment objects we get from
the NaiveBayesAnalyzer contain a classification—'pos' (positive) or 'neg' (negative)—and p_pos (percentage positive) and p_neg (percentage negative) values from 0.0 to 1.0.