From https://machinelearninggeek.com/text-analytics-for-beginner-using-textblob/

In [1]:
from textblob import TextBlob

# Tokenization

Tokenization is the process of splitting text documents into small pieces, known as tokens. It will ignore punctuations and spaces from the text document

In [2]:
# Create TextBlob object
text = TextBlob("I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could. I tried to be the best role model that I possibly could.")

# Print the tokens
print(text.words)

['I', 'want', 'to', 'be', 'remembered', 'not', 'only', 'as', 'an', 'entertainer', 'but', 'as', 'a', 'person', 'who', 'cared', 'a', 'lot', 'and', 'I', 'gave', 'the', 'best', 'that', 'I', 'could', 'I', 'tried', 'to', 'be', 'the', 'best', 'role', 'model', 'that', 'I', 'possibly', 'could']


In [3]:
# Print the tokenized sentences
print(text.sentences)

[Sentence("I want to be remembered not only as an entertainer but as a person who cared a lot, and I gave the best that I could."), Sentence("I tried to be the best role model that I possibly could.")]


# Noun Phrases
A noun phrase is a set of words that belongs to a noun. It can be a subject or object in the sentence.

In [4]:
# Print noun phrases
print(text.noun_phrases)

['role model']


# Part of Speech (POS) Tagging

Part of speech or PoS defines the function of any sentence. For example, the verb identifies the action, noun or adjective identifies the object. Discovering such labels into the data is called PoS tagging.

In [5]:
#Print PoS tags
print(text.tags)

[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('be', 'VB'), ('remembered', 'VBN'), ('not', 'RB'), ('only', 'RB'), ('as', 'IN'), ('an', 'DT'), ('entertainer', 'NN'), ('but', 'CC'), ('as', 'IN'), ('a', 'DT'), ('person', 'NN'), ('who', 'WP'), ('cared', 'VBD'), ('a', 'DT'), ('lot', 'NN'), ('and', 'CC'), ('I', 'PRP'), ('gave', 'VBD'), ('the', 'DT'), ('best', 'JJS'), ('that', 'IN'), ('I', 'PRP'), ('could', 'MD'), ('I', 'PRP'), ('tried', 'VBD'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('best', 'JJS'), ('role', 'NN'), ('model', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('possibly', 'RB'), ('could', 'MD')]


# Lemmatization

Lemmatization is a process of normalizing the text in a linguistic manner. It chops the given input text and provides the root word of a given word with the use of a vocabulary and morphological analysis.

In [6]:
print(text.words[15].lemmatize("v"))

care


In [7]:
# Import word
from textblob import Word

# Create Word object
w = Word("remembered")

# Print lemmatized word
print(w.lemmatize("v"))

remember


# Finding a word and counting its occurrence

TextBlob has a find() function for searching the word and a count() function for counting the occurrence of any word. 

In [8]:
# Find a string
text.find("care") # returns the start index of that string in original text

71

# n-grams
n-grams or bag of word model is used to find the frequency of words in a given text document.

In [9]:
# Count number of times I appeared
print(text.words.count('I'))

5


# Sentiment Analysis

In TextBlob, sentiment property returns two scores(polarity, subjectivity) in namedtuple. The polarity score lies between -1 to +1. Negative values show negative sentiment or opinion while positive values show positive opinion or sentiment. The Subjectivity range between 0 and 1. Here, zero means objective and 1 means subjective opinion.

TextBlob offers two implementations of sentiment analysis. One is based on a pattern library and the other is based on an NLTK classifier trained on a movie reviews corpus.

In [10]:
# Print the polarity and subjectivity
print(text.sentiment)

Sentiment(polarity=0.5, subjectivity=0.65)


# Spell Correction
TextBlob offers spell correction using the correct() function.

In [11]:
# Create TextBlob object
b = TextBlob("I havv goood speling!")
print(b.correct())

I have good spelling!


# Language Detection and Translation
TextBlob offers detect_language() function for detection languages and translate() for translate text from one language to another language. It uses Google Translate API. To run these functions, requires an internet connection.

In [6]:
# from textblob import TextBlob
# # Create TextBlob object
# text = TextBlob("नमस्ते आप कैसे हैं")

# # Detect Language
# print(text.detect_language())

# # Translate into english
# print(text.translate(to='en'))

## This is broken and it's just too much trouble to fix it. 
## It's better to use other libraries for translation, example deep_translator

# Text Classification using TextBlob
In this section, we will focus on text classification which is one of the most important NLP techniques. Text classification will help us in various applications such as document classification, sentiment classification, predicting review rating, spam filtering, support tickets classification, and fake news classification.

# Prepare Dataset
In this section, our main objective is to prepare a dataset. Let’s prepare data by writing sentences and their sentiment in a tuple:

In [7]:
train = [
...     ('I love this sandwich.', 'pos'),
...     ('this is an amazing place!', 'pos'),
...     ('I feel very good about these beers.', 'pos'),
...     ('this is my best work.', 'pos'),
...     ("what an awesome view", 'pos'),
...     ('I do not like this restaurant', 'neg'),
...     ('I am tired of this stuff.', 'neg'),
...     ("I can't deal with this", 'neg'),
...     ('he is my sworn enemy!', 'neg'),
...     ('my boss is horrible.', 'neg')
... ]

test = [
...     ('the beer was good.', 'pos'),
...     ('I do not enjoy my job', 'neg'),
...     ("I ain't feeling dandy today.", 'neg'),
...     ("I feel amazing!", 'pos'),
...     ('Gary is a friend of mine.', 'pos'),
...     ("I can't believe I'm doing this.", 'neg')
... ]

# Train Model
In this section, we are going to create a NaiveBayes classifier using TextBlob. Let’s create a NaiveBayes classifier and train the model.

In [8]:
# Import NaiveBayes Classifier
from textblob.classifiers import NaiveBayesClassifier

# Perofrm model training
cl = NaiveBayesClassifier(train) 

# Make Prediction
Let’s make prediction on the given input sentence in the below code:

In [9]:
# Make prediction
print(cl.classify("This is an amazing library!"))

pos


In [10]:
print(cl.classify("Gary is a friend of mine."))

neg


# Evaluate Model
Let’s evaluate the model performance using the accuracy method:

In [11]:
# Evaluate the model
cl.accuracy(test) 

0.8333333333333334

In the above code, we have assessed the performance using accuracy measure and we have got 83.33 % accuracy.

# Retraining Model
Let’s retrain the model using the update method. First, we will prepare the new dataset and then update the previously trained model.

In [13]:
# Prepare new data
new_data = [('She is my best friend.', 'pos'),
            ("I'm happy to have a new friend.", 'pos'),
            ("Stay thirsty, my friend.", 'pos'),
            ("He ain't from around here.", 'neg')]

# Update model with new data
cl.update(new_data) # 4. retraining of model

# Test the model 
cl.classify("Gary is a friend of mine.")

'pos'

# Calculate Class Probabilities
We can also calculate the probabilities for predicted classes using the prob_classify(text) function. Let’s see the example below for detailed understanding:

In [14]:
cl = NaiveBayesClassifier(train)
prob_dist = cl.prob_classify("I feel happy this morning.")

print("Positive and Negative Probabilities:",prob_dist.prob("pos"),prob_dist.prob("neg"))
print("Largest Probability:",prob_dist.max())

Positive and Negative Probabilities: 0.9256990307165033 0.07430096928349521
Largest Probability: pos


# Decision Tree Classifier
Let’s train the model using the Decision Tree Classifier using TextBlob and evaluate the model performance using the accuracy method.

In [15]:
# Import Decision Tree
from textblob.classifiers import DecisionTreeClassifier

# Create Decision Tree Classifier
dt=DecisionTreeClassifier(train)

# Test the model
dt.accuracy(test)

1.0

# Maximum Entropy Classifier
Let’s train the model using Maximum Entropy Classifier using TextBlob and evaluate the model performance using the accuracy method.

In [16]:
# Import MaxEntClassifier
from textblob.classifiers import MaxEntClassifier

# Create Decision Tree Classifier
me = MaxEntClassifier(train)

# Test the model
print(me.accuracy(test))

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.571
             2          -0.64213        0.571
             3          -0.61362        0.571
             4          -0.58734        0.643
             5          -0.56305        0.857
             6          -0.54057        0.857
             7          -0.51971        0.857
             8          -0.50032        0.929
             9          -0.48226        0.929
            10          -0.46540        0.929
            11          -0.44963        0.929
            12          -0.43486        0.929
            13          -0.42099        1.000
            14          -0.40795        1.000
            15          -0.39567        1.000
            16          -0.38408        1.000
            17          -0.37312        1.000
            18          -0.36275        1.000
            19          -0.35293        1.000
 

# Pros and Cons
TextBlob is built on top of the NLTK and Pattern library. It provides a simple intuitive interface for beginners. It also offers language detection, language translation (powered by Google Translate), Sentiment analysis, and easy-to-use Text Classification functionality.

TextBlob is slower than Spacy but faster than NLTK. It does not offer a few NLP tasks such as word vectorization and dependency parsing.

# Summary
We have performed various NLP operations such as PoS tagging, noun phrases, sentiment analysis, parsing, spell correction, language detection, language translation, and text classification using Naive Bayes and Decision Tree. 