<h1>CSCE 670 Spotlight: Text Blob</h1>

<h2>1. Introduction</h2>
<p>TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for performing natural language processing tasks such as part-of-speech (POS) tagging, noun phrase extractions, translation, text classification, sentiment analysis, etc.

TextBlob is built on NLTK and Pattern. It is a simple and easy to learn library for performing small NLP tasks. 
</p>

<h2>2. Setup</h2>
<p> In the command prompt, enter the following two commands. One for the installation of Textblob and another is to download the corpora for performing the textblob operations.</p>

In [18]:
from textblob import TextBlob

<h2>3. Features</h2>
<p> A number of features are offered by textblob. Textblob objects behave like python strings. The below sample sentence is used to analyse the NLP tasks provided by textblob.</p>
    

In [19]:
blob=TextBlob("Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of information for government, commercial, and academic purposes.")

<h3>3.1 Tokenization</h3>
<p>TextBlob helps in splitting a paragraph into sentence objects and splitting sentences into word objects. Tokenization is the basis for any NLP task. Tokens are used to find patterns in a text.</p>

<h4>Sentence Tokenization</h4>

In [20]:
sentences=blob.sentences
print(sentences)
print("Datatype returned: ",type(sentences))
print("Datatype of each sentence: ",type(sentences[0]))

[Sentence("Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request."), Sentence("Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of information for government, commercial, and academic purposes.")]
Datatype returned:  <class 'list'>
Datatype of each sentence:  <class 'textblob.blob.Sentence'>


<h4>Word Tokenization</h4>

In [21]:
words=blob.words
print("Words:",words)
print("Datatype returned: ",type(words))
print("Datatype of each word: ",type(words[0]))


Words: ['Information', 'storage', 'and', 'retrieval', 'the', 'systematic', 'process', 'of', 'collecting', 'and', 'cataloging', 'data', 'so', 'that', 'they', 'can', 'be', 'located', 'and', 'displayed', 'on', 'request', 'Computers', 'and', 'data', 'processing', 'techniques', 'have', 'made', 'possible', 'the', 'high-speed', 'selective', 'retrieval', 'of', 'large', 'amounts', 'of', 'information', 'for', 'government', 'commercial', 'and', 'academic', 'purposes']
Datatype returned:  <class 'textblob.blob.WordList'>
Datatype of each word:  <class 'textblob.blob.Word'>


<p>It is also possible to get the words belonging to a particular sentence in Textblob</p>

In [22]:
words_sentence1=sentences[0].words
print(words_sentence1)

['Information', 'storage', 'and', 'retrieval', 'the', 'systematic', 'process', 'of', 'collecting', 'and', 'cataloging', 'data', 'so', 'that', 'they', 'can', 'be', 'located', 'and', 'displayed', 'on', 'request']


<h3>3.2 Part of Speech Tagging </h3>
<p>Part of Speech tagging is used in getting the part of speech of each word in the textblob object. For instance, it will tell whether a word is a noun, verb, pronoun,etc

In [23]:
pos_blob=blob.tags
print(pos_blob)

[('Information', 'NN'), ('storage', 'NN'), ('and', 'CC'), ('retrieval', 'NN'), ('the', 'DT'), ('systematic', 'JJ'), ('process', 'NN'), ('of', 'IN'), ('collecting', 'VBG'), ('and', 'CC'), ('cataloging', 'VBG'), ('data', 'NNS'), ('so', 'RB'), ('that', 'IN'), ('they', 'PRP'), ('can', 'MD'), ('be', 'VB'), ('located', 'VBN'), ('and', 'CC'), ('displayed', 'VBN'), ('on', 'IN'), ('request', 'NN'), ('Computers', 'NNS'), ('and', 'CC'), ('data', 'NNS'), ('processing', 'VBG'), ('techniques', 'NNS'), ('have', 'VBP'), ('made', 'VBN'), ('possible', 'JJ'), ('the', 'DT'), ('high-speed', 'NN'), ('selective', 'JJ'), ('retrieval', 'NN'), ('of', 'IN'), ('large', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('information', 'NN'), ('for', 'IN'), ('government', 'NN'), ('commercial', 'JJ'), ('and', 'CC'), ('academic', 'JJ'), ('purposes', 'NNS')]


<h4>Applications</h4>
<p>Parts of speech is very useful in word sense disambiguation. Word sense disambiguation is identifying which meaning of a word is used in a sentence, when the word has multiple meanings.It can also be used in text to speech conversion where we need to know the POS to know how to pronounce the word.</p>

<h3>3.3 Noun Phrases</h3>
<p> In part of speech tagging you get the part of speech of all words in a blob. But in noun phrases, we can get the words which are just nouns. This can be used when we want to find the main subject of the text given to us.</p>

In [24]:
noun_blob=blob.noun_phrases
print(noun_blob)

['information', 'systematic process', 'computers', 'data processing techniques', 'selective retrieval', 'large amounts', 'academic purposes']


<h4>Applications</h4>
<p>Noun phrases is useful in automatic book indexing. In this, we collect the noun phrases and do inverse document frequency to find the topics in the book and to create the summary of a content. It can also be used for adjective grouping.</p>

<h3>3.4 Lemmatization and Word Inflection</h3>
<p> Lemmatization and Word Inflection is used to prepare the text, words for further Natural Language Processing. It is used for data cleaning. Lemmatization is reducing a word to its base form. Textblob wordlist/word object also has a function called pluralize and singularize. However, the function doesnt work properly for some words.As we can see below singularize didn't work for the word process. It simply removed one 's' from the end of the sentence.Pluralize didn't work for already pluralized word 'computers'.</p>

In [25]:
print("SINGULARIZE AND PLURALIZE")
for word in noun_blob:  #pluralizing/singularizing only nouns
    print(word," pluralizes to ",word.pluralize())
    print(word,"singularizes to ",word.singularize())
    print()
print("LEMMATIZATION")
for word,pos in pos_blob:
    if pos=='VB'or pos=='VBN' or pos=='VBG':
        print(word,":",word.lemmatize("v")) #v indicates verb

SINGULARIZE AND PLURALIZE
information  pluralizes to  information
information singularizes to  information

systematic process  pluralizes to  systematic processes
systematic process singularizes to  systematic proces

computers  pluralizes to  computerss
computers singularizes to  computer

data processing techniques  pluralizes to  data processing techniquess
data processing techniques singularizes to  data processing technique

selective retrieval  pluralizes to  selective retrievals
selective retrieval singularizes to  selective retrieval

large amounts  pluralizes to  large amountss
large amounts singularizes to  large amount

academic purposes  pluralizes to  academic purposess
academic purposes singularizes to  academic purpose

LEMMATIZATION
collecting : collect
cataloging : catalog
be : be
located : locate
displayed : display
processing : process
made : make


<h3>3.5 Spelling correction and Spellcheck</h3>

<p>TextBlob has an inbuilt function 'correct()' to do spelling correction. Spellcheck() is a function which gives several possible forms of the incorrectly spelled word along with the confidence parameter. Based on this confidence parameter, correct() will pick the word with the highest confidence. For example, the word "strage" was supposed to be corrected to "storage", but the confidence level of the word "strange" is higher than that of the word "storage", thus the word "Strange" was selected which is wrong. Most of the words with spelling errors have been corrected properly, however, some of the words had an issue.</p>

In [26]:
incorrect_blob=TextBlob("Informtion strage and retrieval, the systmatic process of collectinggg and cataloging data so thait they can be locatted and displaiyed on requeist.")
print("Incorrect sentence: ",incorrect_blob)
print()
print("Corrected sentence: ",incorrect_blob.correct())

Incorrect sentence:  Informtion strage and retrieval, the systmatic process of collectinggg and cataloging data so thait they can be locatted and displaiyed on requeist.

Corrected sentence:  Information strange and retrieved, the systematic process of collecting and cataloging data so that they can be located and displayed on request.


In [27]:
from textblob import Word
storage_word=Word('strage')
print("Spellcheck for the word strage: ",storage_word.spellcheck())
print()
retrieval_word=Word('retrieval')
print("Spellcheck for the word retrieval:",retrieval_word.spellcheck())
print()
request_word=Word('reqeuist')
print("Spellcheck for the word reqeuist:",request_word.spellcheck())


Spellcheck for the word strage:  [('strange', 0.6646525679758308), ('stage', 0.32628398791540786), ('storage', 0.00906344410876133)]

Spellcheck for the word retrieval: [('retrieved', 0.6666666666666666), ('retrieve', 0.3333333333333333)]

Spellcheck for the word reqeuist: [('request', 1.0)]


<h3>3.6 WordNet Integration</h3>
<p>WordNet is the dictionary for English language, specifically designed for natural language processing. WordNet is an NLTK corpus reader.

Synset instances are the groupings of words that express the same concept (synonyms). Some of the words have only one Synset and some have several. A synset is identified with a 3-part name of the form: word.pos.number

'word' is the word’s morphological stem.    
'pos' is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.   
'number' is the sense number, counting from 0.

After getting the synonyms of a word, synset also has a function to display the definitions and examples of the different synonyms. It is also possible to get the synsets based on parts of speech using the attribute pos.
Similarity between the words can also be found using synsets.</p>

In [28]:
from textblob.wordnet import VERB
from textblob.wordnet import NOUN
synset_word=Word('act')
for word in synset_word.synsets:
    print("Synset: ",word)
    print("Definition:",word.definition())
    print("Example: ",word.examples())
    print()
print(synset_word.get_synsets(pos=VERB))
print()
print(synset_word.get_synsets(pos=NOUN))




Synset:  Synset('act.n.01')
Definition: a legal document codifying the result of deliberations of a committee or society or legislative body
Example:  []

Synset:  Synset('act.n.02')
Definition: something that people do or cause to happen
Example:  []

Synset:  Synset('act.n.03')
Definition: a subdivision of a play or opera or ballet
Example:  []

Synset:  Synset('act.n.04')
Definition: a short theatrical performance that is part of a longer program
Example:  ['he did his act three times every evening', 'she had a catchy little routine', 'it was one of the best numbers he ever did']

Synset:  Synset('act.n.05')
Definition: a manifestation of insincerity
Example:  ['he put on quite an act for her benefit']

Synset:  Synset('act.v.01')
Definition: perform an action, or work out or perform (an action)
Example:  ['think before you act', 'We must move quickly', 'The governor should act on the new energy bill', 'The nanny acted quickly by grabbing the toddler and covering him with a wet towe

<h4>Applications</h4>
<p>Synsets are useful in finding similar words, antonyms, hypernyms, hyponyms, etc and also to find similarity between the words</p>

<h3>3.7 Language Detection and Translation</h3>
<p>TextBlob uses google translate library to provide features like detecting language and also language translation. </p>

In [29]:
#Language translation
to_spanish=blob.translate(to='es')
to_arabic=blob.translate(to='ar')
to_simplified_chinese=blob.translate(to='zh-CN')
print("Spanish:",to_spanish)
print()
print("Arabic:",to_arabic)
print()
print("Simplified Chinese:",to_simplified_chinese)
print()
#Language Detection
print(blob.detect_language())
print(to_spanish.detect_language())
print(to_arabic.detect_language())
print(to_simplified_chinese.detect_language())


Spanish: Almacenamiento y recuperación de información, el proceso sistemático de recopilación y catalogación de datos para que puedan ubicarse y visualizarse a pedido. Las computadoras y las técnicas de procesamiento de datos han hecho posible la recuperación selectiva y de alta velocidad de grandes cantidades de información para fines gubernamentales, comerciales y académicos.

Arabic: تخزين المعلومات واسترجاعها ، هي العملية المنهجية لجمع البيانات وفهرستها حتى يمكن تحديد موقعها وعرضها عند الطلب. مكنت أجهزة الكمبيوتر وتقنيات معالجة البيانات من الاسترجاع الانتقائي عالي السرعة لكميات كبيرة من المعلومات للأغراض الحكومية والتجارية والأكاديمية.

Simplified Chinese: 信息存储和检索，数据收集和分类的系统过程，以便可以根据需要定位和显示它们。计算机和数据处理技术使得为政府，商业和学术目的高速，选择性地检索大量信息成为可能。

en
es
ar
zh-CN


<h3>3.8 N-grams</h3>
<p> This inbuilt feature of textblob is used to get n consecutive words in a sentence. In information retrieval, this is a very useful feature as consecutive words are more informative like we studied in phrase queries than separate words. In text blob, we can define how many consecutive words we need.</p>

In [30]:
blob.ngrams(n=7)

[WordList(['Information', 'storage', 'and', 'retrieval', 'the', 'systematic', 'process']),
 WordList(['storage', 'and', 'retrieval', 'the', 'systematic', 'process', 'of']),
 WordList(['and', 'retrieval', 'the', 'systematic', 'process', 'of', 'collecting']),
 WordList(['retrieval', 'the', 'systematic', 'process', 'of', 'collecting', 'and']),
 WordList(['the', 'systematic', 'process', 'of', 'collecting', 'and', 'cataloging']),
 WordList(['systematic', 'process', 'of', 'collecting', 'and', 'cataloging', 'data']),
 WordList(['process', 'of', 'collecting', 'and', 'cataloging', 'data', 'so']),
 WordList(['of', 'collecting', 'and', 'cataloging', 'data', 'so', 'that']),
 WordList(['collecting', 'and', 'cataloging', 'data', 'so', 'that', 'they']),
 WordList(['and', 'cataloging', 'data', 'so', 'that', 'they', 'can']),
 WordList(['cataloging', 'data', 'so', 'that', 'they', 'can', 'be']),
 WordList(['data', 'so', 'that', 'they', 'can', 'be', 'located']),
 WordList(['so', 'that', 'they', 'can', 'be

<h4>Applications</h4>
<p>N-grams has a wide usage in NLP. They can be used for auto completion of sentences, automatic spell check.</p>

<h3>3.9 Sentiment Analysis</h3>
<p> This is one of the most important feature offered by Textblob. This is used is many applications like classification of positive/negative movie reviews, positive/negative comments in tweets, etc. When you use the function 'sentiment' on a sentence, it gives the polarity and subjectivity of a sentence.</p>
    
<h4>Polarity: </h4><p> Polarity ranges from -1 to +1. Negative polarity value signifies that the sentence is more negative. Positive polarity value signifies that the sentence in more positive</p>
<h4>Subjectivity: </h4><p> Subjectivity value ranges from 0 to 1. If the sentence is factual, subjectivitiy value will be more towards zero. If the sentence is an opinion or more of a personal sentence, subjectivity value will be more towards 1.</p>

<h4>Pattern Analyzer and Naive Bayes Analyser</h4>
<p>Text Blob generally uses pattern analyser to calculate the sentiment. However, we can use Naive Bayes Analyser to compute the sentiment of sentences by manually changing the analyser parameter. PatternAnalyzer is based on the pattern library and NaiveBayesAnalyzer is an NLTK classifier trained on a movie reviews corpus.

Naive Bayes Analyzer outputs the classification of the sentence (positive/negative) and also outputs how positive and negative is the sentence.</p>
    

In [31]:
from textblob import Sentence  
from textblob.sentiments import NaiveBayesAnalyzer
positive_sentence=TextBlob("I love reading books")
negative_sentence=TextBlob("I hate reading books")
factual_sentence=TextBlob("Sentiment analysis is the process of determining emotions of a writer")
print("Using Pattern Analyser: ")
print(positive_sentence,":",positive_sentence.sentiment)
print(negative_sentence,":",negative_sentence.sentiment)
print(factual_sentence,":",factual_sentence.sentiment)   #pattern analyzer
print()

print("Using Naive Bayes Analyser: ")
positive_sentence=TextBlob("I love reading books",analyzer=NaiveBayesAnalyzer())
negative_sentence=TextBlob("I hate reading books",analyzer=NaiveBayesAnalyzer())
factual_sentence=TextBlob("Sentiment analysis is the process of determining emotions of a writer",analyzer=NaiveBayesAnalyzer())
print(positive_sentence,":",positive_sentence.sentiment)
print(negative_sentence,":",negative_sentence.sentiment)
print(factual_sentence,":",factual_sentence.sentiment)   #pattern analyzer
print()

for sentence in sentences:
    print("Sentence:",sentence)
    print("Sentiment ",sentence.sentiment)
    print()
#opinion = TextBlob("I love reading books", )

Using Pattern Analyser: 
I love reading books : Sentiment(polarity=0.5, subjectivity=0.6)
I hate reading books : Sentiment(polarity=-0.8, subjectivity=0.9)
Sentiment analysis is the process of determining emotions of a writer : Sentiment(polarity=0.0, subjectivity=0.0)

Using Naive Bayes Analyser: 
I love reading books : Sentiment(classification='pos', p_pos=0.7582388390506671, p_neg=0.24176116094933342)
I hate reading books : Sentiment(classification='pos', p_pos=0.7424393605051768, p_neg=0.25756063949482355)
Sentiment analysis is the process of determining emotions of a writer : Sentiment(classification='pos', p_pos=0.9530619620476906, p_neg=0.04693803795230751)

Sentence: Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request.
Sentiment  Sentiment(polarity=0.0, subjectivity=0.0)

Sentence: Computers and data processing techniques have made possible the high-speed, selective retrieval of large a

<h4>Applications</h4>
<p>TextBlob has different models for performing sentiment analysis. You can create a custom sentiment analyser according to your application. It has many applications. Some of them are
1.Reputation management: It is used for brand monitoring using which an organization knows what they are good at and what they are bad at and improve their products. 
2. Customer support :  An organization can do social media monitoring and provide immediate responses to negative reviews. 
3. Competitor Monitoring: An organization monitors the reviews of your competitor to improve your products.</p>

<h2>4. Text Classification</h2>
<p>Creating a custom classifier is very simple with TextBlob. We can use a NaiveBayes Classifier or Decision Tree Based Classifier to perform the text classification.</p>
<h4>Step 1 :</h4><p>Import the dataset and store it for classification purposes. 
In this example, Names dataset from NLTK corpus is used to classify male and female names. Store the male and female names in an array using the tag 'male' and 'female' respectively.
Shuffle the data set and construct a training set and testing set.
</p>
<h4>Step 2:</h4><p>Creating a custom feature extractor.
In this example, feature extractor using the last letter of the name as a feature for classification. We can add more features to improve the classification.</p>
<h4>Step 3:</h4><p>Calling the classifier on the trainset.
In this example, Naives Bayes Classifier is used to train the train set and the custom feature extractor is passed as a parameter</p>
<h4>Step 4:</h4><p>Accuracy is measured on the test set.</p>

<p>Naive Bayes Classifier has another function called show_informative_features(). This function displays the top features used for classification. From the example, we can see that, if the last letter is 'a', the name is mostly a female name and if the last letter is 'k', the name is mostly a male name.</p>
<p>The classifiers also has another function called 'update()' which can be used to update the training set</p>

In [32]:
import nltk


In [33]:
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\naghm\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [34]:
def last_letter_extractor(word):
        last_letter=word[-1]
        feats = {}
        feats["last({0})".format(last_letter)] = True
        return feats

In [35]:
from nltk.corpus import names
from textblob.classifiers import NaiveBayesClassifier
from random import shuffle 
print ("Files in the names corpus:",names.fileids())
#creating a list for male names with a tag
male_names=[]
for name in names.words('male.txt'):
    male_names.append((name,'male'))
len_male=len(male_names)
print("Number of male names:",len(male_names))

#creating a list for female names with a tag 
female_names=[]
for name in names.words('female.txt'):
    female_names.append((name,'female'))
    len_female=len(female_names)
print("Number of female names:",len(female_names))

#Shuffling the data set
shuffle(male_names)
shuffle(female_names)
#creating the training and testing set 
train_male=int((3/4)*len_male)
train_female=int((3/4)*len_female)
train_set=male_names[0:train_male]+female_names[0:train_female]
test_set=male_names[train_male:len_male]+female_names[train_female:len_female]
print("Length of trainset:",len(train_set))
print("Length of testset:",len(test_set))

#Calling the Naive Bayes classifier
classifier = NaiveBayesClassifier(train_set,feature_extractor=last_letter_extractor)

#Computing the accuracy on test set
accuracy = classifier.accuracy(test_set)
print("Accuracy:",accuracy)
      
#Showing the informative features
print (classifier.show_informative_features(5))
      
#Testing the classifier with random names
word1="Sophia"
word2="Ethan"
word3="Sophie"
word4="Peter"
print(word1,"is a",classifier.classify(word1))
print(word2,"is a ",classifier.classify(word2))
print(word3,"is a",classifier.classify(word3))
print(word4,"is a ",classifier.classify(word4))


Files in the names corpus: ['female.txt', 'male.txt']
Number of male names: 2943
Number of female names: 5001
Length of trainset: 5957
Length of testset: 1987
Accuracy: 0.758933064921993
Most Informative Features
                 last(k) = True             male : female =     60.6 : 1.0
                 last(a) = True           female : male   =     36.4 : 1.0
                 last(p) = True             male : female =     15.3 : 1.0
                 last(f) = True             male : female =     15.3 : 1.0
                 last(v) = True             male : female =     11.9 : 1.0
None
Sophia is a female
Ethan is a  male
Sophie is a female
Peter is a  male


<h4>Application</h4>
<p>Text classification is one of the important task in supervised machine learning. This can be used spam filtering, email routing, sentiment analysis etc.</p>

<h2>5. Conclusion</h2>
<p>    
TextBlob is a simple language processing tool. The main advantage of TextBlob is that it is built on top of NLTK. It is an extension of NLTK. It makes accessing many functions of NLTK in a simplified manner. It provides a simple and easy to learn intuitive interface for beginners in contrast to NLTK. TextBlob also uses function from Pattern library and it uses google translator interface for translating languages. 
    
It is used for text mining, text processing and text analysis. It is majorly used for its sentiment analysis function. 
It does other important text processing functions like tokenizing, lemmatizing,parts of speech tagging,noun phrase extraction, spelling correction,finding similarity between words,machine translation, etc.
</p>

<h2>6. References</h2>
<p>
1. https://textblob.readthedocs.io/en/dev/quickstart.html
2. https://textblob.readthedocs.io/en/dev/classifiers.html 
3. http://www.nltk.org/howto/wordnet.html</p>