<a href="https://www.kaggle.com/code/tejasurya/guide-to-nlp-using-textblob-with-disaster-tweets?scriptVersionId=109830980" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Guide to NLP in TextBlob with Disaster Tweets

# Introduction
TextBlob is a Python (2 and 3) library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. 

# Installation

In [1]:
!pip install -U textblob
!python -m textblob.download_corpora

[0m[nltk_data] Downloading package brown to /usr/share/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /usr/share/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


Refer to [TextBlob Documentation](https://textblob.readthedocs.io/en/dev/) here 
- [Github](https://github.com/sloria/TextBlob)

# Features of TextBlob
*    Noun phrase extraction
*    Part-of-speech tagging
*    Sentiment analysis
*    Classification (Naive Bayes, Decision Tree)
*    Tokenization (splitting text into words and sentences)
*    Word and phrase frequencies
*    Parsing
*    n-grams
*    Word inflection (pluralization and singularization) and lemmatization
*    Spelling correction
*    Add new models or languages through extensions
*    WordNet integration



# Importing TextBlob

In [2]:
from textblob import TextBlob
import pandas as pd

In [3]:
df = pd.read_csv('../input/nlp-getting-started/train.csv')
df_test = pd.read_csv('../input/nlp-getting-started/test.csv')
df.sample(5)

Unnamed: 0,id,keyword,location,text,target
2821,4055,displaced,North Carolina,@lizhphoto When I have so much shit going on i...,0
3760,5342,fire,,#news Politifiact: Harry Reid's '30 Percent of...,0
4370,6207,hijacker,"West Chester, PA",Remove the http://t.co/9Jxb3rx8mF and Linkury ...,0
7488,10710,wreck,new york,act my age was a MESS everyone was so wild it ...,0
1131,1631,bombing,,The United Kingdom and France are bombing Daes...,1


In [4]:
text = df.text.values
text[2]

"All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"

## Part-of-speech Tagging

Part-of-speech tags can be accessed through the tags property.

In [5]:
blob = TextBlob(text[2])
blob.tags           

[('All', 'DT'),
 ('residents', 'NNS'),
 ('asked', 'VBD'),
 ('to', 'TO'),
 ("'shelter", 'VB'),
 ('in', 'IN'),
 ('place', 'NN'),
 ('are', 'VBP'),
 ('being', 'VBG'),
 ('notified', 'VBN'),
 ('by', 'IN'),
 ('officers', 'NNS'),
 ('No', 'DT'),
 ('other', 'JJ'),
 ('evacuation', 'NN'),
 ('or', 'CC'),
 ('shelter', 'NN'),
 ('in', 'IN'),
 ('place', 'NN'),
 ('orders', 'NNS'),
 ('are', 'VBP'),
 ('expected', 'VBN')]

## Noun Phrase Extraction¶
Similarly, noun phrases are accessed through the noun_phrases property.

In [6]:
blob.noun_phrases

WordList(['place orders'])

# Sentiment Analysis¶

The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [7]:
print(blob.sentiment)

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

Sentiment(polarity=-0.018750000000000003, subjectivity=0.3875)
0.0
-0.018750000000000003


## Tokenization

You can break TextBlobs into words or sentences.

In [8]:
blob.words

WordList(['All', 'residents', 'asked', 'to', "'shelter", 'in', 'place', 'are', 'being', 'notified', 'by', 'officers', 'No', 'other', 'evacuation', 'or', 'shelter', 'in', 'place', 'orders', 'are', 'expected'])

In [9]:
blob.sentences

[Sentence("All residents asked to 'shelter in place' are being notified by officers."),
 Sentence("No other evacuation or shelter in place orders are expected")]

Sentence objects have the same properties and methods as TextBlobs.

In [10]:
for sentence in blob.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.018750000000000003, subjectivity=0.3875)


## Words Inflection and Lemmatization

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

In [11]:
sentence = TextBlob(text[2])
sentence.words

WordList(['All', 'residents', 'asked', 'to', "'shelter", 'in', 'place', 'are', 'being', 'notified', 'by', 'officers', 'No', 'other', 'evacuation', 'or', 'shelter', 'in', 'place', 'orders', 'are', 'expected'])

In [12]:
sentence.words[1].singularize()

'resident'

In [13]:
sentence.words[-4].pluralize()

'places'

Words can be lemmatized by calling the lemmatize method.

In [14]:
from textblob import Word
import nltk
nltk.download('omw-1.4')

w = Word("octopi")
w.lemmatize()

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


'octopus'

In [15]:
w = Word("went")
w.lemmatize("v")  # Pass in WordNet part of speech (verb)

'go'

You can access the definitions for each synset via the definitions property or the define() method, which can also take an optional part-of-speech argument.

In [16]:
Word("octopus").definitions

['tentacles of octopus prepared as food',
 'bottom-living cephalopod having a soft oval body with eight long tentacles']

You can also create synsets directly.

In [17]:
from textblob.wordnet import Synset
octopus = Synset('octopus.n.02')
shrimp = Synset('shrimp.n.03')
octopus.path_similarity(shrimp)

0.1111111111111111

## WordLists

A WordList is just a Python list with additional methods.

In [18]:
animals = TextBlob("cat dog octopus")
animals.words

WordList(['cat', 'dog', 'octopus'])

In [19]:
animals.words.pluralize()


WordList(['cats', 'dogs', 'octopodes'])

## Spelling Correction
Use the correct() method to attempt spelling correction.

In [20]:
b = TextBlob("I havv goood speling!")
print(b.correct())

I have good spelling!


Word objects have a spellcheck() <br>Word.spellcheck() method that returns a list of (word, confidence) tuples with spelling suggestions.

In [21]:
from textblob import Word
w = Word('falibility')
w.spellcheck()

[('fallibility', 1.0)]

## Get Word and Noun Phrase Frequencies
There are two ways to get the frequency of a word or noun phrase in a TextBlob.<br>
The first is through the word_counts dictionary.

In [22]:
monty = TextBlob("We are no longer the Knights who say Ni. "
                     "We are now the Knights who say Ekki ekki ekki PTANG.")
monty.word_counts['ekki']

3

In [23]:
monty.words.count('ekki')

3

You can specify whether or not the search should be case-sensitive (default is False).

In [24]:
monty.words.count('ekki', case_sensitive=True)

2

Each of these methods can also be used with noun phrases.

In [25]:
blob.noun_phrases.count('python')

0

## TextBlobs Are Like Python Strings!

In [26]:
#You can use Python’s substring syntax.

blob[0:19]

TextBlob("All residents asked")

In [27]:
#You can use common string methods.
blob.upper()

TextBlob("ALL RESIDENTS ASKED TO 'SHELTER IN PLACE' ARE BEING NOTIFIED BY OFFICERS. NO OTHER EVACUATION OR SHELTER IN PLACE ORDERS ARE EXPECTED")

In [28]:
blob.find("Simple")

-1

In [29]:
# You can make comparisons between TextBlobs and strings.

apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob

True

In [30]:
apple_blob == 'apples'

True

In [31]:
#You can concatenate and interpolate TextBlobs and strings.

apple_blob + ' and ' + banana_blob

"{0} and {1}".format(apple_blob, banana_blob)

'apples and bananas'

## n-grams

The TextBlob.ngrams() method returns a list of tuples of n successive words.

In [32]:
blob = TextBlob("Now is better than never.")
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

## Get Start and End Indices of Sentences

Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob

In [33]:
for s in blob.sentences:
    print(s)
    print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))


Now is better than never.
---- Starts at index 0, Ends at index 25


# Building Text Classification on Disaster Tweets

The textblob.classifiers module makes it simple to create custom classifiers.<br>

As an example, let’s create a custom sentiment analyzer.<br>
Loading Data and Creating a Classifier<br>

First we’ll create some training and test data.<br>

**Target Labels**
* **0 - Positive**
* **1 - Negative**

In [34]:
train , test = [],[]
for i in range(len(df)):
    train.append((df.iloc[i]['text'],df.iloc[i]['target'])) 
train[:5]

[('Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all', 1),
 ('Forest fire near La Ronge Sask. Canada', 1),
 ("All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
  1),
 ('13,000 people receive #wildfires evacuation orders in California ', 1),
 ('Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ',
  1)]

In [35]:
t = tuple() 
for i in range(len(df_test)):
    t =(df_test.iloc[i]['text'])
    test.append(t)
test[:5]

['Just happened a terrible car crash',
 'Heard about #earthquake is different cities, stay safe everyone.',
 'there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all',
 'Apocalypse lighting. #Spokane #wildfires',
 'Typhoon Soudelor kills 28 in China and Taiwan']



Now we’ll create a Naive Bayes classifier, passing the training data into the constructor.


In [36]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)


## Loading Data from Files

You can also load data from common file formats including CSV, JSON, and TSV.

CSV files should be formatted like so:<br>



I love this sandwich.,pos
This is an amazing place!,pos
I do not like this restaurant,neg

JSON files should be formatted like so:

[
    {"text": "I love this sandwich.", "label": "pos"},
    {"text": "This is an amazing place!", "label": "pos"},
    {"text": "I do not like this restaurant", "label": "neg"}
]

You can then pass the opened file into the constructor.

>>> with open('train.json', 'r') as fp:<br>
...     cl = NaiveBayesClassifier(fp, format="json")


# Classifying Text

Call the classify(text) method to use the classifier.


In [37]:
cl.classify("This is an amazing library!")

0

"This is an amazing library!" statement is Positive output = 0.

In [38]:
output = lambda x :"positive" if x.max()==0 else "negative"


You can get the label probability distribution with the prob_classify(text) method.


In [39]:
prob_dist = cl.prob_classify(test[0])
print(f"{test[0]} statement is classified as {output(prob_dist)} statement")

Just happened a terrible car crash statement is classified as negative statement


## Probability distribution of Positive and Negative targets

In [40]:
round(prob_dist.prob(1), 2)

0.5

In [41]:
round(prob_dist.prob(0),2)


0.5


# Classifying TextBlobs

Another way to classify text is to pass a classifier into the constructor of TextBlob and call its classify() method.


In [42]:
from textblob import TextBlob
blob = TextBlob(test[2], classifier=cl)
print(f"{test[2]} statement is {output(blob.classify())} statement")

there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all statement is negative statement



The advantage of this approach is that you can classify sentences within a TextBlob.


In [43]:
for s in blob.sentences:
    print(s)
    print(output(s.classify()))

there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all
negative


# Evaluating Classifiers

To compute the accuracy on our test set, use the accuracy(test_data) method.


In [44]:
sentiment_test = [1,0,1,1,1]

In [45]:
final_test = []
for i in range(len(test[:5])):
    final_test.append((test[i],sentiment_test[i]))
final_test

[('Just happened a terrible car crash', 1),
 ('Heard about #earthquake is different cities, stay safe everyone.', 0),
 ('there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all',
  1),
 ('Apocalypse lighting. #Spokane #wildfires', 1),
 ('Typhoon Soudelor kills 28 in China and Taiwan', 1)]

In [46]:
cl.accuracy(final_test)

0.8

In [47]:
output = []
for i in range(len(test)):
    blob = TextBlob(test[i], classifier=cl)
    output.append(blob.classify())

In [48]:
output[:5]

[1, 1, 1, 1, 1]

In [49]:
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv").fillna(' ')

In [50]:
submission['target'] = pd.Series(output)
submission.to_csv('submission.csv', index=False)


Note

You can also pass in a file object into the accuracy method. The file can be in any of the formats listed in the Loading Data section.

Use the show_informative_features() method to display a listing of the most informative features.


In [51]:
cl.show_informative_features(5)  

Most Informative Features
     contains(Hiroshima) = True                1 : 0      =     74.8 : 1.0
         contains(Japan) = True                1 : 0      =     45.6 : 1.0
      contains(Malaysia) = True                1 : 0      =     42.9 : 1.0
      contains(wildfire) = True                1 : 0      =     35.8 : 1.0
        contains(killed) = True                1 : 0      =     32.8 : 1.0



Updating Classifiers with New Data

Use the update(new_data) method to update a classifier with new training data.


In [52]:
'''
new_data = [('She is my best friend.', 0),
             ("I'm happy to have a new friend.", 0),
             ("Stay thirsty, my friend.", 0),
             ("He ain't from around here.", 1)]
cl.update(new_data)
cl.accuracy(final_test)
'''

'\nnew_data = [(\'She is my best friend.\', 0),\n             ("I\'m happy to have a new friend.", 0),\n             ("Stay thirsty, my friend.", 0),\n             ("He ain\'t from around here.", 1)]\ncl.update(new_data)\ncl.accuracy(final_test)\n'

# Feature Extractors

By default, the NaiveBayesClassifier uses a simple feature extractor that indicates which words in the training set are contained in a document.

For example, the sentence “I feel happy” might have the features contains(happy): True or contains(angry): False.

You can override this feature extractor by writing your own. A feature extractor is simply a function with document (the text to extract features from) as the first argument. The function may include a second argument, train_set (the training dataset), if necessary.

The function should return a dictionary of features for document.

For example, let’s create a feature extractor that just uses the first and last words of a document as its features.


In [53]:
'''
def end_word_extractor(document):
    tokens = document.split()
    first_word, last_word = tokens[0], tokens[-1]
    feats = {} 
    feats["first({0})".format(first_word)] = True
    feats["last({0})".format(last_word)] = False
    return feats
features = end_word_extractor("I feel happy")
assert features == {'last(happy)': False, 'first(I)': True}
'''

'\ndef end_word_extractor(document):\n    tokens = document.split()\n    first_word, last_word = tokens[0], tokens[-1]\n    feats = {} \n    feats["first({0})".format(first_word)] = True\n    feats["last({0})".format(last_word)] = False\n    return feats\nfeatures = end_word_extractor("I feel happy")\nassert features == {\'last(happy)\': False, \'first(I)\': True}\n'


We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.


In [54]:
'''
cl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)
blob = TextBlob("I'm excited to try my new classifier.", classifier=cl2)
blob.classify()
'''

'\ncl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)\nblob = TextBlob("I\'m excited to try my new classifier.", classifier=cl2)\nblob.classify()\n'