### TextBlob Library

#### Features

- Part-of-speech tagging
- Tokenization (splitting text into words and sentences)
- Word frequencies
- Spelling correction
- Word inflection (pluralization and singularization)
- Sentiment analysis
- n-grams
- Classification (Naive Bayes, Decision Tree)

### Start

#### Library Installation

In [1]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [2]:
#download default models in this library
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /home/justdial/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/justdial/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/justdial/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/justdial/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home/justdial/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/justdial/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [3]:
from textblob import TextBlob

In [4]:
data = TextBlob("I am going to New York today. Will enjoy my vacation")

#### Part of Speech Tagging

In [5]:
data.tags

[('I', 'PRP'),
 ('am', 'VBP'),
 ('going', 'VBG'),
 ('to', 'TO'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 ('today', 'NN'),
 ('Will', 'MD'),
 ('enjoy', 'VB'),
 ('my', 'PRP$'),
 ('vacation', 'NN')]

#### Explanation of the tags is as follows:

- PRP: Personal pronoun
- VBP: Verb, non-3rd person singular present
- VBG: Verb, gerund or present participle
- TO: Preposition or infinitival "to"
- NNP: Proper noun, singular
- NN: Noun, singular or mass
- MD: Modal auxiliary
- VB: Verb, base form
- PRP$: Possessive pronoun
- NN: Noun, singular or mass

#### Tokenization: sentence and work tokenization

In [6]:
data.sentences

[Sentence("I am going to New York today."), Sentence("Will enjoy my vacation")]

In [7]:
data.words

WordList(['I', 'am', 'going', 'to', 'New', 'York', 'today', 'Will', 'enjoy', 'my', 'vacation'])

In [8]:
data.words[-1]

'vacation'

#### Word inflection (pluralization and singularization)

In [9]:
data.words[-1].pluralize()   #in the same way singularize can also be used

'vacations'

In [10]:
data.words[1].pluralize()   #in the same way singularize can also be used

'ams'

#### Spelling correction and completion

In [11]:
data = 'May you havv a good mornin'
text = TextBlob(data)
text.correct()

TextBlob("May you have a good morning")

In [12]:
data = 'Ma you hav a good mornin'
text = TextBlob(data)
text.correct()

TextBlob("A you had a good morning")

In [13]:
#the above is not that great

In [14]:
from textblob import Word
k = Word('can')
k.spellcheck()

[('can', 1.0)]

In [15]:
#1.0 it means it is 100 % correct

In [16]:
k = Word('Can')
k.spellcheck()

[('An', 0.5222764723832773),
 ('Man', 0.25205981080256334),
 ('Can', 0.1670735428745804),
 ('Ran', 0.049130302105584375),
 ('San', 0.003051571559353067),
 ('Van', 0.0028989929813854134),
 ('Fan', 0.0012206286237412267),
 ('Ban', 0.00091547146780592),
 ('Pan', 0.0006103143118706134),
 ('Dan', 0.0003051571559353067),
 ('Wan', 0.00015257857796765334),
 ('Nan', 0.00015257857796765334),
 ('Jan', 0.00015257857796765334)]

#### Explanation

In TextBlob, spell checking is based on the underlying WordNet lexical database. WordNet includes a collection of words along with their possible meanings or senses. When performing spell checking, TextBlob checks if the word exists in WordNet and provides suggestions based on the available senses.

In the case of k = Word('can'), where "can" is lowercase, it is considered as a proper word, and it might not have multiple senses or suggestions in WordNet. Hence, the spellcheck result may return the word as correct without any suggestions.

On the other hand, k = Word('Can'), where "Can" is in CamelCase, can have multiple senses or meanings in WordNet. The spellcheck result may provide suggestions based on alternative words or senses that are similar to "Can".

#### Word frequencies

In [17]:
from collections import Counter
text = "I love to code. Coding is fun and rewarding."
blob = TextBlob(text)

word_frequencies = Counter(blob.words)
print("Word Frequencies:")
print(word_frequencies)


Word Frequencies:
Counter({'I': 1, 'love': 1, 'to': 1, 'code': 1, 'Coding': 1, 'is': 1, 'fun': 1, 'and': 1, 'rewarding': 1})


In [18]:
from collections import Counter
text = "I love to code. code is fun and rewarding."
blob = TextBlob(text)

word_frequencies = Counter(blob.words)
print("Word Frequencies:")
print(word_frequencies)


Word Frequencies:
Counter({'code': 2, 'I': 1, 'love': 1, 'to': 1, 'is': 1, 'fun': 1, 'and': 1, 'rewarding': 1})


#### Sentiment Analysis

In [19]:
text = "worst movie"
blob = TextBlob(text)
sentiment = blob.sentiment
print(sentiment)
if sentiment.polarity>0.5:
    print('positive')
else:
    print('negative')

Sentiment(polarity=-1.0, subjectivity=1.0)
negative


In [20]:
#subjectivity is actually reasoning with fact

#### Translation

In [21]:
b = TextBlob("All friends play together")
b.translate(from_lang='en', to='hi')

TextBlob("सभी दोस्त एक साथ खेलते हैं")

#### Ngram handling

In [22]:
n = 2  # Set the value of 'n' for n-grams (e.g., 2 for bigrams, 3 for trigrams, etc.)

blob = TextBlob("I am going to New York today. Will enjoy my vacation")
# Tokenize the text into words
words = blob.words

# Create n-grams
ngrams = [words[i:i + n] for i in range(len(words) - n + 1)]

print("Original Text:"
      , text)
print("Word Tokenization:", words)
print(f"{n}-grams:", ngrams)

Original Text: worst movie
Word Tokenization: ['I', 'am', 'going', 'to', 'New', 'York', 'today', 'Will', 'enjoy', 'my', 'vacation']
2-grams: [WordList(['I', 'am']), WordList(['am', 'going']), WordList(['going', 'to']), WordList(['to', 'New']), WordList(['New', 'York']), WordList(['York', 'today']), WordList(['today', 'Will']), WordList(['Will', 'enjoy']), WordList(['enjoy', 'my']), WordList(['my', 'vacation'])]


#### Lemmatization

In [31]:
from textblob import TextBlob

text = "running cats are better than ran dogs"
blob = TextBlob(text)

# Lemmatize the words
lemmatized_words = [word.lemmatize() for word in blob.words]

print(lemmatized_words)

['running', 'cat', 'are', 'better', 'than', 'ran', 'dog']


#### Write a function to clean the data

In [24]:
import nltk
from nltk.corpus import stopwords
from string import punctuation
from textblob import TextBlob

# Download necessary resources
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean the data
def clean_data(text):
    # Tokenize the text into words
    blob = TextBlob(text)
    words = blob.words

    # Remove punctuation and stopwords, and convert words to lowercase in a single pass
    stop_words = set(stopwords.words('english'))
    clean_words = [word.lower() for word in words if word.lower() not in stop_words and word not in punctuation]

    # Join the clean words back into a cleaned text
    cleaned_text = " ".join(clean_words)
    return cleaned_text

# Test the clean_data function
text_to_clean = "Hello! This is a simple example, showing how to clean data using TextBlob."
cleaned_text = clean_data(text_to_clean)
print("Original Text:", text_to_clean)
print("Cleaned Text:", cleaned_text)


Original Text: Hello! This is a simple example, showing how to clean data using TextBlob.
Cleaned Text: hello simple example showing clean data using textblob


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/justdial/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/justdial/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Classifier

In [25]:
#Create a text classification system

#Create a text classification system

train = [
    ("I love this movie!", "positive"),
    ("The food was delicious.", "positive"),
    ("The service was terrible.", "negative"),
    ("I had a great time at the party.", "positive"),
    ("The product did not meet my expectations.", "negative"),
    ("The weather is perfect today.", "positive"),
    ("The customer support was unhelpful.", "negative"),
    ("I feel disappointed with the outcome.", "negative"),
    ("She is a talented musician.", "positive"),
    ("The traffic jam ruined my morning.", "negative"),
    ("The book is captivating.", "positive"),
    ("The movie was boring and predictable.", "negative"),
    ("I am extremely satisfied with the product.", "positive"),
    ("The staff was friendly and helpful.", "positive"),
    ("The hotel room was dirty and smelly.", "negative"),
    ("The concert was amazing!", "positive"),
    ("I regret buying this item.", "negative"),
    ("The customer service was prompt and efficient.", "positive"),
    ("The performance was lackluster.", "negative")
]

test = [("The beach was crowded and noisy.", "negative"),
    ("I had a fantastic experience at the amusement park.", "positive"),
    ("The company's stock price plummeted.", "negative"),
    ("The new design is innovative and user-friendly.", "positive"),
    ("The job interview went well.", "positive"),
    ("The website is slow and frustrating to use.", "negative")]

In [None]:
#here below we can see it is using its own model, we dont need to do any pre-processing, 
#like convertng categorical to numeric data, we dont need to use vectorization

In [26]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

In [27]:
#Single statement prediction

cl.classify("good man")

'positive'

In [28]:
#evaluation

cl.accuracy(test)

0.6666666666666666

In [29]:
cl.update(test)   #retraining the classifier object with new data

True

In [30]:
cl.accuracy(test)

1.0