# NLP with TextBlob

TextBlob is a python library that provides a simple API for common NLP tasks and builds on the Natural Language Toolkit (nltk) and the Pattern web mining libraries. TextBlob facilitates part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others.

## Imports & Settings

In [5]:
%matplotlib inline
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# spacy, textblob and nltk for language processing
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

# sklearn for feature extraction & modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.externals import joblib

In [6]:
np.random.seed(42)
pd.set_option('float_format', '{:,.2f}'.format)

## Load BBC Data

To illustrate the use of TextBlob, we sample a BBC sports article with the headline ‘Robinson ready for difficult task’. Similar to spaCy and other libraries, the first step is to pass the document through a pipeline represented by the TextBlob object to assign annotations required for various tasks.

In [30]:
path = Path('..', 'data', 'bbc')
files = sorted(list(path.glob('**/*.txt')))
doc_list = []
for i, file in enumerate(files):
    topic = file.parts[-2]
    article = file.read_text(encoding='latin1').split('\n')
    heading = article[0].strip()
    body = ' '.join([l.strip() for l in article[1:]]).strip()
    doc_list.append([topic, heading, body])

In [31]:
docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])
docs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   topic    2225 non-null   object
 1   heading  2225 non-null   object
 2   body     2225 non-null   object
dtypes: object(3)
memory usage: 52.3+ KB


## Introduction to TextBlob

You should already have downloaded TextBlob, a Python library used to explore common NLP tasks.

### Select random article

In [32]:
article = docs.sample(1).squeeze()

In [33]:
print(f'Topic:\t{article.topic.capitalize()}\n\n{article.heading}\n')
print(article.body.strip())

Topic:	Sport

Man City 0-2 Man Utd

Manchester United reduced Chelsea's Premiership lead to nine points after a scrappy victory over Manchester City.  Wayne Rooney met Gary Neville's cross to the near post with a low shot, which went in via a deflection off Richard Dunne, to put United ahead. Seven minutes later, the unfortunate Dunne hooked a volley over David James' head and into his own net. Steve McManaman wasted City's best chance when he shot wide from three yards in the first half. In the opening 45 minutes United had looked unlikely to earn the win they needed to maintain any chance of catching Chelsea in the title race. Their approach play was more laboured than patient and they managed to fashion just one chance - a Paul Scholes header over the bar. And City seemed to be content to sit back and try and hit their rivals on the break as the game settled into a tepid pattern. Only Shaun Wright-Phillips appeared capable of interrupting the monotony, looking lively down the right 

In [34]:
parsed_body = TextBlob(article.body)

### Tokenization

In [35]:
parsed_body.words

WordList(['Manchester', 'United', 'reduced', 'Chelsea', "'s", 'Premiership', 'lead', 'to', 'nine', 'points', 'after', 'a', 'scrappy', 'victory', 'over', 'Manchester', 'City', 'Wayne', 'Rooney', 'met', 'Gary', 'Neville', "'s", 'cross', 'to', 'the', 'near', 'post', 'with', 'a', 'low', 'shot', 'which', 'went', 'in', 'via', 'a', 'deflection', 'off', 'Richard', 'Dunne', 'to', 'put', 'United', 'ahead', 'Seven', 'minutes', 'later', 'the', 'unfortunate', 'Dunne', 'hooked', 'a', 'volley', 'over', 'David', 'James', 'head', 'and', 'into', 'his', 'own', 'net', 'Steve', 'McManaman', 'wasted', 'City', "'s", 'best', 'chance', 'when', 'he', 'shot', 'wide', 'from', 'three', 'yards', 'in', 'the', 'first', 'half', 'In', 'the', 'opening', '45', 'minutes', 'United', 'had', 'looked', 'unlikely', 'to', 'earn', 'the', 'win', 'they', 'needed', 'to', 'maintain', 'any', 'chance', 'of', 'catching', 'Chelsea', 'in', 'the', 'title', 'race', 'Their', 'approach', 'play', 'was', 'more', 'laboured', 'than', 'patient', 

### Sentence boundary detection

In [36]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/stefan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
parsed_body.sentences

[Sentence("Manchester United reduced Chelsea's Premiership lead to nine points after a scrappy victory over Manchester City."),
 Sentence("Wayne Rooney met Gary Neville's cross to the near post with a low shot, which went in via a deflection off Richard Dunne, to put United ahead."),
 Sentence("Seven minutes later, the unfortunate Dunne hooked a volley over David James' head and into his own net."),
 Sentence("Steve McManaman wasted City's best chance when he shot wide from three yards in the first half."),
 Sentence("In the opening 45 minutes United had looked unlikely to earn the win they needed to maintain any chance of catching Chelsea in the title race."),
 Sentence("Their approach play was more laboured than patient and they managed to fashion just one chance - a Paul Scholes header over the bar."),
 Sentence("And City seemed to be content to sit back and try and hit their rivals on the break as the game settled into a tepid pattern."),
 Sentence("Only Shaun Wright-Phillips appea

### Stemming

To perform stemming, we instantiate the SnowballStemmer from the nltk library, call its .stem() method on each token and display tokens that were modified as a result:

In [38]:
# Initialize stemmer.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# Stem each word.
[(word, stemmer.stem(word)) for i, word in enumerate(parsed_body.words) 
 if word.lower() != stemmer.stem(parsed_body.words[i])]

[('Manchester', 'manchest'),
 ('United', 'unit'),
 ('reduced', 'reduc'),
 ('points', 'point'),
 ('scrappy', 'scrappi'),
 ('victory', 'victori'),
 ('Manchester', 'manchest'),
 ('City', 'citi'),
 ('Wayne', 'wayn'),
 ('Gary', 'gari'),
 ('Neville', 'nevill'),
 ('deflection', 'deflect'),
 ('Dunne', 'dunn'),
 ('United', 'unit'),
 ('minutes', 'minut'),
 ('unfortunate', 'unfortun'),
 ('Dunne', 'dunn'),
 ('hooked', 'hook'),
 ('James', 'jame'),
 ('wasted', 'wast'),
 ('City', 'citi'),
 ('chance', 'chanc'),
 ('yards', 'yard'),
 ('opening', 'open'),
 ('minutes', 'minut'),
 ('United', 'unit'),
 ('looked', 'look'),
 ('unlikely', 'unlik'),
 ('needed', 'need'),
 ('any', 'ani'),
 ('chance', 'chanc'),
 ('catching', 'catch'),
 ('title', 'titl'),
 ('laboured', 'labour'),
 ('managed', 'manag'),
 ('chance', 'chanc'),
 ('Scholes', 'schole'),
 ('City', 'citi'),
 ('seemed', 'seem'),
 ('try', 'tri'),
 ('rivals', 'rival'),
 ('settled', 'settl'),
 ('Only', 'onli'),
 ('Wright-Phillips', 'wright-phillip'),
 ('appear

### Lemmatization

In [39]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/stefan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [40]:
[(word, word.lemmatize()) for i, word in enumerate(parsed_body.words) 
 if word != parsed_body.words[i].lemmatize()]

[('points', 'point'),
 ('minutes', 'minute'),
 ('yards', 'yard'),
 ('minutes', 'minute'),
 ('was', 'wa'),
 ('rivals', 'rival'),
 ('as', 'a'),
 ('problems', 'problem'),
 ('feet', 'foot'),
 ('has', 'ha'),
 ('was', 'wa'),
 ('was', 'wa'),
 ('was', 'wa'),
 ('was', 'wa'),
 ('was', 'wa'),
 ('minutes', 'minute'),
 ('boss', 'bos'),
 ('was', 'wa'),
 ('chances', 'chance'),
 ('was', 'wa'),
 ('boss', 'bos'),
 ('was', 'wa'),
 ('months', 'month'),
 ('winners', 'winner'),
 ('times', 'time'),
 ('games', 'game'),
 ('was', 'wa')]

Lemmatization relies on parts-of-speech (POS) tagging; `spaCy` performs POS tagging, here we make assumptions, e.g. that each token is verb.

In [41]:
[(word, word.lemmatize(pos='v')) for i, word in enumerate(parsed_body.words) 
 if word != parsed_body.words[i].lemmatize(pos='v')]

[('reduced', 'reduce'),
 ('points', 'point'),
 ('met', 'meet'),
 ('shot', 'shoot'),
 ('went', 'go'),
 ('hooked', 'hook'),
 ('wasted', 'waste'),
 ('shot', 'shoot'),
 ('opening', 'open'),
 ('had', 'have'),
 ('looked', 'look'),
 ('needed', 'need'),
 ('catching', 'catch'),
 ('was', 'be'),
 ('laboured', 'labour'),
 ('managed', 'manage'),
 ('seemed', 'seem'),
 ('rivals', 'rival'),
 ('settled', 'settle'),
 ('appeared', 'appear'),
 ('interrupting', 'interrupt'),
 ('looking', 'look'),
 ('causing', 'cause'),
 ('found', 'find'),
 ('embarrassed', 'embarrass'),
 ('took', 'take'),
 ('delivered', 'deliver'),
 ('demonstrated', 'demonstrate'),
 ('has', 'have'),
 ('scored', 'score'),
 ('was', 'be'),
 ('forced', 'force'),
 ('came', 'come'),
 ('caused', 'cause'),
 ('looked', 'look'),
 ('was', 'be'),
 ('being', 'be'),
 ('marshalled', 'marshal'),
 ('was', 'be'),
 ('poured', 'pour'),
 ('was', 'be'),
 ('renewed', 'renew'),
 ('delivered', 'deliver'),
 ('showed', 'show'),
 ('needed', 'need'),
 ('was', 'be'),
 (

### Sentiment & Polarity

TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. These dictionaries lexicon map adjectives frequently found in product reviews to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).

The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token

In [42]:
parsed_body.sentiment

Sentiment(polarity=0.08138227513227513, subjectivity=0.45500264550264563)

In [43]:
parsed_body.sentiment_assessments

Sentiment(polarity=0.08138227513227513, subjectivity=0.45500264550264563, assessments=[(['cross'], 0.0, 0.0, None), (['near'], 0.1, 0.4, None), (['low'], 0.0, 0.3, None), (['later'], 0.0, 0.0, None), (['unfortunate'], -0.5, 1.0, None), (['own'], 0.6, 1.0, None), (['net'], 0.0, 0.0, None), (['wasted'], -0.2, 0.0, None), (['best'], 1.0, 0.3, None), (['wide'], -0.1, 0.4, None), (['first'], 0.25, 0.3333333333333333, None), (['half'], -0.16666666666666666, 0.16666666666666666, None), (['unlikely'], -0.5, 0.5, None), (['win'], 0.8, 0.4, None), (['catching'], 0.6, 0.9, None), (['more'], 0.5, 0.5, None), (['back'], 0.0, 0.0, None), (['game'], -0.4, 0.4, None), (['only'], 0.0, 1.0, None), (['capable'], 0.2, 0.4, None), (['lively', 'down'], -0.15555555555555559, 0.2888888888888889, None), (['right'], 0.2857142857142857, 0.5357142857142857, None), (['difficult'], -0.5, 1.0, None), (['near'], 0.1, 0.4, None), (['past'], -0.25, 0.25, None), (['former'], 0.0, 0.0, None), (['easy'], 0.433333333333333

### Combine Textblob Lemmatization with `CountVectorizer`

In [28]:
def lemmatizer(text):
    words = TextBlob(text.lower()).words
    return [word.lemmatize() for word in words]

In [29]:
vectorizer = CountVectorizer(analyzer=lemmatizer, decode_error='replace')