# Text Classification

# What is Text Classification?

Document or text classification is used to classify information, that is, assign a category to a text; it can be a document, a tweet, a simple message, an email, and so on.

# Part 1: A Tweet Sentiment Analyzer (Simple classification)

Our first classifier will be a simple sentiment analyzer trained on a small dataset of fake tweets.

To begin, we'll import the textblob.classifiers and create some training and test data.

* To Install TextBlog


* pip install -U textblob nltk


In [7]:
from textblob.classifiers import NaiveBayesClassifier

In [8]:
train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

We create a new classifier by passing training data into the constructor for a NaiveBayesClassifier.

In [9]:
cl = NaiveBayesClassifier(train)

We can now classify arbitrary text using the NaiveBayesClassifier.classify(text) method.

In [12]:
cl.classify("Their burgers are amazing")  # "pos"


'pos'

In [13]:
cl.classify("I don't like their pizza.")  

'neg'

Another way to classify strings of text is to use TextBlob objects. We can pass classifiers into the constructor of a TextBlob.

In [14]:
from textblob import TextBlob
blob = TextBlob("The beer was amazing. "
                "But the hangover was horrible. My boss was not happy.",
                classifier=cl)

In [15]:
#We can then call the classify() method on the blob.

blob.classify() 

'neg'

We can also take advantage of TextBlob's sentence tokenization and classify each sentence indvidually.

In [16]:
for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())

The beer was amazing.
pos
But the hangover was horrible.
neg
My boss was not happy.
neg


Let's check the accuracy on the test set.

In [17]:
cl.accuracy(test)

0.8333333333333334

We can also find the most informative features:

In [18]:
cl.show_informative_features(5)

Most Informative Features
          contains(this) = True              neg : pos    =      2.3 : 1.0
          contains(this) = False             pos : neg    =      1.8 : 1.0
          contains(This) = False             neg : pos    =      1.6 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0


This indicates that tweets containing the word "my" but not containing the word "place" tend to be negative.

# Full Script

In [19]:
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]

cl = NaiveBayesClassifier(train)

# Classify some text
print(cl.classify("Their burgers are amazing."))  # "pos"
print(cl.classify("I don't like their pizza."))   # "neg"

# Classify a TextBlob
blob = TextBlob("The beer was amazing. But the hangover was horrible. "
                "My boss was not pleased.", classifier=cl)
print(blob)
print(blob.classify())

for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())

# Compute accuracy
print("Accuracy: {0}".format(cl.accuracy(test)))

# Show 5 most informative features
cl.show_informative_features(5)

pos
neg
The beer was amazing. But the hangover was horrible. My boss was not pleased.
neg
The beer was amazing.
pos
But the hangover was horrible.
neg
My boss was not pleased.
neg
Accuracy: 0.8333333333333334
Most Informative Features
          contains(this) = True              neg : pos    =      2.3 : 1.0
          contains(this) = False             pos : neg    =      1.8 : 1.0
          contains(This) = False             neg : pos    =      1.6 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0


# Part 2: Adding More Data from NLTK

We can improve our classifier by adding more training and test data. Here we'll add data from the movie review corpus which was downloaded with NLTK.

In [20]:
import random
from nltk.corpus import movie_reviews

reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
new_train, new_test = reviews[0:100], reviews[101:200]

Let's see what one of these documents looks like.

In [21]:
print(new_train[0])

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'b

Notice that unlike the data in Part 1, the text comes as a list of words instead of a single string. TextBlob is smart about this; it will treat both forms of data as expected.

We can now update our classifier with the new training data using the update(new_data) method, as well as test it using the larger test dataset.

In [22]:
cl.update(new_train)
accuracy = cl.accuracy(test + new_test)

In [23]:
accuracy

0.9714285714285714

# Part 3: Language Detector (Custom Feature Extraction)

An important aspect that I haven't yet mentioned is how features are being extracted from the text.

For a given document and training set train, TextBlob's default behavior is to compute which words in train are present in document. For example, the sentence "It's just a flesh wound." might have features contains(flesh): True, contains(wound): True, and contains(knight): False.

Of course, this simple feature extractor may not be appropriate for all problems. Here we'll create a custom feature extractor for a language detector.

Here's the training and test data.

In [24]:
train = [
    ("amor", "spanish"),
    ("perro", "spanish"),
    ("playa", "spanish"),
    ("sal", "spanish"),
    ("oceano", "spanish"),
    ("love", "english"),
    ("dog", "english"),
    ("beach", "english"),
    ("salt", "english"),
    ("ocean", "english")
]
test = [
    ("ropa", "spanish"),
    ("comprar", "spanish"),
    ("camisa", "spanish"),
    ("agua", "spanish"),
    ("telefono", "spanish"),
    ("clothes", "english"),
    ("buy", "english"),
    ("shirt", "english"),
    ("water", "english"),
    ("telephone", "english")
]

A feature extractor is simply a function that takes an argument text (the text to extract features from) and returns a dictionary of features.

Let's create a very simple extractor that uses the last letter of a given word as its only feature.

In [25]:
def extractor(word):
    feats = {}
    last_letter = word[-1]
    feats["last_letter({0})".format(last_letter)] = True
    return feats

print(extractor("python"))  # {'last_letter(n)': True}

{'last_letter(n)': True}


We can pass this feature extractor as the second argument to the constructor of a NaiveBayesClassifier.

In [26]:
lang_detector = NaiveBayesClassifier(train, feature_extractor=extractor)

And again,by compute accuracy and informative features.

In [27]:
lang_detector.accuracy(test)

0.7

In [28]:
lang_detector.show_informative_features(5)

Most Informative Features
          last_letter(o) = None           englis : spanis =      1.6 : 1.0
          last_letter(h) = None           spanis : englis =      1.2 : 1.0
          last_letter(r) = None           englis : spanis =      1.2 : 1.0
          last_letter(e) = None           spanis : englis =      1.2 : 1.0
          last_letter(t) = None           spanis : englis =      1.2 : 1.0


Not surprisingly, words that do not end with the letter "o" tend to be English.

# Conclusion

TextBlob makes it easy to create our own custom text classifiers.