# Classififying documents

In this recipe, we will learn how to write build a Naive Bayes classifier that can be used to classify documents. For this exercise, we will make use of rich site summary (RSS) feeds to classify documents. 

The list of categories are known ahead of the time, which is important for the classification task.

In [13]:
# install required libraries
!pip3 install feedparser --user



In [14]:
# import required libraries
import nltk
import random
import feedparser
from sklearn.model_selection import train_test_split

In [15]:
# we'll use two RSS feeds pointing to Yahoo! sports and which are pre-categorized
urls = {
    'mlb': 'https://sports.yahoo.com/mlb/rss.xml',
    'nfl': 'https://sports.yahoo.com/nfl/rss.xml'
}

In [16]:
# initialize empty dictionary to keep the list of RSS feeds in memory until the program terminates
feedmap = {}

In [17]:
# get a list of english stopwords
stopwords = nltk.corpus.stopwords.words('english')

In [18]:
# define function to create features from words (e.g dictionary where key:word, value:True)
def featureExtractor(words):
    features = {}
    for word in words:
        # filter out all english stopwords and create feature dictionary
        if word not in stopwords:
            features["word({})".format(word)] = True
    return features

In [19]:
# define an empty list to store all correctly labeled sentences
sentences = []

In [20]:
# iterate over all dictionary url keys
for category in urls:
    # download the feed and store the result as the value for feedmap[category]
    feedmap[category] = feedparser.parse(urls[category])
    
    # display the url being downloaded
    print("downloading {}".format(urls[category]))
    
    # iterate over all RSS 'entries'
    for entry in feedmap[category]['entries']:
        # get 'summary' data from 'entries' list for each feed
        data = entry['summary']
        # get the list of words from 'summary'
        words = data.split()
        # store into a list a tuple consisting of url category & all sentence tokens
        sentences.append((category, words))

downloading https://sports.yahoo.com/mlb/rss.xml
downloading https://sports.yahoo.com/nfl/rss.xml


In [21]:
# check the first occurence format in sentences list
sentences[:1]

[('mlb',
  ['Mike',
   'Trout',
   'called',
   'this',
   'season',
   'his',
   '"best',
   'year',
   'yet,"',
   'but',
   'his',
   'Angels',
   'will',
   'miss',
   'the',
   'playoffs',
   'again',
   'and',
   'he',
   'could',
   'lose',
   'out',
   'on',
   'another',
   'MVP.'])]

In [22]:
# check number of occurences in sentences list
len(sentences)

93

In [23]:
# next, extract all features of sentences and store them into a list
featureset = [(featureExtractor(words), category) for category, words in sentences]

In [24]:
# check the first 5 occurences in featuresets
featureset[:5]

[({'word(Mike)': True,
   'word(Trout)': True,
   'word(called)': True,
   'word(season)': True,
   'word("best)': True,
   'word(year)': True,
   'word(yet,")': True,
   'word(Angels)': True,
   'word(miss)': True,
   'word(playoffs)': True,
   'word(could)': True,
   'word(lose)': True,
   'word(another)': True,
   'word(MVP.)': True},
  'mlb'),
 ({'word(And)': True,
   'word(one)': True,
   'word(belongs)': True,
   'word(Marty)': True,
   'word(Brennaman.)': True,
   'word(With)': True,
   'word(fans)': True,
   'word(applauding)': True,
   'word(every)': True,
   'word(mention,)': True,
   'word(Hall)': True,
   'word(Fame)': True,
   'word(broadcaster)': True,
   'word(called)': True,
   'word(final)': True,
   'word(Cincinnati)': True,
   'word(Reds)': True,
   'word(game)': True,
   'word(Thursday,)': True,
   'word(ending)': True,
   'word(46-year)': True,
   'word(career)': True,
   "word(that's)": True,
   'word(featured)': True,
   'word(many)': True,
   'word(big)': True,


In [25]:
# perform suffle data in featureset
random.shuffle(featureset)

In [26]:
# next, we split into 70% train data and 30% test data
 train_data,test_data = train_test_split(featureset,train_size = 0.7,random_state=42)

In [27]:
len(train_data)

65

In [28]:
len(test_data)

28

In [29]:
# next, we create a NaiveBayes classifier and train using train_data
classifier = nltk.NaiveBayesClassifier.train(train_data)

In [30]:
# next, check classifier accuracy using the test_data
print("Classifier accuracy: ",nltk.classify.accuracy(classifier, test_data))

Classifier accuracy:  0.8571428571428571


Looking at the generated result, we can see that our classifier reached 85% accuracy prediction score.

In [31]:
# print the informative features about the data
classifier.show_most_informative_features(5)

Most Informative Features
           word(Chicago) = True              mlb : nfl    =      3.6 : 1.0
                word(He) = True              nfl : mlb    =      3.1 : 1.0
         word(Wednesday) = True              mlb : nfl    =      2.9 : 1.0
              word(said) = True              nfl : mlb    =      2.4 : 1.0
              word(2019) = True              nfl : mlb    =      2.4 : 1.0


In [33]:
# next, we make predictions on the 'title' attribute data of 4 different sample entries from nfl RSS category
for (i, entry) in enumerate(feedmap['nfl']['entries']):
    if i < 4:
        # get features on 'title' data
        features = featureExtractor(entry['title'].split())
        # get category name
        category = classifier.classify(features)
        print('{} -> {}'.format(category, entry['title']))

nfl -> Giant belief: How Jones handled draft-night drama
nfl -> Chiefs' Patrick Mahomes more comfortable with his emotions
mlb -> Bruce Irvin set to play for Panthers this week
mlb -> Early bye week comes at right time for struggling Jets
