# News Categorization using Multinomial Naive Bayes

The objective of this site is to show how to use Multinomial Naive Bayes method to classify news according to some predefined classes. 

The News Aggregator Data Set comes from the UCI Machine Learning Repository. 

This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories in this dataset are labelled:

* b: business; 
* t: science and technology; 
* e: entertainment; and 
* m: health. 

Using Multinomial Naive Bayes method, we will try to predict the category (business, entertainment, etc.) of a news article given only its headline.

Let's begin importing the Pandas (Python Data Analysis Library) module. The import statement is the most common way to gain access to the code in another module. 

In [1]:
%matplotlib inline
import pandas as pd 

This way we can refer to pandas by its alias 'pd'. Let's import news aggregator data via Pandas

In [2]:
df = pd.read_csv('uci-news-aggregator.csv')
df.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


Function head gives us the first 5 items in a column (or the first 5 rows in the DataFrame)

In [3]:
df.CATEGORY.value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

### Function to normalize text

In [4]:
import re
def normalize_text(s):
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s

In [5]:
df['TEXT'] = [normalize_text(s) for s in df['TITLE']]

In [22]:
df.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,TEXT
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698,fed official says weak data caused by weather ...
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207,fed's charles plosser sees high bar for change...
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550,us open stocks fall after fed official hints a...
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793,fed risks falling behind the curve' charles pl...
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027,fed's plosser nasty weather has curbed job growth


In [6]:
df.columns

Index(['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME',
       'TIMESTAMP', 'TEXT'],
      dtype='object')

In [7]:
df.CATEGORY.value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

In [8]:
## Feature Extraction

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])

In [23]:
x

<422419x54637 sparse matrix of type '<class 'numpy.int64'>'
	with 3747875 stored elements in Compressed Sparse Row format>

In [10]:
# LabelEncoder allows us to assign ordinal levels to categorical data
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['CATEGORY'])

In [24]:
y

array([0, 0, 0, ..., 2, 2, 2])

### Train/Test Split

In [11]:
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [12]:
# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(337935, 54637)
(337935,)
(84484, 54637)
(84484,)


So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of
unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with
each observation in the x training vector.

So we're ready to go. Let's make the classifier!

## Naive Bayes


In [13]:
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(x_train, y_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
nb.score(x_test, y_test)

0.9269329103735618

Over 92% accuracy, just by using words as independent features

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier:

Function to predict category from a direct tittle:

In [15]:
def predict_cat(title):
    cat_names = {'b' : 'Business', 't' : 'Science and Technology', 'e' : 'Entertainment', 'm' : 'Health'}
    cod = nb.predict(vectorizer.transform([title]))
    return cat_names[encoder.inverse_transform(cod)[0]]

In [16]:
print(predict_cat("star was seen tv"))
print(predict_cat("stocks are on the rise"))
print(predict_cat("eggs and cholesterol"))
print(predict_cat("US equities just legged lower"))
print(predict_cat("potentially smaller corporate tax cut"))
print(predict_cat("over 23 million consumers who hold subprime"))
print(predict_cat("Roku is down 6% this morning"))
print(predict_cat("We had about nine open investigations of classified leaks"))
print(predict_cat("The last 48 hours has been quite a chaotic"))
print(predict_cat("shortly after the US open failed to spark "))
print(predict_cat("Well that escalated quickly as the USDJPY"))
print(predict_cat("the ailing foundation of the economy provides"))

Entertainment
Business
Health
Business
Business
Business
Science and Technology
Science and Technology
Entertainment
Business
Health
Business
