# Building a category predictor

#### - To determine the category of a given document.
#### - Used in classification to categorize text documents.
#### - Used in search Engines.
#### - Corpus of data to train an algorithm.

# To Build Category Predictor:

#### TermFrequency - InverseDocument Frequency (tf-idf): The importance of each word .

#### Term Frequency: is basicaly a measure of how frequently each word appears in a given document.

#### we divide the count of each word by the total number of words in a given document to obtain TF.

#### InverseDocument Frequency (IDF): is a measure of how unique a word is to this document in the given set of documents.

#### this helps us identify words that are unique to each document as well.

#### we need to compute the ratio of the number of documents with the given word, 
#### and divide it by the total number of documents then taking the negative algorithm of this ratio

#### we then combine term frequency and inverse document frequency to formulate a feature vector to categorize documents.

In [14]:
# Import Libs
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
# Define Category Map
categoryMap ={
    'talk.politics.misc': 'Politics', 'rec.autos': 'Autos',
    'rec.sport.hockey': 'Hokey', 'sci.electronics': 'Electronics',
    'sci.med': 'Medicine'
}

In [16]:
# get the training dataset
training_data =fetch_20newsgroups(subset='train', categories=categoryMap.keys(), shuffle=True, random_state=5 )

In [17]:
# Build the count vectorizer & extract term count
count_vectorizer =CountVectorizer()
train_tc =count_vectorizer.fit_transform(training_data.data)
print("\nDimentions of training data : ", train_tc.shape)


Dimentions of training data :  (2844, 40321)


In [18]:
# create TF-IDF Transformer
tfidf =TfidfTransformer()
train_tfidf =tfidf.fit_transform(train_tc)

In [19]:
# define test data
test_data =[
    'you need to be careful with cars when you are driving in a slippery road',
    'A lots of devices can be operated wirelessly',
    'Players need to be careful when they are close to goal posts',
    'Political debates help us understand the perspectives of both sides'
]

In [22]:
# Train Multinomial Naive bayes classifier
classifier =MultinomialNB()
classifier.fit(train_tfidf, training_data.target)

In [24]:
# transform input data using count vectorizer
input_tc =count_vectorizer.transform(test_data)

In [25]:
# transform vectorized data using TF-IDF
input_tfidf =tfidf.fit_transform(input_tc)

In [27]:
# predict the output categories
predictions =classifier.predict(input_tfidf)

In [30]:
# print the outputs
for sent, category in zip(test_data, predictions):
    print('\nInput:', sent, '\nPredictions Category:', categoryMap[training_data.target_names[category]] )


Input: you need to be careful with cars when you are driving in a slippery road 
Predictions Category: Autos

Input: A lots of devices can be operated wirelessly 
Predictions Category: Electronics

Input: Players need to be careful when they are close to goal posts 
Predictions Category: Hokey

Input: Political debates help us understand the perspectives of both sides 
Predictions Category: Medicine
