## Text classification using  Naive Bayes classifier
In this notebook we will
1. Load sklearn 20newsgroups dataset
2.  Create Model using pipeline containing two components, count vectorizer and naive bayes classifier
3.  Save, load and test the model/pipeline

### Setup

Below are installation instruction of libraries requeried for this notebook

> * !pip install -U scikit-learn
* !pip install joblib

In [1]:
# Common imports
import numpy as np

# import for dataset from sklearn
from sklearn.datasets import fetch_20newsgroups

# import for feature creation
from sklearn.feature_extraction.text import CountVectorizer

# import algorithms
from sklearn.naive_bayes import MultinomialNB

# import for creating pipelines
from sklearn.pipeline import Pipeline

# import for saving and loading model file
from joblib import dump, load

### Get the Data

In [2]:
# Loading the 20 newsgroups dataset form sklean libariry
label_categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
train_data = fetch_20newsgroups(subset='train', categories=label_categories, shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test', categories=label_categories, shuffle=True, random_state=42)

In [3]:
# priniting sample data
print("Label Number: ",train_data.target[0])
print("Label Name: ",train_data.target_names[0])
print("\nText:\n")
print(train_data.data[0])

Label Number:  1
Label Name:  alt.atheism

Text:

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



### Working with Pipeline

We are creating a pipeline consists of two components, vectorizer and classifier.

The output of first is input for next component `vectorizer => classifier`.

1. CountVectorizer(vectorizer) converts text data into numbers using `Bag Of Words` methodology. It includes Text preprocessing, tokenizing and filtering of stopwords.


2. MultinomialNB() is a naïve Bayes classifier which predict the class.

In [4]:
# creating model pipeline
model = Pipeline([
    ('vectorizer', CountVectorizer(stop_words="english")),
    ('classifier', MultinomialNB()),
])

In [5]:
# training model pipeline
model.fit(train_data.data, train_data.target)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [6]:
# testing over test data
predicted = model.predict(test_data.data)
print("Accuracy: ",np.mean(predicted == test_data.target))

Accuracy:  0.9420772303595206


### Saving and loading the model file

In [7]:
dump(model, 'naive_bayes_model') 

['naive_bayes_model']

In [8]:
naive_bayes_model = load('naive_bayes_model')

In [9]:
text = """Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files."""

predicted_value = naive_bayes_model.predict([text])
predicted_value

array([1], dtype=int64)

In [10]:
print("Predicted Number: ",predicted_value[0])
print("Predicted Class: ",train_data.target_names[predicted_value[0]])

Predicted Number:  1
Predicted Class:  comp.graphics


##### Reference - https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html