## Naive Bayes Classifier with basic NLP techniques
The dataset considered is the famous “20 Newsgoup” data set (Original 20 Newsgroups data set). Dataset can be found at the following link: http://qwone.com/~jason/20Newsgroups/ 

**Dataset used here**: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html



1. Load the data set and import necessary libraries.
2. Extract features from text files. Each unique word in the document can be considered as a feature.
3. Compute TF and TF-IDF factors.
4. Demonstrate text classification by using Naive Bayes classifier(build a NBclassifier in scikit). Train the NB classifier on the data provided.
5. Build a pipeline of TF-IDF and NB classification function.
6. Test the performance of the NB classifier on the test set.
7. Remove the stop-words and  build a pipeline of tf-idf and classification function. Now compare the performance with the previous process.(i.e without stopword removal)


!pip install gensim contractions bs4 plotly wordcloud spacy python-Levenshtein

### Importing necessary libraries 

In [1]:
import re, string, unicodedata                          # Import Regex, string and unicodedata
import contractions                                     # Import contractions library
from bs4 import BeautifulSoup                           # Import BeautifulSoup
from pprint import pprint

import numpy as np                                      # Import numpy
import pandas as pd                                     # Import pandas
import nltk                                             # Import Natural Language Tool-Kit

## UN-COMMENT while running for the FIRST time
# nltk.download('stopwords')                            # Download Stopwords
# nltk.download('punkt')
# nltk.download('wordnet')

from nltk.corpus import stopwords                       # Import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer

## Import visualization library
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter


# Importing Multinomial Naive Bayes as asked
from sklearn.naive_bayes import MultinomialNB
# from sklearn.naive_bayes import GaussianNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier 

# Import Scikit-learn utilities for data split and evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score, learning_curve
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay, f1_score

# Importing Scikit-learn Pipelines
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer

# Importing TfidfVectorizer to convert text data to numbers
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.decomposition import TruncatedSVD


from gensim import corpora
from gensim.parsing import strip_tags, strip_numeric, strip_multiple_whitespaces, stem_text, strip_punctuation, \
    remove_stopwords, preprocess_string

### Loading the dataset (from sklearn.datasets)

In [2]:
# we are using the dataset from sklearn library 
from sklearn.datasets import fetch_20newsgroups

In [3]:
# TRAIN set
newsgroups_train = fetch_20newsgroups(subset='train')
# Trainset size
print(f"TRAINset shape : {newsgroups_train.filenames.shape}, Target shape: {newsgroups_train.target.shape}")
# pprint(list(newsgroups_train.target_names))

# TEST set
newsgroups_test = fetch_20newsgroups(subset='test')
# Testset size
print(f"TESTset shape : {newsgroups_test.filenames.shape}, Target shape: {newsgroups_test.target.shape}")
# pprint(list(newsgroups_test.target_names))
print()
categories = list(newsgroups_train.target_names)
pprint(categories)

TRAINset shape : (11314,), Target shape: (11314,)
TESTset shape : (7532,), Target shape: (7532,)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [None]:
# making a dataframe for TRAIN dataset
news_train_df = pd.DataFrame(newsgroups_train.data, columns=['raw_data'])
# adding a target column
news_train_df['target'] = newsgroups_train.target
# adding length of data for visualizations
news_train_df['length'] = news_train_df['raw_data'].apply(len)

news_train_df

In [None]:
# making a dataframe for TEST dataset
news_test_df = pd.DataFrame(newsgroups_test.data, columns=['raw_data'])
# adding a target column
news_test_df['target'] = newsgroups_test.target
# adding length of data for visualizations
news_test_df['length'] = news_test_df['raw_data'].apply(len)

news_test_df

In [None]:
# BarPlot of Target labels
sns.barplot(x='target', y='length', data=news_train_df)

In [None]:
# BarPlot of Target labels
sns.barplot(x='target', y='length', data=news_test_df)

### preprocess using gensim.parsing

We have used gensim preprocessing pipeline here, as first attempt

* NOT REMOVING STOPWORDs

In [8]:
# ref: https://www.kaggle.com/venkatkrishnan/gensim-text-mining-techniques

transform_to_lower = lambda s: s.lower()

remove_single_char = lambda s: re.sub(r'\s+\w{1}\s+', ' ', s)

# Filters to be executed in pipeline
CLEAN_FILTERS = [strip_tags,
                strip_numeric,
                strip_punctuation, 
                strip_multiple_whitespaces, 
                transform_to_lower,
#                 remove_stopwords,   # NOT REMOVING STOPWORDS at first go
                remove_single_char]


# Method does the filtering of all the unrelevant text elements
def cleaning_text(document):
    # Invoking gensim.parsing.preprocess_string method with set of filters
    processed_words = preprocess_string(document, CLEAN_FILTERS)
    return ' '.join(processed_words)


print(f"Original text : \n---------------------------------\n{news_test_df['raw_data'][0]}")
print(f"Cleaned text : \n---------------------------------\n{cleaning_text(news_test_df['raw_data'][0])}")

Original text : 
From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

			Neil Gandler

Cleaned text : 
from vmbk ubvmsd cc buffalo edu neil gandler subject need info on bonneville organization university at buffalo lines news software vax vms vnews nntp posting host ubvmsd cc buffalo edu am little confused on all of the models of the bonnevilles have heard of the le se ls

In [9]:
# adding processed/cleaned data to a new column
news_train_df['clean_data'] = news_train_df['raw_data'].apply(cleaning_text)
# adding length of data for visualizations
news_train_df['clean_length'] = news_train_df['clean_data'].apply(len)


news_test_df['clean_data'] = news_test_df['raw_data'].apply(cleaning_text)
# adding length of data for visualizations
news_test_df['clean_length'] = news_test_df['clean_data'].apply(len)

In [None]:
news_train_df

In [11]:
bow_transformer = CountVectorizer().fit(news_train_df['clean_data'])
# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

85086


In [12]:
result = bow_transformer.vocabulary_.items()
data = list(result)
term_freq_df = pd.DataFrame(data)
term_freq_df.columns =['Words', 'Count']
term_freq_df.head(10)

Unnamed: 0,Words,Count
0,from,27040
1,lerxst,41661
2,wam,80022
3,umd,76387
4,edu,21442
5,where,80815
6,my,49039
7,thing,73401
8,subject,70632
9,what,80767


# Computing Tf-Idf for the given data

- t — term (word)
- d — document (set of words)
- N — count of corpus (size)
- corpus — the total document set

## Term Frequency :
The number of times a term occurs in a document is called its term frequency.
The weight of a term that occurs in a document is simply proportional to the term frequency.
#### tf(t,d) = count of {term} in {doc} / number of words in {doc}


## Inverse Document Frequency : 
IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.
#### idf(t) = log( {N} / ({doc freq} + 1)) 


### Formula:           tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

Calculating the TF-IDF Scores of 25 words

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
tfIdfVectorizer=TfidfVectorizer(use_idf=True)
tfIdf = tfIdfVectorizer.fit_transform(news_train_df['clean_data'])
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))

                TF-IDF
car           0.388991
lerxst        0.362351
wam           0.264854
umd           0.216329
tellme        0.181176
bricklin      0.171154
rac           0.159623
funky         0.159623
was           0.148913
this          0.147964
enlighten     0.137357
bumper        0.136004
neighborhood  0.134738
the           0.125813
doors         0.118028
maryland      0.111996
production    0.111057
specs         0.111057
sports        0.110694
where         0.110630
anyone        0.105241
door          0.102800
park          0.101632
separate      0.101073
il            0.099690


In [15]:
messages_bow = bow_transformer.transform(news_train_df['clean_data'])
print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)

Shape of Sparse Matrix:  (11314, 85086)
Amount of Non-Zero occurences:  1652974


In [16]:
tfidf_transformer = TfidfTransformer().fit(messages_bow)

messages_tfidf = tfidf_transformer.transform(messages_bow)
print(messages_tfidf.shape)

(11314, 85086)


In [17]:
X_train = news_train_df["clean_data"]
X_test = news_test_df['clean_data']
y_train = news_train_df["target"] 
y_test = news_test_df['target']

In [18]:
np.logspace(-4, 3, 10)

array([1.00000000e-04, 5.99484250e-04, 3.59381366e-03, 2.15443469e-02,
       1.29154967e-01, 7.74263683e-01, 4.64158883e+00, 2.78255940e+01,
       1.66810054e+02, 1.00000000e+03])

Using TF-IDF Transformer with Count Vectorizer gives the TF-IDF Vectorizer

In [19]:
# Create first pipeline for base without reducing features.
pipeline = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

# Create param grid.
param_grid = [
    {'classifier' : [MultinomialNB()],
     'classifier__alpha' : np.logspace(-4, 3, 10)   
    }
]

print(f"# Tuning hyper-parameters ")
mnb_clf = GridSearchCV(pipeline, param_grid = param_grid, cv=10, verbose=1, refit=True, n_jobs=-1, scoring='accuracy')
mnb_clf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print()
print(mnb_clf.best_params_)
print(mnb_clf.best_score_)
print()

# Tuning hyper-parameters 
Fitting 10 folds for each of 10 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   28.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.1min finished


Best parameters set found on development set:

{'classifier': MultinomialNB(alpha=0.021544346900318846, class_prior=None, fit_prior=True), 'classifier__alpha': 0.021544346900318846}
0.9134694272869002



In [None]:
best = mnb_clf.best_estimator_

print("Detailed classification report:")
print()
y_true, y_pred = y_test, best.predict(X_test)
print(classification_report(y_true, y_pred))
print()
print(f'Accuracy of MultinomialNB classifier on test set: {accuracy_score(y_true, y_pred)}')
print()

We achieved 83.4% accuracy on the test set without Stopwords Removal

In [None]:
# Plot non-normalized confusion matrix
titles_options = [("Confusion matrix, without normalization", None),
                  ("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(best, X_test, y_test,
                                 display_labels=mnb_clf.classes_,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)
    print("===================================================================================")
    print(title)
    print(disp.confusion_matrix)
    print()

plt.show()


In [None]:
target_names = categories

fig, ax = plt.subplots(figsize=(30, 20))
plot_confusion_matrix(best, X_test, y_test,
                      display_labels=target_names,
                      cmap=plt.cm.Blues, ax=ax)


## Recreating dataset with STOPWORD removal

In [24]:
transform_to_lower = lambda s: s.lower()

remove_single_char = lambda s: re.sub(r'\s+\w{1}\s+', ' ', s)

# Filters to be executed in pipeline
CLEAN_FILTERS = [strip_tags,
                strip_numeric,
                strip_punctuation, 
                strip_multiple_whitespaces, 
                transform_to_lower,
                remove_stopwords,   # REMOVING STOPWORDS at second go
                remove_single_char]


# Method does the filtering of all the unrelevant text elements
def cleaning_text(document):
    # Invoking gensim.parsing.preprocess_string method with set of filters
    processed_words = preprocess_string(document, CLEAN_FILTERS)
    return ' '.join(processed_words)


print(f"Original text : \n-------------------------------\n{news_test_df['raw_data'][0]}")
print(f"Cleaned text : \n---------------------------------\n{cleaning_text(news_test_df['raw_data'][0])}")

Original text : 
-------------------------------
From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

			Neil Gandler

Cleaned text : 
---------------------------------
vmbk ubvmsd cc buffalo edu neil gandler subject need info bonneville organization university buffalo lines news software vax vms vnews nntp posting host ubvmsd cc buffalo edu little confused models bonnevi

In [25]:
# adding processed/cleaned data to a new column
news_train_df['cleaned_stopword_data'] = news_train_df['raw_data'].apply(cleaning_text)
# adding length of data for visualizations
news_train_df['cleaned_stop_length'] = news_train_df['cleaned_stopword_data'].apply(len)


news_test_df['cleaned_stopword_data'] = news_test_df['raw_data'].apply(cleaning_text)
# adding length of data for visualizations
news_test_df['cleaned_stop_length'] = news_test_df['cleaned_stopword_data'].apply(len)

In [None]:
news_train_df

We have added a column "cleaned_stop_length" to show how the no. of words reduces after the removal of stopwords

Next we divide the dataset into train and test with the cleaned stopwords

In [28]:
X_train2 = news_train_df["cleaned_stopword_data"]
X_test2 = news_test_df['cleaned_stopword_data']
y_train = news_train_df["target"] 
y_test = news_test_df['target']

In [None]:
# Create first pipeline for base without reducing features.
# pipeline = Pipeline([
#     ('bow', CountVectorizer()),
#     ('tfidf', TfidfTransformer()),
#     ('classifier', MultinomialNB())
# ])

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=85000)),
    ('classifier', MultinomialNB())
])

# Create param grid.
param_grid = [
    {'classifier' : [MultinomialNB()],
     'classifier__alpha' : np.logspace(-5, 3, 8)   
    }
]

print(f"# Tuning hyper-parameters ")
mnb_clf2 = GridSearchCV(pipeline, param_grid = param_grid, cv=10, verbose=1, refit=True, n_jobs=-1, scoring='accuracy')
mnb_clf2.fit(X_train2, y_train)

print("Best parameters set found on development set:")
print()
print(mnb_clf2.best_params_)
print(mnb_clf2.best_score_)
print()

In [None]:
best2 = mnb_clf.best_estimator_

print("Detailed classification report:")
print()
y_true, y_pred = y_test, best.predict(X_test2)
print(classification_report(y_true, y_pred))
print()
print(f'Accuracy of MultinomialNB classifier on test set: {accuracy_score(y_true, y_pred)}')
print()

We can see that after stopword removal, the accuracy only improved by a very small percentage rather than a significant change. This shows that we have to try some other preprocessing techniques.

In [None]:
target_names = categories

fig, ax = plt.subplots(figsize=(30, 20))
plot_confusion_matrix(best2, X_test2, y_test,
                      display_labels=target_names,
                      cmap=plt.cm.Blues, ax=ax)


## Performing some NLP on same data

In [32]:
# merging both train and test datasets
newsdf_whole = pd.concat([news_train_df, news_test_df])

In [40]:
stopword_list = stopwords.words('english')
# print(stopword_list)

## Set custom stop-word's list as not, couldn't etc. words matter in "SMS", so not removing them from original data.
custom_excludelist = ["couldn't",'didn',"don't","didn't",'doesn',"doesn't",'down','hadn',"hadn't",'hasn',
                      "hasn't",'haven',"haven't",'isn',"isn't","it's",'ma','mightn',"mightn't",'mustn',"no","not",
                      "mustn't",'needn',"needn't",'shan',"shan't","she's",'shouldn',"should've","shouldn't",'s',
                      "that'll",'wasn',"wasn't",'weren',"weren't",'won',"won't",'wouldn',"wouldn't","you'd","you'll","y"]

## Add custom stop-word's list to stopwords for SMS (short) language 
custom_new_stopwords = ["thru","whr","whn","hw","gonna","u","plz","hmm","cn","ur","cya","idk","ttyl",
                        "lol","lmao","sup","coz","bcoz","cum","fr","frm","shud","hs","agn","hav","thr"]

stoplist = set(set(stopword_list) - set(custom_excludelist)) 
stoplist = list(stoplist.union(custom_new_stopwords))
print(stoplist)

['does', 'each', 'do', 'its', 'now', 'here', 'or', 'my', 'aren', 'there', 'own', 'will', 'thru', 'too', 'at', 'against', 'themselves', 'was', 'nor', 'ourselves', 'm', 'for', 'are', 'below', 'from', 'why', 'some', 'that', 'because', 'most', 'lmao', 'her', 'after', 'ur', 'an', 'under', 'yourself', 't', 'we', 're', 'she', 'he', 'them', 'they', 'theirs', 'until', 'into', 've', 'couldn', 'plz', 'your', 'again', 'coz', 'whom', 'what', 'which', 'were', 'as', 'ours', 'same', 'and', "you're", 'had', 'a', 'yours', 'frm', 'herself', 'doing', 'idk', 'in', 'd', 'other', 'such', 'himself', 'being', 'agn', 'who', 'over', "you've", 'those', 'more', 'these', 'll', 'itself', 'but', 'it', 'when', 'above', 'his', 'on', 'both', 'only', 'him', 'once', 'how', "aren't", 'about', 'then', 'can', 'be', 'while', 'by', 'this', 'bcoz', 'hs', 'you', 'if', 'few', 'just', 'shud', 'our', 'hmm', 'o', 'did', 'during', 'before', 'gonna', 'lol', 'sup', 'myself', 'i', 'where', 'me', 'up', 'hers', 'fr', 'out', 'should', 'am'

In [None]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)


newsdf_whole['raw_news'] = newsdf_whole['raw_data'].apply(lambda x: replace_contractions(x))
newsdf_whole.head(10)

In [42]:
lemmatizer = WordNetLemmatizer()


def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', ' ', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stoplist:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def preprocess(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)


In [None]:

## Tokenization of SMS data for pre-processing
newsdf_whole['tokenized_news'] = newsdf_whole.apply(lambda row: nltk.word_tokenize(row['raw_news']), axis=1)

## Pre-process each row
newsdf_whole['cleaned_news'] = newsdf_whole.apply(lambda row: preprocess(row['tokenized_news']), axis=1)

## Remove pre-processing column name 'cleaned_sms'
newsdf_whole = newsdf_whole.drop(['tokenized_news', 'raw_news'], axis=1)
newsdf_whole.head(10)

In [44]:
train, test = train_test_split(newsdf_whole, test_size=0.3, shuffle=True)

In [45]:
train['cleaned_news'].shape, train['target'].shape, test['cleaned_news'].shape, test['target'].shape

((13192,), (13192,), (5654,), (5654,))

In [46]:
X_train3 = train['cleaned_news']
y_train = train['target']
X_test3 = test['cleaned_news']
y_test = test['target']

In [47]:
# Create first pipeline for base without reducing features.
# pipeline = Pipeline([
#     ('bow', CountVectorizer()),
#     ('tfidf', TfidfTransformer()),
#     ('classifier', MultinomialNB())
# ])

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=60000)),
    ('classifier', MultinomialNB())
])

# Create param grid.
param_grid = [
    {'classifier' : [MultinomialNB()],
     'classifier__alpha' : np.logspace(-5, 3, 8)   
    }
]

print(f"# Tuning hyper-parameters ")
mnb_clf3 = GridSearchCV(pipeline, param_grid = param_grid, cv=10, verbose=1, refit=True, n_jobs=-1, scoring='accuracy')
mnb_clf3.fit(X_train3, y_train)

print("Best parameters set found on development set:")
print()
print(mnb_clf3.best_params_)
print(mnb_clf3.best_score_)
print()

# Tuning hyper-parameters 
Fitting 10 folds for each of 8 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   39.1s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  1.1min finished


Best parameters set found on development set:

{'classifier': MultinomialNB(alpha=0.026826957952797246, class_prior=None, fit_prior=True), 'classifier__alpha': 0.026826957952797246}
0.9025917821122521



In [48]:
best3 = mnb_clf3.best_estimator_

print("Detailed classification report:")
print()
y_true, y_pred = y_test, best3.predict(X_test3)
print(classification_report(y_true, y_pred))
print()
print(f'Accuracy of MultinomialNB classifier on test set: {accuracy_score(y_true, y_pred)}')
print()

Detailed classification report:

              precision    recall  f1-score   support

           0       0.90      0.93      0.91       238
           1       0.80      0.85      0.82       267
           2       0.85      0.83      0.84       309
           3       0.79      0.87      0.83       307
           4       0.92      0.87      0.89       293
           5       0.94      0.90      0.92       288
           6       0.86      0.84      0.85       287
           7       0.92      0.94      0.93       302
           8       0.97      0.97      0.97       299
           9       0.95      0.97      0.96       303
          10       0.96      0.97      0.97       276
          11       0.95      0.95      0.95       293
          12       0.91      0.89      0.90       308
          13       0.95      0.93      0.94       301
          14       0.93      0.97      0.95       272
          15       0.92      0.94      0.93       298
          16       0.94      0.96      0.95     

#### We can see that after some NLP preprocessing, the classification accuracy improved by a good percentage. [83% --> 91%]

In [None]:
target_names = categories

fig, ax = plt.subplots(figsize=(30, 20))
plot_confusion_matrix(best3, X_test3, y_test,
                      display_labels=target_names,
                      cmap=plt.cm.Blues, ax=ax)

### Conclusion:
    
Stopword removal improves the model accuracy, but specific preprocessing measures are needed to be taken (based on our dataset) and then only model accuracy improves. Here we saw that some NLP techniques improves the performance, rather than just stopword removal.