# Text Classification

We are going to use the very popular but non-trivial 20 Newsgroups dataset, which is available in Scikit-Learn. The 20 Newsgroups dataset comprises around 18,000 newsgroups posts spread across 20 different categories or topics, thus making it a 20-class classification problem, which is definitely non-trivial as compared to predicting spam in emails. Remember, the higher the number of classes, the more complex it gets to build an accurate classifier. 

Details pertaining to the dataset can be found at `http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html` and it is recommended to remove the headers, footers, and quotes from the text documents to prevent the model from overfitting or not generalizing well due to certain specific headers or email addresses. 

Scikit-Learn allows you to load the 20 Newsgroups data and provide a parameter called remove, telling it what kinds of information to strip out of each file. The remove parameter should be a tuple containing any subset of ('headers', 'footers', 'quotes'), telling it to remove headers, signature blocks, and quotation blocks, respectively.

We will also remove documents that are empty or have no content after removing these three items during the data preprocessing stage, because it would be pointless to try to extract features from empty documents. 

### Load data

Let’s start by loading the necessary dataset and defining functions for building the training and testing datasets.

In [1]:
%run setup.ipynb
%run text_libraries.ipynb

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Elisabetta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
from sklearn.datasets import fetch_20newsgroups
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [3]:
%run text_data_preprocessing_steps.ipynb

In [4]:
data = fetch_20newsgroups(subset='all', shuffle=True,
                          remove=('headers', 'footers', 'quotes'))
#Downloading 20news dataset. This may take a few minutes.
#Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
data_labels_map = dict(enumerate(data.target_names))

# building the dataframe
corpus, target_labels, target_names = (data.data, data.target, [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})
print(data_df.shape)
data_df.head(10)

(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


From this dataset, we can see that each document has some textual content and the label can be denoted by a specific number, which maps to a newsgroup category name.

In [5]:
total_nulls = data_df[data_df.Article.str.strip() == ''].shape[0]
print("Empty documents:", total_nulls)

Empty documents: 515


In [6]:
data_df = data_df[~(data_df.Article.str.strip() == '')]
data_df.shape

(18331, 3)

In [7]:
#import nltk
stopword_list = nltk.corpus.stopwords.words('english')
# normalize our corpus
norm_corpus = preprocess(data_df['Article'], cleaning = True, stemming = False, stem_type = None, 
                         lemmatization = True, remove_stopwords = True)
data_df['Clean Article'] = norm_corpus
# view sample data
data_df = data_df[['Article', 'Clean Article', 'Target Label', 'Target Name']]
data_df.head(10)

Unnamed: 0,Article,Clean Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,sure bashers pen fan pretty confuse lack kind ...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,brother market high performance video card sup...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,finally say what dream mediterranean new area ...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,think scsi card dma transfer disk scsi card dm...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,1 old jasmine drive which cannot use new syste...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,back high school work lab assistant bunch expe...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,ae dallas try 214 241 6060 214 241 0055 tech s...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",stuff delete ok solution problem move canada y...,10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",yeah second one believe price try get good loo...,10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,christian mean someone who believe divinity je...,19,talk.religion.misc


We now have a nice preprocessed and normalized corpus of articles. There might have been some documents that, after preprocessing, might end up being empty or null. We use the following code to test this assumption and remove these documents from our corpus.

In [8]:
data_df = data_df.replace(r'^(\s?)+$', np.nan, regex=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18331 entries, 0 to 18845
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Article        18331 non-null  object
 1   Clean Article  18306 non-null  object
 2   Target Label   18331 non-null  int32 
 3   Target Name    18331 non-null  object
dtypes: int32(1), object(3)
memory usage: 644.4+ KB


We definitely have some null articles after our preprocessing operation. We can safely remove these null documents using the following code.

In [9]:
data_df = data_df.dropna().reset_index(drop=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18306 entries, 0 to 18305
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Article        18306 non-null  object
 1   Clean Article  18306 non-null  object
 2   Target Label   18306 non-null  int32 
 3   Target Name    18306 non-null  object
dtypes: int32(1), object(3)
memory usage: 500.7+ KB


We can now use this dataset for building our text classification system. Feel free to store the dataset using the following code if needed so you don’t need to run the preprocessing step every time.

In [10]:
data_df.to_csv(f'{RESULTS_PATH}/clean_newsgroups.csv', index=False)


### Building Train and Test Datasets

To build a machine learning system, we need to build our models on training data and then test and evaluate their performance on test data. Hence, we split our dataset into train and test datasets. We take a `train dataset, test dataset` split of 67%/33% of the total data.

In [11]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                       np.array(data_df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape

((12265,), (6041,))

You can also observe the distribution of the various articles by the different newsgroup categories using the 
following code. We can then get an idea of how many documents will be used to train the model and how many are used to test the model. 

In [12]:
from collections import Counter
trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd], 
    columns=['Target Label', 'Train Count', 'Test Count']).sort_values(by=['Train Count', 'Test Count'], ascending=False))

Unnamed: 0,Target Label,Train Count,Test Count
17,sci.crypt,669,293
5,soc.religion.christian,664,310
15,rec.sport.hockey,660,313
7,comp.graphics,651,302
3,rec.autos,645,290
10,comp.windows.x,644,336
12,rec.sport.baseball,641,314
16,rec.motorcycles,640,329
6,sci.electronics,639,317
4,misc.forsale,638,321


Above you can see the distribution of train and test articles by the 20 newsgroups. 

We now briefly cover the various feature engineering techniques, which we use in this chapter to build our text classification models.

### Feature Engineering Techniques

There are various feature extraction or feature engineering techniques. 

In a dataset, there are typically many data points, which are usually the rows of the dataset, and the columns are various features or properties of the dataset with specific values for each row or observation. In machine learning terminology, features are unique measurable attributes or properties for each observation or data point in a dataset. Features are usually numeric in nature and can be absolute numeric values or categorical features that can be encoded as binary features for each category in the list using a process called `one-hot` encoding . They can be represented as distinct numerical entities using a process called `label-encoding`. The process of extracting and selecting features is both an art and a science and this process is called `feature extraction` or `feature engineering`.

Feature engineering is very important and is often known as the secret sauce to creating superior and better performing machine learning models. Extracted features are fed into machine learning algorithms for learning patterns that can be applied on future new data points for getting insights. These algorithms usually expect features in the form of numeric vectors because each algorithm is at heart a mathematical operation of optimization and minimizing loss and error when it tries to learn patterns from data points and observations. Hence, with textual data comes the added challenge of figuring out how to transform and extract numeric features from textual data.

Traditional (count-based) feature engineering strategies for textual data involve models belonging to a family of models, popularly known as the Bag of Words model in general. While they are effective methods for extracting features from text, due to the inherent nature of the model being just a bag of unstructured words, we lose additional information like the semantics, structure, sequence, and context around nearby words in each text document.

#### Bag of Words
Use the `Bag of Words` model to represent each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0), or even weighted values. The model’s name is such because each document is represented literally as a bag of its own words, disregarding word orders, sequences, and grammar.

Let’s start by using a basic Bag of Words, the term frequency-based feature engineering model, to extract features from our train and test datasets.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# build BOW features on train articles
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

In [14]:
# transform test articles into features
cv_test_features = cv.transform(test_corpus)

In [15]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)

BOW model:> Train features shape: (12265, 93559)  Test features shape: (6041, 93559)


#### Classification Models
Classification models are supervised machine learning algorithms that are used to classify, categorize, or label data points based on what it has observed in the past.

We now build several classifiers on these features using the training data and test their performance on the test dataset using all the classification models we discussed earlier. We also check model accuracies using five-fold cross validation just to see if the model performs consistently across the validation folds of data (we use this same strategy to tune the models later).

##### Multinomial NB 
This is a special case of the popular Naïve Bayes algorithm used specifically for prediction and classification tasks where we have more than two classes.

The Naïve Bayes algorithm is a supervised learning algorithm that puts into action the very popular Bayes theorem. However there is a `naïve` assumption here that each feature is completely independent of the others. 

Multinomial Naïve Bayes is an extension of the NB algorithm for predicting and classifying data points, where the number of distinct classes or outcomes are more than two.

In [16]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, train_label_names)
mnb_bow_cv_scores = cross_val_score(mnb, cv_train_features, train_label_names, cv=5)
mnb_bow_cv_mean_score = np.mean(mnb_bow_cv_scores)
print('CV Accuracy (5-fold):', mnb_bow_cv_scores)
print('Mean CV Accuracy:', mnb_bow_cv_mean_score)
mnb_bow_test_score = mnb.score(cv_test_features, test_label_names)
print('Test Accuracy:', mnb_bow_test_score)

CV Accuracy (5-fold): [0.66408479 0.65674684 0.6583775  0.65103954 0.65144721]
Mean CV Accuracy: 0.6563391765185488
Test Accuracy: 0.6661148816421122


##### Logistic Regression 
The logistic regression model is actually a statistical model developed by statistician David Cox in 1958. It is also known as the logit or logistic model since it uses the logistic (popularly also known as sigmoid) mathematical function to estimate the parameter values. These are the coefficients of all our features such that the overall loss is minimized when predicting the outcome—in this case, the newsgroup categories. However, we don’t focus on errors but more about maximizing the likelihood of the predicted values to the observed values using Maximum-Likelihood Estimation (MLE).

In [17]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, train_label_names)
lr_bow_cv_scores = cross_val_score(lr, cv_train_features, train_label_names, cv=5)
lr_bow_cv_mean_score = np.mean(lr_bow_cv_scores)
print('CV Accuracy (5-fold):', lr_bow_cv_scores)
print('Mean CV Accuracy:', lr_bow_cv_mean_score)
lr_bow_test_score = lr.score(cv_test_features, test_label_names)
print('Test Accuracy:', lr_bow_test_score)

CV Accuracy (5-fold): [0.68772931 0.6742764  0.69629026 0.68854464 0.67549939]
Mean CV Accuracy: 0.6844679983693437
Test Accuracy: 0.6993875186227446


##### Support Vector Machines 
In machine learning, support vector machines , known popularly as SVMs, are supervised learning algorithms. They are used for classification, regression, novelty and anomaly, and outlier detection. Considering a binary classification problem, if we have training data such that each data point or observation belongs to a specific class, the SVM algorithm can be trained based on this data such that it can assign future data points into one of the two classes. 

This algorithm represents the training data samples as points in space such that points belonging to either class can be separated by a wide gap between them (hyperplane) and the new data points to be predicted are assigned classes based on which side of this hyperplane they fall into. This process is for a typical linear classification process. However, SVM can also perform non-linear classification by an interesting approach known as a kernel trick, where kernel functions are used to operate on high-dimensional feature spaces that are non-linear separable. Usually, inner products between data points in the feature space help achieve this.

The SVM algorithm takes in a set of training data points and constructs a hyperplane of a collection of hyperplanes for a high-dimensional feature space. The larger the margins of the hyperplane, the better the separation.

The Scikit-Learn implementation of SVM can be found in SVC, LinearSVC, or SGDClassifier, where we use the hinge loss function (set by default) to optimize and build the model. This loss function helps us get the soft margins and is often known as a soft-margin SVM. You can also use different kernel functions to convert the existing feature space into an even higher dimensional feature space, where the data can be separated linearly. However, we do not recommend this a lot for text data problems since you already deal with a huge number of dimensions right from the start.

For a multi-class classification problem, if we have `n` classes, for each class a binary classifier is trained and learned that helps is separating between each class and the other `n-1` classes. During prediction, the scores (distances to hyperplanes) for each classifier are computed and the maximum score is chosen for selecting the class label. The stochastic gradient descent is often used for minimizing the loss function in SVM algorithms. 

In [18]:
from sklearn.svm import LinearSVC

svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)
svm_bow_cv_scores = cross_val_score(svm, cv_train_features, train_label_names, cv=5)
svm_bow_cv_mean_score = np.mean(svm_bow_cv_scores)
print('CV Accuracy (5-fold):', svm_bow_cv_scores)
print('Mean CV Accuracy:', svm_bow_cv_mean_score)
svm_bow_test_score = svm.score(cv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_score)

CV Accuracy (5-fold): [0.64573991 0.64410925 0.65878516 0.63962495 0.64573991]
Mean CV Accuracy: 0.6467998369343662
Test Accuracy: 0.6586657838106273


In [19]:
from sklearn.linear_model import SGDClassifier

svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(cv_train_features, train_label_names)
svmsgd_bow_cv_scores = cross_val_score(svm_sgd, cv_train_features, train_label_names, cv=5)
svmsgd_bow_cv_mean_score = np.mean(svmsgd_bow_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_bow_cv_scores)
print('Mean CV Accuracy:', svmsgd_bow_cv_mean_score)
svmsgd_bow_test_score = svm_sgd.score(cv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_bow_test_score)

CV Accuracy (5-fold): [0.64084794 0.64777823 0.64573991 0.66163881 0.632287  ]
Mean CV Accuracy: 0.6456583774969425
Test Accuracy: 0.6383049164045688


##### Random Forest

Decision trees are a family of supervised machine learning algorithms that can represent and interpret sets of rules automatically from the underlying data. They use metrics like information gain and gini-index to build the tree. However, a major drawback of decision trees is that since they are non-parametric, the more data there is, greater the depth of the tree. We can end up with really huge and deep trees that are prone to overfitting. The model might work really well on training data, but instead of learning, it just memorizes all the training samples and builds very specific rules to them. Hence, it performs really poorly on the test data. Random forests try to tackle this problem.

A random forest is a meta-estimator or an ensemble model that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement (bootstrap samples). In random forests, all the trees are trained in parallel (bagging model/bootstrap aggregation). Besides this, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Also, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. Thus the randomness introduced in a random forest is both due to random sampling of data and random selection of features when splitting nodes in each tree. Hence, due to this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random decision tree). However, due to averaging, the overall variance of the model decreases significantly as compared to the increase in bias and hence it gives us an overall better model.

When building a random forest , you can set specific model parameters for both the base decision trees and the overall forest. For the trees, you usually have the same parameters as a normal decision tree model like the tree depth, number of leaves, number of features in each split, samples per leaf, criteria for the node splits, information gain, and gini impurity. For the forest, you can tune the total number of trees needed, the number of features to be used per tree, and so on.

In [20]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(cv_train_features, train_label_names)
rfc_bow_cv_scores = cross_val_score(rfc, cv_train_features, train_label_names, cv=5)
rfc_bow_cv_mean_score = np.mean(rfc_bow_cv_scores)
print('CV Accuracy (5-fold):', rfc_bow_cv_scores)
print('Mean CV Accuracy:', rfc_bow_cv_mean_score)
rfc_bow_test_score = rfc.score(cv_test_features, test_label_names)
print('Test Accuracy:', rfc_bow_test_score)

CV Accuracy (5-fold): [0.50224215 0.51406441 0.52710966 0.52792499 0.51365675]
Mean CV Accuracy: 0.5169995923359152
Test Accuracy: 0.5158086409534846


##### Gradient boosting machines

They, popularly known as Gradient Boosting Machines (GBMs), can be used for regression and classification. Typically, GBMs builds an additive model in a forward stage-wise sequential fashion; they allow for the optimization of arbitrary differentiable loss functions. GBMs can usually work on any combination of models (weak learners) and loss functions. Scikit-Learn uses GBRTs (Gradient Boosted Regression Trees), which are generalized boosting models that can be applied to arbitrary differentiable loss functions. The beauty of this model is that is accurate and can be used for both regression and classification problems. 

In [21]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(cv_train_features, train_label_names)
gbc_bow_cv_scores = cross_val_score(gbc, cv_train_features, train_label_names, cv=5)
gbc_bow_cv_mean_score = np.mean(gbc_bow_cv_scores)
print('CV Accuracy (5-fold):', gbc_bow_cv_scores)
print('Mean CV Accuracy:', gbc_bow_cv_mean_score)
gbc_bow_test_score = gbc.score(cv_test_features, test_label_names)
print('Test Accuracy:', gbc_bow_test_score)

CV Accuracy (5-fold): [0.5413779  0.54545455 0.53607827 0.56257644 0.54178557]
Mean CV Accuracy: 0.5454545454545454
Test Accuracy: 0.5596755504055619


It is interesting to see that simpler models like Naïve Bayes and Logistic Regression performed much better than the ensemble models. Let’s look at the next model pipeline now.

#### TF-IDF

Use the `TF-IDF` model: TF-IDF stands for Term Frequency-Inverse Document Frequency and it’s a combination of two metrics, term frequency (TF) and inverse document frequency (IDF). This technique was originally developed as a metric for for showing search engine results based on user queries and has become part of information retrieval and text feature extraction.

We use TF-IDF features to train our classification models. Assuming TF-IDF weighs down unimportant features, we might get better performing models.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW features on train articles
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

In [23]:
# transform test articles into features
tv_test_features = tv.transform(test_corpus)

In [24]:
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (12265, 93559)  Test features shape: (6041, 93559)


#### Classification Models

We now build several classifiers on these features using the training data and test their performance on the test dataset using all the classification models. We also check model accuracies using five-fold cross validation, just like we did earlier.

In [25]:
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)
mnb_tfidf_cv_scores = cross_val_score(mnb, tv_train_features, train_label_names, cv=5)
mnb_tfidf_cv_mean_score = np.mean(mnb_tfidf_cv_scores)
print('CV Accuracy (5-fold):', mnb_tfidf_cv_scores)
print('Mean CV Accuracy:', mnb_tfidf_cv_mean_score)
mnb_tfidf_test_score = mnb.score(tv_test_features, test_label_names)
print('Test Accuracy:', mnb_tfidf_test_score)

CV Accuracy (5-fold): [0.71259682 0.70158989 0.7264574  0.70118223 0.70729719]
Mean CV Accuracy: 0.7098247044435386
Test Accuracy: 0.713623572256249


In [26]:
lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(tv_train_features, train_label_names)
lr_tfidf_cv_scores = cross_val_score(lr, tv_train_features, train_label_names, cv=5)
lr_tfidf_cv_mean_score = np.mean(lr_tfidf_cv_scores)
print('CV Accuracy (5-fold):', lr_tfidf_cv_scores)
print('Mean CV Accuracy:', lr_tfidf_cv_mean_score)
lr_tfidf_test_score = lr.score(tv_test_features, test_label_names)
print('Test Accuracy:', lr_tfidf_test_score)

CV Accuracy (5-fold): [0.7480636  0.73909499 0.74561761 0.74847126 0.74072564]
Mean CV Accuracy: 0.7443946188340808
Test Accuracy: 0.7531865585168018


In [27]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)
svm_tfidf_cv_scores = cross_val_score(svm, tv_train_features, train_label_names, cv=5)
svm_tfidf_cv_mean_score = np.mean(svm_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svm_tfidf_cv_scores)
print('Mean CV Accuracy:', svm_tfidf_cv_mean_score)
svm_tfidf_test_score = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_tfidf_test_score)

CV Accuracy (5-fold): [0.77048512 0.75703221 0.76477782 0.76559315 0.75947819]
Mean CV Accuracy: 0.7634732980024459
Test Accuracy: 0.7717265353418308


In [28]:
svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(tv_train_features, train_label_names)
svmsgd_tfidf_cv_scores = cross_val_score(svm_sgd, tv_train_features, train_label_names, cv=5)
svmsgd_tfidf_cv_mean_score = np.mean(svmsgd_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_tfidf_cv_scores)
print('Mean CV Accuracy:', svmsgd_tfidf_cv_mean_score)
svmsgd_tfidf_test_score = svm_sgd.score(tv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_tfidf_test_score)

CV Accuracy (5-fold): [0.76151651 0.75580921 0.76029352 0.76314717 0.75662454]
Mean CV Accuracy: 0.7594781899714635
Test Accuracy: 0.7659327925840093


In [29]:
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(tv_train_features, train_label_names)
rfc_tfidf_cv_scores = cross_val_score(rfc, tv_train_features, train_label_names, cv=5)
rfc_tfidf_cv_mean_score = np.mean(rfc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', rfc_tfidf_cv_scores)
print('Mean CV Accuracy:', rfc_tfidf_cv_mean_score)
rfc_tfidf_test_score = rfc.score(tv_test_features, test_label_names)
print('Test Accuracy:', rfc_tfidf_test_score)

CV Accuracy (5-fold): [0.52833265 0.50428047 0.50346514 0.52140236 0.51977171]
Mean CV Accuracy: 0.5154504688136975
Test Accuracy: 0.5302102300943552


In [30]:
gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(tv_train_features, train_label_names)
gbc_tfidf_cv_scores = cross_val_score(gbc, tv_train_features, train_label_names, cv=5)
gbc_tfidf_cv_mean_score = np.mean(gbc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', gbc_tfidf_cv_scores)
print('Mean CV Accuracy:', gbc_tfidf_cv_mean_score)
gbc_tfidf_test_score = gbc.score(tv_test_features, test_label_names)
print('Test Accuracy:', gbc_tfidf_test_score)

CV Accuracy (5-fold): [0.5413779  0.54341623 0.53933958 0.5629841  0.54586221]
Mean CV Accuracy: 0.5465960048919691
Test Accuracy: 0.5573580533024334


It’s interesting to see that the overall accuracy of several models increases by quite a bit, including logistic regression, Naïve Bayes, and SVM. Interestingly, the ensemble models don’t perform as well. Using more estimators might improve them, but still wouldn’t be as good as the other models and it would take a huge amount of training time.

### Comparative Model Performance Evaluation

We can now do a nice comparison of all the models we have tried so far with the two different feature engineering techniques. We will build a dataframe from our modeling results and compare the results.

Combine score obtained with BoW and TF-IDF.

In [31]:
pd.DataFrame([['Naive Bayes', mnb_bow_cv_mean_score, mnb_bow_test_score, 
               mnb_tfidf_cv_mean_score, mnb_tfidf_test_score],
              ['Logistic Regression', lr_bow_cv_mean_score, lr_bow_test_score, 
               lr_tfidf_cv_mean_score, lr_tfidf_test_score],
              ['Linear SVM', svm_bow_cv_mean_score, svm_bow_test_score, 
               svm_tfidf_cv_mean_score, svm_tfidf_test_score],
              ['Linear SVM (SGD)', svmsgd_bow_cv_mean_score, svmsgd_bow_test_score, 
               svmsgd_tfidf_cv_mean_score, svmsgd_tfidf_test_score],
              ['Random Forest', rfc_bow_cv_mean_score, rfc_bow_test_score, 
               rfc_tfidf_cv_mean_score, rfc_tfidf_test_score],
              ['Gradient Boosted Machines', gbc_bow_cv_mean_score, gbc_bow_test_score, 
               gbc_tfidf_cv_mean_score, gbc_tfidf_test_score]],
             columns=['Model', 'CV Score (TF)', 'Test Score (TF)', 'CV Score (TF-IDF)', 'Test Score (TF-IDF)'],
             ).T

Unnamed: 0,0,1,2,3,4,5
Model,Naive Bayes,Logistic Regression,Linear SVM,Linear SVM (SGD),Random Forest,Gradient Boosted Machines
CV Score (TF),0.656339,0.684468,0.6468,0.645658,0.517,0.545455
Test Score (TF),0.666115,0.699388,0.658666,0.638305,0.515809,0.559676
CV Score (TF-IDF),0.709825,0.744395,0.763473,0.759478,0.51545,0.546596
Test Score (TF-IDF),0.713624,0.753187,0.771727,0.765933,0.53021,0.557358


Result shows us that the best performing models were SVM followed by Logistic Regression and Naïve Bayes. Ensemble models did not perform as well on this dataset.

### Model Tuning
Model tuning is perhaps one of the key stages in the machine learning process and can lead to better performing models. Any machine learning model typically has hyperparameters, which are high-level concepts much like configuration settings that you can tune like knobs in a device! A very important point to remember is that hyperparameters are model parameters that are not directly learned within estimators and do not depend on the underlying data (as opposed to model parameters or coefficients like the coefficients of logistic regression, which can change based on the underlying training data).

It is possible and recommended to search the hyperparameter space for the best cross-validation score for which we use a five-fold cross validation scheme along with grid search for finding the best hyperparameter values. A typical search for the best hyperparameter values during tuning consists of the following major components:
* A model or estimator like LogisticRegression from Scikit-Learn
* A hyperparameter space that we can define with values and ranges
* A method for searching or sampling candidates like Grid Search
* A cross-validation scheme, like five-fold cross-validation
* A score function, like accuracy, for classification models

There are two very common approaches for sampling search candidates also available in Scikit-Learn. We have GridSearchCV, which exhaustively considers all parameter combinations set by users. However, RandomizedSearchCV typically samples a given number of candidates from a parameter space with a specified distribution instead of taking all combinations. We use Grid Search for our tuning experiments.

To tune the experiments, we also use a Scikit-Learn Pipeline object , which is an excellent way to chain multiple components together where we sequentially apply a list of transforms like data preprocessors, feature engineering methods, and a model estimator for predictions. Intermediate steps of the pipeline must be some form of a “transformer,” that is, they must implement fit and transform methods.

The purpose of the pipeline and why we want to use it is so that we can assemble multiple components like feature 
engineering and modeling so that they can be cross-validated while setting different hyperparameter values for grid search. Let’s get started with tuning our Naïve Bayes model.

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

mnb_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('mnb', MultinomialNB())
                       ])

param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'mnb__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]
}

gs_mnb = GridSearchCV(mnb_pipeline, param_grid, cv=5, verbose=2)
gs_mnb = gs_mnb.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   2.6s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   2.7s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   2.5s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 1); total time=   2.6s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=  13.9s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   9.8s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=  10.2s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=   9.9s
[CV] END ........mnb__alpha=1e-05, tfidf__ngram_range=(1, 2); total time=  10.1s
[CV] END .......mnb__alpha=0.0001, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END .......mnb__alpha=0.0001, tfidf__ngram_

We can now inspect the hyperparameter values chosen for our best estimator/model using the following code.

In [33]:
gs_mnb.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
  ('mnb', MultinomialNB(alpha=0.01))],
 'verbose': False,
 'tfidf': TfidfVectorizer(ngram_range=(1, 2)),
 'mnb': MultinomialNB(alpha=0.01),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 2),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'mnb__alpha': 0.01,
 'mnb__class_prior': None,
 'mnb__fit_prior': True}

Now you might be wondering how these hyperparameters specifically were selected for the best estimator. Well, it decided this based on the model performance, with those hyperparameter values on the five-folds of validation data during cross-validation.

In [34]:
cv_results = gs_mnb.cv_results_
results_df = pd.DataFrame({'rank': cv_results['rank_test_score'],
                           'params': cv_results['params'], 
                           'cv score (mean)': cv_results['mean_test_score'], 
                           'cv score (std)': cv_results['std_test_score']} 
              )
results_df = results_df.sort_values(by=['rank'], ascending=True)
pd.set_option('display.max_colwidth', 100)
results_df

Unnamed: 0,rank,params,cv score (mean),cv score (std)
5,1,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 2)}",0.776519,0.004271
4,2,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 1)}",0.774643,0.007864
3,3,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 2)}",0.760701,0.003554
6,4,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 1)}",0.760049,0.007064
7,5,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 2)}",0.759234,0.008669
1,6,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 2)}",0.751406,0.005198
2,7,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 1)}",0.745862,0.004674
0,8,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 1)}",0.733062,0.005361
8,9,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 1)}",0.711374,0.008632
9,10,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 2)}",0.706237,0.007194


Table shows model performances across different hyperparameter values in the hyperparameter space. You can see how the best hyperparameters including bi-gram TF-IDF features gave the best cross-validation accuracy. Note that we are never tuning 
our models based on test data scores, because that would end up biasing our model toward the test dataset. We can now check our tuned model’s performance on the test data.

In [35]:
best_mnb_test_score = gs_mnb.score(test_corpus, test_label_names)
print('Test Accuracy :', best_mnb_test_score)

Test Accuracy : 0.7808309882469789


Looks like we have achieved a model accuracy of 77.3%, which is an improvement of 6% over the base model! Let’s look at how it performs for logistic regression now.

In [36]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

In [37]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

lr_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('lr', LogisticRegression(penalty='l2', max_iter=100, random_state=42))
                       ])

param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'lr__C': [1, 5, 10]
}

gs_lr = GridSearchCV(lr_pipeline, param_grid, cv=5, verbose=2)
gs_lr = gs_lr.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time= 1.1min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=  55.7s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=  58.5s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time= 1.2min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 1); total time=  58.4s
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time= 6.5min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time= 6.3min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time= 7.9min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time= 7.4min
[CV] END .................lr__C=1, tfidf__ngram_range=(1, 2); total time= 7.6min
[CV] END .................lr__C=5, tfidf__ngram_range=(1, 1); total time=  48.6s
[CV] END .................lr__C=5, tfidf__ngram_r

In [38]:
gs_lr.best_estimator_

Pipeline(steps=[('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
                ('lr', LogisticRegression(C=10, random_state=42))])

In [39]:
best_lr_test_score = gs_lr.score(test_corpus, test_label_names)
print('Test Accuracy :', best_lr_test_score)

Test Accuracy : 0.7705677867902665


We get an overall test accuracy of approximately 77%, which is almost a 2.5% improvement from the base logistic regression model. Finally, let’s tune our top two SVM models—the regular Linear SVM model and the SVM with Stochastic Gradient Descent.

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

svm_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('svm', LinearSVC(random_state=42))
                       ])

param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'svm__C': [0.01, 0.1, 1, 5]
}

gs_svm = GridSearchCV(svm_pipeline, param_grid, cv=5, verbose=2)
gs_svm = gs_svm.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   2.5s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   2.3s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   8.8s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   8.7s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   9.2s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   8.9s
[CV] END .............svm__C=0.01, tfidf__ngram_range=(1, 2); total time=   9.2s
[CV] END ..............svm__C=0.1, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END ..............svm__C=0.1, tfidf__ngram_r

In [41]:
gs_svm.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
  ('svm', LinearSVC(C=5, random_state=42))],
 'verbose': False,
 'tfidf': TfidfVectorizer(ngram_range=(1, 2)),
 'svm': LinearSVC(C=5, random_state=42),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 2),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'svm__C': 5,
 'svm__class_weight': None,
 'svm__dual': True,
 'svm__fit_intercept': True,
 'svm__intercept_scaling': 1,
 'svm__loss': 'squared_hinge',
 'svm__max_iter'

In [42]:
best_svm_test_score = gs_svm.score(test_corpus, test_label_names)
print('Test Accuracy :', best_svm_test_score)

Test Accuracy : 0.7867902665121669


In [43]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('sgd', SGDClassifier(random_state=42))
                       ])

param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'sgd__alpha': [1e-7, 1e-6, 1e-5, 1e-4]
}

gs_sgd = GridSearchCV(sgd_pipeline, param_grid, cv=5, verbose=2)
gs_sgd = gs_sgd.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   2.5s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   2.6s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 1); total time=   2.5s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   9.1s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   9.0s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   9.2s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   9.1s
[CV] END ........sgd__alpha=1e-07, tfidf__ngram_range=(1, 2); total time=   9.3s
[CV] END ........sgd__alpha=1e-06, tfidf__ngram_range=(1, 1); total time=   2.4s
[CV] END ........sgd__alpha=1e-06, tfidf__ngram_r

In [44]:
gs_sgd.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
  ('sgd', SGDClassifier(alpha=1e-05, random_state=42))],
 'verbose': False,
 'tfidf': TfidfVectorizer(ngram_range=(1, 2)),
 'sgd': SGDClassifier(alpha=1e-05, random_state=42),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 2),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'sgd__alpha': 1e-05,
 'sgd__average': False,
 'sgd__class_weight': None,
 'sgd__early_stopping': False,
 'sgd__epsilon': 0.1,
 'sgd__eta0': 0.0

In [45]:
best_sgd_test_score = gs_sgd.score(test_corpus, test_label_names)
print('Test Accuracy :', best_sgd_test_score)

Test Accuracy : 0.7841416983943056


This is definitely the highest overall accuracy we have obtained so far! However, not a huge improvement from the default linear SVM model performance. The SVM with SGD gives us a tuned model accuracy of 76.8%.

### Compute performance metrics 

Choosing the best model for deployment depends on a number of factors, like the model speed, accuracy, ease of use, understanding, and so on. Based on all the models we have built, the Naïve Bayes model is the fastest to train and, even though the SVM model might be slightly better on the test dataset in terms of accuracy, SVMs are notoriously slow and often hard to scale. Let’s take a detailed performance evaluation of our best, tuned Naïve Bayes model on the test dataset.

In [46]:
mnb_predictions = gs_mnb.predict(test_corpus)
unique_classes = list(set(test_label_names))

In [47]:
import model_evaluation_utils as meu
meu.get_metrics(true_labels=test_label_names, predicted_labels=mnb_predictions)

Accuracy: 0.7808
Precision: 0.7884
Recall: 0.7808
F1 Score: 0.7769


It is good to see good consistency with the classification metrics. Besides seeing the holistic view of model performance metrics, often a more granular view into per-class model performance metrics helps. Let’s take a look at that.

In [48]:
meu.display_classification_report(true_labels=test_label_names, 
                                  predicted_labels=mnb_predictions, classes=unique_classes)

                          precision    recall  f1-score   support

            misc.forsale       0.81      0.79      0.80       321
      talk.politics.misc       0.76      0.62      0.68       258
           comp.graphics       0.66      0.76      0.71       302
             alt.atheism       0.75      0.57      0.65       258
         rec.motorcycles       0.86      0.78      0.82       329
               sci.crypt       0.77      0.86      0.82       293
               sci.space       0.80      0.82      0.81       320
      talk.religion.misc       0.79      0.28      0.42       194
          comp.windows.x       0.84      0.83      0.84       336
 comp.os.ms-windows.misc       0.78      0.71      0.74       311
      rec.sport.baseball       0.94      0.87      0.90       314
   comp.sys.mac.hardware       0.81      0.75      0.78       309
comp.sys.ibm.pc.hardware       0.72      0.75      0.73       341
      talk.politics.guns       0.70      0.79      0.74       301
  soc.rel

This gives us a nice overview into the model performance for each newsgroup class and interestingly some categories like religion, Christianity, and atheism have slightly lower performance. Could it be that the model is getting some of these mixed up? The confusion matrix is a great way to test this assumption. Let’s first look at the newsgroup name to number mappings.

In [49]:
label_data_map = {v:k for k, v in data_labels_map.items()}
label_map_df = pd.DataFrame(list(label_data_map.items()), columns=['Label Name', 'Label Number'])
label_map_df

Unnamed: 0,Label Name,Label Number
0,alt.atheism,0
1,comp.graphics,1
2,comp.os.ms-windows.misc,2
3,comp.sys.ibm.pc.hardware,3
4,comp.sys.mac.hardware,4
5,comp.windows.x,5
6,misc.forsale,6
7,rec.autos,7
8,rec.motorcycles,8
9,rec.sport.baseball,9


We could now build a confusion matrix to show the correct and misclassified instances of each class label, which we represent by numbers for display purposes, due to the long names. 

In [50]:
unique_class_nums = label_map_df['Label Number'].values
mnb_prediction_class_nums = [label_data_map[item] for item in mnb_predictions]

In [51]:
unique_classes = label_map_df['Label Name'].values

Let’s take a closer look at these class labels to see what their newsgroup names are.

In [52]:
label_map_df[label_map_df['Label Number'].isin([0, 15, 19])]

Unnamed: 0,Label Name,Label Number
0,alt.atheism,0
15,soc.religion.christian,15
19,talk.religion.misc,19


All the newsgroup pertaining to different aspects of region have more misclassifications. Let's explore some specific instances.

In [53]:
train_idx, test_idx = train_test_split(np.array(range(len(data_df['Article']))), test_size=0.33, random_state=42)
test_idx

array([ 9877, 12655,  7048, ...,  7624,  1594,  6293])

Let's add two columns to our dataframe in our test dataset. The first column is the predictied label from our Naive Bayes model and the second column is the confidence of the model when making the prediction, which is basically the probability of the model prediction.

In [54]:
predict_probas = gs_mnb.predict_proba(test_corpus).max(axis=1)
test_df = data_df.iloc[test_idx]
test_df['Predicted Name'] = mnb_predictions
test_df['Predicted Confidence'] = predict_probas
test_df.head()

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
9877,"\nCENTERS\n[...]\n[...]\n\nSanderson will be on Team Canada, but he'd be out of position as a ce...",center sanderson team canada position center although draft center play rookie sanderson score 4...,10,rec.sport.hockey,rec.sport.hockey,0.997253
12655,"\nI do not think they can use the eavesdropping as evidence at all. However,\nusing the info the...",think use eavesdrop evidence however use info gather listen go search right place find good stro...,11,sci.crypt,sci.crypt,0.99995
7048,"\nThere are a few bills not yet in the archive, but these are the main ones\nwe need to fight. ...",bill yet archive main ones need fight thank david robinson scan many us subdirectory bill store ...,16,talk.politics.guns,sci.crypt,0.342186
15056,\n\nUnfortunately there a *LOT* of such software. I also find it to be\nthe case that the major...,unfortunately lot software also find case majority software bad regard commercial software way m...,3,comp.sys.ibm.pc.hardware,comp.os.ms-windows.misc,0.388951
17485,I have looked through the FAQ sections and have not\nseen a answer for this.\n\nI have an X/Moti...,look faq section see answer x motif application write couple gif file pict scan color scanner wo...,5,comp.windows.x,comp.graphics,0.885876


Let's now take a look at some articles that were from the newsgroup `talk.religion.misc`, but our model predicted `soc.religion.christian` with the highest confidence.

In [55]:
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc') & (test_df['Predicted Name'] == 'soc.religion.christian')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
6237,"Brian Ceccarelli wrote (that's me):\n\n\nKent Sandvik responds:\n\n\nI think I see where you are coming from Kent. Jesus doesn't view\nguilt like our modern venacular colors it. \n\n""Feelings"" ...",brian ceccarelli write kent sandvik respond think see where come kent jesus view guilt like modern venacular color feel nothing guilt feel arise state guilty feel guilt mutally exclusive feel reac...,19,talk.religion.misc,soc.religion.christian,0.99971
13307,"THE DIVINE MASTERS \n \n Most Christians would agree, and correctly so, that \n Jesus Christ was a Divine Master, and a projection of God \n into the phy...",divine master christians would agree correctly jesus christ divine master projection god physical world god incarnate important relate facts christians completely ignorant followers world religion...,19,talk.religion.misc,soc.religion.christian,0.999601
17282,"The primary problem in human nature is a ""fragmentation of being.""\nHumans are in a state of tension, a tension of opposites. Good and\nevil are the most thought provoking polarities that come to ...",primary problem human nature fragmentation humans state tension tension opposites good evil think provoke polarities come mind bible provide us many examples fragmentation war opposites within us ...,19,talk.religion.misc,soc.religion.christian,0.999427
4367,:\n (lots of stuff about the Nicene Creed deleted which can be read in the\n original basenote. I will also leave it up to other LDS netters to\n take Mr. Weiss to task on using Mormon Doctrine...,lot stuff nicene creed delete which read original basenote also leave lds netters take mr weiss task use mormon doctrine declare difinitive word what lds church teach doctrine hopefully lds netter...,19,talk.religion.misc,soc.religion.christian,0.999261
18218,"\n\nJesus also recognized other holy days, like the Passover. Acts 15 says \nthat no more should be layed on the Gentiles than that which is necessary.\nThe sabbath is not in the list, nor do any...",jesus also recognize holy days like passover act 15 say lay gentiles which necessary sabbath list epistles instruct people keep 7th day christians live among people who keep 7th day look like woul...,19,talk.religion.misc,soc.religion.christian,0.998448


You can have an idea about which instances might be getting misclassified and why., It looks like there are definitely some aspects of Christianity also mentioned in some of these articles, which leads the model to predict the `soc.religion.christian` category.

Let's now take a look at some articles that were from the newsgroup `talk.religion.misc`, but our model predicted `alt.atheism` with the highest confidence.

In [56]:
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc') & (test_df['Predicted Name'] == 'alt.atheism')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
2467,"Why is the NT tossed out as info on Jesus. I realize it is normally tossed\nout because it contains miracles, but what are the other reasons?\n\nMAC\n--\n*****************************************...",why nt toss info jesus realize normally toss contain miracles what reason mac michael cobb raise tax middle university illinois class pay program champaign urbana bill clinton 3rd debate cobb alex...,19,talk.religion.misc,alt.atheism,0.999882
5746,": In my mind, to say that science has its basis in values is a bit of a\n: reach. Science has its basis in observable fact. \n\nI'd say that what one chooses to observe and how the observation is\...",mind say science basis value bite reach science basis observable fact say what one choose observe how observation interpret what significance give depend great deal value observer science human ac...,19,talk.religion.misc,alt.atheism,0.992743
11119,"\n\n\tUnless God admits that he didn't do it....\n\n\t=)\n\n\n--- \n\n "" I'd Cheat on Hillary Too.""",unless god admit cheat hillary,19,talk.religion.misc,alt.atheism,0.893857
6876,\nI'm for the moment interested in this notion of the 'leap of faith'\nestablished by Kierkegaard. It clearly points out a possible solution\nto transcendental values. What I don't understand is t...,moment interest notion leap faith establish kierkegaard clearly point possible solution transcendental value what understand also clearly show existentialism system where leap transcendental direc...,19,talk.religion.misc,alt.atheism,0.855462
11825,\nI think that if a theist were truly objective and throws out the notion that\nGod definitely exists and starts from scratch to prove to themselves that\nthe scriptures are the whole truth then t...,think theist truly objective throw notion god definitely exist start scratch prove scriptures whole truth person would longer theist miss something people who convert non theism theism bring non t...,19,talk.religion.misc,alt.atheism,0.844723


This should be a no-brainer considering atheism and religion are related in several aspects when people talk about them, especially on online forums. You should check if there are some other interesting patterns in the articles.

### Exercise

Compute the performance metrics of the various machine learning techniques. Consider the `model_evaluation_utils.py` file.