This lab uses the [20 Newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html?highlight=newsgroup#sklearn.datasets.fetch_20newsgroups)
directly available in Scikit-Learn. It comprises around 18,000 newsgroups posts spread across 20 different news classes.

    

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import processing as pp  # local module

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
data = fetch_20newsgroups(subset='all', shuffle=True,
                          remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

In [3]:
corpus, target_labels, target_names = (data.data, data.target, 
                                       [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels,
'Target Name': target_names})

print(data_df.shape)
data_df.head(10)


(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


In [4]:
print(data_df.iloc[0])
print(data_df.iloc[0].Article)


Article         \n\nI am sure some bashers of Pens fans are pr...
Target Label                                                   10
Target Name                                      rec.sport.hockey
Name: 0, dtype: object


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




# Data Preprocessing and Normalization

In [5]:
# How many records are empty?
total_nulls = data_df[data_df.Article.str.strip() == ""].shape[0]

print("Total number of empty documents = {} out of {} documents".format(
            total_nulls, data_df.shape[0]))

Total number of empty documents = 515 out of 18846 documents


In [6]:
# Remove the empty records
data_df = data_df[~(data_df.Article.str.strip() == "")]
data_df.shape

(18331, 3)

------------------------------------

In [7]:

norm_corpus = pp.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=False, special_char_removal=True, 
                     stopword_removal=True)
    
data_df['Clean Article'] = norm_corpus

### Data splitting 

In [8]:
train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names = train_test_split(np.array(data_df['Clean Article']),
                                                                                                                     np.array(data_df['Target Label']),
                                                                                                                     np.array(data_df['Target Name']),
                                                                                                                     stratify=data_df['Target Label'],
                                                                                                                     test_size=0.20, random_state=42)
train_corpus.shape, test_corpus.shape

((14664,), (3667,))

# Building and Evaluating Our Text Classifier

We will use the following workflows to build our text classifiers.
- Traditional feature representation (BOW, TF-IDF) and classification models


# Bag of Words Features with Classification Models

Let’s start by using a basic Bag of Words, the term frequency-based feature engineering
model, to extract features from our train and test datasets.

In [9]:
# build BOW features on train articles
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

# transform test articles into features
cv_test_features = cv.transform(test_corpus)

In [10]:
print('BOW model:> Train features shape:', cv_train_features.shape,' Test features shape:', cv_test_features.shape)

BOW model:> Train features shape: (14664, 125133)  Test features shape: (3667, 125133)


We now build several classifiers on these features using the training data and test
their performance on the test dataset using all the classification models we discussed
earlier. We also check model accuracies using five-fold cross validation just to see if
the model performs consistently across the validation folds of data (we use this same
strategy to tune the models later)

In [11]:
# Naïve Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, train_label_names)

mnb_bow_train_scores = mnb.score(cv_train_features, train_label_names)
print('train Accuracy:', mnb_bow_train_scores)

mnb_bow_test_scores = mnb.score(cv_test_features, test_label_names)
print('Test Accuracy:', mnb_bow_test_scores)

train Accuracy: 0.8434260774686306
Test Accuracy: 0.7022088901008999


In [12]:
# Support Vector Machines
from sklearn.svm import LinearSVC
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)

svm_bow_train_scores = svm.score(cv_train_features, train_label_names)
print('Train Accuracy:', svm_bow_train_scores)

svm_bow_test_scores = svm.score(cv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_scores)

Train Accuracy: 0.997545008183306
Test Accuracy: 0.6686664848650122


# TF-IDF Features with Classification Models

We use TF-IDF features to train our classification models. Assuming TF-IDF weighs
down unimportant features, we might get better performing models. Let’s test our
assumption!

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
# build BOW features on train articles
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

# transform test articles into features
tv_test_features = tv.transform(test_corpus)
print('TFIDF model:> Train features shape:', tv_train_features.shape,' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (14664, 125133)  Test features shape: (3667, 125133)


We now build several classifiers on these features using the training data and test
their performance on the test dataset using all the classification models. We also check
model accuracies using five-fold cross validation, just like we did earlier.

In [14]:
# Naïve Bayes
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)

mnb_tfidf_train_scores = mnb.score(tv_train_features, train_label_names)
print('Train Accuracy:', mnb_tfidf_train_scores)

mnb_tfidf_test_scores = mnb.score(tv_test_features, test_label_names)
print('Test Accuracy:', mnb_tfidf_test_scores)

Train Accuracy: 0.8743180578286961
Test Accuracy: 0.7395691300790838


In [15]:
# Support Vector Machines
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)

svm_tfidf_train_scores = svm.score(tv_train_features, train_label_names)
print('Train Accuracy:', svm_tfidf_train_scores)

svm_tfidf_test_scores = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_tfidf_test_scores)

Train Accuracy: 0.9955673758865248
Test Accuracy: 0.7706572129806382


# Comparison of Model Performance

We can now do a nice comparison of all the models we have tried so far with the two
different feature engineering techniques. We will build a dataframe from our modeling
results and compare the results. 

In [16]:
pd.DataFrame([['Naive Bayes', mnb_bow_train_scores, mnb_bow_test_scores,
               mnb_tfidf_train_scores, mnb_tfidf_test_scores],
              ['Linear SVM', svm_bow_train_scores, svm_bow_test_scores,
               svm_tfidf_train_scores, svm_tfidf_test_scores]
              ],
             columns=['Model', 'Train Score (TF)', 'Test Score (TF)','Train Score (TF-IDF)', 'Test Score (TF-IDF)'],).T

Unnamed: 0,0,1
Model,Naive Bayes,Linear SVM
Train Score (TF),0.843426,0.997545
Test Score (TF),0.702209,0.668666
Train Score (TF-IDF),0.874318,0.995567
Test Score (TF-IDF),0.739569,0.770657
