# Text Classification - 20 Newsgroups Dataset (Notebook 2/2)

This lab uses the 20 Newsgroups dataset directly available in Scikit-Learn. It comprises around 18,000 newsgroups posts spread across 20 different news classes.

##### I, Samer Haidar, affirm that I completed this assignment on my own without receiving or giving any help.

## Importing Libraries

In [2]:
import processing as pp

import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import pickle

import nltk

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from sklearn.datasets import fetch_20newsgroups

from sklearn.decomposition import TruncatedSVD

from gensim.models import word2vec

from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

## Loading Scores

In [2]:
voting_scores = pd.read_pickle("voting_scores.pkl")

In [3]:
voting_scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Model,Voting 1,Voting 2,Voting 3,Voting 4,Voting 5,Voting 6,Voting 7,Voting 8,Voting 9,Voting 10,Voting 11,Voting 12
Train Score (TF),0.973336,0.973609,0.965357,0.973677,0.974564,0.973472,0.967744,0.967676,0.967608,0.975041,0.96713,0.972245
Test Score (TF),0.769839,0.771203,0.770112,0.767112,0.776929,0.769839,0.776111,0.775839,0.774475,0.775839,0.774475,0.773384


## Loading the Data

In [3]:
data = fetch_20newsgroups(subset='all', shuffle=True, remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

In [5]:
data.target

array([10,  3, 17, ...,  3,  1,  7])

In [5]:
corpus, target_labels, target_names = (data.data, data.target, [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})

print(data_df.shape)
data_df.head(10)

(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


In [6]:
data_df = data_df[~(data_df.Article.str.strip() == "")]
data_df.shape

(18331, 3)

## Preprocessing (Best Combination)

In [7]:
norm_corpus = pp.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True,
                     text_lower_case=True, text_lemmatization=True, stopword_removal=True, accented_char_removal=True, 
                     special_char_removal=True)
    
data_df['Clean Article'] = norm_corpus

In [10]:
len(norm_corpus)

18331

## Splitting Data

In [8]:
train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names = train_test_split(np.array(data_df['Clean Article']),
                                                                                                                         np.array(data_df['Target Label']),
                                                                                                                         np.array(data_df['Target Name']),
                                                                                                                         stratify=data_df['Target Label'],
                                                                                                                         test_size=0.20, random_state=42)

train_corpus, val_corpus, train_label_nums, val_label_nums, train_label_names, val_label_names = train_test_split(train_corpus,
                                                                                                                          train_label_nums,
                                                                                                                          train_label_names,
                                                                                                                          stratify=train_label_nums,
                                                                                                                          test_size=0.30, random_state=42)

train_corpus.shape, val_corpus.shape, test_corpus.shape

((10264,), (4400,), (3667,))

# Feature Extraction

## 1.1 Bag of Words Features with Classification Models (Unigrams)

In [9]:
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

# transform test articles into features
cv_val_features = cv.transform(val_corpus)

In [10]:
print('BOW model:> Train features shape:', cv_train_features.shape,' Valid features shape:', cv_val_features.shape)

BOW model:> Train features shape: (10264, 78525)  Valid features shape: (4400, 78525)


## Models

#### Multinomial Naive Bayes Classifier

In [11]:
mnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, train_label_names)

mnb_bow_train_scores = mnb.score(cv_train_features, train_label_names)
print('Train Accuracy:', mnb_bow_train_scores)

mnb_bow_valid_scores = mnb.score(cv_val_features, val_label_names)
print('Valid Accuracy:', mnb_bow_valid_scores)

Train Accuracy: 0.8068978955572876
Valid Accuracy: 0.6827272727272727


#### Logistic Regression

In [12]:
log = LogisticRegression()
log.fit(cv_train_features, train_label_names)

log_bow_train_scores = log.score(cv_train_features, train_label_names)
print('Train Accuracy:', log_bow_train_scores)

log_bow_valid_scores = log.score(cv_val_features, val_label_names)
print('Valid Accuracy:', log_bow_valid_scores)

Train Accuracy: 0.9943491816056118
Valid Accuracy: 0.6843181818181818


#### Support Vector Machine (SVM)

In [13]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)

svm_bow_train_scores = svm.score(cv_train_features, train_label_names)
print('Train Accuracy:', svm_bow_train_scores)

svm_bow_valid_scores = svm.score(cv_val_features, val_label_names)
print('Valid Accuracy:', svm_bow_valid_scores)

Train Accuracy: 0.9974668745128605
Valid Accuracy: 0.6418181818181818


#### Stochastic Gradient Descent Classifier

In [14]:
sgd = SGDClassifier()
sgd.fit(cv_train_features, train_label_names)

sgd_bow_train_scores = sgd.score(cv_train_features, train_label_names)
print('Train Accuracy:', sgd_bow_train_scores)

sgd_bow_valid_scores = sgd.score(cv_val_features, val_label_names)
print('Valid Accuracy:', sgd_bow_valid_scores)

Train Accuracy: 0.9711613406079501
Valid Accuracy: 0.6377272727272727


#### Random Forest Classifier

In [15]:
rfc = RandomForestClassifier()
rfc.fit(cv_train_features, train_label_names)

rfc_bow_train_scores = rfc.score(cv_train_features, train_label_names)
print('Train Accuracy:', rfc_bow_train_scores)

rfc_bow_valid_scores = rfc.score(cv_val_features, val_label_names)
print('Valid Accuracy:', rfc_bow_valid_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6636363636363637


#### Voting Classifier

In [16]:
voting = VotingClassifier(estimators=[('mnb', mnb), ('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')
voting_fitted = voting.fit(cv_train_features, train_label_names)

voting_bow_train_scores = voting.score(cv_train_features, train_label_names)
print('Train Accuracy:', voting_bow_train_scores)

voting_bow_val_scores = voting.score(cv_val_features, val_label_names)
print('Valid Accuracy:', voting_bow_val_scores)

Train Accuracy: 0.9970771628994544
Valid Accuracy: 0.7127272727272728


## 1.2 Bag of Words Features with Classification Models (Unigrams, Bigrams)

In [17]:
cv_bi = CountVectorizer(ngram_range=(1,2), binary=False, min_df=0.0, max_df=1.0)
cv_bi_train_features = cv_bi.fit_transform(train_corpus)

# transform test articles into features
cv_bi_val_features = cv_bi.transform(val_corpus)

In [18]:
print('BOW model:> Train features shape:', cv_bi_train_features.shape,' Valid features shape:', cv_bi_val_features.shape)

BOW model:> Train features shape: (10264, 782403)  Valid features shape: (4400, 782403)


## Models

#### Multinomial Naive Bayes Classifier

In [19]:
mnb = MultinomialNB(alpha=1)
mnb.fit(cv_bi_train_features, train_label_names)

mnb_bi_bow_train_scores = mnb.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', mnb_bi_bow_train_scores)

mnb_bi_bow_valid_scores = mnb.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', mnb_bi_bow_valid_scores)

Train Accuracy: 0.9623928293063133
Valid Accuracy: 0.6690909090909091


#### Logistic Regression

In [20]:
log = LogisticRegression()
log.fit(cv_bi_train_features, train_label_names)

log_bi_bow_train_scores = log.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', log_bi_bow_train_scores)

log_bi_bow_valid_scores = log.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', log_bi_bow_valid_scores)

Train Accuracy: 0.9967848791893998
Valid Accuracy: 0.6929545454545455


#### Support Vector Machine (SVM)

In [21]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_bi_train_features, train_label_names)

svm_bi_bow_train_scores = svm.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', svm_bi_bow_train_scores)

svm_bi_bow_valid_scores = svm.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', svm_bi_bow_valid_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6695454545454546


#### Stochastic Gradient Descent Classifier

In [22]:
sgd = SGDClassifier()
sgd.fit(cv_bi_train_features, train_label_names)

sgd_bi_bow_train_scores = sgd.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', sgd_bi_bow_train_scores)

sgd_bi_bow_valid_scores = sgd.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', sgd_bi_bow_valid_scores)

Train Accuracy: 0.9916212003117693
Valid Accuracy: 0.6584090909090909


#### Random Forest Classifier

In [23]:
rfc = RandomForestClassifier()
rfc.fit(cv_bi_train_features, train_label_names)

rfc_bi_bow_train_scores = rfc.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', rfc_bi_bow_train_scores)

rfc_bi_bow_valid_scores = rfc.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', rfc_bi_bow_valid_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6586363636363637


#### Voting Classifier

In [24]:
voting = VotingClassifier(estimators=[('mnb', mnb), ('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')
voting_fitted = voting.fit(cv_bi_train_features, train_label_names)

voting_bi_bow_train_scores = voting.score(cv_bi_train_features, train_label_names)
print('Train Accuracy:', voting_bi_bow_train_scores)

voting_bi_bow_val_scores = voting.score(cv_bi_val_features, val_label_names)
print('Valid Accuracy:', voting_bi_bow_val_scores)

Train Accuracy: 0.997564302416212
Valid Accuracy: 0.7206818181818182


## 2.1 TF-IDF Features with Classification Models (Unigrams)

In [25]:
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

# transform test articles into features
tv_valid_features = tv.transform(val_corpus)
print('TFIDF model:> Train features shape:', tv_train_features.shape,' Valid features shape:', tv_valid_features.shape)

TFIDF model:> Train features shape: (10264, 78525)  Valid features shape: (4400, 78525)


## Models

#### Multinomial Naive Bayes Classifier

In [26]:
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)

mnb_tfidf_train_scores = mnb.score(tv_train_features, train_label_names)
print('Train Accuracy:', mnb_tfidf_train_scores)

mnb_tfidf_valid_scores = mnb.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', mnb_tfidf_valid_scores)

Train Accuracy: 0.8685697583787997
Valid Accuracy: 0.7281818181818182


#### Logistic Regression

In [27]:
log = LogisticRegression()
log.fit(tv_train_features, train_label_names)

log_tfidf_train_scores = log.score(tv_train_features, train_label_names)
print('Train Accuracy:', log_tfidf_train_scores)

log_tfidf_valid_scores = log.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', log_tfidf_valid_scores)

Train Accuracy: 0.92371395167576
Valid Accuracy: 0.7481818181818182


#### Support Vector Machine (SVM)

In [28]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)

svm_tfidf_train_scores = svm.score(tv_train_features, train_label_names)
print('Train Accuracy:', svm_tfidf_train_scores)

svm_tfidf_valid_scores = svm.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', svm_tfidf_valid_scores)

Train Accuracy: 0.9952260327357755
Valid Accuracy: 0.7654545454545455


#### Stochastic Gradient Descent Classifier

In [29]:
sgd = SGDClassifier()
sgd.fit(tv_train_features, train_label_names)

sgd_tfidf_train_scores = sgd.score(tv_train_features, train_label_names)
print('Train Accuracy:', sgd_tfidf_train_scores)

sgd_tfidf_valid_scores = sgd.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', sgd_tfidf_valid_scores)

Train Accuracy: 0.9761301636788776
Valid Accuracy: 0.7670454545454546


#### Random Forest Classifier

In [30]:
rfc = RandomForestClassifier()
rfc.fit(tv_train_features, train_label_names)

rfc_tfidf_train_scores = rfc.score(tv_train_features, train_label_names)
print('Train Accuracy:', rfc_tfidf_train_scores)

rfc_tfidf_valid_scores = rfc.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', rfc_tfidf_valid_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6740909090909091


#### Voting Classifier

In [31]:
voting = VotingClassifier(estimators=[('mnb', mnb), ('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')
voting_fitted = voting.fit(tv_train_features, train_label_names)

voting_tfidf_train_scores = voting.score(tv_train_features, train_label_names)
print('Train Accuracy:', voting_tfidf_train_scores)

voting_tfidf_val_scores = voting.score(cv_val_features, val_label_names)
print('Valid Accuracy:', voting_tfidf_val_scores)

Train Accuracy: 0.9796375681995323
Valid Accuracy: 0.7361363636363636


## 2.2 TF-IDF Features with Classification Models (Unigrams and Bigrams)

In [32]:
tv_bi = TfidfVectorizer(ngram_range=(1,2), use_idf=True, min_df=0.0, max_df=1.0)
tv_bi_train_features = tv_bi.fit_transform(train_corpus)

# transform test articles into features
tv_bi_valid_features = tv_bi.transform(val_corpus)
print('TFIDF model:> Train features shape:', tv_bi_train_features.shape,' Valid features shape:', tv_bi_valid_features.shape)

TFIDF model:> Train features shape: (10264, 782403)  Valid features shape: (4400, 782403)


## Models

#### Multinomial Naive Bayes Classifier

In [33]:
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_bi_train_features, train_label_names)

mnb_bi_tfidf_train_scores = mnb.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', mnb_bi_tfidf_train_scores)

mnb_bi_tfidf_valid_scores = mnb.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', mnb_bi_tfidf_valid_scores)

Train Accuracy: 0.953526890101325
Valid Accuracy: 0.72


#### Logistic Regression

In [34]:
log = LogisticRegression()
log.fit(tv_bi_train_features, train_label_names)

log_bi_tfidf_train_scores = log.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', log_bi_tfidf_train_scores)

log_bi_tfidf_valid_scores = log.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', log_bi_tfidf_valid_scores)

Train Accuracy: 0.9674590802805924
Valid Accuracy: 0.7481818181818182


#### Support Vector Machine (SVM)

In [35]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_bi_train_features, train_label_names)

svm_bi_tfidf_train_scores = svm.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', svm_bi_tfidf_train_scores)

svm_bi_tfidf_valid_scores = svm.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', svm_bi_tfidf_valid_scores)

Train Accuracy: 0.9974668745128605
Valid Accuracy: 0.7793181818181818


#### Stochastic Gradient Descent Classifier

In [36]:
sgd = SGDClassifier()
sgd.fit(tv_bi_train_features, train_label_names)

sgd_bi_tfidf_train_scores = sgd.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', sgd_bi_tfidf_train_scores)

sgd_bi_tfidf_valid_scores = sgd.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', sgd_bi_tfidf_valid_scores)

Train Accuracy: 0.9942517537022604
Valid Accuracy: 0.7797727272727273


#### Random Forest Classifier

In [37]:
rfc = RandomForestClassifier()
rfc.fit(tv_bi_train_features, train_label_names)

rfc_bi_tfidf_train_scores = rfc.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', rfc_bi_tfidf_train_scores)

rfc_bi_tfidf_valid_scores = rfc.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', rfc_bi_tfidf_valid_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6704545454545454


#### Voting Classifier

In [38]:
voting = VotingClassifier(estimators=[('mnb', mnb), ('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')
voting_fitted = voting.fit(tv_bi_train_features, train_label_names)

voting_bi_tfidf_train_scores = voting.score(tv_bi_train_features, train_label_names)
print('Train Accuracy:', voting_bi_tfidf_train_scores)

voting_bi_tfidf_val_scores = voting.score(tv_bi_valid_features, val_label_names)
print('Valid Accuracy:', voting_bi_tfidf_val_scores)

Train Accuracy: 0.9943491816056118
Valid Accuracy: 0.7718181818181818


## 3. BOW + TF-IDF Transformer  with Classification Models (Unigrams and Bigrams)

#### Voting Classifier

In [39]:
mnb = MultinomialNB(alpha=1)
log = LogisticRegression()
svm = LinearSVC(penalty='l2', C=1)
sgd = SGDClassifier()
rfc = RandomForestClassifier()

voting_bow_tfidf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2), binary=False, min_df=0.0, max_df=1.0)),
                     ('tfidf', TfidfTransformer()),
                     ('voting', VotingClassifier(estimators=[('mnb', mnb), ('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')),])

voting_bow_tfidf.fit_transform(train_corpus, train_label_names)

voting_bow_tfidf_train_score = voting_bow_tfidf.score(train_corpus, train_label_names)
print('Train Accuracy:', voting_bow_tfidf_train_score)

voting_bow_tfidf_valid_score = voting_bow_tfidf.score(val_corpus, val_label_names)
print('Valid Accuracy:', voting_bow_tfidf_valid_score)

Train Accuracy: 0.9944466095089634
Valid Accuracy: 0.7709090909090909


## 4. Word2Vec (Word Embedding) with Classification Models using GLOVE

The following file should be dowloaded for the Word2Vec code to work: 
https://www.kaggle.com/danielwillgeorge/glove6b100dtxt?select=glove.6B.100d.txt

In [40]:
def load_glove_data():
    # read the word to vec file
    GLOVE_6B_100D_PATH = "glove.6B.100d.txt"
    dim = 100
    glove_small = {}
    with open(GLOVE_6B_100D_PATH, "rb") as infile:
        for line in infile:
            parts = line.split()
            try:
                word = parts[0].decode("utf-8")
                x = []
                for i in range(len(parts)-1):
                    x.append(float(parts[i+1].decode("utf-8")))
                glove_small[word] = x
            except: 
                print('')
    return glove_small

In [41]:
def word2vec_transform(dataset, word2vec, dim):
    trans_data = []
    for doc in dataset:
        words = doc.lower().split()
        w_length = 1
        data = np.zeros(dim)
        for i in range(len(words)):
            if words[i] in word2vec and words[i] not in stop_words:
                data = data + word2vec[words[i]]
                w_length = w_length + 1
        data = data / float(w_length)
        trans_data.append(data)
    return trans_data

In [42]:
word2vec = load_glove_data()

train_docs = word2vec_transform(train_corpus, word2vec, 100)
valid_docs = word2vec_transform(val_corpus, word2vec, 100)

## Models

#### Logistic Regression

In [43]:
log = LogisticRegression()
log.fit(train_docs, train_label_names)

log_word_train_scores = log.score(train_docs, train_label_names)
print('Train Accuracy:', log_word_train_scores)

log_word_valid_scores = log.score(valid_docs, val_label_names)
print('Valid Accuracy:', log_word_valid_scores)

Train Accuracy: 0.6515003897116134
Valid Accuracy: 0.6013636363636363


#### Support Vector Machine (SVM)

In [44]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(train_docs, train_label_names)

svm_word_train_scores = svm.score(train_docs, train_label_names)
print('Train Accuracy:', svm_word_train_scores)

svm_word_valid_scores = svm.score(valid_docs, val_label_names)
print('Valid Accuracy:', svm_word_valid_scores)

Train Accuracy: 0.6403936087295401
Valid Accuracy: 0.5984090909090909


#### Stochastic Gradient Descent Classifier

In [45]:
sgd = SGDClassifier()
sgd.fit(train_docs, train_label_names)

sgd_word_train_scores = sgd.score(train_docs, train_label_names)
print('Train Accuracy:', sgd_word_train_scores)

sgd_word_valid_scores = sgd.score(valid_docs, val_label_names)
print('Valid Accuracy:', sgd_word_valid_scores)

Train Accuracy: 0.565958690568979
Valid Accuracy: 0.5295454545454545


#### Random Forest Classifier

In [46]:
rfc = RandomForestClassifier()
rfc.fit(train_docs, train_label_names)

rfc_word_train_scores = rfc.score(train_docs, train_label_names)
print('Train Accuracy:', rfc_word_train_scores)

rfc_word_valid_scores = rfc.score(valid_docs, val_label_names)
print('Valid Accuracy:', rfc_word_valid_scores)

Train Accuracy: 0.9983437256430242
Valid Accuracy: 0.5436363636363636


#### Voting Classifier

In [47]:
voting = VotingClassifier(estimators=[('log', log), ('svm', svm), ('sgd', sgd), ('rfc', rfc)], voting='hard')
voting_fitted = voting.fit(train_docs, train_label_names)

voting_word_train_scores = voting.score(train_docs, train_label_names)
print('Train Accuracy:', voting_word_train_scores)

voting_word_val_scores = voting.score(valid_docs, val_label_names)
print('Valid Accuracy:', voting_word_val_scores)

Train Accuracy: 0.6844310210444271
Valid Accuracy: 0.6018181818181818


In [48]:
fe_scores = pd.DataFrame([['BOW (Unigram)', mnb_bow_valid_scores, log_bow_valid_scores, svm_bow_valid_scores, sgd_bow_valid_scores, rfc_bow_valid_scores, voting_bow_val_scores],
          ['BOW (Uni-Bigram)', mnb_bi_bow_valid_scores, log_bi_bow_valid_scores, svm_bi_bow_valid_scores, sgd_bi_bow_valid_scores, rfc_bi_bow_valid_scores, voting_bi_bow_val_scores],
          ['TF-IDF (Unigram)', mnb_tfidf_valid_scores, log_tfidf_valid_scores, svm_tfidf_valid_scores, sgd_tfidf_valid_scores, rfc_tfidf_valid_scores, voting_tfidf_val_scores],
          ['TF-IDF (Uni-Bigram)', mnb_bi_tfidf_valid_scores, log_bi_tfidf_valid_scores, svm_bi_tfidf_valid_scores, sgd_bi_tfidf_valid_scores, rfc_bi_tfidf_valid_scores, voting_bi_tfidf_val_scores],
          ['BOW + TF-IDF', '', '', '', '', '', voting_bow_tfidf_valid_score],
          ['Word2Vec-GLOVE', '', log_word_valid_scores, svm_word_valid_scores, sgd_word_valid_scores, rfc_word_valid_scores, voting_word_val_scores]],
          columns = ['Model', 'MNB', 'LOG', 'SVM', 'SGD', 'RFC', 'Voting'])

fe_scores.to_pickle("fe_scores.pkl")

fe_scores

Unnamed: 0,Model,MNB,LOG,SVM,SGD,RFC,Voting
0,BOW (Unigram),0.682727,0.684318,0.641818,0.637727,0.663636,0.712727
1,BOW (Uni-Bigram),0.669091,0.692955,0.669545,0.658409,0.658636,0.720682
2,TF-IDF (Unigram),0.728182,0.748182,0.765455,0.767045,0.674091,0.736136
3,TF-IDF (Uni-Bigram),0.72,0.748182,0.779318,0.779773,0.670455,0.771818
4,BOW + TF-IDF,,,,,,0.770909
5,Word2Vec-GLOVE,,0.601364,0.598409,0.529545,0.543636,0.601818


##### TFIDF (Unigram and Bigram) seems to have the best accuracy scores

# Hyperparameter Tuning

##### For a better flow of work, i will recreate the models

In [36]:
tv = TfidfVectorizer(ngram_range=(1,2), use_idf=True, min_df=0.0, max_df=1.0)

tv_train_features = tv.fit_transform(train_corpus)

tv_valid_features = tv.transform(val_corpus)

tv_test_features = tv.transform(test_corpus)

print('TFIDF model:> Train features shape:', tv_train_features.shape,' Valid features shape:', tv_valid_features.shape, 'Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (10264, 782403)  Valid features shape: (4400, 782403) Test features shape: (3667, 782403)


##### 'refit = True' saves the best model to the GridSearch object

#### Multinomial Naive Bayes Classifier

In [61]:
param_grid_mnb = {'alpha': np.linspace(0.01,0.1,50)} 

random_mnb_class = RandomizedSearchCV(
    estimator = MultinomialNB(),
    param_distributions = param_grid_mnb,
    n_iter = 50,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_mnb_class.fit(tv_train_features, train_label_names)

df_mnb = pd.concat([pd.DataFrame(random_mnb_class.cv_results_["params"]),pd.DataFrame(random_mnb_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_mnb

Unnamed: 0,alpha,Accuracy
0,0.01,0.772701
1,0.011837,0.773967
2,0.013673,0.772798
3,0.01551,0.772408
4,0.017347,0.771824
5,0.019184,0.771337
6,0.02102,0.770557
7,0.022857,0.770167
8,0.024694,0.769388
9,0.026531,0.768999


In [62]:
random_mnb_class_train_scores = random_mnb_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_mnb_class_train_scores)

random_mnb_class_val_scores = random_mnb_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_mnb_class_val_scores)

Train Accuracy: 0.9964925954793453
Valid Accuracy: 0.7818181818181819


#### Complement Naive Bayes Classifier

In [28]:
param_grid_cnb = {'alpha': np.linspace(0.001,0.01,50)} 

random_cnb_class = RandomizedSearchCV(
    estimator = ComplementNB(),
    param_distributions = param_grid_cnb,
    n_iter = 50,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_cnb_class.fit(tv_train_features, train_label_names)

df_cnb = pd.concat([pd.DataFrame(random_cnb_class.cv_results_["params"]),pd.DataFrame(random_cnb_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_cnb

Unnamed: 0,alpha,Accuracy
0,0.001,0.737822
1,0.001184,0.739771
2,0.001367,0.740453
3,0.001551,0.741427
4,0.001735,0.743083
5,0.001918,0.744058
6,0.002102,0.745422
7,0.002286,0.746396
8,0.002469,0.74776
9,0.002653,0.748637


In [29]:
random_cnb_class_train_scores = random_cnb_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_cnb_class_train_scores)

random_cnb_class_val_scores = random_cnb_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_cnb_class_val_scores)

Train Accuracy: 0.9965900233826968
Valid Accuracy: 0.7706818181818181


#### Logistic Regression Classifier

In [30]:
param_grid_log = {'solver': ['lbfgs', 'liblinear'], 'penalty': ['none', 'l1', 'l2'], 'C': [100, 50, 10]} 

random_log_class = RandomizedSearchCV(
    estimator = LogisticRegression(),
    param_distributions = param_grid_log,
    n_iter = 10,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_log_class.fit(tv_train_features, train_label_names)

df_log = pd.concat([pd.DataFrame(random_log_class.cv_results_["params"]),pd.DataFrame(random_log_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_log

Unnamed: 0,solver,penalty,C,Accuracy
0,liblinear,none,50,
1,lbfgs,l2,100,
2,lbfgs,none,100,
3,liblinear,l2,100,0.755261
4,lbfgs,l2,10,0.750194
5,liblinear,none,10,
6,liblinear,l2,10,0.751168
7,liblinear,l1,100,0.702942
8,lbfgs,l1,10,
9,liblinear,l1,50,0.700798


In [31]:
random_log_class_train_scores = random_log_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_log_class_train_scores)

random_log_class_val_scores = random_log_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_log_class_val_scores)

Train Accuracy: 0.9981488698363211
Valid Accuracy: 0.7763636363636364


#### Support Vector Machine Classifier (SVC)

In [45]:
param_grid_svm = {'kernel': ['linear', 'rbf'], 'C': [10, 1.0]} 

random_svm_class = RandomizedSearchCV(
    estimator = SVC(),
    param_distributions = param_grid_svm,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_svm_class.fit(tv_train_features, train_label_names)

df_svm = pd.concat([pd.DataFrame(random_svm_class.cv_results_["params"]),pd.DataFrame(random_svm_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_svm

Unnamed: 0,kernel,C,Accuracy
0,linear,10.0,0.7424
1,rbf,10.0,0.72155
2,linear,1.0,0.740938
3,rbf,1.0,0.694368


In [46]:
random_svm_class_train_scores = random_svm_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_svm_class_train_scores)

random_svm_class_val_scores = random_svm_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_svm_class_val_scores)

Train Accuracy: 0.997954014029618
Valid Accuracy: 0.7672727272727272


#### Stochastic Gradient Descent Classifier

In [47]:
param_grid_sgd = {'penalty': ['l2', 'elasticnet'], 'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01]} 

random_sgd_class = RandomizedSearchCV(
    estimator = SGDClassifier(),
    param_distributions = param_grid_sgd,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_sgd_class.fit(tv_train_features, train_label_names)

df_sgd = pd.concat([pd.DataFrame(random_sgd_class.cv_results_["params"]),pd.DataFrame(random_sgd_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_sgd

Unnamed: 0,penalty,alpha,Accuracy
0,l2,1e-06,0.720088
1,elasticnet,1e-06,0.715218
2,l2,1e-05,0.753214
3,elasticnet,1e-05,0.74922
4,l2,0.0001,0.760522
5,elasticnet,0.0001,0.733047
6,l2,0.001,0.73597
7,elasticnet,0.001,0.570537
8,l2,0.01,0.73597
9,elasticnet,0.01,0.124511


In [48]:
random_sgd_class_train_scores = random_sgd_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_sgd_class_train_scores)

random_sgd_class_val_scores = random_sgd_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_sgd_class_val_scores)

Train Accuracy: 0.9942517537022604
Valid Accuracy: 0.7804545454545454


#### Random Forest Classifier

In [13]:
param_grid_rfc = {'n_estimators': [10, 50, 100, 200], 'criterion': ['gini', 'entropy']} 

random_rfc_class = RandomizedSearchCV(
    estimator = RandomForestClassifier(),
    param_distributions = param_grid_rfc,
    scoring='accuracy', n_jobs=4, cv = 2, refit=True, return_train_score = True)

random_rfc_class.fit(tv_train_features, train_label_names)

df_rfc = pd.concat([pd.DataFrame(random_rfc_class.cv_results_["params"]),pd.DataFrame(random_rfc_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_rfc

Unnamed: 0,n_estimators,criterion,Accuracy
0,10,gini,0.423617
1,50,gini,0.597428
2,100,gini,0.626559
3,200,gini,0.644291
4,10,entropy,0.277572
5,50,entropy,0.457229
6,100,entropy,0.52991
7,200,entropy,0.565277


In [14]:
random_rfc_class_train_scores = random_rfc_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_rfc_class_train_scores)

random_rfc_class_val_scores = random_rfc_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_rfc_class_val_scores)

Train Accuracy: 0.9982462977396727
Valid Accuracy: 0.6852272727272727


#### K-Nearest Neighbors Classifier (KNN)

In [15]:
param_grid_knn = {'n_neighbors': [100, 120, 150, 200], 'metric': ['euclidean', 'manhattan']} 

random_knn_class = RandomizedSearchCV(
    estimator = KNeighborsClassifier(),
    param_distributions = param_grid_knn,
    scoring='accuracy', n_jobs=4, cv = 5, refit=True, return_train_score = True)

random_knn_class.fit(tv_train_features, train_label_names)

df_knn = pd.concat([pd.DataFrame(random_knn_class.cv_results_["params"]),pd.DataFrame(random_knn_class.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
df_knn

Unnamed: 0,n_neighbors,metric,Accuracy
0,100,euclidean,0.660366
1,120,euclidean,0.657151
2,150,euclidean,0.652767
3,200,euclidean,0.647214
4,100,manhattan,0.054462
5,120,manhattan,0.058262
6,150,manhattan,0.059918
7,200,manhattan,0.05378


In [16]:
random_knn_class_train_scores = random_knn_class.score(tv_train_features, train_label_names)
print('Train Accuracy:', random_knn_class_train_scores)

random_knn_class_val_scores = random_knn_class.score(tv_valid_features, val_label_names)
print('Valid Accuracy:', random_knn_class_val_scores)

Train Accuracy: 0.6976812159002338
Valid Accuracy: 0.6765909090909091


# Testing the Models on the Test Dataset

#### Multinomial Naive Bayes

In [49]:
random_mnb_class_test_scores = random_mnb_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_mnb_class_test_scores)

Test Accuracy: 0.7763839650940824


#### Complement Naive Bayes

In [50]:
random_cnb_class_test_scores = random_cnb_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_cnb_class_test_scores)

Test Accuracy: 0.7657485683119717


#### Logistic Regression

In [51]:
random_log_class_test_scores = random_log_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_log_class_test_scores)

Test Accuracy: 0.7725661303517862


#### Support Vector Machine

random_svm_class_test_scores = random_svm_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_svm_class_test_scores)

#### Stochastic Gradient Descent

In [53]:
random_sgd_class_test_scores = random_sgd_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_sgd_class_test_scores)

Test Accuracy: 0.7722934278701936


#### Random Forest

In [54]:
random_rfc_class_test_scores = random_rfc_class.score(tv_test_features, test_label_names)
print('Test Accuracy:', random_rfc_class_test_scores)

Test Accuracy: 0.676029451868012


# Best Model

By observing the accuracy of the models on the test dataset, it seems that the Multinomial Naive Bayes had the best predictive performance with a 0.776 accuracy

# Description

### Methodology

The purpose of this assignment was to optimize the basic text classification pipeline of the 20newsgroup dataset. Thus, i started first by testing 12 different preprocessing combinations in attempt to find the best preprocessing technique in addition to determining the preprocessing step which has the highest impact on the models' performance. The two main libraries that were used are Spacy and NLTK. Moreover, i also compared the lemmatization and the stemming techniques to see which one had a better effect on the models's accuracy. It's important to note that while testing different preprocessing combinations, TF-IDF was used as the default feature extraction techniques for all of the combinations. Besides, the preprocessing techniques were tested on six different classification algorithms (Naive Bayes, Logistic Regression, Support Vector Classifier, Stochastic Gradient Descent, Random Foreset, and a Voting Classifier which included all the previous algorithms).

Next after finding the best sequence of preprocessing techniques, i resplit the data into three different sets; training, validation, and testing. I did so because i found it more convenient to use a new notebook for the feature extraction task. I tested six different feature extraction techniques which are the following: 

               1- Bag of Words (Unigrams)
               2- Bag of Words (Unigrams and Bigrams)
               3- TF-IDF (Unigrams)
               4- TF-IDF (Unigrams and Bigrams)
               5- Bag of Words and TF-IDF (Unigrams and Bigrams)
               6- Word2Vec and Glove (Unigrams and Bigrams)

As before, those feature extraction techniques were tested on all six predefined models.
 
After selecting the best preprocessing and feature extraction techniques, i then moved into hyperparameter tuning. I tuned the hyperparameters of seven different models using a Randomized Grid Search. The seven models are:

               1- Multinomial Naive Bayes Classifier
               2- Complement Naive Bayes Classifier
               3- Logistic Regression Classifier
               4- Support Vector Machine Classifier 
               5- Stochastic Gradient Descent Classifier
               6 -Random Forest Classifier
               7- K-Nearest Neighbors Classifier

After finding the best hyperparameters for each of the seven models, i then moved to the final testing stage. Each of the seven models (which are now stored in the Grid Search object after setting 'refit = True') were tested against the testing dataset, and finally, the best model was chosen.

### Findings

For the preprocessing stage, the following combination was found to exhibit the highest prediction accuracy: 
                
               1- HTML Stripping
               2- Contraction Expansion
               3- Text Lower Case
               4- Text Lemmatization
               5- Stopword Removal
               6- Accented Char Removal
               7- Special Char Removal
               
As for the feature extraction stage, the TF-IDF with Unigrams and Bigrams seemed to have the highest effect on model performace as thus was picked.

Next, for the hyperparameter tuning, Multinomial Naive Bayes performed best with an alpha = 0.01. The Complement Naive Bayes performed best with an alpha = 0.01 also. As for the Logistic Regression, the model performed best with an l2 penalty, C = 100, and solver = liblinear. SVM did best with a kernel = linear and C = 10. SGD had the highest accuracy with a penalty = l2 and alpha = 0.0001. As for Random Forest, it did best with n-estimators = 200 and criterion = gini. Lastly, the best parameters for the KNN were n-neighbors = 200 and metric = euclidean.

Finally, after testing the models against the testing dataset, it was found the Multinomial Naive Bayes performed best achieving an accuracy of 0.776. 

Note: Due to the high computational cost, i wasnt able to test all the hyperparameters values, and thus was obliged to run the Grid Search only once.

### Recommendations

1- Redo the hyperparameter stage by including more parameters and a bigger range of values

2- The data needs more preprocessing as the corpus is still a bit messy and the words arent completely tockenized

3- More models should be tested, especially Neural Networks and RNN which are proved to perform very well on such type of problems

##### I, Samer Haidar, affirm that I completed this assignment on my own without receiving or giving any help.