# Text Classification with The 20 newsgroups text dataset

## 1. Import neccessary libraries
- MultinomialNB (The multinomial Naive Bayes classifier): is suitable for classification with discrete features
- CountVectorizer: transform text to vector by Bag of Words method
- TfidfVectorizer: transform text to vector by TF-IDF method
- TfidfTransformer: recalculate vectors of a BoW (by TF-IDF) to make vector more accurracy  

In [1]:
#from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
#import string
import pandas as pd
#import re
#import spacy
#nlp = spacy.load('en')

## 2. Get the 20 newsgroups text dataset
- Get train subset test subset
- Remove the footer part and the quote part of an email

In [32]:
from sklearn.datasets import fetch_20newsgroups
#cats = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale']
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True, remove=('footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True, remove=('footers', 'quotes'))

#newsgroups_train = fetch_20newsgroups(subset='train', categories = cats, shuffle = True, remove=('footers', 'quotes'))
#newsgroups_test = fetch_20newsgroups(subset='test', categories = cats, shuffle = True, remove=('footers', 'quotes'))

## 3. Preprocess the data
- This process includes:
 - Clean data (Remove punctuation, and stopword)
 - Tokenize data
 - Vectorize data: use 2 methods: Bag of Words and TF-IDF
- __CountVectorizer Library:__ Support us to do all steps. It uses Bag of Words method to vectorize data
- __TfidfTransformer Library:__ Use TF-IDF methods to transform vector of BoW method

In [33]:
###Method 1: Bag of Words
vectorizer_bow = CountVectorizer(stop_words = "english")
vectors_bow = vectorizer_bow.fit_transform(newsgroups_train.data)
vectors_bow_t = vectorizer_bow.transform(newsgroups_test.data)
###Method 2: TF-IDF
tfidf_transformer = TfidfTransformer()
vectors_tfidf = tfidf_transformer.fit_transform(vectors_bow)
vectors_tfidf_t = tfidf_transformer.fit_transform(vectors_bow_t)
#vectorizer_tfidf = TfidfVectorizer(stop_words='english')
#vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data)
#vectors_tfidf_t = vectorizer_tfidf.transform(newsgroups_test.data)

# 4. Training
- Use The multinomial Naive Bayes classifier to train
- Predict the target group of a testing dataset and compare with the result to get the accuracy (F-score)

In [34]:
#Get the model
clf = MultinomialNB(alpha=.01)
#Train model with training dataset
clf.fit(vectors_bow, newsgroups_train.target)
#Predict the target of testing dataset
pred = clf.predict(vectors_bow_t)

print ("F-score of BoW:", metrics.f1_score(newsgroups_test.target, pred, average='macro'))

F-score of BoW: 0.7182918139019095


In [35]:
clf = MultinomialNB(alpha=.01)

clf.fit(vectors_tfidf, newsgroups_train.target)

pred = clf.predict(vectors_tfidf_t)

print ("F-score of BoW:", metrics.f1_score(newsgroups_test.target, pred, average='macro'))

F-score of BoW: 0.7819474884205162


## 5. Conclusion:
- I run above code with the number of categories k (2 < k < 20)
- BoW gets higher F-score in almost cases with k <= 3
- With k >= 4, TF-IDF gets higher score in all cases