## Exercise 0: Preprocessing Text Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils.validation import check_X_y, check_array
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import spacy 
import en_core_web_sm
import re
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

To implement a Text Classifiers to categorize news items. The dataset
name and link:
• 20newsgroups dataset (A collection of 20,000 news items across 20 categories)
• Available via Scikit-Learn Datasets API.
• Subset the dataset to only the following two categories named as ’sci.med’ and ’comp.graphics’

To refer to the following resource 
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html as given in question

In [2]:

#Fetching the data set and subset it to the two given categories

twenty_dataset = fetch_20newsgroups(subset='train',categories=['comp.graphics', 'sci.med'], shuffle=True, random_state=3116)



In [3]:
twenty_dataset #the retreived data set

{'data': ['From: kaminski@netcom.com (Peter Kaminski)\nSubject: Re: Krillean Photography\nLines: 101\nOrganization: The Information Deli - via Netcom / San Jose, California\n\n[Newsgroups: m.h.a added, followups set to most appropriate groups.]\n\nIn <1993Apr19.205615.1013@unlv.edu> todamhyp@charles.unlv.edu (Brian M.\nHuey) writes:\n\n>I am looking for any information/supplies that will allow\n>do-it-yourselfers to take Krillean Pictures.\n\n(It\'s "Kirlian".  "Krillean" pictures are portraits of tiny shrimp. :)\n\n[...]\n\n>One might extrapolate here and say that this proves that every object\n>within the universe (as we know it) has its own energy signature.\n\nI think it\'s safe to say that anything that\'s not at 0 degrees Kelvin\nwill have its own "energy signature" -- the interesting questions are\nwhat kind of energy, and what it signifies.\n\nI\'d check places like Edmund Scientific (are they still in business?) --\nor I wonder if you can find ex-Soviet Union equipment for sal

In [4]:
#The y values are stored in the target attribute. target_names store the 2 category values
twenty_dataset.target_names

['comp.graphics', 'sci.med']

#### Preprocessing textual data to remove punctuation, stop-words (list available via external libraries such as NLTK and spaCy). 


Here I am using the list from spacy.

In [12]:

#loading the english language small model of spacy
eng = spacy.load('en_core_web_sm')
stop_spacy = eng.Defaults.stop_words #to get the stopwords in english language
print(stop_spacy) #printing the same


{'n‘t', 'thereafter', 'were', 'whole', 'sometimes', 'thereby', 'fifty', 'back', 'along', 'perhaps', 'something', 'many', 'the', 'before', 'my', '’ve', 'thru', 'thereupon', 'will', 'see', "'d", "n't", 'off', 'towards', 'third', 'become', 'hereupon', 'them', 'therein', 'up', 'alone', 'we', 'ours', 'whatever', 'yet', 'make', 'his', 'who', 'how', 'go', 'most', 'you', "'m", 'upon', 'get', 'whether', 'fifteen', 'anywhere', 'this', 'amongst', 'being', 'been', 'nevertheless', 'has', 'to', 'call', 'why', 'full', 'enough', 'often', 'beyond', 'indeed', 'using', 'more', 'otherwise', 'least', 'name', 'no', 'only', 'nine', '’ll', 'when', 'any', 'though', 'its', 'neither', 'really', 'these', 'after', 'at', 'very', 'and', 'last', 'formerly', "'ve", 'must', 'himself', 'sometime', 'again', 'keep', 'except', 'some', 'just', 'then', 'thence', 'behind', 'always', 'less', 'mine', 'itself', 'out', 'anyhow', 'meanwhile', 'via', 'whoever', 'almost', '‘s', 'hence', 'well', 'against', 'own', '’d', 'both', 'is', 

In [13]:
#Removing the stop words and punctuations using the list we have.
new_text_list =[]

#Iterating for each row
for i in range(1178):
    new_words= [word for word in twenty_dataset.data[i] if not word in stop_stacy]
    new_text = " ".join(new_words)
    new_text_list.append(new_text)
print(new_text_list[50]) #to print the 50th news item, its old length and new length
print("Old length: ", len(twenty_dataset.data[50]))
print("New length: ", len(new_text_list[50]))

 besmith@uncc edu (Brian E Smith) Subject   Rayshade query Nntp Posting Host  ws   uncc edu Reply  besmith@uncc edu Organization  University NC Charlotte Lines      article     @sunvax sun ac za         @sunvax sun ac za () writes   looking surface chesspieces  board marble   Unfortunately black won't work  Anybody ideas  nice surfaces   brass silver   I've seen real chessboards use material      post finished chessboard     Right good place   Can't wait   use POV raytracer   compatible chessboard                                                                                        "I don't know you've got picture  doesn't        like he's running thrusters!"    Leonard McCoy       "A guess    Spock   That's extraordinary!"    James T  Kirk                                                                                    Brian Smith  (besmith@mosaic uncc edu)  
Old length:  1105
New length:  875


 Implementing a bag-of-words feature representation for each text sample

In [14]:
# Here I am using the CountVectorizer() to return the vocabulary and the occurences

count_vector = CountVectorizer()
X_dataset_counts = count_vector.fit_transform(new_text_list)
X_dataset_counts.shape

(1178, 21529)

In [15]:
count_vector.vocabulary_.get(u'algorithm')

718

 Implementing a TF-IDF feature representation for each text sample

In [16]:
# Here I am using the TfidfTransformer() and then fitting and transforming the data. It gives the frequency of each word

tf_transformer = TfidfTransformer(use_idf=False).fit(X_dataset_counts)
X_dataset_tf = tf_transformer.transform(X_dataset_counts)
X_dataset_tf.shape

(1178, 21529)

 Split the dataset randomly into train/validation/test splits according to ratios 80%:10%:10%

In [18]:
len1 = int(0.8*X_dataset_counts.shape[0])
len2 = int(len1+(0.1*X_dataset_counts.shape[0]))
X_train_counts = X_dataset_counts[:len1]
X_val_counts = X_dataset_counts[len1:len2]
X_test_counts = X_dataset_counts[len2:]

print(X_train_counts.shape)
print(X_test_counts.shape)
print(X_val_counts.shape)



(942, 21529)
(119, 21529)
(117, 21529)


## Exercise 2: Implementing SVM Classifier via Scikit-Learn

In [19]:
# Peforming Naive Bayes Classifier using sklearn on the tfidf transformed data

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_dataset_counts)
X_train_tfidf.shape
clf = MultinomialNB().fit(X_train_tfidf, twenty_dataset.target) # we get the fitted classifier

In [20]:
#Checking the prediction for 2 new values
docs_new = ['Science is love', 'Graphics on the GPU is fast']
X_new_counts = count_vector.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts) #Tfidf transformation

predicted = clf.predict(X_new_tfidf) #predciting using the already created model

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_dataset.target_names[category])) #Displaying the predicted categories
print("\n We see that the predictions match the text given")

'Science is love' => sci.med
'Graphics on the GPU is fast' => comp.graphics

 We see that the predictions match the text given


In [21]:
#To make the vectorizer,transformer and classifier easier to work with, scikit-learn provides a Pipeline class 
# It behaves as a compund classifier
text_clf = Pipeline([
('vector', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])

In [22]:
text_clf.fit(twenty_dataset.data, twenty_dataset.target) #fitting directly on the pipeline


Pipeline(steps=[('vector', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [23]:
# Finding the prediction accuracy of the Naive Bayes classifier
#fetching test data as shown in scikit learn documentation
two_test = fetch_20newsgroups(subset='test',  
categories=['comp.graphics', 'sci.med'], shuffle=True, random_state=3116)
docs_test = two_test.data
predicted = text_clf.predict(docs_test)
acc = np.mean(predicted == two_test.target)
print("Accuracy of Naive Bayes Classifier: ",acc)

Accuracy of Naive Bayes Classifier:  0.9477707006369427


Now implementing Support Vector Machines for the news item classification.

Here we only need to plug in the different classifier object into the pipeline.

SGDClassifier is a linear SVM classifier with SGD training.

In [24]:
#hinge refers to being linear model(default value), penalty is for regularisation, alpha is the penalty value

text_classification = Pipeline([
('vector', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=3116,
                          max_iter=5, tol=None)),
])

text_classification.fit(new_text_list[:len1], twenty_dataset.target[:len1])#fitting the model with the train set


Pipeline(steps=[('vector', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=3116,
                               tol=None))])

In [26]:

predicted = text_classification.predict(new_text_list[len2:])
accu = np.mean(predicted == twenty_dataset.target[len2:])
acc1 = accuracy_score(predicted, twenty_dataset.target[len2:])
print("Accuracy of SVM model with SGD training:",acc1)

Accuracy of SVM model with SGD training: 0.9663865546218487


We observe that SVM has better accuracy than naive bayes classifier.

In [27]:
#Finding accuracy using the various metrics provided by sklearn.
print(metrics.classification_report(twenty_dataset.target[len2:], predicted,
    target_names=twenty_dataset.target_names))

               precision    recall  f1-score   support

comp.graphics       0.96      0.99      0.97        69
      sci.med       0.98      0.94      0.96        50

     accuracy                           0.97       119
    macro avg       0.97      0.96      0.97       119
 weighted avg       0.97      0.97      0.97       119



Performing Grid search to tune the hyper parameters

In [31]:

# Here we try the classifier on either words or bi grams, with or without use_idf in tfidf transformer and penalty parameter
#Defining a parameter space
parameters = {
    'vector__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

In [32]:
#cross validation folds set to 5
tune_clf = GridSearchCV(text_classification, parameters, cv=5, n_jobs=-1) #n_jobs = -1 for parallel training


In [33]:
tune_clf = tune_clf.fit(new_text_list[:len2], twenty_dataset.target[:len2]) #fitting the data with the train and validation sets


In [34]:
#Trying the prediction for one data
twenty_dataset.target_names[tune_clf.predict(['Light travels fast'])[0]]

'sci.med'

We observe that the prediction is correct. 

In [35]:
tune_clf.best_score_ #Gets the best accuracy value for train set

0.9801663238844676

In [36]:
for param_name in sorted(parameters.keys()): # to print the best params
    print("%s: %r" % (param_name, tune_clf.best_params_[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vector__ngram_range: (1, 1)


In [41]:
# Now finding test accuracy on the best param model

best_clf = Pipeline([
('vector', CountVectorizer(ngram_range=(1,1))),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=0.001, random_state=3116,
                          max_iter=5, tol=None)),
])
best_clf.fit(new_text_list[:len2], twenty_dataset.target[:len2])
pred = best_clf.predict(new_text_list[len2:])
accu_best = np.mean(pred == twenty_dataset.target[len2:])

print("Accuracy of SVM model with best params:",accu_best)

Accuracy of SVM model with best params: 0.9663865546218487
