## 1) Exploring the dataset

#### We download the datasets

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train_unclear = fetch_20newsgroups(subset='train')
newsgroups_train = fetch_20newsgroups (subset='train',remove =('headers', 'footers', 'quotes'))

#### Check differences

Since both datasets, with or without the meta-information, have the same length, we should continue with the second one, since it is more possible to avoid overfitting. This may happen because the removed information may contain words, or addresses that would cause large learning rates on traning dataset, but smaller ones on testing dataset.

In [2]:
if (newsgroups_train_unclear.filenames.shape) == (newsgroups_train.filenames.shape):
  print('Same length')
else:
  print('Different length')


Same length


#### Present the difference

We print the first instance of these two training datasets to visualize their internal context.

In [3]:
print("With meta-information: \n")
print(newsgroups_train_unclear.data[0])

print("Without meta-information: \n")
print(newsgroups_train.data[0])

With meta-information: 

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Without meta-information: 

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition

#### Formulation of the dataset

In [4]:
print("Number of training instances: ", newsgroups_train.filenames.shape)
print("Number of labels: ", len(newsgroups_train.target_names))

print('Label names: \n')
for i in set(newsgroups_train.target_names):
  print(i)

print('Label ids inside the dataset: \n')
for i in set(newsgroups_train.target):
  print(i)

Number of training instances:  (11314,)
Number of labels:  20
Label names: 

sci.med
talk.politics.misc
comp.sys.mac.hardware
comp.os.ms-windows.misc
soc.religion.christian
rec.motorcycles
rec.sport.hockey
talk.politics.mideast
sci.crypt
sci.electronics
rec.autos
misc.forsale
talk.religion.misc
sci.space
comp.graphics
talk.politics.guns
comp.windows.x
alt.atheism
comp.sys.ibm.pc.hardware
rec.sport.baseball
Label ids inside the dataset: 

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


### Download some packages for text-processing.

#### We moreover set some default stopwords for more experiments

In [5]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
import nltk
nltk.download('punkt')
nltk.download('stopwords')

stopwords_our = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at',
 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 
 'can', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during',
 'each', 'few', 'for', 'from', 'further', 
 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's",
 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's",
 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself',
 "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself',
 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours' 'ourselves', 'out', 'over', 'own',
 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 
 'than', 'that',"that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", 
 "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 
 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where',
 "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's",'will', 'with', "won't", 'would', "wouldn't", 
 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves', 
 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'hundred', 'thousand', '1st', '2nd', '3rd',
 '4th', '5th', '6th', '7th', '8th', '9th', '10th']

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vaggelis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vaggelis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2) Construct two Tokenizer functions 

We use the following text transformations :
1.   Lower all characters
2.   Tokenize each text segment, so as to obtain all the words, using the build in word_tokenize function
3.   Remove some usual stopwords, for having a more clear version of each segment
4.   Stem the remaining words, mapping some similar words as the same ones (e.g. faded or fading -> fade)
5.   Remove words with length smaller than 3 characters

--->>Two variants have been build here for being examined later!



In [6]:
def Tokenizer1(str_input, remove_stops = True, apply_stem = True, remove_short = True):
    #lower all characters
    str_input = str_input.lower()
    
    #separate all words in the input string
    words = word_tokenize(str_input)

    #we try to remove words with only 2 characters
    if remove_short:
      words = [word for word in words if len(word) > 2]
    
    #remove stopwords
    if remove_stops:
      stop_words = set(stopwords.words('english'))
      words = [w for w in words if not w in stop_words]
    
    #stem the words
    if apply_stem:
      porter_stemmer=nltk.PorterStemmer()
      words = [porter_stemmer.stem(word) for word in words]

    return words

def Tokenizer2(str_input, remove_stops = False, apply_stem = False, remove_short = False):
    #lower all characters
    str_input = str_input.lower()
    
    #separate all words in the input string
    words = word_tokenize(str_input)
    
    #we try to remove words with only 2 characters
    if remove_short:
      words = [word for word in words if len(word) > 2]
    
    #remove stopwords
    if remove_stops:
      stop_words = set(stopwords.words('english'))
      words = [w for w in words if not w in stop_words]
    
    #stem the words
    if apply_stem:
      porter_stemmer=nltk.PorterStemmer()
      words = [porter_stemmer.stem(word) for word in words]

    return words

#### examples of the above tokenizers

In [7]:
print('Tokenizer1 : lowers letters, tokenizes input strings, removes stop words, stems them and removes the short ones')
print(Tokenizer1(newsgroups_train.data[15]))
print('--------------------------------------------')
print('Tokenizer2 : only lowers letters and tokenizes the input string')
print(Tokenizer2(newsgroups_train.data[15]))

#variations
#print(Tokenizer(newsgroups_train.data[15], True,  False))
#print(Tokenizer(newsgroups_train.data[15], False, True ))

Example of using or not each component of pre=processing into the Tokenizer
--------------------------------------------
Tokenizer1 : lowers letters, tokenizes input strings, removes stop words, stems them and removes the short ones
["n't", 'sure', 'look', 'happen', 'japanes', 'citizen', 'world', 'war', "'re", 'prepar', 'say', 'let', 'round', 'peopl', 'stick', 'concentr', 'camp', 'without', 'trial', 'short', 'step', 'gass', 'without', 'trial', 'seem', 'nazi', 'origin', 'intend', 'imprison', 'jew', 'final', 'solut', 'dreamt', 'partli', 'could', "n't", 'afford', 'run', 'camp', 'devast', 'caus', 'goer', 'total', 'war', "n't", 'gass', 'gener', 'die', 'malnutrit', 'diseas']
--------------------------------------------
Tokenizer2 : only lowers letters and tokenizes the input string
['do', "n't", 'be', 'so', 'sure', '.', 'look', 'what', 'happened', 'to', 'japanese', 'citizens', 'in', 'the', 'us', 'during', 'world', 'war', 'ii', '.', 'if', 'you', "'re", 'prepared', 'to', 'say', '``', 'let', "'

Remark : Looking at the printed results, we observe that indeed Tokenizer1 does not contain words like "they" (stop word) or "it" (short word). At the same time, the word "happened" of Tokenizer2 is stemmed into "happen" in Tokenizer1.

## 3) Defining the text classifiers

We employ the tfidf transformation with the 2 variants of our build Tokenizers. Thus, we insert pipeline structures which contain the variants of the TfidfVectorizer along with 3 selected classifiers:

1.   SGD classifier (ie linear classifier with SGD training )
2.   Multinomial NB, a well-known classifier for textual data
3.   GradientBoosting (GNB), a powerful predictive algorithm that is based on several weak estimators trying to catch the errors of previous iterations so as to learn the underlying problem.




In [8]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import GradientBoostingClassifier as GB
from sklearn.naive_bayes import MultinomialNB  as MNB
from sklearn.linear_model import SGDClassifier as SGD


text_clf1 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   SGD(max_iter=1000)),
])


text_clf2 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer2, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   SGD(max_iter=1000)),
])

text_clf3 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   MNB()),
])


text_clf4 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer2, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   MNB()),
])

text_clf5 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   GB()),
])


text_clf6 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer2, max_df=0.3, min_df=0.0002, max_features=1000)),
    ('clf',   GB()),
])

## 4) Training the models ( over reduced version of the dataset ie consider 4 labels only - due to limited computational power - )

In [9]:
# Download the reduced training dataset along with its corresponding test set

categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups (subset='train',remove = ('headers', 'footers', 'quotes'), categories=categories)
newsgroups_test = fetch_20newsgroups (subset='test',  remove = ('headers', 'footers', 'quotes'), categories=categories)

print('Size of train data: ', newsgroups_train.filenames.shape)
print('Size of test  data: ', newsgroups_test.filenames.shape)

Size of train data:  (2034,)
Size of test  data:  (1353,)


In [10]:
#Tokenizer1 + SGD clf
text_clf1.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train1 = text_clf1.predict(newsgroups_train.data)
predictions_test1 = text_clf1.predict(newsgroups_test.data)

#Tokenizer2 + SGD clf
text_clf2.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train2 = text_clf2.predict(newsgroups_train.data)
predictions_test2 = text_clf2.predict(newsgroups_test.data)

#Tokenizer1 + MNB clf
text_clf3.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train3 = text_clf3.predict(newsgroups_train.data)
predictions_test3 = text_clf3.predict(newsgroups_test.data)

#Tokenizer2 + MNB clf
text_clf4.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train4 = text_clf4.predict(newsgroups_train.data)
predictions_test4 = text_clf4.predict(newsgroups_test.data)

#Tokenizer1 + GB clf
text_clf5.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train5 = text_clf5.predict(newsgroups_train.data)
predictions_test5 = text_clf5.predict(newsgroups_test.data)

#Tokenizer2 + GB clf
text_clf6.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train6 = text_clf6.predict(newsgroups_train.data)
predictions_test6 = text_clf6.predict(newsgroups_test.data)


## 5) Evaluating the models

#### a) We can have a first sample (20 text data) of the predictions vs the reality:

In [11]:
print('Tokenizer1 + SGD clf, training set predictions : ' ,predictions_train1[0:20])
print('Tokenizer2 + SGD clf, training set predictions : ' ,predictions_train2[0:20])
print('                      training set real labels : ', newsgroups_train.target[0:20])
print('--------------------------------------------------------------------------------------------------')
print('Tokenizer1 + SGD clf, test set predictions : ' ,predictions_test1[0:20])
print('Tokenizer2 + SGD clf, test set predictions : ' ,predictions_test2[0:20])
print('                      test set real labels : ' ,newsgroups_test.target[0:20])


Tokenizer1 + SGD clf, training set predictions :  [1 3 2 3 2 0 2 1 2 1 2 1 1 2 1 2 0 2 2 3]
Tokenizer2 + SGD clf, training set predictions :  [1 3 2 3 2 0 2 1 2 1 2 1 2 2 1 2 0 2 2 3]
                      training set real labels :  [1 3 2 0 2 0 2 1 2 1 2 1 1 2 1 2 0 2 2 3]
--------------------------------------------------------------------------------------------------
Tokenizer1 + SGD clf, test set predictions :  [2 1 1 1 1 1 2 2 0 0 1 1 1 2 1 0 3 1 1 2]
Tokenizer2 + SGD clf, test set predictions :  [2 1 1 1 1 1 2 2 0 2 1 1 1 2 1 1 3 1 1 2]
                      test set real labels :  [2 1 1 1 1 1 2 2 0 2 1 1 1 2 1 0 0 0 1 2]


#### Or we could also obtain the final transformation of the input data based on their tfidf format:

In [12]:
vectorizer = TfidfVectorizer(tokenizer=Tokenizer1, max_df=0.3, min_df=0.0002, max_features=1000)

X_tfidf_train = vectorizer.fit_transform(newsgroups_train.data)
print(X_tfidf_train.shape)

X_tfidf_test = vectorizer.fit_transform(newsgroups_test.data)
print(X_tfidf_test.shape)

(2034, 1000)
(1353, 1000)


### b) Below obtain the overall behavior of the examined results:

In [13]:
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix

print('Tokenizer1 + SGD clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test1))
print("Precision:", precision_score(newsgroups_test.target, predictions_test1, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test1))
print(confusion_matrix(newsgroups_test.target, predictions_test1))
print('-------------------------------------------------------------------------------------------')
print('Tokenizer2 + SGD clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test2))
print("Precision:", precision_score(newsgroups_test.target, predictions_test2, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test2))
print(confusion_matrix(newsgroups_test.target, predictions_test2))
print('-------------------------------------------------------------------------------------------')
print('Tokenizer1 + MNB clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test3))
print("Precision:", precision_score(newsgroups_test.target, predictions_test3, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test3))
print(confusion_matrix(newsgroups_test.target, predictions_test3))
print('-------------------------------------------------------------------------------------------')
print('Tokenizer2 + MNB clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test4))
print("Precision:", precision_score(newsgroups_test.target, predictions_test4, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test4))
print(confusion_matrix(newsgroups_test.target, predictions_test4))
print('-------------------------------------------------------------------------------------------')
print('Tokenizer1 + GB clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test5))
print("Precision:", precision_score(newsgroups_test.target, predictions_test5, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test5))
print(confusion_matrix(newsgroups_test.target, predictions_test5))
print('-------------------------------------------------------------------------------------------')
print('Tokenizer2 + GB clf')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test6))
print("Precision:", precision_score(newsgroups_test.target, predictions_test6, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test6))
print(confusion_matrix(newsgroups_test.target, predictions_test6))

Tokenizer1 + SGD clf
Accuracy: 0.6792313377679231
Precision: 0.6749082463610018
              precision    recall  f1-score   support

           0       0.59      0.55      0.57       319
           1       0.77      0.84      0.80       389
           2       0.75      0.73      0.74       394
           3       0.52      0.51      0.52       251

    accuracy                           0.68      1353
   macro avg       0.66      0.66      0.66      1353
weighted avg       0.67      0.68      0.68      1353

[[176  24  38  81]
 [ 20 327  31  11]
 [ 31  51 289  23]
 [ 72  25  27 127]]
-------------------------------------------------------------------------------------------
Tokenizer2 + SGD clf
Accuracy: 0.656319290465632
Precision: 0.6527905105178128
              precision    recall  f1-score   support

           0       0.56      0.52      0.54       319
           1       0.73      0.82      0.77       389
           2       0.75      0.71      0.72       394
           3       0

**Summary :**

**1) overall we see that in ALL cases we see that the FIRST variant ( ie remove stopwords and small words and include stemming ) of the Tokenizer is the best.**

**2) overall among all six classifiers the one that performs better is the MultinomialNaiveBayes classifier via the first variant of the Tokenizer function.**

## 6) Further analysis of the best classifier (Tokenizer1 + MNB) using n-grams

So far, in all aforementioned classifiers, the ngram_range parameter was set to the default value of (1,1), meaning that the transformer considered only unigrams of the input text. In other words, so far every single word was independent of its previous word.

However, it seems very logical to wonder if we could get better classifying results, in case we also allow bigrams ( ie each two succesive words are dependent of each other ) in the training.

#### Below we examine 3 variants of the above classifier:

1.   Default arguments of TfidfVectorizer, ie ngram_range = (1,1) 
2.   Use of unigrams and bigrams, ie. ngram_range = (1,2) 
3.   Use of bigrams only, ie. ngram_range = (2,2) 



####  "Define - Train - Test" each variant

In [16]:
# Only unigrams
text_clf3_variant1 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1)),
    ('clf',   MNB()),])
text_clf3_variant1.fit(newsgroups_train.data, newsgroups_train.target)
predictions_test1 = text_clf3_variant1.predict(newsgroups_test.data)

# Unigrams + Bigrams
text_clf3_variant2 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, ngram_range = (1,2))),
    ('clf',   MNB()),])
text_clf3_variant2.fit(newsgroups_train.data, newsgroups_train.target)
predictions_test2 = text_clf3_variant2.predict(newsgroups_test.data)

# Only bigrams
text_clf3_variant3 = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, ngram_range = (2,2))),
    ('clf',   MNB()),])
text_clf3_variant3.fit(newsgroups_train.data, newsgroups_train.target)
predictions_test3 = text_clf3_variant3.predict(newsgroups_test.data)

#### Evaluation of the models

In [17]:
print('Only unigrams')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test1))
print("Precision:", precision_score(newsgroups_test.target, predictions_test1, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test1))
print(confusion_matrix(newsgroups_test.target, predictions_test1))
print('---------------------------------------------------------------------------------------------')
print('Unigrams + Bigrams')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test2))
print("Precision:", precision_score(newsgroups_test.target, predictions_test2, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test2))
print(confusion_matrix(newsgroups_test.target, predictions_test2))
print('---------------------------------------------------------------------------------------------')
print('Only bigrams')
print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test3))
print("Precision:", precision_score(newsgroups_test.target, predictions_test3, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test3))
print(confusion_matrix(newsgroups_test.target, predictions_test3))


Only unigrams
Accuracy: 0.7287509238728751
Precision: 0.74671267456911
              precision    recall  f1-score   support

           0       0.59      0.71      0.65       319
           1       0.88      0.89      0.89       389
           2       0.71      0.91      0.80       394
           3       0.80      0.21      0.33       251

    accuracy                           0.73      1353
   macro avg       0.74      0.68      0.67      1353
weighted avg       0.75      0.73      0.70      1353

[[227  14  65  13]
 [  4 347  38   0]
 [ 14  21 359   0]
 [139  13  46  53]]
---------------------------------------------------------------------------------------------
Unigrams + Bigrams
Accuracy: 0.7184035476718403
Precision: 0.7473780990379352
              precision    recall  f1-score   support

           0       0.61      0.71      0.65       319
           1       0.89      0.87      0.88       389
           2       0.66      0.93      0.77       394
           3       0.84     

### Results: We observe the better generalization ability over test data when we use both unigrams and bigrams!

## 7) Tuning and Evaluating the final model

#### We consider the best classifier found so far ( Tokenizer1+MNB+Unigrams+Bigrams) and we use a grid search in order to find better parameters for the whole pipeline. 

#### More experiments can be selected, here we have created a grid of 24 separate cases.

In [250]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'tfidf__max_features': (5000, None),
    'tfidf__max_df': (1, 0.75, 0.5),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.9, 1),
    #'clf__n_estimators': (50, 100, 250),
}

text_tune = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=Tokenizer1, ngram_range = (1,2))),
    ('clf',   MNB()),
])


gs_clf = GridSearchCV(text_tune, parameters, cv=3, iid=False, n_jobs=-1)
gs_clf.fit(newsgroups_train.data, newsgroups_train.target)

print("Best score: %0.3f" % gs_clf.best_score_)
print("Best parameters set:")
best_parameters = gs_clf.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))



Best score: 0.792
Best parameters set:
	clf__alpha: 0.9
	tfidf__max_df: 0.75
	tfidf__max_features: 5000
	tfidf__norm: 'l2'


### Below we apply the best parameters and indeed obtain better predictions!

In [251]:
text_clf_tuned = text_tune.set_params(**best_parameters)
text_clf_tuned.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train = text_clf_tuned.predict(newsgroups_train.data)
predictions_test = text_clf_tuned.predict(newsgroups_test.data)


print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test))
print("Precision:", precision_score(newsgroups_test.target, predictions_test, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test))
print(confusion_matrix(newsgroups_test.target, predictions_test))


Accuracy: 0.7590539541759054
Precision: 0.7646233573261035
              precision    recall  f1-score   support

           0       0.62      0.72      0.67       319
           1       0.88      0.90      0.89       389
           2       0.77      0.88      0.82       394
           3       0.77      0.39      0.52       251

    accuracy                           0.76      1353
   macro avg       0.76      0.72      0.72      1353
weighted avg       0.76      0.76      0.75      1353

[[231  13  47  28]
 [  8 351  29   1]
 [ 22  25 347   0]
 [113  11  29  98]]


## 8) Behaviour of the final model over the 20-label dataset

In [253]:
newsgroups_train = fetch_20newsgroups (subset='train',remove = ('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups (subset='test',  remove = ('headers', 'footers', 'quotes'))


text_clf_tuned.fit(newsgroups_train.data, newsgroups_train.target)
predictions_train2 = text_clf_tuned.predict(newsgroups_train.data)
predictions_test2 = text_clf_tuned.predict(newsgroups_test.data)


print("Accuracy:", accuracy_score(newsgroups_test.target, predictions_test2))
print("Precision:", precision_score(newsgroups_test.target, predictions_test2, average='weighted'))
print(classification_report(newsgroups_test.target, predictions_test2))
print(confusion_matrix(newsgroups_test.target, predictions_test2))


Accuracy: 0.6477695167286245
Precision: 0.6699598130689187
              precision    recall  f1-score   support

           0       0.54      0.32      0.40       319
           1       0.54      0.64      0.58       389
           2       0.61      0.55      0.58       394
           3       0.58      0.63      0.60       392
           4       0.66      0.58      0.62       385
           5       0.71      0.73      0.72       395
           6       0.78      0.74      0.76       390
           7       0.70      0.67      0.68       396
           8       0.75      0.70      0.73       398
           9       0.86      0.77      0.81       397
          10       0.55      0.91      0.69       399
          11       0.72      0.72      0.72       396
          12       0.61      0.50      0.55       393
          13       0.78      0.69      0.73       396
          14       0.78      0.72      0.75       394
          15       0.45      0.88      0.59       398
          16       0.5