# Exercise - 20 Newsgroups Dataset - Solution

### Introducing a solution

**In this notebook, we only give guidelines on how the problem can be solved. You can use it to resolve an error you might have encountered or just for inspiration. The notebook does not necessarily present the best solution.**

### Introducing the assignment

In this assignment, you will be introduced to the **20 newsgroups text dataset**. It is one of the real-world datasets that can be directly imported from sklearn. The dataset consists of 18000 newsgroup posts on 20 topics.

The code under the following sections is implemented:
* **Importing the necessary libraries** - **some** of the libraries necessary for the next section are imported. The rest we leave for you to import.
* **Reading the database** - in this section, we do the following:
    - fetch the 20 newsgroups dataset
    - display the type of the **newsgroups** variable
    - display the names of all classes
    - display the first post in the database just to have an idea of how the dataset looks like
    - display the targets
    - using the Counter class, count the number of times each target has occurred in the list of targets
    
Your task is to build a Naive Bayes model in a similar fashion to the spam-filtering model we have built during the course. Then, analyze your results with the help of a confusion matrix and a classification report. Test both the multinomial and the complement naive bayes classifiers.

*Hint: Make use of the **categories** variable to print out the classification report.*

Good luck and have fun!

### Importing the necessary libraries

In [None]:
from sklearn.datasets import fetch_20newsgroups

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np

### Reading the database

In [None]:
newsgroups = fetch_20newsgroups()

In [None]:
type(newsgroups)

In [None]:
categories = newsgroups.target_names

In [None]:
newsgroups.data[0]

In [None]:
newsgroups.target

In [None]:
Counter(newsgroups.target)

### Defining the inputs and the target

In [None]:
inputs = newsgroups.data
target = newsgroups.target

In [None]:
len(target)

### Creating the train-test split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, 
                                                    test_size=0.2, 
                                                    random_state=365, 
                                                    stratify = target)

### Tokenizing the YouTube comments

In [None]:
vectorizer = CountVectorizer()

x_train_transf = vectorizer.fit_transform(x_train)
x_test_transf = vectorizer.transform(x_test)

### Performing the classification

In [None]:
clf = MultinomialNB()

clf.fit(x_train_transf, y_train)

### Performing the evaluation on the test dataset

In [None]:
y_test_pred = clf.predict(x_test_transf)

In [None]:
sns.reset_orig()

ConfusionMatrixDisplay.from_predictions(
    y_test, y_test_pred,
    labels = clf.classes_,
    cmap = 'magma'
);

In [None]:
print(classification_report(y_test, y_test_pred, target_names = categories))