# Naive Bayes for Document Classification

*This lab session is inspired by the online [scikit-learn tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).*

In this session, we will apply a simple naive Bayes classifier for document classification. In particular, we will use the [*Twenty Newsgroups* dataset](http://qwone.com/~jason/20Newsgroups), which is a collection of 20,000 documents, partitioned (nearly) evenly across 20 different newsgroups, such as `soc.religion.christian`, `comp.graphics`, etc.  Some of the newsgroups are very closely related to each other (e.g., `comp.sys.ibm.pc.hardware` and `comp.sys.mac.hardware`), while others are highly unrelated (e.g., `misc.forsale` and `soc.religion.christian`).

## Data: 20 Newsgroups

The data can be downloaded using scikit-learn with

```python
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train',      # Which set to fetch (train/test)
                                categories=categories, # Which categories to fetch
                                shuffle=True,          # Shuffle the data?
                                random_state=42)       # Seed
```
For the purposes of this session, we will consider only 4 out of the 20 categories, namely,
```python
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
```
**Download the data.** Thus, we download the training data as follows:

In [None]:
# This may take some time when executed for the first time
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
data_train = fetch_20newsgroups(subset='train',        # Which set to fetch (train/test)
                                categories=categories, # Which categories to fetch
                                shuffle=True,          # Shuffle the data?
                                random_state=42)       # Seed

**Getting familiar with data format.** Now, `data_train` is a dictionary that contains the data. In particular,
- `data_train['data']` contains the raw documents.
- `data_train['target']` has the label (indicating the category) of each document.
- `data_train['target_names']` contains the category names.

In [None]:
# Print the number of documents
print('Number of documents: {}'.format(len(data_train['data'])))

In [None]:
# Print the first document
print(data_train['data'][0])

In [None]:
# Print the label of the first document
print('Label of the first document: {}'.format(data_train['target'][0]))

In [None]:
# Print the label names
print(data_train['target_names'])

**Text preprocessing.** Text preprocessing, tokenizing and filtering of stopwords are included in a high level component of scikit-learn, which is able to build a dictionary of features and transform documents to feature vectors. Let's convert the data into a sparse matrix of word counts:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data_train['data'])

**[Task]** Print the size of `X_train_counts` (use cell below). What do the columns of  correspond to? And the number of rows?

**[Task]** Print a block of the sparse matrix (use cell below) to verify that it contains integers. For instance, you can print rows `0:2` and columns `0:2000`. What's the meaning of the printed numbers?

`CountVectorizer` supports counts of N-grams of words or consecutive characters. In addition, it automatically builds a dictionary of feature indices. For instance, we can find out which index corresponds to the word "email" with `count_vect.vocabulary_.get('email')`:

In [None]:
print('The word "email" corresponds to token {}'.format(count_vect.vocabulary_.get('email')))

## Training a Naive Bayes Classifier

Here, we use scikit-learn to fit a naive Bayes classifier to our data collection. Scikit-learn includes several variants of this classifier; the most suitable one for document classification is the multinomial (`MultinomialNB`).

**Train the classifier.** The `fit` method of `MultinomialNB` learns the parameters of the classifer via maximum a posteriori.

In [None]:
from sklearn.naive_bayes import MultinomialNB
naive_classifier = MultinomialNB().fit(X_train_counts, data_train['target'])

**Classify documents from the test set.** As an example, here we fetch and classify a document from the test set.

In [None]:
# Fetch the test data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42)

**[Task]** In the cell below, print the two first documents in the test set.

To predict the outcome of a new document, we need to extract the word counts, as we did before. The only difference is that we call `transform` instead of `fit_transform`, which have already been fit to the training set. The function to make actual predictions is called `predict`. We encapsulate this pipeline on a python function,

In [None]:
def classify_pipeline(docs):
    X_test_counts = count_vect.transform(docs)
    X_test_predicted = naive_classifier.predict(X_test_counts)
    return X_test_predicted

**[Task]** In the cell below, classify the two first documents in the test set and print the predicted labels. For that, use the function `classify_pipeline(data_test['data'][0:2])`.

**[Question]** Based on reading the two documents that you printed above, do you think the predicted classes make sense?

**Evaluation: Accuracy.** We now classify *all* the documents in the test set and compute how many classifications were correct. The percentage of correct classifications is called *accuracy*.

In [None]:
import numpy as np
# Predict labels for all documents in test set
X_test_predicted = classify_pipeline(data_test['data'])
# Compute and print accuracy
acc = np.mean(X_test_predicted == data_test['target'])
print('The accuracy on the test set is: {:.6f}'.format(acc))

**Evaluation: Precision and recall.** Scikit-learn can also compute more advanced metric, such as precision, recall, and F1-score.

In [None]:
from sklearn import metrics
print(metrics.classification_report(data_test['target'], X_test_predicted,
                                    target_names=data_test['target_names']))

**[Question]** What is the "support" column above?

**Evaluation: Confusion matrix.** We can also print the confusion matrix using scikit-learn.

In [None]:
print(metrics.confusion_matrix(data_test['target'], X_test_predicted))

**[Question]** Take a moment to look at the confusion matrix. What does it indicate? Which are the two categories that are more likely to be confused?

## Excercises

**[Task]** Now replace the naive Bayes approach with a neural network classifier. For this, you need to use the following scikit-learn function: 
```python
from sklearn.neural_network import MLPClassifier
nn_classifer = MLPClassifier(solver='adam',      # which variant of gradient descent
                             alpha=1.0,                     # regularization term
                             activation='relu',             # which non-linear function
                             hidden_layer_sizes=(64, 64),   # units per hidden layer
                             random_state=42).fit(X_train_counts, data_train['target'])
```

Fit a neural network and evaluate its performance on the test set (accuracy, precision, recall, F1-score, and confusion matrix).

**[Task]** Now replace the naive Bayes approach with another more complex classifier. In particular, use a support vector machine (SVM). The SVM is a discriminative classifier that uses a hyperplace to separate the clases, after applying a non-linear transformation of the input. You don't need to know the details now. All you need to know is that scikit-learn also provides a function for the SVM: 

```python
from sklearn.svm import LinearSVC

# Train the classifier
svm_classifier = LinearSVC(C=1.0).fit(X_train_counts, data_train['target'])
```

Fit a SVM and evaluate its performance on the test set (accuracy, precision, recall, F1-score, and confusion matrix).