# NLP Core 3 Exercise: Bagged News

In this exercise we will learn how to perform document classification in order to predict the category of news articles from the Reuters Corpus using a **bag-of-words** model and **one-hot encoding**. We will then see how we can use **TF-IDF** to improve our features for classification.

## The Reuters Corpus

The Reuters Corpus is a collection of news documents along with category tags that are commonly used to test document classification. It is split into two sets: the *training* documents used to train a classification algorithm, and the *test* documents used to test the classifier's performance. Here we load the corpus and save the IDs of the training and test documents:

In [0]:
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
train_ids = [d for d in reuters.fileids() if d.startswith('train')]
test_ids = [d for d in reuters.fileids() if d.startswith('test')]

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


**Questions**:
  1. How many documents are in the Reuters Corpus? What percentage are training and what percentage are testing documents?
  2. How many words are in the training documents? (use *reuters.words(file_id)*)

Let's have a look at the categories in the Reuters Corpus. Note that one document can have more than one category:

In [0]:
print(len(reuters.categories()), 'categories in the Reuters Corpus')
print('Categories of one sample document:', reuters.categories(train_ids[9]))
print('Sample from that document:', reuters.raw(train_ids[9])[:98])

90 categories in the Reuters Corpus
Categories of one sample document: ['cocoa', 'coffee', 'sugar']
Sample from that document: COFFEE, SUGAR AND COCOA EXCHANGE NAMES CHAIRMAN
  The New York Coffee, Sugar and Cocoa
  Exchange 


**Question:**
  3. What are the three most common categories in the training documents? (use *reuters.categories(file_id)*)

## Bag of words representations

We will now see how a sentence can be transformed into a feature vector using a bag of words model. Consider the following sentences:

In [0]:
sentences = [
  'This is the first document.',
  'This document is the second document.',
  'And this is the third one.',
   'Is this the first document?',
]

We can represent each word as a **one-hot** encoded vector (with a single 1 in the column for that word), and add their vectors together to get the feature vector for a sentence:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

**Questions:**
  4. What do the rows and columns of the feature matrix X represent?
  5. What word does the second column of X represent? What about the third column? (If you are stuck, look at *vectorizer.get_feature_names()*)
 
 **Bonus**: Try using TfidfVectorizer instead of CountVectorizer, and try to explain why some values of X become smaller than others.

## Classifying Reuters

Now let's put these together in order to build a classifier for Reuters articles. Fill in the following code using the instructions in the questions below:

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

train_docs = [reuters.raw(train_id) for train_id in train_ids]
test_docs = [reuters.raw(test_id) for test_id in test_ids]

In [0]:
#### (A) add code here from question 6

In [0]:
# convert the category labels into binary features for classification
mlb = MultiLabelBinarizer()
y = mlb.fit_transform([reuters.categories(train_id) for train_id in train_ids])
y2 = mlb.transform([reuters.categories(test_id) for test_id in test_ids])

In [0]:
#### (B) add code here from question 7

In [0]:
# show classifier's performance (look at average scores at the bottom)
print(classification_report(y2, predictions))

**Questions:**
  6. In (A) above, add code to convert the training and testing documents into matrices X and X2 of feature vectors using CountVectorizer(). (Hint: use fit_transform() first on the training set, and then transform() on the testing set)
  7. In (B) above, add code to fit a multiclass SVM classifier on the training data . (Hint: use *OneVsRestClassifier(LinearSVC())* as the classifier object, and then call its fit() and predict() methods on the data.)
  
 **Bonus**: Try using TF-IDF (TfidfVectorizer) weighted features. Does the classifier's performance improve?