# Classification of Newswires: A Multiclass Classification Example

In the Internet Movie Database (IMDb) example we classified two classes. In this example we are going to build a network to classify Reuters newswires into 46 mutually exclusive topics. This example comes from the book "Deep Learning with Python" by Francois Chollet.

## The Reuters Dataset

This is a set of short newswires and their topics. These were published by Reuters back in 1986. Newswire services such as the Assoicated Press (AP), Reuters, Bloomberg and others provide news and information to media outlets, business and governments. 

This dataset is widely used for text classification. Each classification has at least 10 examples.

In [7]:
%config IPCompleter.greedy=True

In [2]:
from keras.datasets import reuters # comes as part of Keras
# bring in 10K of the most used words
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = 10000)

Check the data dimensions

In [3]:
len(train_data)

8982

In [4]:
len(test_data)

2246

We have 8,982 training examples and 2,246 test examples. Just like the IMDb set our words are a list of integers or word indicies.

In [6]:
# train_data[10]

We can decode back to words using the following code. The indices are offset by 3 because  0, 1, and 2 are reserved for "padding".

In [15]:
word_index = reuters.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
# Note that our indices were offset by 3
# because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[2312]])

In [16]:
decoded_newswire

'? data access systems inc said chairman david cohen has sold 1 800 000 common shares to phoenix financial corp for undisclosed terms and resigned as chairman and chief executive officer the company said phoenix financial now has a 27 pct interest in data access and effective control data access said phoenix chairman martin s ? has been named chairman of data access as well and two other phoenix representatives have been named to the data access board it said four directors other than cohen have resigned from the board reuter 3'

### Data Prep: Vectorization

We are going to vectorize the data. Below are a few examples of how we can do that.

In [18]:
import numpy as np

def vectorize_sequences(sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

We can also use One-hot encoding as well to vectorize. One-hot encoding is used a lot to catagorize variables.

In [19]:
# One-hot encoding
def to_one_hot(labels, dimension=46):
    results = np.zeros((len(labels), dimension))
    for i, label in enumerate(labels):
        results[i, label] = 1.
    return results

# Our vectorized training labels
one_hot_train_labels = to_one_hot(train_labels)
# Our vectorized test labels
one_hot_test_labels = to_one_hot(test_labels)

Keras built in method

In [20]:
from keras.utils.np_utils import to_categorical

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

### Constructing the Network

Compared with the IMDb dataset, this dataset has much more dimensionality.

In a stack of Dense layers like the one we have beening each layer can only access information present in the output of the previous layer. So, if we have one layer drop some information relevant to the classification problem the information can never be recovered by later layers. Basically each layer could become a bottleneck.

In the IMDb example we used 16-dimensions. For this example we will need to expand our network. We will experiment with 64 layers in this example.

In [21]:
from keras import models
from keras import layers

# 64 layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))