# DATA20001 Deep Learning - Group Project
## Text project

**Due Wednesday December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory, but if you use the keras.embedding layer, it will be more efficient. 
- Loading all documents into one big matrix as we have done in the exercises is not feasible (e.g. the virtual servers in CSC have only 3 GB of RAM). You need to load the documents in smaller chunks for the training. This shouldn't be a problem, as we are doing mini-batch training anyway, and thus we don't need to keep all the documents in memory. You can simply pass you current chunk of documents to `model.fit()` as it remembers the weights from the previous run.


## Download the data
Let's first set some paths & download the data set:

In [4]:
from src.data_utility import download_data 

database_path = 'train/'
corpus_path = database_path + 'REUTERS_CORPUS_2/'
data_path = corpus_path + 'data/'
codes_path = corpus_path + 'codes/'

download_data(database_path)

Data set already downloaded.
Data set already unzipped.


The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Reading the data set
First we will read the codes into the dictionary:

In [5]:
from src.data_utility import read_topics

(topics, topic_index, topic_labels) = read_topics(database_path)
n_class = len(topics)

print(n_class, ' different classes\n\n')
print(topics, '\n\n')           # topics codes as an array
print(topic_index, '\n\n')      # dictionary of topic code : index of this code in "topics array"

for key in topic_labels:        # dictionary of topic code : label of this topic code
    print(key, ' : ', topic_labels[key])


126  different classes


['1POL', '2ECO', '3SPO', '4GEN', '6INS', '7RSK', '8YDB', '9BNX', 'ADS10', 'BNW14', 'BRP11', 'C11', 'C12', 'C13', 'C14', 'C15', 'C151', 'C1511', 'C152', 'C16', 'C17', 'C171', 'C172', 'C173', 'C174', 'C18', 'C181', 'C182', 'C183', 'C21', 'C22', 'C23', 'C24', 'C31', 'C311', 'C312', 'C313', 'C32', 'C33', 'C331', 'C34', 'C41', 'C411', 'C42', 'CCAT', 'E11', 'E12', 'E121', 'E13', 'E131', 'E132', 'E14', 'E141', 'E142', 'E143', 'E21', 'E211', 'E212', 'E31', 'E311', 'E312', 'E313', 'E41', 'E411', 'E51', 'E511', 'E512', 'E513', 'E61', 'E71', 'ECAT', 'ENT12', 'G11', 'G111', 'G112', 'G113', 'G12', 'G13', 'G131', 'G14', 'G15', 'G151', 'G152', 'G153', 'G154', 'G155', 'G156', 'G157', 'G158', 'G159', 'GCAT', 'GCRIM', 'GDEF', 'GDIP', 'GDIS', 'GEDU', 'GENT', 'GENV', 'GFAS', 'GHEA', 'GJOB', 'GMIL', 'GOBIT', 'GODD', 'GPOL', 'GPRO', 'GREL', 'GSCI', 'GSPO', 'GTOUR', 'GVIO', 'GVOTE', 'GWEA', 'GWELF', 'M11', 'M12', 'M13', 'M131', 'M132', 'M14', 'M141', 'M142', 'M143', 'MCAT', 'MEUR', '

Let's read a small training and test set:

In [6]:
from src.data_utility import read_news

n_train = 10000
n_test = 10000

(news_train_t, tags_train_t, news_test_t, tags_test_t) = read_news(database_path, n_train, n_test, seed = 1234)

print(tags_train_t[0:3], '\n')
print(news_train_t[1], '\n')

print(tags_test_t[0:3], '\n')
print(news_test_t[0], '\n')

[['C15', 'C151', 'CCAT'], ['GCAT', 'GSPO'], ['E12', 'ECAT', 'M13', 'M132', 'MCAT']] 

['ICE HOCKEY-WORLD CHAMPIONSHIP STANDINGS.', 'Ice hockey world', "championship standings after Monday's games:", '    Pool A\t\t P   W   D   L   F   A  Pts', ' 1. Czech Republic     2   2   0   0   4   2   4', ' 2. Russia\t\t 2   1   1   0   7   3   3', ' 4. Slovakia\t     2   1   1   0   7   5   3', ' 3. Finland\t\t2   1   0   1   7   3   2', ' 5. France\t\t 2   0   0   2   4  11   0', ' 6. Germany\t\t2   0   0   2   2   7   0', 'Pool B', ' 1. Sweden\t\t 2   2   0   0  12   5   4', ' 2. U.S.\t\t   2   2   0   0   8   5   4', ' 3. Canada\t\t 2   1   0   1   9   7   2', ' 4. Italy\t\t  2   1   0   1   8   9   2', ' 5. Latvia\t\t 2   0   0   2   8  10   0', ' 6. Norway\t\t 2   0   0   2   1  10   0', 'Note: Top three teams qualify for medal round'] 

[['M13', 'M131', 'MCAT'], ['M13', 'M131', 'MCAT'], ['C15', 'C151', 'CCAT']] 

['Canadian T-bills open mostly weaker in quiet trade.', 'Canadian T-bills ope

In [7]:
from src.data_utility import download_glove
embeddings_path = "embeddings/"
download_glove(embeddings_path)

GloVe Zip found


In [8]:
from src.data_utility import unzip_glove
zip_file_name = "glove.6B.zip"
unzip_glove(embeddings_path, zip_file_name)

Already unzipped


In [9]:
from src.data_utility import get_glove_embeddings
p=200
embeddings = get_glove_embeddings(p, embeddings_path)

In [10]:
print(len(embeddings["the"]))
print(len(embeddings.keys()))

200
400000


In [11]:
import os

In [12]:
from src.data_utility import process_data
if not os.path.exists("train/REUTERS_CORPUS_2/tokenized/"): process_data("train/")


In [13]:
from src.data_utility import build_dictionary
if not os.path.exists("dictionary.json"): build_dictionary("train/")

In [14]:
import json
word_to_index = json.loads(open("dictionary.json").read())
dict_size = len(word_to_index.keys())

In [15]:
from src.data_utility import vectorize_data
if not os.path.exists("train/REUTERS_CORPUS_2/vectorized/"): vectorize_data("train/")

In [16]:
from src.data_utility import get_vectorized_data
vectorized_data_path = "train/REUTERS_CORPUS_2/vectorized/"
tags_path="train/REUTERS_CORPUS_2/tags/"
n_train=10000
n_test=10000
(news_train, tags_train, news_test, tags_test) = get_vectorized_data(vectorized_data_path, tags_path, n_train, n_test, seed = 1234)



In [17]:
import numpy as np
print(news_train[1])
print(news_train_t[1])
print(word_to_index["NUM"])
lengths = np.array([len(x) for x in news_train])
print(lengths.mean() + np.sqrt(lengths.var()))

[  9263 100301   7404   7511   9263   9294    460   7404   7511   3560
   4090   9041   1242   8212   4484   5633   1344   2939   1242   4595
      8   7597   7598      8      8      8      8      8      8      8
      8   7855      8      8      8      8      8      8      8      8
   7603      8      8      8      8      8      8      8      8   6701
      8      8      8      8      8      8      8      8   1167      8
      8      8      8      8      8      8      8   1168      8      8
      8      8      8      8      8   9041    764      8   7582      8
      8      8      8      8      8      8      8   1165      8      8
      8      8      8      8      8      8   1166      8      8      8
      8      8      8      8      8    492      8      8      8      8
      8      8      8      8   7583      8      8      8      8      8
      8      8      8    570      8      8      8      8      8      8
      8   2036   5659   2170   7427   7672  12077   2902]
['ICE HOCKEY-WORLD 

In [18]:
#Work towards a deep learning solution
from keras.preprocessing import sequence
max_news_length = int(np.percentile(lengths, 90))
news_train = sequence.pad_sequences(news_train, maxlen=max_news_length, padding='post', truncating='post')
news_test = sequence.pad_sequences(news_test, maxlen=max_news_length, padding='post', truncating='post')


In [19]:
print(news_train.shape)
print(news_test.shape)
print(list(tags_train[0]))

(10000, 326)
(10000, 326)
[15, 16, 44]


In [20]:
print(news_train[0])

[345747    893   2029   1814      8   1812      8      8   1807   1814
      8   4206   1804   1814      8   1814      8   1811      8      8
   2034   2035      8   4206   1816   1191   1817   1818   1819   1760
   1820      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      

In [21]:
print(n_class)
#encode responses
tags_train_matrix = np.zeros((n_train,n_class))
for ii in range(n_train):
    tags_train_matrix[ii, list(tags_train[ii])] = 1
    
tags_test_matrix = np.zeros((n_train,n_class))    
for ii in range(n_test):
    tags_test_matrix[ii, list(tags_test[ii])] = 1    

print(tags_train[0])    
print(tags_train_matrix[0,])

print(tags_test[0])    
print(tags_test_matrix[0,])

126
[15 16 44]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[116 117 123]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0. 

In [22]:
print(np.array(tags_train).shape)
print(len(word_to_index.keys()))

(10000,)
402715


In [23]:
embedding_matrix = np.zeros((len(word_to_index.keys())+1, p))
for word, i in word_to_index.items():
    vector = embeddings.get(word)
    if vector is not None:
        embedding_matrix[i] = vector

In [24]:
print(embedding_matrix.shape)
print(embedding_matrix[0,:])
#print(embedding_matrix[1,:])

(402716, 200)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]


In [25]:
#changing the embedding of "NUM" to be columnwise mean of 10 first N+
import numpy as np
from scipy import spatial
from numpy.linalg import norm
#print(embeddings["1"])
#print(embeddings["2"])
print(spatial.distance.cosine(embeddings["1"], embeddings["2"]))
print(spatial.distance.cosine(embeddings["1"], embeddings["5"]))
print(spatial.distance.cosine(embeddings["1"], embeddings["one"]))
print(spatial.distance.cosine(embeddings["1"], embeddings["dog"]))
numberss = np.zeros((10,len(embeddings["1"])))
for ii in range(10):
    numberss[ii,:] = embeddings[str(ii)]
    
print(len(np.mean(numberss, axis=0)))
print(embedding_matrix[word_to_index["NUM"],:])
embedding_matrix[word_to_index["NUM"],:] = np.mean(numberss, axis=0)
print(embedding_matrix[word_to_index["NUM"],:])

0.0487912585443
0.123944838227
0.487306300989
0.803939525358
200
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]
[ -7.44000000e-01   6.62274000e-01  -6.51124000e-01   2.36011600e-01
   5.67250000e-03  -3.07229500e-01   2.41807400e-01  

In [52]:
import keras.backend as K
#notice that this must be done without rounding, which would be in the stadard case of f1 score.
def f1_score_loss(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.sigmoid(1000 * (K.clip(y_true * y_pred, 0, 1)) - 0.5)) #special sigmoid for imitating the rounding
    c2 = K.sum(K.sigmoid(1000 * (K.clip(y_pred, 0, 1)) - 0.5))
    c3 = K.sum(K.clip(y_true, 0, 1))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    # How many selected items are relevant?
    precision = c1 / c2

    # How many relevant items are selected?
    recall = c1 / c3

    # Calculate f1_score
    f1_score = 2 * (precision * recall) / (precision + recall)
    return -1 * f1_score #loss, we are trying to max the f1 score

In [56]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Dropout
from keras.layers import GlobalMaxPooling1D, Merge, Concatenate
from keras.layers import Conv1D, MaxPooling1D, Flatten
from keras.layers.embeddings import Embedding
from keras import metrics
from sklearn.metrics import f1_score


### Modelling

We start by doing a simple LSTM layer. Motivation for this was a blog post from Jason Brownlee, [Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/).

In [28]:
# create the model with a simple LSTM layer
batch_size = 64
epochs = 5
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=5, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 326, 200)          80543200  
_________________________________________________________________
dropout_1 (Dropout)          (None, 326, 200)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 126)               12726     
Total params: 80,676,326
Trainable params: 133,126
Non-trainable params: 80,543,200
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c276b6b38>

In [30]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.3254


Seems to converge to a not so good value.

Next we will try out to optimize the "f1 loss" defined above. Note that f1 score is not differentiable, but we can approximate it by a differentiable sigmoid function with a proper scaling. A reasonable function we are using has the shift parameter of 0.5 and scale parameter of 10.

In [53]:
#Model with f1 loss
batch_size = 64
epochs = 2
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Dropout(0.1))
model.add(LSTM(5))
model.add(Dense(512))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss=f1_score_loss, optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
dropout_23 (Dropout)         (None, 326, 200)          0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 5)                 4120      
_________________________________________________________________
dense_22 (Dense)             (None, 512)               3072      
_________________________________________________________________
dropout_24 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_23 (Dense)             (None, 126)               64638     
Total params: 80,615,030
Trainable params: 71,830
Non-trainable params: 80,543,200
___________________________________________________________

<keras.callbacks.History at 0x1c29f0b9b0>

In [54]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.1 #same as we use for optimization
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.0497


It seems that the loss is getting concerningly low as the f1 score should be between 0 and 1. I have not found a bug in the f1 implementation.


In [55]:
print(max([max(x) for x in prob_test]))

0.55605


F1 score seems not to be working. The f1 score is near 0, which is the worse, and it seems that we are predicting zero class for everything. I will go back to the binary crossentropy. Next one is just basic convolutional NN. It is pretty similar to what we experimented in the exercises. The amount of parameters in the next one is pretty huge due to wide dense layer after convolutions and before the output layer. This is motivated by many successful CNN architectures as they tend to have a dense layer before output layer.

In [57]:
# create the model with a CNN layer
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 3
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=True)

model.add(embedding_layer)
model.add(Conv1D(filters=n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(Flatten())
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 326, 64)           102464    
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 108, 64)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 6912)              0         
_________________________________________________________________
dense_24 (Dense)             (None, 512)               3539456   
_________________________________________________________________
dropout_25 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_25 (Dense)             (None, 126)               64638     
Total para

<keras.callbacks.History at 0x1c2b4cfa58>

In [58]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7876


So far we get the best performance with this CNN network. Next model is an attempt to improve the performance by adding a LSTM layer. This approach is again motivated by the blog post from [Jason Brownlee](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/).

In [59]:
# create the model with LSTM and CNN layer
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 3
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Conv1D(filters=n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 326, 64)           102464    
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 108, 64)           0         
_________________________________________________________________
lstm_13 (LSTM)               (None, 50)                23000     
_________________________________________________________________
dense_26 (Dense)             (None, 512)               26112     
_________________________________________________________________
dropout_26 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_27 (Dense)             (None, 126)               64638     
Total para

<keras.callbacks.History at 0x1c2b824550>

In [60]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.461


CNN seems to be doing better job than CNN and LSTM. The iterations were faster and the convergence was reached earlier with worse performance in terms of loss. 

Next model was trying out different size of kernels and learning them side by side. This approach was motivated by the network graphs shown in the lectures and the implementation was learned by [this Github issue](https://github.com/fchollet/keras/issues/6547).

In [61]:
# create a model with multiple size kernels
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 3
kernels = (3,5,8,10)
n_filters = 128

submodels = []
for kw in kernels:    # kernel sizes
    submodel = Sequential()
    submodel.add(Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False))
    submodel.add(Conv1D(n_filters,
                        kw,
                        padding='valid',
                        activation='relu',
                        strides=1))
    submodel.add(GlobalMaxPooling1D())
    submodels.append(submodel)
    
model = Sequential()
model.add(Merge(submodels, mode="concat"))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())

news_train_rep = [np.array(news_train)] * len(kernels)

model.fit(news_train_rep, np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
merge_1 (Merge)              (None, 512)               0         
_________________________________________________________________
dense_28 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_27 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_29 (Dense)             (None, 126)               64638     
Total params: 323,166,206
Trainable params: 993,406
Non-trainable params: 322,172,800
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c2bea5f28>

In [62]:
news_test_rep = [np.array(news_test)] * len(kernels)
prob_test = model.predict(news_test_rep, batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.775


Seems to converge fast, but the final result is still in the same level as everything else. We can try to boost this with an LSTM layer.

In [63]:
# create a model with multiple size kernels and LSTM
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 5
kernels = (3,5,8,10)
n_filters = 128

submodels = []
for kw in kernels:    # kernel sizes
    submodel = Sequential()
    submodel.add(Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False))
    submodel.add(Conv1D(n_filters,
                        kw,
                        padding='valid',
                        activation='relu',
                        strides=1))
    submodel.add(GlobalMaxPooling1D())
    submodels.append(submodel)
    
model = Sequential()
model.add(Merge(submodels, mode="concat"))
model.add(MaxPooling1D(pool_size=pooling_size))
#model.add(Flatten())
print(model.summary())
model.add(LSTM(100))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())

news_train_rep = [np.array(news_train)] * len(kernels)

model.fit(news_train_rep, np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))



ValueError: Input 0 is incompatible with layer max_pooling1d_3: expected ndim=3, found ndim=2

In [None]:
prob_test = model.predict(news_test_rep, batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

Dimension mismatch somewhere? I could not solve what is wrong with the above approach. However a reasonable idea.

Next approach we took was the Bidirectional LSTM. It is used for a sequence classification problems where the whole sequence is known. So for regular time series it would not be applicable, but for sequence classifications such as an article it is ok. The motivation for this came through serious information retrieval from Jason Bronwlee's blog post [How to Develop a Bidirectional LSTM For Sequence Classification in Python with Keras](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/) and from Richard Liao's blog [Text Classification, Part 2 - sentence level Attentional RNN](https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-RNN/).

In [64]:
#bidirectional LSTM
from keras.layers import Bidirectional
epochs=5
batch_size=64
embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(512, activation='relu'))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
dropout_28 (Dropout)         (None, 326, 200)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               240800    
_________________________________________________________________
dense_30 (Dense)             (None, 512)               102912    
_________________________________________________________________
dense_31 (Dense)             (None, 126)               64638     
Total params: 80,951,550
Trainable params: 408,350
Non-trainable params: 80,543,200
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c2c469b00>

In [65]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7067


In [68]:
#bidirectional LSTM with couple of CNN layers
from keras.layers import Bidirectional
epochs=5
batch_size=64
n_convolution = 128
kernel_size = 5
pooling_size=3
embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(Conv1D(filters=2*n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(512, activation='relu'))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_25 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_13 (Conv1D)           (None, 326, 64)           64064     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 108, 64)           0         
_________________________________________________________________
conv1d_14 (Conv1D)           (None, 108, 128)          41088     
_________________________________________________________________
max_pooling1d_7 (MaxPooling1 (None, 36, 128)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 200)               183200    
_________________________________________________________________
dense_34 (Dense)             (None, 512)               102912    
__________

<keras.callbacks.History at 0x1c2dae40f0>

In [69]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.6779


Surprisingly bad, even though I got a good results with a bidirectional LSTM and CNNs separately. Also our group has been discussing that CNN is the way to go because of their good results. Next one is a modification of the model that we are looking to try out.

In [73]:
#Anisia super model
epochs = 5
batch_size = 64
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)
model.add(embedding_layer)
model.add(Conv1D(300, 4, activation='relu'))
model.add(Conv1D(100, 6, activation='relu'))
model.add(MaxPooling1D(pool_size=3))
model.add(Conv1D(100, 10, activation='relu'))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_29 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_24 (Conv1D)           (None, 323, 300)          240300    
_________________________________________________________________
conv1d_25 (Conv1D)           (None, 318, 100)          180100    
_________________________________________________________________
max_pooling1d_11 (MaxPooling (None, 106, 100)          0         
_________________________________________________________________
conv1d_26 (Conv1D)           (None, 97, 100)           100100    
_________________________________________________________________
flatten_5 (Flatten)          (None, 9700)              0         
_________________________________________________________________
dense_42 (Dense)             (None, 512)               4966912   
__________

<keras.callbacks.History at 0x1c30ca5cc0>

In [74]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7587


This approach did not achieve such great results here, but in the larger data set it achieved near 0.85 f1 score. I guess the final evaluation will show how good our models were.

I finally want to check some baseline scores when predicting exactly correct, only ones and only zeros.

In [75]:
print('F1 score: ', round(f1_score(pred_test, pred_test, average='micro'), 4))
print('F1 score: ', round(f1_score(np.ones(pred_test.shape), pred_test, average='micro'), 4))
print('F1 score: ', round(f1_score(np.zeros(pred_test.shape), pred_test, average='micro'), 4))


F1 score:  1.0
F1 score:  0.0542
F1 score:  0.0


  'recall', 'true', average, warn_for)


Let's change the target variable into one-hot encoding:

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(topics)
y_train = mlb.fit_transform(tags_train)
y_test = mlb.fit_transform(tags_test)

print(y_train.shape)
print(y_test.shape, '\n')
print(y_train[0])

## Preprocessing the data set (this is just for test)
Then we will convert the training and test sets into one-hot encoding:

In [None]:
from keras.preprocessing.text import Tokenizer
import itertools

max_vocabulary = 30000 # take only max_vocabulary most popular words
tokenizer = Tokenizer(max_vocabulary)

# concatenate each news item into a single string
words_train = [' '.join(filter(None, news_item)) for news_item in news_train] 
tokenizer.fit_on_texts(words_train)
matrix_train = tokenizer.texts_to_matrix(words_train)

words_test = [' '.join(filter(None, news_item)) for news_item in news_test] 
matrix_test = tokenizer.texts_to_matrix(words_test)

print(matrix_train.shape)
print(matrix_test.shape)

Let's import the F1 score that is our error metric:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import adam
import numpy as np
from sklearn.metrics import f1_score

## Test model
Okay, finally we can define a simple model:

In [None]:
model = Sequential()
model.add(Dense(512, input_shape=(max_vocabulary,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(y_train.shape[1]))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())


Let's try training for some iterations:

In [None]:
%%time
history = model.fit(matrix_train, 
                    y_train, 
                    epochs=5, 
                    batch_size=128,
                    verbose=1)


In [None]:
prob_test = model.predict(matrix_test, batch_size=128)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(y_test, pred_test, average='micro'), 4))

Sanity check for the first point of test set:

In [None]:
print(y_test[0])
print(pred_test[0])
print(prob_test[0])

## Save your model

Finally, save your best model to the competition and return it as an `h5` file. For example like this.

In [None]:
model.save('model.h5')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` (e.g., by calling `y=model.predict(x_test)`) you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')