# DATA20001 Deep Learning - Group Project
## Text project

**Due Wednesday December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory, but if you use the keras.embedding layer, it will be more efficient. 
- Loading all documents into one big matrix as we have done in the exercises is not feasible (e.g. the virtual servers in CSC have only 3 GB of RAM). You need to load the documents in smaller chunks for the training. This shouldn't be a problem, as we are doing mini-batch training anyway, and thus we don't need to keep all the documents in memory. You can simply pass you current chunk of documents to `model.fit()` as it remembers the weights from the previous run.


## Download the data
Let's first set some paths & download the data set:

In [1]:
from src.data_utility import download_data 

database_path = 'train/'
corpus_path = database_path + 'REUTERS_CORPUS_2/'
data_path = corpus_path + 'data/'
codes_path = corpus_path + 'codes/'

download_data(database_path)

Using TensorFlow backend.


Data set already downloaded.
Data set already unzipped.


The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Reading the data set
First we will read the codes into the dictionary:

In [2]:
from src.data_utility import read_topics

(topics, topic_index, topic_labels) = read_topics(database_path)
n_class = len(topics)

print(n_class, ' different classes\n\n')
print(topics, '\n\n')           # topics codes as an array
print(topic_index, '\n\n')      # dictionary of topic code : index of this code in "topics array"

for key in topic_labels:        # dictionary of topic code : label of this topic code
    print(key, ' : ', topic_labels[key])


126  different classes


['1POL', '2ECO', '3SPO', '4GEN', '6INS', '7RSK', '8YDB', '9BNX', 'ADS10', 'BNW14', 'BRP11', 'C11', 'C12', 'C13', 'C14', 'C15', 'C151', 'C1511', 'C152', 'C16', 'C17', 'C171', 'C172', 'C173', 'C174', 'C18', 'C181', 'C182', 'C183', 'C21', 'C22', 'C23', 'C24', 'C31', 'C311', 'C312', 'C313', 'C32', 'C33', 'C331', 'C34', 'C41', 'C411', 'C42', 'CCAT', 'E11', 'E12', 'E121', 'E13', 'E131', 'E132', 'E14', 'E141', 'E142', 'E143', 'E21', 'E211', 'E212', 'E31', 'E311', 'E312', 'E313', 'E41', 'E411', 'E51', 'E511', 'E512', 'E513', 'E61', 'E71', 'ECAT', 'ENT12', 'G11', 'G111', 'G112', 'G113', 'G12', 'G13', 'G131', 'G14', 'G15', 'G151', 'G152', 'G153', 'G154', 'G155', 'G156', 'G157', 'G158', 'G159', 'GCAT', 'GCRIM', 'GDEF', 'GDIP', 'GDIS', 'GEDU', 'GENT', 'GENV', 'GFAS', 'GHEA', 'GJOB', 'GMIL', 'GOBIT', 'GODD', 'GPOL', 'GPRO', 'GREL', 'GSCI', 'GSPO', 'GTOUR', 'GVIO', 'GVOTE', 'GWEA', 'GWELF', 'M11', 'M12', 'M13', 'M131', 'M132', 'M14', 'M141', 'M142', 'M143', 'MCAT', 'MEUR', '

Let's read a small training and test set:

In [3]:
from src.data_utility import read_news

n_train = 10000
n_test = 10000

(news_train, tags_train, news_test, tags_test) = read_news(database_path, n_train, n_test, seed = 1234)

print(tags_train[0:3], '\n')
print(news_train[0], '\n')

print(tags_test[0:3], '\n')
print(news_test[0], '\n')

[['C15', 'C151', 'CCAT'], ['GCAT', 'GSPO'], ['E12', 'ECAT', 'M13', 'M132', 'MCAT']] 

['TABLE-Primus Telecommunications Q2 loss.', '3 Months', '\t\t\t     1997\t\t\t1996', ' Shr  loss\t\t   $0.50\t\t\t NA', ' Net  loss\t\t  $9,000     loss\t$8,900', ' Revs\t\t\t$70,000\t\t  $59,000', ' Avg shrs\t     17,778,731\t\t\t NA', '(All data above 000s except per share numbers)'] 

[['M13', 'M131', 'MCAT'], ['M13', 'M131', 'MCAT'], ['C15', 'C151', 'CCAT']] 

['Canadian T-bills open mostly weaker in quiet trade.', 'Canadian T-bills opened mostly weaker in quiet trade on Tuesday, taking much of their tone from U.S. Treasuries as dealers returned to the market following a long weekend in most of Canada.', '"I\'ve seen a couple of sellers in the front end, nothing too huge though," said one T-bill dealer with a bank-owned brokerage. "I think that the Canadian money market guys are going to start to play this thing more cautiously."', "Canada's three-month cash T-bill softened to yield 3.28 percent 

In [4]:
from src.data_utility import download_glove
embeddings_path = "embeddings/"
download_glove(embeddings_path)

GloVe Zip found


In [5]:
from src.data_utility import unzip_glove
zip_file_name = "glove.6B.zip"
unzip_glove(embeddings_path, zip_file_name)

Already unzipped


In [6]:
from src.data_utility import get_glove_embeddings
p=200
embeddings = get_glove_embeddings(p, embeddings_path)

In [7]:
print(len(embeddings["the"]))
print(len(embeddings.keys()))

200
400000


In [8]:
import os

In [9]:
from src.data_utility import process_data
if not os.path.exists("train/REUTERS_CORPUS_2/tokenized/"): process_data("train/")


In [10]:
from src.data_utility import build_dictionary
if not os.path.exists("dictionary.json"): build_dictionary("train/")

In [11]:
import json
word_to_index = json.loads(open("dictionary.json").read())
dict_size = len(word_to_index.keys())

In [12]:
from src.data_utility import vectorize_data
if not os.path.exists("train/REUTERS_CORPUS_2/vectorized/"): vectorize_data("train/")

In [13]:
from src.data_utility import get_vectorized_data
vectorized_data_path = "train/REUTERS_CORPUS_2/vectorized/"
tags_path="train/REUTERS_CORPUS_2/tags/"
n_train=10000
n_test=10000
(news_train, tags_train, news_test, tags_test) = get_vectorized_data(vectorized_data_path, tags_path, n_train, n_test, seed = 1234)



In [14]:
import numpy as np
print(news_train[1])
print(word_to_index["NUM"])
lengths = np.array([len(x) for x in news_train])
print(lengths.mean() + np.sqrt(lengths.var()))

[  9263 100301   7404   7511   9263   9294    460   7404   7511   3560
   4090   9041   1242   8212   4484   5633   1344   2939   1242   4595
      8   7597   7598      8      8      8      8      8      8      8
      8   7855      8      8      8      8      8      8      8      8
   7603      8      8      8      8      8      8      8      8   6701
      8      8      8      8      8      8      8      8   1167      8
      8      8      8      8      8      8      8   1168      8      8
      8      8      8      8      8   9041    764      8   7582      8
      8      8      8      8      8      8      8   1165      8      8
      8      8      8      8      8      8   1166      8      8      8
      8      8      8      8      8    492      8      8      8      8
      8      8      8      8   7583      8      8      8      8      8
      8      8      8    570      8      8      8      8      8      8
      8   2036   5659   2170   7427   7672  12077   2902]
8
289.834024572


In [15]:
#Work towards LSTM solution
from keras.preprocessing import sequence
max_news_length = int(np.percentile(lengths, 90))
news_train = sequence.pad_sequences(news_train, maxlen=max_news_length, padding='post', truncating='post')
news_test = sequence.pad_sequences(news_test, maxlen=max_news_length, padding='post', truncating='post')


In [16]:
print(news_train.shape)
print(news_test.shape)
print(list(tags_train[0]))

(10000, 326)
(10000, 326)
[15, 16, 44]


In [17]:
print(news_train[0])

[345747    893   2029   1814      8   1812      8      8   1807   1814
      8   4206   1804   1814      8   1814      8   1811      8      8
   2034   2035      8   4206   1816   1191   1817   1818   1819   1760
   1820      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      

In [18]:
print(n_class)
#encode responses
tags_train_matrix = np.zeros((n_train,n_class))
for ii in range(n_train):
    tags_train_matrix[ii, list(tags_train[ii])] = 1
    
tags_test_matrix = np.zeros((n_train,n_class))    
for ii in range(n_test):
    tags_test_matrix[ii, list(tags_test[ii])] = 1    

print(tags_train[0])    
print(tags_train_matrix[0,])

print(tags_test[0])    
print(tags_test_matrix[0,])

126
[15 16 44]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[116 117 123]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0. 

In [19]:
print(np.array(tags_train).shape)
print(len(word_to_index.keys()))

(10000,)
402715


In [20]:
embedding_matrix = np.zeros((len(word_to_index.keys())+1, p))
for word, i in word_to_index.items():
    vector = embeddings.get(word)
    if vector is not None:
        embedding_matrix[i] = vector

In [21]:
print(embedding_matrix.shape)
print(embedding_matrix[0,:])
#print(embedding_matrix[1,:])

(402716, 200)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]


In [44]:
import keras.backend as K
#notice that this must be done without rounding, which would be in the stadard case of f1 score.
def f1_score_own(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.sigmoid(10 * (K.clip(y_true * y_pred, 0, 1)) - 0.5)) #special sigmoid for imitating the rounding
    c2 = K.sum(K.sigmoid(10 * (K.clip(y_pred, 0, 1)) - 0.5))
    c3 = K.sum(K.clip(y_true, 0, 1))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0:
        return 0

    # How many selected items are relevant?
    precision = c1 / c2

    # How many relevant items are selected?
    recall = c1 / c3

    # Calculate f1_score
    f1_score = 2 * (precision * recall) / (precision + recall)
    return -1 * f1_score #loss, we are trying to max the f1 score

In [None]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras import metrics

In [24]:
# create the model with a simple LSTM layer
batch_size = 64
epochs = 5
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=5, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 326, 200)          80543200  
_________________________________________________________________
dropout_1 (Dropout)          (None, 326, 200)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 126)               12726     
Total params: 80,676,326
Trainable params: 133,126
Non-trainable params: 80,543,200
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c1be67c88>

In [25]:
from sklearn.metrics import f1_score
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.3372


Seems to get stuck or converge at very bad (in terms of performance).

In [46]:
#Model with f1 loss
batch_size = 64
epochs = 2
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Dropout(0.1))
model.add(LSTM(5))
model.add(Dense(512))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss=f1_score_own, optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
dropout_21 (Dropout)         (None, 326, 200)          0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 5)                 4120      
_________________________________________________________________
dense_12 (Dense)             (None, 512)               3072      
_________________________________________________________________
dropout_22 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 126)               64638     
Total params: 80,615,030
Trainable params: 71,830
Non-trainable params: 80,543,200
___________________________________________________________

<keras.callbacks.History at 0x1c24c91080>

In [47]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.0


F1 score seems not to be working. :(

In [53]:
from keras.layers import Conv1D, MaxPooling1D, Flatten
# create the model with a CNN layer
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 3
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=True)

model.add(embedding_layer)
model.add(Conv1D(filters=n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(Flatten())
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 326, 64)           102464    
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 108, 64)           0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 6912)              0         
_________________________________________________________________
dense_18 (Dense)             (None, 512)               3539456   
_________________________________________________________________
dropout_25 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_19 (Dense)             (None, 126)               64638     
Total para

<keras.callbacks.History at 0x1c2548cf98>

In [54]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7969


In [55]:
# create the model with LSTM and CNN layer
batch_size = 64
epochs = 5
n_convolutions = 32
kernel_size = 3
pooling_size = 3
model = Sequential()

embedding_layer = Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False)

model.add(embedding_layer)
model.add(Conv1D(filters=n_convolutions, kernel_size=kernel_size, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())
model.fit(np.array(news_train), np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 326, 200)          80543200  
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 326, 32)           19232     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 108, 32)           0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 50)                16600     
_________________________________________________________________
dense_20 (Dense)             (None, 512)               26112     
_________________________________________________________________
dropout_26 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 126)               64638     
Total para

<keras.callbacks.History at 0x1c257f36a0>

In [56]:
prob_test = model.predict(np.array(news_test), batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.4178


CNN seems to be doing better job than CNN and LSTM. However based on test adding the Dense layer with a lot of hidden nodes actually have the highest impact.

In [84]:
# create a model with multiple size kernels
from keras.layers import GlobalMaxPooling1D, Merge, Concatenate
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 3
kernels = (3,5,8,10)
n_filters = 128

submodels = []
for kw in kernels:    # kernel sizes
    submodel = Sequential()
    submodel.add(Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False))
    submodel.add(Conv1D(n_filters,
                        kw,
                        padding='valid',
                        activation='relu',
                        strides=1))
    submodel.add(GlobalMaxPooling1D())
    submodels.append(submodel)
    
model = Sequential()
model.add(Merge(submodels, mode="concat"))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())

news_train_rep = [np.array(news_train)] * len(kernels)

model.fit(news_train_rep, np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
merge_17 (Merge)             (None, 512)               0         
_________________________________________________________________
dense_54 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_43 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_55 (Dense)             (None, 126)               64638     
Total params: 323,166,206
Trainable params: 993,406
Non-trainable params: 322,172,800
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1c26d8e390>

In [86]:
news_test_rep = [np.array(news_test)] * len(kernels)
prob_test = model.predict(news_test_rep, batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7764


Seems to converge fast, but the final edge is still as bad as everything else. We can try to boost this with an LSTM layer.

In [91]:
# create a model with multiple size kernels and LSTM
from keras.layers import GlobalMaxPooling1D, Merge, Concatenate
batch_size = 64
epochs = 5
n_convolutions = 64
kernel_size = 8
pooling_size = 5
kernels = (3,5,8,10)
n_filters = 128

submodels = []
for kw in kernels:    # kernel sizes
    submodel = Sequential()
    submodel.add(Embedding(len(word_to_index.keys())+1,
                                p,
                                weights=[embedding_matrix],
                                input_length=max_news_length,
                                trainable=False))
    submodel.add(Conv1D(n_filters,
                        kw,
                        padding='valid',
                        activation='relu',
                        strides=1))
    submodel.add(GlobalMaxPooling1D())
    submodels.append(submodel)
    
model = Sequential()
model.add(Merge(submodels, mode="concat"))
model.add(MaxPooling1D(pool_size=pooling_size))
model.add(Flatten())
model.add(LSTM(100, return_sequences=False))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
print(model.summary())

news_train_rep = [np.array(news_train)] * len(kernels)

model.fit(news_train_rep, np.array(tags_train_matrix),epochs=epochs, batch_size=batch_size)
#validation_data=(np.array(news_test), np.array(tags_test))



ValueError: Input 0 is incompatible with layer max_pooling1d_7: expected ndim=3, found ndim=2

In [85]:
prob_test = model.predict(news_test_rep, batch_size=batch_size)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(tags_test_matrix, pred_test, average='micro'), 4))

F1 score:  0.7764


Let's change the target variable into one-hot encoding:

In [4]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(topics)
y_train = mlb.fit_transform(tags_train)
y_test = mlb.fit_transform(tags_test)

print(y_train.shape)
print(y_test.shape, '\n')
print(y_train[0])

(10000, 126)
(10000, 126) 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


## Preprocessing the data set (this is just for test)
Then we will convert the training and test sets into one-hot encoding:

In [5]:
from keras.preprocessing.text import Tokenizer
import itertools

max_vocabulary = 30000 # take only max_vocabulary most popular words
tokenizer = Tokenizer(max_vocabulary)

# concatenate each news item into a single string
words_train = [' '.join(filter(None, news_item)) for news_item in news_train] 
tokenizer.fit_on_texts(words_train)
matrix_train = tokenizer.texts_to_matrix(words_train)

words_test = [' '.join(filter(None, news_item)) for news_item in news_test] 
matrix_test = tokenizer.texts_to_matrix(words_test)

print(matrix_train.shape)
print(matrix_test.shape)

(10000, 30000)
(10000, 30000)


Let's import the F1 score that is our error metric:

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import adam
import numpy as np
from sklearn.metrics import f1_score

## Test model
Okay, finally we can define a simple model:

In [7]:
model = Sequential()
model.add(Dense(512, input_shape=(max_vocabulary,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(y_train.shape[1]))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               15360512  
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 126)               64638     
_________________________________________________________________
activation_2 (Activation)    (None, 126)               0         
Total params: 15,425,150
Trainable params: 15,425,150
Non-trainable params: 0
_________________________________________________________________
None


Let's try training for some iterations:

In [8]:
%%time
history = model.fit(matrix_train, 
                    y_train, 
                    epochs=5, 
                    batch_size=128,
                    verbose=1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 6min 36s, sys: 47.8 s, total: 7min 24s
Wall time: 2min 25s


In [9]:
prob_test = model.predict(matrix_test, batch_size=128)
pred_test = np.array(prob_test) > 0.2
print('F1 score: ', round(f1_score(y_test, pred_test, average='micro'), 4))

F1 score:  0.7999


Sanity check for the first point of test set:

In [10]:
print(y_test[0])
print(pred_test[0])
print(prob_test[0])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False  True False False False False False
 False False False False False False False False False False False False
  True False False False False False False False False False False False
 False False 

## Save your model

Finally, save your best model to the competition and return it as an `h5` file. For example like this.

In [None]:
model.save('model.h5')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` (e.g., by calling `y=model.predict(x_test)`) you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')