# DATA20001 Deep Learning - Group Project
## Text project

**Due Wednesday December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory, but if you use the keras.embedding layer, it will be more efficient. 
- Loading all documents into one big matrix as we have done in the exercises is not feasible (e.g. the virtual servers in CSC have only 3 GB of RAM). You need to load the documents in smaller chunks for the training. This shouldn't be a problem, as we are doing mini-batch training anyway, and thus we don't need to keep all the documents in memory. You can simply pass you current chunk of documents to `model.fit()` as it remembers the weights from the previous run.


## Download the data

In [1]:
from keras.utils.data_utils import get_file
import os
import zipfile
import itertools
from collections import Counter


database_path = 'train/'
corpus_path = database_path + 'REUTERS_CORPUS_2/'
data_path = corpus_path + 'data/'
codes_path = corpus_path + 'codes/'

if not os.path.exists(database_path):
    dl_file='reuters.zip'
    dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
    get_file(dl_file, dl_url+dl_file, cache_dir='./', cache_subdir=database_path, extract=True)
else:
    print('Data set already downloaded.')

if not os.path.exists(data_path):
    print('\n\nUnzipping data...')
    
    codes_zip = corpus_path + 'codes.zip'
    with zipfile.ZipFile(codes_zip, 'r') as zip_ref:
        zip_ref.extractall(codes_path)
    os.remove(codes_zip)
   
    dtds_zip = corpus_path + 'dtds.zip'
    with zipfile.ZipFile(dtds_zip, 'r') as zip_ref:
        zip_ref.extractall(corpus_path + 'dtds/')
    os.remove(dtds_zip)
    
    for item in os.listdir(corpus_path): 
        if item.endswith('zip'):
            file_name = corpus_path + item 
            with zipfile.ZipFile(file_name, 'r') as zip_ref:
                zip_ref.extractall(data_path)
            os.remove(file_name) 
    
    print('Data set unzipped.')
else:
    print('\nData set already unzipped.')

Using TensorFlow backend.
  return f(*args, **kwds)


Data set already downloaded.

Data set already unzipped.


The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Preprocessing the data
First we will read the codes into the dictionary:

In [2]:
topics = []
topic_labels = {}
codes_file = codes_path + 'topic_codes.txt'
with open(codes_file) as f:
    for line in f:
        if not line.startswith(';'):
            splits = line.split()
            topic_code = splits[0]
            topic_labels[topic_code] = ' '.join(splits[1:len(splits)])
            topics.append(topic_code)

n_class = len(topics)
topic_index = {topics[i] : i for i in range(n_class)}


print(n_class, ' different classes\n\n')
print(topics, '\n\n')
print(topic_index, '\n\n')

for key in topic_labels:
    print(key, ' : ', topic_labels[key])

for key in topic_index:
    print(key, ' : ', topic_index[key])


126  different classes


['1POL', '2ECO', '3SPO', '4GEN', '6INS', '7RSK', '8YDB', '9BNX', 'ADS10', 'BNW14', 'BRP11', 'C11', 'C12', 'C13', 'C14', 'C15', 'C151', 'C1511', 'C152', 'C16', 'C17', 'C171', 'C172', 'C173', 'C174', 'C18', 'C181', 'C182', 'C183', 'C21', 'C22', 'C23', 'C24', 'C31', 'C311', 'C312', 'C313', 'C32', 'C33', 'C331', 'C34', 'C41', 'C411', 'C42', 'CCAT', 'E11', 'E12', 'E121', 'E13', 'E131', 'E132', 'E14', 'E141', 'E142', 'E143', 'E21', 'E211', 'E212', 'E31', 'E311', 'E312', 'E313', 'E41', 'E411', 'E51', 'E511', 'E512', 'E513', 'E61', 'E71', 'ECAT', 'ENT12', 'G11', 'G111', 'G112', 'G113', 'G12', 'G13', 'G131', 'G14', 'G15', 'G151', 'G152', 'G153', 'G154', 'G155', 'G156', 'G157', 'G158', 'G159', 'GCAT', 'GCRIM', 'GDEF', 'GDIP', 'GDIS', 'GEDU', 'GENT', 'GENV', 'GFAS', 'GHEA', 'GJOB', 'GMIL', 'GOBIT', 'GODD', 'GPOL', 'GPRO', 'GREL', 'GSCI', 'GSPO', 'GTOUR', 'GVIO', 'GVOTE', 'GWEA', 'GWELF', 'M11', 'M12', 'M13', 'M131', 'M132', 'M14', 'M141', 'M142', 'M143', 'MCAT', 'MEUR', '

Then we will parse the xml files. Let's first try to parse one file:

In [3]:
import xml.etree.ElementTree as etree

filename_xml = '810900newsML.xml'
file_xml = data_path + filename_xml

def read_xml_file(file_xml):
    sentences = []
    tags = []
    read_tags = False
    for event, elem in etree.iterparse(file_xml, events=('start', 'end')):
        t = elem.tag
        idx = k = t.rfind("}")
        if idx != -1:
            t = t[idx + 1:]
        tname = t

        if event == 'start':
            if tname == 'codes':
                if elem.attrib['class'] == 'bip:topics:1.0':
                    read_tags = True
            if tname == 'code':
                if read_tags:
                    tags.append(elem.attrib['code'])
    
        if event == 'end':
            if tname == 'headline':
                sentences.append(elem.text)
            if tname == 'p':
                sentences.append(elem.text)
            if tname == 'codes':
                if elem.attrib['class'] == 'bip:topics:1.0':
                    read_tags = False

    return [sentences, tags]
    
(sentences, tags) = read_xml_file(file_xml)

print('\n\ntags: ', tags, '\n\n')
print('sentences: ', sentences, '\n\n')
for i in range(len(sentences)):
    print(sentences[i], '\n')



tags:  ['GCAT', 'GVIO'] 


sentences:  ['Sri Lanka rebels attack government troops in north.', "Heavy fighting erupted between government troops and Liberation Tigers of Tamil Eelam (LTTE) rebels in Sri Lanka's northern Wanni region late on Friday, military officials said on Saturday.", "They said the rebels had attacked the military's defensive positions just north of the government-held town of Vavuniya, some 220 km (135 miles) north of the capital Colombo.", 'The military, which launched a major offensive in May, is battling rebels in the Wanni to open a strategic highway linking the northern Jaffna peninsula with the rest of the island.', "The military officials said defences guarded by police and the navy had been breached but troops had linked up again after repulsing Friday's attack.", 'Casualty figures and other details were not immediately available, but officials said troops were clearing the area.', 'An undisclosed number of wounded had been airlifted to Anuradhapura milit

Let's read a small training and test set:

In [4]:
import random

random.seed(123)
n_train = 10
n_test = 10

data_list = os.listdir(data_path)
n_samples = len(data_list)
random_indices = random.sample(range(n_samples), n_train + n_test)

train_indices = random_indices[0:n_train]
test_indices = random_indices[n_train:(n_train + n_test)]

train_list = [data_list[i] for i in train_indices]
test_list = [data_list[i] for i in test_indices]

news_train = []
tags_train = []
for file_name in train_list:
    file_xml = data_path + file_name 
    (sentences, tags) = read_xml_file(file_xml)
    
    news_train.append(sentences)
    tags_train.append(tags)

print(tags_train[0:3], '\n')
for i in range(3):
    print(news_train[i], '\n')

news_test = []
tags_test = []
for file_name in test_list:
    file_xml = data_path + file_name 
    (sentences, tags) = read_xml_file(file_xml)
    
    news_test.append(sentences)
    tags_test.append(tags)

print(tags_test[0:3], '\n')
for i in range(3):
    print(news_test[i], '\n')

[['M14', 'M141', 'MCAT'], ['C18', 'C181', 'CCAT'], ['GCAT', 'GCRIM', 'GDIP', 'GVIO']] 

['Australia says 1997/98 oat prices depend on rain.', 'The Australian Barley Board (ABB) said on Tuesday the Australian domestic market could be expected to have a major effect on oat prices in 1997/98, but just how much would depend on rainfall in spring.', '"Our pool forecasts have been increased to A$140 per tonne for milling oats (in Victoria, oats number 1), and A$130 for feed oats, although these are still based on export potential," ABB chief executive Michael Iwaniw said in a statement.  ', 'Iwaniw said there were many negative factors on oats in the current world market, with a considerable production boost in the United States expected to offset slightly lower oat production in Canada and Sweden.', 'Domestic oat prices were high at present along with all grains in demand for the intensive stockfeed industries, particularly in Victoria, and demand was also evident from the grazing industry 

Then we will convert the training and test sets into one-hot encoding:

In [6]:
from keras.preprocessing.text import Tokenizer
import itertools

tokenizer = Tokenizer()
words_train = [' '.join(news_item) for news_item in news_train] # concatenate each news item into a single string
tokenizer.fit_on_texts(words_train)
matrix_train = tokenizer.texts_to_matrix(words_train)

words_test = [' '.join(news_item) for news_item in news_test] 
matrix_test = tokenizer.texts_to_matrix(words_test)

print(matrix_train.shape)
print(matrix_test.shape)

(10, 791)
(10, 791)


In [7]:
from keras.preprocessing.text import text_to_word_sequence

words = []
for s in sentences:
    words.append(text_to_word_sequence(s))

for w in words:
    print(w, '\n')
    
word_counts = Counter(itertools.chain(*words))
print('\n\n', word_counts)

vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}



['integrated', 'packaging', 'names', 'larrenaga', 'as', 'cfo'] 

['integrated', 'packaging', 'assembly', 'corp', 'said', 'on', 'tuesday', 'it', 'appointed', 'alfred', 'larrenaga', 'as', 'vice', 'president', 'and', 'chief', 'financial', 'officer'] 

['larrenaga', 'was', 'formerly', 'the', 'senior', 'vice', 'president', 'and', 'chief', 'financial', 'officer', 'of', 'southwall', 'technologies', 'inc', 'and', 'prior', 'to', 'that', 'vice', 'president', 'and', 'chief', 'financial', 'officer', 'of', 'asyst', 'technologies', 'inc'] 

['larrenaga', 'replaces', 'tony', 'lin', 'the', "company's", 'cfo', 'since', 'its', 'inception', 'in', '1993'] 

['integrated', 'packaging', 'is', 'a', 'semiconductor', 'packaging', 'foundry', 'which', 'gets', 'wafers', 'from', 'customers', 'and', 'assembles', 'and', 'encases', 'each', 'integrated', 'circuit', 'in', 'a', 'plastic', 'package'] 



 Counter({'and': 6, 'integrated': 4, 'packaging': 4, 'larrenaga': 4, 'vice': 3, 'president': 3, 'chief': 3, 'financial

Let's change also the target variable into one-hot encoding:

In [29]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(topics)
y_train = mlb.fit_transform(tags_train)
y_test = mlb.fit_transform(tags_test)

print(mlb.classes_, '\n')
print(y_train[0], '\n')
print(tags_train[0], '\n')
print(y_train.shape)
print(y_test.shape)

['1POL' '2ECO' '3SPO' '4GEN' '6INS' '7RSK' '8YDB' '9BNX' 'ADS10' 'BNW14'
 'BRP11' 'C11' 'C12' 'C13' 'C14' 'C15' 'C151' 'C1511' 'C152' 'C16' 'C17'
 'C171' 'C172' 'C173' 'C174' 'C18' 'C181' 'C182' 'C183' 'C21' 'C22' 'C23'
 'C24' 'C31' 'C311' 'C312' 'C313' 'C32' 'C33' 'C331' 'C34' 'C41' 'C411'
 'C42' 'CCAT' 'E11' 'E12' 'E121' 'E13' 'E131' 'E132' 'E14' 'E141' 'E142'
 'E143' 'E21' 'E211' 'E212' 'E31' 'E311' 'E312' 'E313' 'E41' 'E411' 'E51'
 'E511' 'E512' 'E513' 'E61' 'E71' 'ECAT' 'ENT12' 'G11' 'G111' 'G112' 'G113'
 'G12' 'G13' 'G131' 'G14' 'G15' 'G151' 'G152' 'G153' 'G154' 'G155' 'G156'
 'G157' 'G158' 'G159' 'GCAT' 'GCRIM' 'GDEF' 'GDIP' 'GDIS' 'GEDU' 'GENT'
 'GENV' 'GFAS' 'GHEA' 'GJOB' 'GMIL' 'GOBIT' 'GODD' 'GPOL' 'GPRO' 'GREL'
 'GSCI' 'GSPO' 'GTOUR' 'GVIO' 'GVOTE' 'GWEA' 'GWELF' 'M11' 'M12' 'M13'
 'M131' 'M132' 'M14' 'M141' 'M142' 'M143' 'MCAT' 'MEUR' 'PRB13'] 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## Save your model

Finally, save your best model to the competition and return it as an `h5` file. For example like this.

In [None]:
model.save('model.h5')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` (e.g., by calling `y=model.predict(x_test)`) you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')