# DATA20001 Deep Learning - Group Project
## Text project

**Due Wednesday December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory, but if you use the keras.embedding layer, it will be more efficient. 
- Loading all documents into one big matrix as we have done in the exercises is not feasible (e.g. the virtual servers in CSC have only 3 GB of RAM). You need to load the documents in smaller chunks for the training. This shouldn't be a problem, as we are doing mini-batch training anyway, and thus we don't need to keep all the documents in memory. You can simply pass you current chunk of documents to `model.fit()` as it remembers the weights from the previous run.


## Download the data

In [1]:
from keras.utils.data_utils import get_file

database_path = 'train/'

dl_file='reuters.zip'
dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
get_file(dl_file, dl_url+dl_file, cache_dir='./', cache_subdir=database_path, extract=True)

Using TensorFlow backend.


Downloading data from https://www.cs.helsinki.fi/u/jgpyykko/reuters.zip

'./train/reuters.zip'

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Preprocessing the data
First we will parse the xml files. Let's first set paths and define some utility functions: 

In [13]:
import xml.etree.ElementTree as etree

path_xml = 'train/REUTERS_CORPUS_2/data/'
filename_xml = '810900newsML.xml'
file_xml = path_xml + filename_xml
print(file_xml)

def strip_tag_name(t):
    t = elem.tag
    idx = k = t.rfind("}")
    if idx != -1:
        t = t[idx + 1:]
    return t


train/REUTERS_CORPUS_2/data/810900newsML.xml


In [26]:
x = [6, 65]
x.append(8)
print(x)

[6, 65, 8]


Then we can try to parse one file first: 

In [47]:
words = []
tags = []
read_tags = False
for event, elem in etree.iterparse(file_xml, events=('start', 'end')):
    tname = strip_tag_name(elem)
    if event == 'start':
        if tname == 'codes':
            if elem.attrib['class'] == 'bip:topics:1.0':
                read_tags = True
        if tname == 'code':
            if read_tags:
                tags.append(elem.attrib['code'])
    
    if event == 'end':
        if tname == 'headline':
            words.append(elem.text)
        if tname == 'p':
            words.append(elem.text)
    

print('\n\ntags: ', tags, '\n\n')
print('words: ', words, '\n\n')
for i in range(len(words)):
    print(words[i], '\n')



tags:  ['GCAT', 'GVIO'] 


words:  ['Sri Lanka rebels attack government troops in north.', "Heavy fighting erupted between government troops and Liberation Tigers of Tamil Eelam (LTTE) rebels in Sri Lanka's northern Wanni region late on Friday, military officials said on Saturday.", "They said the rebels had attacked the military's defensive positions just north of the government-held town of Vavuniya, some 220 km (135 miles) north of the capital Colombo.", 'The military, which launched a major offensive in May, is battling rebels in the Wanni to open a strategic highway linking the northern Jaffna peninsula with the rest of the island.', "The military officials said defences guarded by police and the navy had been breached but troops had linked up again after repulsing Friday's attack.", 'Casualty figures and other details were not immediately available, but officials said troops were clearing the area.', 'An undisclosed number of wounded had been airlifted to Anuradhapura military 

## Save your model

Finally, save your best model to the competition and return it as an `h5` file. For example like this.

In [None]:
model.save('model.h5')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` (e.g., by calling `y=model.predict(x_test)`) you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')