## Project - Machine Learning for Big Data

Description: Script to extract data from Reuters xml files into dataframe and exporting it as a csv.

**Explanation:**

For our project we have pursued a procedural approach to the project by implementing checkpoints as we progressed through the various project stages. The first of these checkpoints was to use existing code developed in Assignment 1 and altering it to keep only the essential columns required to complete the project. Since extracting the data from the reuters files was inconvenient to run multiple times, we have developed a script which extracts the features into a dataframe and then exports it in a CSV file. This allowed us to quickly import the raw data in the future step no matter if the environment is on local or on the cloud. The CSV file was rather small at 7MB which meant it was easily portable through our repository on GitLabs.

In [1]:
import numpy as np
import pandas as pd
import os
import xml.etree.ElementTree as ET

In [2]:
#function to read and return pandas dataframework from XML file
def read_xml(root, file):
    
    filename = root+'/'+file 
    
    tree = ET.parse(filename)
    root = tree.getroot()

    #XMLfilename
    xml_filename = file;

    for item in root.iter('newsitem'):

        #itemid
        item_id = item.get('itemid')

        #headline
        headline = item.find('headline').text

        #text
        text = ""
        for p in root.find('text'):
            text = text + " " + p.text

        bip_topics = np.array([], dtype='str')

        for meta in root.find('metadata'):

            #bip:topics
            if meta.get('class') == 'bip:topics:1.0':

                for c in meta.findall('code'):
                    bip_topics = np.append(bip_topics, c.get('code'))

            #dc.date.published
            if meta.get('element') == 'dc.date.published':
                date_published = meta.get('value')

    data = np.array([item_id, xml_filename, headline, text, bip_topics, date_published])
    return data

In [3]:
#Generate the cumulative dataframe

#new dataframe
df = pd.DataFrame()

#currently directory
dir_path = os.getcwd()

#search for all xml files in the current directory and root folders
for root, dirs, files in os.walk(dir_path): 
    for file in files:  
        
        if file.endswith('.xml'): 
            
            data = np.array( read_xml(str(root), str(file)) )
            
            df2 = pd.DataFrame(data).T
            df2.columns = ['item_id', 'xml_filename', 'headline', 'text', 'bip_topics', 'date_published']
            
            df = df.append(df2, ignore_index=True, sort=False)

#reference: https://www.geeksforgeeks.org/file-searching-using-python/

In [4]:
df

Unnamed: 0,item_id,xml_filename,headline,text,bip_topics,date_published
0,429411,429411newsML.xml,OFFICIAL JOURNAL CONTENTS - OJ L 66 OF MARCH 6...,* Council Regulation (EC) No 390/97 of 20 Dec...,"[G15, GCAT]",1997-03-10
1,429412,429412newsML.xml,OFFICIAL JOURNAL CONTENTS - OJ C 74 OF MARCH 8...,* (Note - contents are displayed in reverse o...,"[G15, GCAT]",1997-03-10
2,429413,429413newsML.xml,OFFICIAL JOURNAL CONTENTS - OJ C 73 OF MARCH 8...,* (Note - contents are displayed in reverse o...,"[G15, GCAT]",1997-03-10
3,429414,429414newsML.xml,OFFICIAL JOURNAL CONTENTS - OJ L 68 OF MARCH 8...,* (Note - contents are displayed in reverse o...,"[G15, GCAT]",1997-03-10
4,429415,429415newsML.xml,Canada provincial T-bill auction results - Man...,DATE PROV MAT C$AMT AVG CHG PRICE ...,"[M13, M131, MCAT]",1997-03-10
...,...,...,...,...,...,...
48370,477881,477881newsML.xml,U.S. to back fewer supercomputer centers - Times.,The National Science Foundation plans to redu...,[],1997-03-31
48371,477882,477882newsML.xml,Indian shares plunge 8.6 pct on political crisis.,Indian shares plunged more than eight percent...,[M11],1997-03-31
48372,477883,477883newsML.xml,"Singapore shares open weak, funds stay sidelined.",Singapore shares opened weaker on Monday with...,[M11],1997-03-31
48373,477884,477884newsML.xml,Selecta declares two centavo cash dividend.,Selecta Dairy Products Inc declared on Monday...,[C151],1997-03-31


In [None]:
df.to_csv('project_reuters_df.csv')