# Parsing Yael's dataset

It quickly became clear that it would be useless to take into account all the xml tags, and the best would be to use only the data inside 'body' tags.

The following code parses all the 'body's in the xml files in the directory "yael_corpus", and makes a file which matches to each word the xml tag which it is under. For words in 'tail' position, the tag 'O' was matched.

In [8]:
import xml.etree.ElementTree as ET

In [9]:
f = open("yael_data.txt", "w")

In [10]:
def write_body(body):
    if body.text:
        for word in body.text.split(" "):
            if len(word) and word.split():
                f.write(body.tag.split("}")[1] + " " + word + "\n")
    if body.tail:
        for word in body.tail.split(" "):
            if len(word) and word.split():
                f.write("O" + " " + word + "\n")
    for child in body:
        write_body(child)

def parse_tree(root):
    body = None
    for child in root:
        if child.tag[-4:] == 'body':
            body = child
            break
        else:
            parse_tree(child)
    if body is not None:
        write_body(body)

In [11]:
files = ['Alkoshi Gedalia.xml', 'alterman.xml', 'Dan_Almagor.xml', 'dvora_baron.xml', 'Even Shoshan.xml', 'gila_almagor.xml', 'groassman.xml', 'haim-guri.xml', 'keret.xml', 'Kobi Oz.xml', 'Kobner.xml', 'meir_ariel.xml', 'rabinian.xml', 'ron_feder.xml', 'tamar_caravan.xml', 'tei-nevo.xml', 'Tohar_Lev_Yoram.xml', 'yossi_banai_1.xml', 'Yossi_Banai.xml']

In [12]:
filepaths = ['yael_corpus/' + filename for filename in files]

In [13]:
for filepath in filepaths:
    tree = ET.parse(filepath)
    root = tree.getroot()
    parse_tree(root)
    f.write("\n\n\n\n\n")

After looking at the results, we concluded that the many of the tags don't match uniquely to one of the tagset we're already using, meaning that some tags could be found on words belonging to different entities or 'O'.

This was the matching (UNCLEAR means it tagged words of different kinds): 

movie MISC_ENT

book MISC_ENT

pubPlace LOC

persName PER

publisher ORG

forename PER

theater ORG

said O

orgName ORG

award MISC_EVENT

biblScope O

author PER

item O

placename LOC had problems

p O UNCLEAR

rs O UNCLEAR

persname PER

lang O

geogName LOC

placeName LOC

l O

occupation O

num O

play MISC_ENT

rolename O

country LOC

education O

docAuthor UNCLEAR

quote O

roleName O

band ORG

name UNCLEAR

singleShow MISC_ENT

ref UNCLEAR

surename PER

orgname ORG

date DATE

editor UNCLEAR

hi UNCLEAR

militaryservice ORG

TVshow MISC_ENT

surname PER

title UNCLEAR

We provide a simple code to 'translate' the tags to the tagset of the rest of the data. We convert UNCLEAR to O.

In [18]:
tags_dict = {   
                "movie": "MISC_ENT",
                "book": "MISC_ENT",
                "pubPlace": "LOC",
                "persName": "PER",
                "publisher": "ORG",
                "forename": "PER",
                "theater": "ORG",
                "said": "O",
                "orgName": "ORG",
                "award": "MISC_EVENT",
                "biblScope": "O",
                "author": "PER",
                "item": "O",
                "placename": "LOC",
                "p": "O",
                "rs": "O",
                "persname": "PER",
                "lang": "O",
                "geogName": "LOC",
                "placeName": "LOC",
                "l": "O",
                "occupation": "O",
                "num": "O",
                "play": "MISC_ENT",
                "rolename": "O",
                "country": "LOC",
                "education": "O",
                "docAuthor": "O",
                "quote": "O",
                "roleName": "O",
                "band": "ORG",
                "name": "O",
                "singleShow": "MISC_ENT",
                "ref": "O",
                "surename": "PER",
                "orgname": "ORG",
                "date": "DATE",
                "editor": "O",
                "hi": "O",
                "militaryservice": "ORG",
                "TVshow": "MISC_ENT",
                "surname": "PER",
                "title": "O"
            }

In [26]:
f2 = open("yael_data_processed.txt", "w")

In [27]:
def write_body2(body):
    tag = tags_dict[body.tag.split("}")[1]] if body.tag.split("}")[1] in tags_dict.keys() else "O"
    if body.text:
        for word in body.text.split(" "):
            if len(word) and word.split():
                f2.write(tag + " " + word + "\n")
    if body.tail:
        for word in body.tail.split(" "):
            if len(word) and word.split():
                f2.write("O" + " " + word + "\n")
    for child in body:
        write_body2(child)

def parse_tree2(root):
    body = None
    for child in root:
        if child.tag[-4:] == 'body':
            body = child
            break
        else:
            parse_tree2(child)
    if body is not None:
        write_body2(body)

In [28]:
for filepath in filepaths:
    tree = ET.parse(filepath)
    root = tree.getroot()
    parse_tree2(root)
    f2.write("\n\n\n\n\n")