DESIGN GOALS OF XML
====================

- Data Transfers
- Easy to write code to read/write
- Document validation
- Human readable
- Supports wide variety of apps.

XML are initually meant to represent table like structure of documents. Like NYtimes article xml where there are several nodes to store the data from a web page. Then it is repurposed to represent data itself. If we look at sample xml file frm opensteet, we can see the heavy usage of attrubutes inside each tag. And there will be so many empty tags also.

In [3]:
# Use python to read XML using TRee method ( we will load the entire xml to memory). We need to use
# python lib xml.etree

import xml.etree.ElementTree as ET
import pprint

tree = ET.parse('./data_files/exampleResearchArticle.xml')
root = tree.getroot()

# we can get the root elelemt and then look at all the children of root as below.
print "\mChildren of Root:"
for child in root:
    print child.tag
    



\mChildren of Root:
ui
ji
fm
bdy
bm


Use of FIND method : We can get to a specific node of the tree by find method on root. Below code shows how to get the tiles from the fm/bibl/title path.

In [7]:
title = root.find('./fm/bibl/title')
title_text = ""

for p in title: 
    title_text +=p.text   # title text can have several paragraphs. So we get them by taking p.text
print "\nTitle:\n", title_text


Title:
Standardization of the functional syndesmosis widening by dynamic U.S examination


Use of "FIND ALL" method:

In [9]:
print "\nAuthor email Addresses:"
for a in root.findall('./fm/bibl/aug/au') :    # Look at all occurances of au
    email = a.find('email')
    if email is not None:
        print email.text


Author email Addresses:
omer@extremegate.com
mcarmont@hotmail.com
laver17@gmail.com
nyska@internet-zahav.net
kammarh@gmail.com
gideon.mann.md@gmail.com
barns.nz@gmail.com
eukots@gmail.com


In [None]:
Sample Program: Write a program to read all the author data from given file and create a python dict
containing fnm, snm and email address.   


In [11]:
def get_authors():
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None
        }
        fnm = author.find('fnm').text
        snm = author.find('snm').text
        email = author.find('email').text
        
        data = {'fnm':fnm ,'snm':snm, 'email':email}
        authors.append(data)

    return authors

get_authors()

[{'email': 'omer@extremegate.com', 'fnm': 'Omer', 'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com', 'fnm': 'Mike', 'snm': 'Carmont'},
 {'email': 'laver17@gmail.com', 'fnm': 'Lior', 'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net', 'fnm': 'Meir', 'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com', 'fnm': 'Hagay', 'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com', 'fnm': 'Gideon', 'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com', 'fnm': 'Barnaby', 'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'snm': 'Kots'}]

If we look at the author entity we can see an attribute called insr.

<au id="A2">
               <snm>Carmont</snm>
               <fnm>Mike</fnm>
               <insr iid="I2"/>
               <email>mcarmont@hotmail.com</email>
</au>
            
If we scroll down in the xml file we can see that insr is the instiution affiliated by author.

<insg>
            <ins id="I1">
               <p>Department of Orthopaedics, Division of Sports Medicine, University of Colorado School of Medicine, Aurora, Colorado</p>
            </ins>
            <ins id="I2">
               <p>Princess Royal Hospital, Telford, UK</p>
</ins>


In this programming assigment, we need to update previous author list by adding insr attribute also. insr should be a list of values as one author can be associated to several institutions. We wil be using the get method to get the attribute value.

In [12]:
def get_authors():
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }
        insrL = []
        data['fnm'] = author.find('fnm').text
        data['snm'] = author.find('snm').text
        data['email'] = author.find('email').text
        
        for insr in author.findall('insr'):
            insrL.append(insr.get('iid'))   # Use of get method to get values of attr iid.We can also use insr.attrib["iid"]
        data['insr'] = insrL
        
        authors.append(data)

    return authors

get_authors()

[{'email': 'omer@extremegate.com',
  'fnm': 'Omer',
  'insr': ['I1'],
  'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com',
  'fnm': 'Mike',
  'insr': ['I2'],
  'snm': 'Carmont'},
 {'email': 'laver17@gmail.com',
  'fnm': 'Lior',
  'insr': ['I3', 'I4'],
  'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net',
  'fnm': 'Meir',
  'insr': ['I3'],
  'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com',
  'fnm': 'Hagay',
  'insr': ['I8'],
  'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com',
  'fnm': 'Gideon',
  'insr': ['I3', 'I5'],
  'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com',
  'fnm': 'Barnaby',
  'insr': ['I6'],
  'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'insr': ['I7'], 'snm': 'Kots'}]