# DESIGN GOALS OF XML
====================

- Data Transfers
- Easy to write code to read/write
- Document validation
- Human readable
- Supports wide variety of apps.

XML are initually meant to represent table like structure of documents. Like NYtimes article xml where there are several nodes to store the data from a web page. Then it is repurposed to represent data itself. If we look at sample xml file frm opensteet, we can see the heavy usage of attrubutes inside each tag. And there will be so many empty tags also.

In [1]:
# Use python to read XML using TRee method ( we will load the entire xml to memory). We need to use
# python lib xml.etree

import xml.etree.ElementTree as ET
import pprint

tree = ET.parse('./data_files/exampleResearchArticle.xml')
root = tree.getroot()

# we can get the root elelemt and then look at all the children of root as below.
print "\mChildren of Root:"
for child in root:
    print child.tag
    



IOError: [Errno 2] No such file or directory: './data_files/exampleResearchArticle.xml'

Use of FIND method : We can get to a specific node of the tree by find method on root. Below code shows how to get the tiles from the fm/bibl/title path.

In [2]:
title = root.find('./fm/bibl/title')
title_text = ""

for p in title: 
    title_text +=p.text   # title text can have several paragraphs. So we get them by taking p.text
print "\nTitle:\n", title_text

NameError: name 'root' is not defined

Use of "FIND ALL" method:

In [3]:
print "\nAuthor email Addresses:"
for a in root.findall('./fm/bibl/aug/au') :    # Look at all occurances of au
    email = a.find('email')
    if email is not None:
        print email.text


Author email Addresses:


NameError: name 'root' is not defined

In [4]:
Sample Program: Write a program to read all the author data from given file and create a python dict
containing fnm, snm and email address.   


SyntaxError: invalid syntax (<ipython-input-4-5e04e44bf94f>, line 1)

In [5]:
def get_authors():
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None
        }
        fnm = author.find('fnm').text
        snm = author.find('snm').text
        email = author.find('email').text
        
        data = {'fnm':fnm ,'snm':snm, 'email':email}
        authors.append(data)

    return authors

get_authors()

NameError: global name 'root' is not defined

If we look at the author entity we can see an attribute called insr.

<au id="A2">
               <snm>Carmont</snm>
               <fnm>Mike</fnm>
               <insr iid="I2"/>
               <email>mcarmont@hotmail.com</email>
</au>
            
If we scroll down in the xml file we can see that insr is the instiution affiliated by author.

<insg>
            <ins id="I1">
               <p>Department of Orthopaedics, Division of Sports Medicine, University of Colorado School of Medicine, Aurora, Colorado</p>
            </ins>
            <ins id="I2">
               <p>Princess Royal Hospital, Telford, UK</p>
</ins>


In this programming assigment, we need to update previous author list by adding insr attribute also. insr should be a list of values as one author can be associated to several institutions. We wil be using the get method to get the attribute value.

In [6]:
def get_authors():
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }
        insrL = []
        data['fnm'] = author.find('fnm').text
        data['snm'] = author.find('snm').text
        data['email'] = author.find('email').text
        
        for insr in author.findall('insr'):
            insrL.append(insr.get('iid'))   # Use of get method to get values of attr iid.We can also use insr.attrib["iid"]
        data['insr'] = insrL
        
        authors.append(data)

    return authors

get_authors()

NameError: global name 'root' is not defined

# SCREEN SCRAPPING

In the example described, we are going to go to the website http://www.transtats.bts.gov/Data_Elements.aspx?Data=2 to get the 
data of different flights arrival and departure from a particular airport.

See below the steps of Data Wrangling Procedure involved here

1. Build list of carrier values
2. Build a list of airport values'
3. Make http request to download all the data
4. Parse the data files.

We will be using a python module called beautiful Soup to parse html.

If we to do the previous mentioned website and inspect the elements where we select the carrier or airport, we can see the html structure as below.


<table  width='720px' >
             
        <tr>
            <td  class="dataTD" style="width: 450px" ><label for="CarrierList">
                 Select a carrier from the dropdown (major carriers) or from a link below: </label>
	        </td>
            <td  class="dataTD" style="width: 250px" ><label for="AirportList">
                  Select an airport: </label>
	        </td>
			<td></td>
   	    </tr>  
	    <tr>
	        <td style="width: 450px">
	            <select name="CarrierList" id="CarrierList" class="slcBox" style="width:450px;">
	<option value="All">All U.S. and Foreign Carriers</option>
	<option value="AllUS">All U.S. Carriers</option>
	<option value="AllForeign">All Foreign Carriers</option>
	<option value="AS">Alaska Airlines </option>
	<option value="G4">Allegiant Air</option>
	<option value="AA">American Airlines </option>


Very similar structure for Airport also. So we need to go find the element with id as "CarrierList". Then from that element find all elements with node name as 'option'. Then exract the value attribute.




In [9]:
from bs4 import BeautifulSoup

def options(soup, id):
    option_values = []
    carrier_list = soup.find(id=id)
    for option in carrier_list.find_all('option'):
        option_values.append(option['value'])
    return option_values


def print_list(label, codes):
    print "\n%s:" % label
    for c in codes:
        print c
        

def main():
    soup = BeautifulSoup(open("virgin_and_logan_airport.html"))
    
    codes = options(soup, 'CarrierList')
    print_list("Carriers", codes)
    
    codes = options(soup, 'AirportList')
    print_list("Airports", codes)

main()                                    


Carriers:
All
AllUS
AllForeign
AS
G4
AA
5Y
DL
MQ
EV
F9
HA
B6
OO
WN
NK
UA
VX

Airports:
All
AllMajors
ATL
BWI
BOS
CLT
MDW
ORD
DFW
DEN
DTW
FLL
IAH
LAS
LAX
MIA
MSP
JFK
LGA
EWR
MCO
PHL
PHX
PDX
SLC
SAN
SFO
SEA
TPA
DCA
IAD
AllOthers
UXM
ABR
ABI
DYS
ADK
VZF
BQN
AKK
KKI
AKI
AKO
CAK
7AK
KQA
AUK
ALM
ALS
ABY
ALB
ABQ
ZXB
WKK
AED
AEX
AXN
AET
ABE
AIA
APN
DQH
AOO
AMA
ABL
OQZ
AOS
OTS
AKP
EDF
DQL
MRI
ANC
AND
AGN
ANI
ANN
ANB
ANV
ATW
ACV
ARC
ADM
AVL
HTS
ASE
AST
AHN
AKB
PDK
FTY
ACY
ATT
ATK
MER
AUO
AGS
AUG
AUS
A28
BFL
BGR
BHB
BRW
BTI
BQV
A2K
BTR
BTL
AK2
A56
BTY
BPT
BVD
WBQ
BKW
BED
A11
KBE
BLV
BLI
BLM
JVL
BVU
BJI
RDM
BEH
BET
BTT
BVY
OQB
A50
BIC
BIG
BGQ
BMX
PWR
A85
BIL
BIX
BGM
KBC
BHM
BIS
BYW
BID
BMG
BMI
BFB
BYH
BCT
BOI
RLU
BXS
BLD
BYA
BWG
BZN
BFD
A23
BRD
BKG
PWT
KTS
BDR
TRI
BKX
RBH
BRO
BWD
BQK
BCE
BKC
BUF
IFP
BUR
BRL
BTV
MVW
BNO
BTM
JQF
UXI
CDW
C01
ADW
CDL
CGI
LUR
EHM
CZF
A61
A40
CYT
MDH
CLD
CNM
A87
CPR
CDC
CID
JRV
NRR
CEM
CDR
CIK
CMI
WCR
CHS
CRW
SPB
STT
CHO
CYM
CHA
CYF
WA7
CEX
EGA
NCN
KCN
VAK
CYS
PWK
DPA




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Now, let us look at the form data to find how the url for this website is constructed.

<form method="post" action="./Data_Elements.aspx?Data=2" id="form1">

So we need to submit a post request with action value as indicated above.

Submit the request for a particular carrier, then do a chrome inspect element. Go to the network tab to find POST request submit on the form.

Now go to the header section of the request and we can see something as below.

Request URL:http://www.transtats.bts.gov/Data_Elements.aspx?Data=2
Request Method:POST
Status Code:200 OK
Remote Address:204.68.194.70:80
Response Headers
view source
Cache-Control:private
Content-Length:334061
Content-Type:text/html; charset=utf-8
Date:Tue, 29 Nov 2016 18:48:28 GMT
Server:Microsoft-IIS/8.5
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET
Request Headers
view source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate
Accept-Language:en-US,en;q=0.8,ml;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:167402
Content-Type:application/x-www-form-urlencoded
Cookie:ASP.NET_SessionId=r515qitron5awx3yopgkxsxp; __utma=261918792.416830484.1480367281.1480367281.1480442978.2; __utmb=261918792.6.10.1480442978; __utmc=261918792; __utmz=261918792.1480367281.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
Host:www.transtats.bts.gov
Origin:http://www.transtats.bts.gov
Referer:http://www.transtats.bts.gov/Data_Elements.aspx?Data=2
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36
Query String Parameters
view source
view URL encoded
Data:2


%%%%%%Form Data%%%%%%%%%

__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTE3NTUzNjYzMDUPFg4eB3N0ckNvbm4FWlByb3ZpZGVyPS5ORVQgRnJhbWV3b3JrIERhdGEgUHJvdmlkZXIgZm9yIE9EQkM7RFNOPUVuZGVhdm91cjt1aWQ9d2VidXNlcjtwd2Q9IVdlYnVzZXIxMjM0Ox4FTUxpc3QFrQEnQVRMJywnQldJJywnQk9TJywnQ0xUJywnTURXJywnT1JEJywnREZXJywnREVOJywnRFRXJywnRkxMJywnSUFIJywnTEFTJywnTEFYJywnTUlBJywnTVNQJywnSkZLJywnTEdBJywnRVdSJywnTUNPJywnUEhMJywnUEhYJywnUERYJywnU0xDJywnU0FOJywnU0ZPJywnU0VBJywnVFBBJn1h0c7od2CNQLRCOlBGk5YLwthc199lRt5ELWg7HE/DPerIVu7sQ+/Qur2c6y6M3GinPnoyo/wBGuaOpTMP/V4RIr1quHOr7Buu7wx5uX0f0WoicwvYnU7Xm8SN3wFxYYfWi3STRI7sJ+66UwWwsEpPB/1eQonEGVjqi0F3HGuu9GATQimsy9mmy43CEV6P28pCSaxfKo3ybn5ooiHSQ5nnSHYxWK08tskaNyRDzQUF0gUVwSjqDLjLC1XWUGX0n/yFDvBsZTpO+hyQJNkZjIzf5i3OWXP5sTrEGvaHLbjNeIsw1Qa6see/YYrkYa8AT893dpZgbkcCueqGFtJfQJZPPY77ku4AM7P+grA==
__VIEWSTATEGENERATOR:8E3A4798
__EVENTVALIDATION:/wEdAMoJZ/HojNNL2IMgjWEiZnbylvc8W+KtN+Mo2aSUQqdKHnn1wNLGBL09odqqO9CBtbAJsDUC3lheYiZJgW/YlADjuWzIaw6NWZdPqqJzsUIui3WJba4zPjTfLRRsH0Y4tKCbvFwJUL16Fg2zvSQ8pNpmmiXKEOf5q1Kv3vsuvzhW05PQbvopHFZM7OfQDQb4hdOUdXAXRWTwuBeN66YRUYZW+iHUadUKYEzQDBGkXUs/YCGr0cwlD3oBni9ShkvD3kwjk8WcekoqTVcnps8CXSCz2VxSHFLZn8o/OI/mzSaJLF7n4FW7/iSCbjzg5qjsDMH5Z4x2xKMscyTvkWGm4eCnGGT+PhzP2oB89KGJMNTRcLt8dfZ0OLmTKBRvL+6aLO1Jlqb4uy82+C1G/TuY290BKE0bVp+gYhNwVZBEHAug4oRNRIquFUPBmZH
CarrierList:VX
AirportList:BOS
Submit:Submit


We can see the form data section as above. So we need to pass __EVENTTARGET, __EVENTARGUMENT, __VIEWSTATE,__VIEWSTATEGENERATOR, __EVENTVALIDATION CarrierList, AirportList and Submit.

