# Reading and manipulating XML 

A useful resource for this topic can be found [here](https://docs.python.org/3/library/xml.etree.elementtree.html).

In [2]:
import xml.etree.ElementTree as ET
import urllib.request

### A simple example: the weather

[This website](https://w1.weather.gov/xml/current_obs/KOAK.xml) is hosted by the National Weather Service, and shows the weather at Oakland airport, in XML format.  To see it in XML, right-click and select "View page source".

- We are going to extract the temperature in degrees F, which is tagged 'temp_f'.

- To do this, we first need to access the data from the web - see the very convenient [urllib](https://docs.python.org/3/howto/urllib2.html).

In [3]:
# access the data from the web.
target_url = 'http://www.weather.gov/xml/current_obs/KOAK.xml'
with urllib.request.urlopen(target_url) as response:
   data = response.read()
data

b'<?xml version="1.0" encoding="ISO-8859-1"?> \r\n<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>\r\n<current_observation version="1.0"\r\n\t xmlns:xsd="http://www.w3.org/2001/XMLSchema"\r\n\t xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\r\n\t xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">\r\n\t<credit>NOAA\'s National Weather Service</credit>\r\n\t<credit_URL>http://weather.gov/</credit_URL>\r\n\t<image>\r\n\t\t<url>http://weather.gov/images/xml_logo.gif</url>\r\n\t\t<title>NOAA\'s National Weather Service</title>\r\n\t\t<link>http://weather.gov</link>\r\n\t</image>\r\n\t<suggested_pickup>15 minutes after the hour</suggested_pickup>\r\n\t<suggested_pickup_period>60</suggested_pickup_period>\n\t<location>Oakland, Metro Oakland International Airport, CA</location>\n\t<station_id>KOAK</station_id>\n\t<latitude>37.7178</latitude>\n\t<longitude>-122.23294</longitude>\n\t<observation_time>Last Updated on Sep 27 2020, 2:53 pm PDT</observ

### You need an XML parser

You may be tempted to try to pull the desired information out of the above data using a regular expression.  And in this simple case, it would probably work.  But in general it's a bad idea, and can lead to [memorable meltdowns](http://stackoverflow.com/a/1732454) - see the top-ranked response which, although tongue-in-cheek and a bit purple, I hope will help to motivate the use of XML parsers in general. 

In [4]:
root = ET.fromstring(data)
root

<Element 'current_observation' at 0x7f98f04912c0>

In [5]:
# first examine the root of the tree you just read in
print(root.tag, root.attrib)

current_observation {'version': '1.0', '{http://www.w3.org/2001/XMLSchema-instance}noNamespaceSchemaLocation': 'http://www.weather.gov/view/current_observation.xsd'}


In [6]:
# now look at children of root node.
for child in root:
    print(child.tag, ":", child.attrib)

credit : {}
credit_URL : {}
image : {}
suggested_pickup : {}
suggested_pickup_period : {}
location : {}
station_id : {}
latitude : {}
longitude : {}
observation_time : {}
observation_time_rfc822 : {}
weather : {}
temperature_string : {}
temp_f : {}
temp_c : {}
relative_humidity : {}
wind_string : {}
wind_dir : {}
wind_degrees : {}
wind_mph : {}
wind_kt : {}
pressure_string : {}
pressure_mb : {}
pressure_in : {}
dewpoint_string : {}
dewpoint_f : {}
dewpoint_c : {}
heat_index_string : {}
heat_index_f : {}
heat_index_c : {}
visibility_mi : {}
icon_url_base : {}
two_day_history_url : {}
icon_url_name : {}
ob_url : {}
disclaimer_url : {}
copyright_url : {}
privacy_policy_url : {}


In [7]:
# finally, focus in on temperature in F, i.e. tag 'temp_f'.
# note that there could in principle be multiple elements with the target tag, so we get all of them.
for elt in root.iter('temp_f'):  # iter: traverse the whole tree, find nodes with the tag 'temp_f'
    print("Temperature in degrees F:", elt.text)

Temperature in degrees F: 95.0


### Another example: a list of languages from Glottolog

[This webpage](https://glottolog.org/glottolog/language) lists languages in Glottolog.  If you press the little download button in the upper right of the page, and select 'kml', you will get the [same information in KML](https://glottolog.org/glottolog/language.kml), which is an [XML-based notation for expressing geographic annotation and visualization](https://en.wikipedia.org/wiki/Keyhole_Markup_Language).

We are going to read this in, and create a sorted list of all the languages mentioned.

In [8]:
# as before, we start by reading in the data from the web.
target_url = 'https://glottolog.org/glottolog/language.kml'
with urllib.request.urlopen(target_url) as response:
   data = response.read()
data

b'<?xml version="1.0" encoding="utf-8"?>\n<kml xmlns="http://earth.google.com/kml/2.1"\n     xmlns:atom="http://www.w3.org/2005/Atom">\n  <Document>\n    <name>Languages</name>\n      <description>\n      </description>\n    <open>1</open>\n    \n    <Style id="cani1243">\n      <IconStyle>\n        <Icon>\n          <href></href>\n        </Icon>\n      </IconStyle>\n    </Style>\n    <Placemark>\n      <name>Canichana</name>\n      <description>\n        <![CDATA[\n        <a class="Language" href="https://glottolog.org/resource/languoid/id/cani1243" title="Canichana">Canichana</a>\n        ]]>\n      </description>\n      <Poin

In [10]:
# now represent as XML tree
root = ET.fromstring(data)
root

<Element '{http://earth.google.com/kml/2.1}kml' at 0x7f98ed07b310>

In [11]:
# now look at children of root node.
for child in root:
    print(child.tag, ":", child.attrib)

{http://earth.google.com/kml/2.1}Document : {}


In [12]:
# note tag header.
th = '{http://earth.google.com/kml/2.1}'

ll = []  # list to hold results
# recursively traverse the entire tree, pulling out 'name' only.
for name in child.iter(th+'name'):  # header string + name
    ll.append(name.text)

# now print list out, sorted.
sorted(ll)  

['!O!ung',
 '//Xegwi',
 '/Xam',
 'Aari',
 'Aariya',
 'Aasax',
 'Abau',
 'Abawiri',
 'Abaza',
 'Abinomn',
 'Abipon',
 'Abkhaz',
 'Abom',
 'Abu',
 'Abui',
 'Abun',
 'Acheron',
 "Achi', Cubulco",
 'Achuar-Shiwiar',
 'Achumawi',
 'Acroá',
 'Adabe',
 'Adai',
 'Adamorobe Sign Language',
 'Adangme',
 'Adap',
 'Aduge',
 'Adurgari',
 'Adyghe',
 'Adzera (Retired)',
 'Aekyom',
 'Aewa',
 'Afghan Sign Language',
 'Afghanistan Gorbat',
 'Afitti',
 'Afrihili',
 'Agariya',
 'Agavotaguerra',
 'Agbirigba',
 'Aghu Tharnggalu (Retired)',
 'Agi',
 'Agob-Ende-Kawam',
 'Aguano',
 'Aguaruna',
 'Agutaynen',
 'Ahe',
 'Ahtena',
 'Aikanã',
 'Aiku',
 'Aimele',
 'Ainbai',
 'Aiome',
 'Airoran',
 'Aja (Benin)',
 'Aja (Sudan)',
 'Ak',
 'Aka',
 'Akabea',
 'Akabo',
 'Akacari',
 'Akajeru',
 'Akakede',
 'Akakol',
 'Akakora',
 'Akarbale',
 'Akhvakh',
 'Akkadian',
 'Akpes',
 'Al-Sayyid Bedouin Sign Language',
 'Alabama',
 'Alacalufe-Austral',
 'Alacalufe-Central',
 'Alaguilac',
 'Alak',
 'Alamblak',
 'Alangan',
 'Alapmunte'

# Observations

- With the proper tools at your disposal, handling XML is pretty easy!