# XML (Extensible Markup Language)
The XML standard is a flexible way to create information formats and electronically share structured data.
* XML is short for Extensible Markup Language and is used to describe data. 
* XML is a markup language much like HTML
* XML was designed to store and transport data
* XML was designed to be self-descriptive
* XML is a W3C Recommendation

Even though it is slowly being replaced by JSON, it is one of the fundamental data formats and it's crucial to learn about it.

Online Resources:
* https://www.w3schools.com/xml/xml_whatis.asp
* https://docs.python.org/3/library/xml.etree.elementtree.html

An example of XML:

<img src='images/XML.png'>

# XML Tutorial
Everything about XML in Python is done with package `xml`. Let's import it at the beginning of the notebook.

In [1]:
import xml.etree.ElementTree as ET

Let's load and parse the data into Python.

We can see that object `tree` has a special type.

In [2]:
tree = ET.parse('data/data.xml')
print(type(tree))

<class 'xml.etree.ElementTree.ElementTree'>


To get the main (`root`) tag of the file, we can call function `getroot()`

In [3]:
root = tree.getroot()
root

<Element 'data' at 0x0000016252D0F900>

Now, `root` represents the top element of the file. We can check its `tag` and `attributes`

In [4]:
print(root.tag)
print(root.attrib)
print(len(root))

data
{}
3


We can see that the length of this element is 3. This means that it has 3 children. We can access these children the same way as elements in a `list`.

In [5]:
# First child of the root
country1 = root[0]
# First child of the child
rank = country1[0]
# What is the tag of the grandchild
print(rank.tag)
# What is the text inside this grandchild
print(rank.text)
# What are the attributes of last element?
print(country1[4].attrib)

rank
1
{'name': 'Switzerland', 'direction': 'W'}


To extract the information from all children we need to iterate through the file. We have a couple of options.

In [6]:
# Find all child with tag country
for country in root.findall('country'):
    # rank is child of the country
    rank = country.find('rank').text
    # name is attribute of the country
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


We can also look for grandchildren directly if we know their tag:

In [7]:
for neighbor in root.iter('neighbor'):
    print(neighbor.attrib)

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


Here are some tips and tricks on how to work with `root.findall()`:

In [8]:
# Top-level elements
root.findall(".")
# All 'neighbor' grand-children of 'country' children of the top-level elements
root.findall("./country/neighbor")
# elements with name='Singapore' that have a 'year' child
root.findall(".//year/..[@name='Singapore']")
# 'year' elements that are children of elements with name='Singapore'
root.findall(".//*[@name='Singapore']/year")
# All 'neighbor' elements that are the second child of their parent
root.findall(".//neighbor[2]")

[<Element 'neighbor' at 0x0000016252D0FAE0>,
 <Element 'neighbor' at 0x0000016252D0FE50>]

Extract the name, rank, year and gdppc from the countries and create a Pandas DataFrame.

In [10]:
import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('data/data.xml')  # Load from file
root = tree.getroot()

my_dict = {'name': [],
           'rank': [],
           'year': [],
           'gdppc': []}


for country in root:
    name_value = country.attrib['name']
    my_dict['name'].append(name_value)

    rank_value = country[0].text
    my_dict['rank'].append(rank_value)

    year_value = country[1].text
    my_dict['year'].append(year_value)

    gdppc_value = country[2].text
    my_dict['gdppc'].append(gdppc_value)

df = pd.DataFrame(my_dict) 
df

Unnamed: 0,name,rank,year,gdppc
0,Liechtenstein,1,2008,141100
1,Singapore,4,2011,59900
2,Panama,68,2011,13600


Because all children of the `root` are countries therefore 
```python 
for country in root:
```
equals
```python
for country in root.findall('country'): 
```