**XML**

XML (*extensible markup language*) is a general method  for storing information in a text file.


It is similar to HTML but HTML is a specific language structured like XML, and used to describe how information is to be displayed in a browser. 

XML addresses a more general need, that of storing information for a wide variety of purposes. One common use of XML is to put information in something akin to the familiar idea of a *form*, with fields reserved for specific information.

XML is unlike HTML in that it does not have specific tags. A user of XML can decide what tags they wish to make use of.

XML provides a separate representation of data that is separate from a description as to how the data ought to be displayed.

The process of creating an HTML display of data in an an XML file is made easier using something called XSLT (Extensible Stylesheet Language Transformations). This is not discussed here.

Users can decide on a common understanding of how, in a particular application, their XML files ought to be structured. In doing this, they may arrive at an XML *schema* (also not discussed here).

Here are some resources for learning about XML 

https://www.w3schools.com/xml/

and for the related element tree package we'll be using below:

https://docs.python.org/3/library/xml.etree.elementtree.html

https://www.datacamp.com/tutorial/python-xml-elementtree


**Structure of an XML file**

Typically an XML file starts with a *prolog* line indicating the XML version and the encoding.

$$\mbox{<?xml version="1.0" encoding="utf-8"?>}$$

**Tag syntax**

XML tags can be 

- opening $\mbox{<tagname ...>}$
- closing $\mbox{</tagname>}$
- self-closing $\mbox{<tagname ... />}$

As we saw for html, tags can have attributes (name/value pairs) 

$$\mbox{<tagname attr1="..." attr2="..." ... >}$$ 


The tags in an XML file must be *nested* which gives an XML file a hierarchical tree structure.

To illustrate, we'll use the sample xml file found here:

https://docs.microsoft.com/en-us/dotnet/standard/linq/sample-xml-file-customers-orders

We start by reading it into a string then print it out some of it.

In [5]:
with open("sample.xml","rb") as fin:
    b=fin.read()
    text=b.decode()
print(text[:200])
print("\n\n\n\n\n")
print(text[-200:])

<?xml version="1.0" encoding="utf-8"?>
<Root>
  <Customers>
    <Customer CustomerID="GREAL">
      <CompanyName>Great Lakes Food Market</CompanyName>
      <ContactName>Howard Snyder</ContactNam






Francisco</ShipCity>
        <ShipRegion>CA</ShipRegion>
        <ShipPostalCode>94117</ShipPostalCode>
        <ShipCountry>USA</ShipCountry>
      </ShipInfo>
    </Order>
  </Orders>
</Root>


**Element tree package**

In an XML file, an element refers to everything from a start tag to its corresponding end tag (including both).

We can use the element tree package to extract elements from an XML file.

We first get the root element of the tree from our xml file.

In [6]:
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
print(root)

<Element 'Root' at 0x0000025971713630>


We can also get the root directly from a string like the one we read from the file above.

In [7]:
root=ET.fromstring(text)
print(root)

<Element 'Root' at 0x000002597322E310>


**Element attributes**

Each element has 

- a tag
- attributes possibly empty
- text/data

Have a look at the root element.

In [8]:
print("tag="+root.tag)
print("attrib="+str(root.attrib))
print("text = "+str(root.text))

tag=Root
attrib={}
text = 
  


**Iterating over descendents**

For any element, we can iterate over **all of its descendants** using the **element.iter** method and print out related information.

Here, we print the first 25 tags that are descendants of the root element.

In [9]:
ctr=0
for e in root.iter():
    if ctr>25:
        break
    print(e.tag)
    ctr+=1

Root
Customers
Customer
CompanyName
ContactName
ContactTitle
Phone
FullAddress
Address
City
Region
PostalCode
Country
Customer
CompanyName
ContactName
ContactTitle
Phone
Fax
FullAddress
Address
City
Region
PostalCode
Country
Customer


**Restricting to specific tags**

We can specify a tag to restrict to when we iterate. 

Here

- we restrict to Customer tags, and
- when we find one we iterate over elements descending from it.

In [10]:
for e in root.iter("Customer"):
    for f in e.iter():
        st="{:20s} {:20s}".format(f.tag,f.text)
        print(st)

Customer             
                   
CompanyName          Great Lakes Food Market
ContactName          Howard Snyder       
ContactTitle         Marketing Manager   
Phone                (503) 555-7555      
FullAddress          
                   
Address              2732 Baker Blvd.    
City                 Eugene              
Region               OR                  
PostalCode           97403               
Country              USA                 
Customer             
                   
CompanyName          Hungry Coyote Import Store
ContactName          Yoshi Latimer       
ContactTitle         Sales Representative
Phone                (503) 555-6874      
Fax                  (503) 555-2376      
FullAddress          
                   
Address              City Center Plaza 516 Main St.
City                 Elgin               
Region               OR                  
PostalCode           97827               
Country              USA                 
Customer       

**Specifying children of an element**

The children of an element can be accessed using square brackets.

In [11]:
x=root[0]
print(x.tag)

Customers


In [12]:
x=root[1]
print(x.tag)

Orders


This works at all levels.

In [13]:
print(root[0][1].tag)
print(root[0][1][0].tag)
print(root[0][1][0].text)

Customer
CompanyName
Hungry Coyote Import Store


**Searching for elements using XPATH**

XPATH is a protocol for picking out portions of an XML file. Some of this is capability is available in the element tree package.


We can use findall to iterate over a specific element's children using *path* specification.

In [14]:
it=root.findall(".")
for x in it:
    print(x.tag)

Root


In [15]:
it=root.findall("./")
for x in it:
    print(x.tag)

Customers
Orders


Here, once we find an element we iterate over it.

In [16]:
it=root.findall("./Customers")
for x in it:
    it2=x.findall("Customer")
    for x2 in it2:
        print(x2.attrib)

{'CustomerID': 'GREAL'}
{'CustomerID': 'HUNGC'}
{'CustomerID': 'LAZYK'}
{'CustomerID': 'LETSS'}


In [13]:
it=root.findall("./Customers/Customer")
for x in it:
    it2=x.iter()
    for x2 in it2:
        if x2.tag=="CompanyName":
            print(x2.text)

Great Lakes Food Market
Hungry Coyote Import Store
Lazy K Kountry Store
Let's Stop N Shop


In [14]:
it=root.findall("Customers/Customer")
for x in it:
    print("Customer")
    it2=x.iter()
    for t in it2:
        st="   {:20s} {:20s}".format(t.tag,t.text)
        print(st)
        if t.tag=="FullAddress":
            for t2 in t:
                st="      {:20s} {:20s}".format(t2.tag,t2.text)
                print(st)
        

Customer
   Customer             
                   
   CompanyName          Great Lakes Food Market
   ContactName          Howard Snyder       
   ContactTitle         Marketing Manager   
   Phone                (503) 555-7555      
   FullAddress          
                   
      Address              2732 Baker Blvd.    
      City                 Eugene              
      Region               OR                  
      PostalCode           97403               
      Country              USA                 
   Address              2732 Baker Blvd.    
   City                 Eugene              
   Region               OR                  
   PostalCode           97403               
   Country              USA                 
Customer
   Customer             
                   
   CompanyName          Hungry Coyote Import Store
   ContactName          Yoshi Latimer       
   ContactTitle         Sales Representative
   Phone                (503) 555-6874      
   Fax       

**Searching for tags with specified data**

Here we search all customers for a ContactName

In [17]:
for c in root.findall("./Customers/Customer/[ContactName='Jaime Yorres']"):
    print(c.tag)
    print(c.attrib)
    for c2 in c.iter():
        st="      {:20s} {:20s}".format(c2.tag,c2.text)
        print(st)
        

Customer
{'CustomerID': 'LETSS'}
      Customer             
                   
      CompanyName          Let's Stop N Shop   
      ContactName          Jaime Yorres        
      ContactTitle         Owner               
      Phone                (415) 555-5938      
      FullAddress          
                   
      Address              87 Polk St. Suite 5 
      City                 San Francisco       
      Region               CA                  
      PostalCode           94117               
      Country              USA                 


**Searching for an attribute**

Here we can search for a Customer whose CustomerID attribute is some specified value.

Note that in contrast with the above example the **@** symbol is used because CustomerID is an **attribute.**

In [16]:
for x in root.findall("./Customers/Customer[@CustomerID='LETSS']"):
    print(x.tag)
    print(x.attrib)

Customer
{'CustomerID': 'LETSS'}


**Parent element**

When we search, we can get the parent by appending ... to the string.

In [17]:
for x in root.findall("./Customers/Customer[@CustomerID='LETSS']..."):
    print(x.tag)
    print(x.attrib)

Customers
{}


**Child elements**

We can also get the child elements of elements found.

In [18]:
for x in root.findall("./Customers/Customer[@CustomerID='LETSS']/"):
    st="{:20s} {:20s}".format(x.tag,x.text)
    print(st)

CompanyName          Let's Stop N Shop   
ContactName          Jaime Yorres        
ContactTitle         Owner               
Phone                (415) 555-5938      
FullAddress          
                   


**XML Parser**

We can also parse an XML file by specifying a parser.

This is done in a similar manner to the HTM parser (see the *Parsing HTML files notebook*) except the class we build is not a derived class. 

Here is a simple example: when we see a Customer tag we print its attribute dictionary.

In [18]:
from xml.etree.ElementTree import XMLParser
class myparser_target: 
    def start(self, tag, attrib):
        if tag=="Customer":
            print(attrib)
    def end(self, tag):             
        pass
    def data(self, data):
        pass            
    def close(self):    
        pass

target = myparser_target()
parser = XMLParser(target=target)
with open("sample.xml","rb") as fin:
    b=fin.read()
    text=b.decode("utf-8")
parser.feed(text)

{'CustomerID': 'GREAL'}
{'CustomerID': 'HUNGC'}
{'CustomerID': 'LAZYK'}
{'CustomerID': 'LETSS'}


**Namespaces**

In order to avoid naming conflicts in which tags with the same name need to be used with different purposes, XML documents can have *namespaces*. Tags are then prefixed to indicate which namespace they belong to. We won't go into too much detail here, but if you see an attribute **xmlns** be aware that a namespace prefix will be added to tags in the XML file.

We'll focus on an xml file, similar to the one above, but with namespaces used.

The URI (http://www.?.com) is not used but typically be a web site providing documentation for the namespace.

In [20]:
with open("sample_with_namespace.xml") as fin:
    text=fin.read()
print(text)

<?xml version="1.0" encoding="utf-8"?>
<Root>
  <Customers xmlns:C="http://www.C.com/">
    <C:Customer CustomerID="GREAL">
      <C:CompanyName>Great Lakes Food Market</C:CompanyName>
      <C:ContactTitle>Marketing Manager</C:ContactTitle>
      <C:Phone>(503) 555-7555</C:Phone>
      <C:FullAddress>
        <C:Address>2732 Baker Blvd.</C:Address>
        <C:City>Eugene</C:City>
        <C:Region>OR</C:Region>
        <C:PostalCode>97403</C:PostalCode>
        <C:Country>USA</C:Country>
      </C:FullAddress>
    </C:Customer>
    <C:Customer CustomerID="HUNGC">
      <C:CompanyName>Hungry Coyote Import Store</C:CompanyName>
      <C:ContactName>Yoshi Latimer</C:ContactName>
      <C:ContactTitle>Sales Representative</C:ContactTitle>
      <C:Phone>(503) 555-6874</C:Phone>
      <C:Fax>(503) 555-2376</C:Fax>
      <C:FullAddress>
        <C:Address>City Center Plaza 516 Main St.</C:Address>
        <C:City>Elgin</C:City>
        <C:Region>OR</C:Region>
        <C:PostalCode>97827</C:

So here's what happens when we iterate through the tags.

In [21]:
import xml.etree.ElementTree as ET
tree = ET.parse('sample_with_namespace.xml')
root2 = tree.getroot()
for e in root2.iter():
    print(e.tag)

Root
Customers
{http://www.C.com/}Customer
{http://www.C.com/}CompanyName
{http://www.C.com/}ContactTitle
{http://www.C.com/}Phone
{http://www.C.com/}FullAddress
{http://www.C.com/}Address
{http://www.C.com/}City
{http://www.C.com/}Region
{http://www.C.com/}PostalCode
{http://www.C.com/}Country
{http://www.C.com/}Customer
{http://www.C.com/}CompanyName
{http://www.C.com/}ContactName
{http://www.C.com/}ContactTitle
{http://www.C.com/}Phone
{http://www.C.com/}Fax
{http://www.C.com/}FullAddress
{http://www.C.com/}Address
{http://www.C.com/}City
{http://www.C.com/}Region
{http://www.C.com/}PostalCode
{http://www.C.com/}Country
Orders
{http://www.O.com/}Order
{http://www.O.com/}CustomerID
{http://www.O.com/}EmployeeID
{http://www.O.com/}OrderDate
{http://www.O.com/}RequiredDate
{http://www.O.com/}ShipInfo
{http://www.O.com/}ShipVia
{http://www.O.com/}Freight
{http://www.O.com/}ShipName
{http://www.O.com/}ShipAddress
{http://www.O.com/}ShipCity
{http://www.O.com/}ShipRegion
{http://www.O.com

**More**

There is lots more that could be said but is not covered here. Importantly you might want to:

- modify an existing XML document
- create your own XML document