## COMP20008 2021 Semester 1 Workshop 3
### Why XML and when do we see it?
- Extensible Markup Language (XML) is widely used markup language used to define rules for encoding documents or data structures (closer to HTML than to Python).
- Commonly used for documents, but also for XML SOAP requests (messaging protocol for requests) when working with asynchronous API's (so yes, you will eventually come across these in industry). 
- Just note that the XML SOAP protocol has been superseded by REST API's (Application Programming Interfaces), but these are still abundant and around!

### XML and Python
- To parse XML data structures in Python, we will use the `lxml` library (different from the `xml` built-in library).
- TL;DR: `lxml` is a more powerful and feature-rich version of `xml`.
- Combining both `lxml` and `requests` (library for sending requests) creates a powerful method of dealing with API's online.
- Notable functions from `lxml` include `etree`, which allows parsing of XML data into a tree-like structure.
- Documentation: https://lxml.de/api/index.html

In [7]:
# import the whole etree module from lxml
from lxml import etree

### Example
Here's what the `roya.xml` file looks like:
```
<?xml version="1.0" encoding="utf-8"?>
  <queen title="Queen Elizabeth II" marriedTo="Philip, Duke of Edinburgh">
    <prince title="Charles, Prince of Wales" marriedTo="Lady Diana Spencer">
      <prince title="Prince William of Wales"/>
      <prince title="Prince Henry of Wales"/>
    </prince>
    <princess title="Anne, Princess Royal"/>
    <prince title="Andrew, Duke of York"/>
    <prince title="Edward, Earl of Wessex"/>
  </queen>
```
...and visually as a tree-like structure

<img src="download.png" align="left" style="width: 30vw; min-width: 200px;"/>

There are two main scenarios of "reading" in XML files:  
1. Reading in a local file.
2. Sending a request online to read an XML file.

#### Method 1
- Use `etree.parse` to parse an XML file into an Element Tree

In [8]:
xmltree = etree.parse("royal.xml")
xmltree

<lxml.etree._ElementTree at 0x2bcc4e88080>

#### Method 2
Use `requests` to grab an online XML file and parse it.
1. First, we use a GET request (get an object from the URL) to get the data.
2. Next, we get the content of the response as a string (`response.content`).
3. Then, we create an XML from the response string.
4. Finally, we parse the XML into an Element Tree

In [11]:
import requests 

# this URL is the github uploaded version of the xml
url = 'https://raw.githubusercontent.com/akiratwang/COMP20008/main/Tutorials/Week-3/royal.xml'

# GET request = "get an object from the URL"
response = requests.get(url)

# response.content = requested object's content as a string
print(response.content)

# convert the response to an xml from a string
xml_response = etree.fromstring(response.content)

# convert the xml to an Element Tree
xmltree_requests = etree.ElementTree(xml_response)

xmltree_requests

b'<?xml version="1.0" encoding="utf-8"?>\n  <queen title="Queen Elizabeth II" marriedTo="Philip, Duke of Edinburgh">\n      <prince title="Charles, Prince of Wales" marriedTo="Lady Diana Spencer">\n\t\t<prince title="Prince William of Wales" />\n\t\t<prince title="Prince Henry of Wales" />\n      </prince>\n      <princess title="Anne, Princess Royal" />\n      <prince title="Andrew, Duke of York" />\n      <prince title="Edward, Earl of Wessex" />\n</queen>\n'


<lxml.etree._ElementTree at 0x2bcc50c64c0>

- So, right now we have an **XML Element Tree** (ET), which represents the whole XML file as a tree-like structure.
- Each node in this ET is represented as an **Element** object.
- You can use `getroot()` to get the root element of the ET, as well as the `tag` attribute to get the tag of an element.

In [12]:
# as you can see, both of the methods above give the same result
# <Element queen at some_referenced_memory>
print(xmltree.getroot())
print(xmltree_requests.getroot())

<Element queen at 0x2bcc4e8ad40>
<Element queen at 0x2bcc50c0740>


In [15]:
# this is the tag name for the root node
root = xmltree_requests.getroot()
print(root.tag) 

queen


### Traversing an XML Tree
- Iterate over the elements of the tree (like a linked-list traversal for those who know what this is).
- Use `.get()` to get the name of the title (much like the `dict.get()` method)
- Use the `attrib` attribute to get the list of all the attributes
- You can also index the locations of the element to access specific child nodes.

In [18]:
root.attrib

{'title': 'Queen Elizabeth II', 'marriedTo': 'Philip, Duke of Edinburgh'}

In [19]:
for element in root:
    print(element.tag)
    print(element.attrib)
    print(element.get("title"))
    print(element.get("marriedTo"))
    print()

prince
{'title': 'Charles, Prince of Wales', 'marriedTo': 'Lady Diana Spencer'}
Charles, Prince of Wales
Lady Diana Spencer

princess
{'title': 'Anne, Princess Royal'}
Anne, Princess Royal
None

prince
{'title': 'Andrew, Duke of York'}
Andrew, Duke of York
None

prince
{'title': 'Edward, Earl of Wessex'}
Edward, Earl of Wessex
None



In [20]:
# using indexing instead
oldest_prince = root[0]

# get the title of the 0th child node
oldest_prince.tag, oldest_prince.get("title")

('prince', 'Charles, Prince of Wales')

- If we want to find the **first matching child**, we use `.find()` (like `list.index()`).
- Note that this only returns the first match, and not all matches!

In [22]:
element = root.find("princess")
element.tag, element.get("title")

('princess', 'Anne, Princess Royal')

- `iterchildren()` is a function which allows you to iterate over all the children given some tag.
- (Advanced) You can further use this to create a generator function to give you a list of all of them.
- (Even more advanced) You can use `iterdescendants()` to iterate over all the nodes!

In [24]:
# iterate and print all the titles of all the princes
for child in root.iterchildren(tag="princess"):
    print(child.get('title'))

Anne, Princess Royal


In [None]:
# creating a generator function
[child.get("title") for child in root.iterchildren(tag="prince")]

In [25]:
# notice how we get an additional 2 more here
[child.get("title") for child in root.iterdescendants(tag="prince")]

['Charles, Prince of Wales',
 'Prince William of Wales',
 'Prince Henry of Wales',
 'Andrew, Duke of York',
 'Edward, Earl of Wessex']

### Exercise 1

Using the `royal.xml`:

1. Write Python code to get the title property of queen's grandsons.
1. Write Python code to get the full title of the only princess in the family tree.

In [41]:
xmltree = etree.parse("royal.xml")
root = xmltree.getroot()

# answer below
# Exercise 1
[grandchild.get("title") for child in root.iterchildren() for grandchild in child.iterchildren()]

# Exercise 2
[girl.get("title") for girl in root.iterchildren(tag="princess")]


['Anne, Princess Royal']

### Accessing XML text


Let's now use another sample of XML data. Consider the file book.xml

```
<?xml version="1.0" encoding="utf-8"?>
  <book id="book001">
    <author>Salinger, J. D.</author>
    <title>The Catcher in the Rye</title>
    <language>English</language>
    <publish_date>1951-07-16</publish_date>
    <publisher>Little, Brown and Company</publisher>
    <isbn>0-316-76953-3</isbn>
    <description>A story about a few important days in the life of Holden Caulfield</description>
  </book>
```

Notice how there are differences with `royal.xml`?
- There is now text between the tags (like HTML)

To access the text, we need to use the `.text` attribute.

In [42]:
xmltree = etree.parse('book.xml')
root = xmltree.getroot() 

# method 1 - iteration
for child in root:
    print(f"{child.tag}: {child.text}")

author: Salinger, J. D.
title: The Catcher in the Rye
language: English
publish_date: 1951-07-16
publisher: Little, Brown and Company
isbn: 0-316-76953-3
description: A story about a few important days in the life of Holden Caulfield


In [43]:
# method 2 - generator function
# notice how this kind of looks like the key, value tuples in dict.items()
[(child.tag, child.text) for child in root]

[('author', 'Salinger, J. D.'),
 ('title', 'The Catcher in the Rye'),
 ('language', 'English'),
 ('publish_date', '1951-07-16'),
 ('publisher', 'Little, Brown and Company'),
 ('isbn', '0-316-76953-3'),
 ('description',
  'A story about a few important days in the life of Holden Caulfield')]

### Adding data into XML trees
- Create a new element with `etree.Element()`
- You can also give it attributes, such as `text`
- The tree works like a list, so adding new elements uses `.append()`

In [62]:
# define a new empty Element
new_element = etree.Element('genre')

# add text to it
new_element.text = 'Novel'

root.append(new_element)

# now you can see the new ('genre', 'Novel') tuple

In [None]:
# a one-line method of "nicely" printing out that specific element
etree.tostring(root[-1], # get the last element
               pretty_print=True, # enable pretty printing
               encoding='unicode' # specify encoding as unicode
)

- Additionally, you can create a new XML tree by defining a root element.
- Then add new elements using the `SubElement()` method.

In [63]:
#root = etree.Element('book')

#new_element = etree.SubElement(root, 'price')
#new_element.text = '23.95'

new_element = etree.Element('price')
new_element.text = '23.95'
root.append(new_element)

[(child.tag, child.text) for child in root]

[('price', '23.95'), ('genre', 'Novel'), ('price', '23.95')]

(Advanced) To insert a new element at a specific location, use `.insert()` (akin to `list.insert()`)

In [None]:
new_element = etree.Element("country")
root.insert(1, new_element) # insert the new_element at index 1 (root = index 0)
root[1].text = "United States" # add some text to it

[(child.tag, child.text) for child in root]

### Serialising XML data
- In other words, how to output XML data
- We use the `.tostring()` method
- **Note:** The `.tostring()` method outputs bytes instead of a Python string

In [64]:
out = etree.tostring(root, encoding="UTF-8")

# notice how the string is b''
out

b'<book><price>23.95</price><genre>Novel</genre><price>23.95</price></book>'

In [65]:
# write as bytes 
with open('output.xml', 'wb') as f:
    f.write(out)

### Exercise 2

- Write Python code to load in the file `"book.xml"`, change the ISBN to `"Unknown"` and then write out the file to `"book-new.xml"`
- Do not hardcode and use the index to change it!

In [95]:
xmltree = etree.parse("book.xml")
root = xmltree.getroot()

print([(child.tag, child.text) for child in root])

# answer below

for i in range(0, len(root)):
    if root[i].tag == 'isbn':
        root[i].text = 'Unknown'

out2 = etree.tostring(root, encoding="UTF-8")
with open('book-new.xml', 'wb') as f:
    f.write(out2) 


[('author', 'Salinger, J. D.'), ('title', 'The Catcher in the Rye'), ('language', 'English'), ('publish_date', '1951-07-16'), ('publisher', 'Little, Brown and Company'), ('isbn', '0-316-76953-3'), ('description', 'A story about a few important days in the life of Holden Caulfield')]


### JSON
- JSON (JavaScript Object Notation) is another common data structure which is supposed to replace the XML data structure.
- Works very similar to a Python dictionary.
- To parse and read `json` files, we can use the `json` library.
- Documentation: https://docs.python.org/3/library/json.html
- Tutorial: https://www.w3schools.com/python/python_json.asp

### Example (ELI5): Creating JSON files
1. Make a Python dictionary with your required structure.
2. Convert the *whole* dictionary into a string.
3. Done.

The reason why we need to do this is because JSON requires **double quotes** for the keys/values.  
For example:
`{'key': 'value'}` (incorrect) vs `{"key": "value"}` (correct)

In [96]:
str_json = '''
{"id": "book001",
 "author": "Salinger, J. D.",
 "title": "The Catcher in the Rye",
 "price": "44.95",
 "language": "English",
 "publish_date": "1951-07-16",
 "publisher": "Little, Brown and Company",
 "isbn": "0-316-76953-3",
 "description": "A story about a few important days in the life of Holden Caulfield"
}
'''

Now, we can parse this as a "proper" JSON format...

#### IMPORTANT
- `json.load()` loads a JSON object.
- `json.loads()` loads a JSON from string (i.e `json.loadSTRING()`, kind of an ambiguous naming convention)

Since we have a string, we should use `.loads()`

In [97]:
import json

data = json.loads(str_json)
type(data)

dict

- As you can see, when we parse (load) the JSON data in, it's treated as a dictionary.
- Normal dictionary operations apply

In [98]:
data

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': 'English',
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': '0-316-76953-3',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

In [99]:
data['price']

'44.95'

In [100]:
data['isbn'] = "Unknown"
data

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': 'English',
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

We can also output this as a "proper" JSON format using `.dump()`

#### IMPORTANT (like `.loads()`)
- `json.dump()` writes a JSON object.
- `json.dumps()` writes a JSON to a string (i.e `json.dumpSTRING()`)

Since we want to output a JSON object, we should use `.dump()`

In [101]:
with open('book.json', 'w') as f:
    json.dump(data, f, indent=2) # indent=2 is for "nicely" formatting the output

In [None]:
# if we want to output a string representation
json.dumps(data)

# this is more useful if we are sending a POST request (sending JSON data online)

In [102]:
# Since we have a JSON object saved now... 
# we use .load()
with open('book.json') as f:
    data = json.load(f)
data

{'id': 'book001',
 'author': 'Salinger, J. D.',
 'title': 'The Catcher in the Rye',
 'price': '44.95',
 'language': 'English',
 'publish_date': '1951-07-16',
 'publisher': 'Little, Brown and Company',
 'isbn': 'Unknown',
 'description': 'A story about a few important days in the life of Holden Caulfield'}

### Exercise 3
- Add Spanish and German to the JSON file above as two extra languages represented as an array. 
- Save this file as `book2.json`. 
- Validate it on [JSONLint](https://jsonlint.com/?code=).

In [106]:
import json 

data["language"] = ["English", "Spanish", "German"]
data
# answer below

with open('book2.json', 'w') as f:
    json.dump(data, f, indent=4)


### Exercise 4 (In your own time)
- Modify the publish date parameter. 
- Make this an array of two objects that have properties of edition (`"first"`, `"second"`) and date (`"1951-07-16"`,`"1979-01-01"`) respectively. 
- Save this file as `book3.json`.

In [110]:
# answer below
first = {"edition":"first", "date":"1951-07-16"}
second = {"edition":"second", "date":"1979-01-01"}
data["publish_date"] = [first, second]

with open('book3.json', 'w') as f:
    json.dump(data, f, indent=4)

### Additional Task: Git Resources 
Local Machines:
- Use either `git` (command line) or GitHub Desktop (nice UI)

Server:
- Only `git` (command line)

(From the Lab):
- Please go throuh the git PDF manual uploaded on Canvas. 
- The manual will help you to get familiar with the commands used when working with git repository.
- You can also access a git tutorial video using this link : https://canvas.lms.unimelb.edu.au/courses/107611/files/6845808?module_item_id=2714691 

How to clone my repo onto JupyterLab Server and get changes:
- `git clone https://github.com/akiratwang/COMP20008` (clone the repo, don't include the `.git` at the end)
- `cd COMP20008` (change directory inside the repo)
- `git pull` (pull new changes) OR `git fetch` (pull new changes and overwrite all your changes)

Other commands:
- `git add .` (add all changes in `.`, where `.` is the root folder)
- `git commit -m MESSAGE` (commit all changes with a message as a string in double quotes)
- `git push` (push changes online)
- `git pull` (pull online changes - make sure inside the repository)

In [None]:
# cell to clear the notebook of output files
import os

for f in ['book.json', 'book2.json', 'book3.json', 'book-new.xml', 'output.xml']:
    try:
        os.remove(f)
    except FileNotFoundError:
        print("Already gone.")