# (Lynda) Python: XML, JSON and the Web
[course link](https://www.lynda.com/Python-tutorials/Python-XML-JSON-Web/699338-2.html)

# (1) Overview

**Working with Internet data**
* XML
    * extensible markup language and related technologies
    * similar to HTML with some changes to better suit general data
    * rich and expressive but more complex than JSON
* JSON
    * very concise format for serializing object data
    * derived from JavaScript but supported by most modern languages
    * compact and easy to read, write and process
* Both data formats are platform and language independent

**XML Overview**
* Mature data format widely used in many applications
* W3C published standard since 1998
* Similar in structure to HTML
* Rules for XML formatting are more strict than HTML
* Usually used for complex, document-like data
    * examples: Android app resources, RSS and ATOM blog feeds
* XML documents must always have a single root tag
* XML documents can have an optional document declaration
* Empty tags must have a closing slash: <tag />
* Attributes must have values that are enclosed in quotes
* Tags must be properly nested within each other
* Tags and attributes starting with "xml" are reserved

** JSON Overview **
* JavaScript Object Notation
* Lightweight, text-based format for data interchange
* Easy for humans to read and machines to use
* JSON nature datatypes:
    * **number**: signed decimal number, no Integer / Float distinction
    * **string**: unicode or escaped characters inside double quotes 
    * **boolean**: true or false value
    * **null**: null value
    * **array**: list of ordered values
    * **object**: collection of key-value pairs, keys are strings

**Internet Data Python Modules**
* **urllib**: contains several modules for working with urls
* **http**: contains code for working with cookies, servers and other low level HTTP protocols
* **JSON**: gives us a way of converting Python data into JSON and from JSON into native Python data types
* **XML**: Python's interfaces for processing XML are grouped into the xml package

**Using httpbin.org**


# (2) Accessing the Internet

**Introducing urllib**
* **urllib.request**: handles the opening and reading of urls
* **urllib.error**: which defines the exception classes for errors raised by the request module
* **urllib.parse**: for parsing url structures
* **urllib.robotparser**: for working with robots.txt files

### Retrieving Data

```
response = urllib.request.urlopen(
    url,
    data=None,
    [timeout, ]*,
    cafile=None,
    capath=None,
    cadefault=False,
    context=None
)
```

**HTTPResponse Object**
* **URL**: the URL that data was ultimately retrieved from (may have been redirected)
* **status**: HTTP status code of result, such as 200 or 404
* **getheader() / getheaders()**: functions for accessing a single header or all headers as a group of 2-tuples
* **read()**: function to read the data from the result

In [3]:
# urllib_start.py

# using urllib to request data

# TODO: import the urllib request class
import urllib.request

def main():
    # the url to retrieve our sample data from
    url = "http://httpbin.org/xml"
    
    # TODO: open the URL and retrieve some data
    result = urllib.request.urlopen(url)
    
    # TODO: print the result code from the request, should be 200 OK
    print("Result code: {}".format(result.status))
    
    # TODO: print the returned data headers
    print("Headers: ----------------------")
    print(result.getheaders())
    
    # TODO: pritn the returned data itself
    print("\nReturned data: ----------------------")
        # decode returns data as text
    print(result.read().decode('utf-8'))
    
if __name__ == "__main__":
    main()
    

Result code: 200
Headers: ----------------------
[('Server', 'gunicorn/19.9.0'), ('Date', 'Sun, 26 Aug 2018 18:24:30 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true'), ('X-Cache', 'MISS from ansprod4868nb'), ('Via', '1.1 vegur, 1.1 ansprod4868nb (squid/3.5.27)'), ('Connection', 'close')]

Returned data: ----------------------
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>


### Sending data with urllib

* GET: retrieve data from a web service
* POST: create or update data on a web service
* PUT: create or update a specific data resource on a web service
* PATCH: perform a partial data update or edit on a web service
* DELETE: delete data on a web service

In [13]:
# urllibdata_start.py

# TODO: import the request and parse modules
import urllib.request
import urllib.parse

def main():
    url = "http://httpbin.org/get"
    
    # TODO: create some data to pass to the GET request
    args = {
        "name": "Samir",
        "is_author": True
    }
    
    # TODO: the data needs to be url-encoded before parsing
    data = urllib.parse.urlencode(args)
    
    # TODO: issue the request with the data params as part of the URL
#    result = urllib.request.urlopen(url + "?" + data)
    
    #TODO: issue the request with a data parameter to use POST
    url = "http://httpbin.org/post"
    data = data.encode()
    result = urllib.request.urlopen(url, data=data)
    
    print("Result code: {0}".format(result.status))
    print("\nReturned data: ----------")
    print(result.read().decode("utf-8"))

if __name__ == "__main__":
    main()

Result code: 200

Returned data: ----------
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "is_author": "True", 
    "name": "Samir"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Cache-Control": "max-age=259200", 
    "Connection": "close", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.6"
  }, 
  "json": null, 
  "origin": "10.36.24.156, 70.37.54.151", 
  "url": "http://httpbin.org/post"
}



### Handling errors

In [23]:
# urlliberr_start.py

# handling errors and status codes

# TODO: import the request, error, and status modules
import urllib.request
from http import HTTPStatus
from urllib.error import HTTPError, URLError

def main():
    url = "http://no-such-server.org" # will generate a URLError
    #url = "http://httpbin.org/status/404" # will generate an HTTPError
    #url = "http://httpbin.org/html" # should work with no errors
    
    try:
        # TODO: use exception handling to attempt the URL access
        result = urllib.request.urlopen(url)
        print("Result code: {0}".format(result.status))
        if (result.getcode() == HTTPStatus.OK):
            print(result.read().decode('utf-8'))
    except HTTPError as err:
        print("Error: {0}".format(err.code))
    except URLError as err:
        print("Yeah, that server is bunk. {0}".format(err.reason))
        
main()

Error: 503


### Drawbacks of urllib
* only supports a subset of HTTP by default
* doesn't automatically decode returned data
* common features, such as cookies or authentication, require more modules
* difficult to implement advanced features, such as timeouts
* processing returned data, such as JSON, is cumbersome

## (3) Using the Requests Library
Documentation [link](http://docs.python-requests.org/en/master)

### Overview of the Requests library
* simple API - each HTTP verb is a method name
* makes working with parameters, headers and cookies easier
* automatically decodes returned content
* automatically parses JSON content when decoded
* handles redirects, timeouts, and errors
* advanced features like authentication and sessions

In [None]:
# run if you need to install requests library
# !pip install requests

### Making a simple request

response = requests.get(url)

* **params**: key-value pairs that will be sent in the query string
* **headers**: dictionary of header values to send along with the request
* **auth**: authentication tuple to enable different forms of authentication
* **timeout**: value in seconds to wait for the server to respond

### Retrieve and send data

In [33]:
# using the requests library to access Internet data

# import the requests libary
import requests

def main():
    # TODO: use requests to issue a standard HTTP GET request
    url = "http://httpbin.org/xml"
    result = requests.get(url)
#    printResults(result)
    
    # TODO: send some parameters to the URL via a GET request
    # note that requests handles this for you, no manual encoding
    url = "http://httpbin.org/post"
    dataValues = {
        "key1":"value1",
        "key2":"value2"
    }
    result = requests.post(url, data=dataValues)
#    printResults(result)
    
    # TODO: pass a custom header to the server
    url = "http://httpbin.org/get"
    headerValues = {
        "User-Agent": "Samir Poonawala App / 1.0.0"
    }
    result = requests.get(url, headers=headerValues)
    printResults(result)

    
def printResults(resData):
    """Prints returned data from the server to the output"""
    print("Result code: {0}".format(resData.status_code))
    print("\n")
    
    print("Headers: ------------------------------")
    print(resData.headers)
    print("\n")
    
    print("Returned data: -------------")
    print(resData.text)

if __name__ == "__main__":
    main()

Result code: 200


Headers: ------------------------------
{'Server': 'gunicorn/19.9.0', 'Date': 'Sun, 26 Aug 2018 21:14:37 GMT', 'Content-Type': 'application/json', 'Content-Length': '324', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'X-Cache': 'MISS from ansprod4868nb', 'Via': '1.1 vegur, 1.1 ansprod4868nb (squid/3.5.27)', 'Connection': 'keep-alive'}


Returned data: -------------
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cache-Control": "max-age=259200", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Samir Poonawala App / 1.0.0"
  }, 
  "origin": "10.36.24.156, 70.37.54.151", 
  "url": "http://httpbin.org/get"
}



Requests library is overwhelmingly preferred by Python developers vs. urllib

### Handling errors

In [42]:
# using the requests libary to access internet data

import requests
from requests import HTTPError, Timeout

def main():
    try:
        # use requests to issue a standard HTTP GET request
        #url = "http://httpbin.org/status/404"
        url = "http://httpbin.org/delay/5"
        result = requests.get(url, timeout=2)
        result.raise_for_status()
        printResults(result)
    except HTTPError as err:
        print("Error: {0}".format(err))
    except Timeout as err:
        print("Request timed out:\n {0}".format(err))
    
def printResults(resData):
    print("Result code: {0}".format(resData.status_code))
    print("\n")
    
    print("Returned data: -----------")
    print(resData.text)
    
if __name__ == "__main__":
    main()

Request timed out:
 HTTPConnectionPool(host='webproxy', port=3128): Read timed out. (read timeout=2)


### Using authentication

In [45]:
# using the requests libary to access internet data

import requests
from requests.auth import HTTPBasicAuth

def main():
    # access a url that requires authentication
    # format of this uURL is that you provide the username / pw to auth against
    
    url = "http://httpbin.org/basic-auth/SamirPoonawala/MyPassword"
    
    # TODO: create a credentials object using HTTPBasicAuth
    myCreds = HTTPBasicAuth("SamirPoonawala","MyPassword")
    
    # TODO: issue the request with the authentication credentials
    result = requests.get(url, auth=myCreds)
    # alternative to above line:
    #result - requests.get*url, auth=("SamirPoonawala", "MyPassword")
    
    printResults(result)
    
def printResults(resData):
    print("Result code: {0}".format(resData.status_code))
    print("\n")
    
    print("Returned data: ------------")
    print(resData.text)

if __name__ == "__main__":
    main()

Result code: 200


Returned data: ------------
{
  "authenticated": true, 
  "user": "SamirPoonawala"
}



## (4) Working with JSON

### The Python JSON Module
* Parsing functions:
    ```
    obj = load(file)
    obj = loads(string)
    ```
* serializing functions:
    ```
    dump(obj, file)
    str = dumps(obj)
    ```

**Serializing Python Data to JSON**

|      Python Object      | JSON Representation |
|:-----------------------:|:-------------------:|
| dict                    |        object       |
| list, tuple             |        array        |
| str                     |        string       |
| int, long, float, Enums |        number       |
| True                    |         true        |
| False                   |        false        |
| None                    |         null        |


**Parsing JSON into Python**

|       JSON Data       | Python Object |
|:---------------------:|:-------------:|
| object                |      dict     |
| array                 |      list     |
| string                |      str      |
| integer number        |      int      |
| Floating point number |     float     |
| true, false           |  True, False  |
| null                  |      None     |

### Parsing and serializing JSON

In [46]:
# json_parse_start.py

# process JSON data returned from a server

# TODO: use the JSON module
import json

def main():
    # define a string of JSON code
    jsonStr = '''{
            "sandwich" : "Reuben",
            "toasted" : true,
            "toppings" : [
                "Thousand Island Dressing",
                "Sauerkraut",
                "Pickles"
            ],
            "price": 8.99
    }'''
    
    # TODO: parse the JSON data using loads
    data = json.loads(jsonStr)
    
    # TODO: print information from the data structure
    print("Sandwich: " + data['sandwich'])
    if (data['toasted']):
        print("And it's toasted")
    for topping in data['toppings']:
        print("Topping: " + topping)

if __name__ == "__main__":
    main()

Sandwich: Reuben
And it's toasted
Topping: Thousand Island Dressing
Topping: Sauerkraut
Topping: Pickles


In [48]:
#json_serialize_start.py

# Process JSON data returned from a server

# use the JSON module
import json


def main():
    # define a python ditcionary
    pythonData = {
        "sandwich": "Reuben",
        "toasted": True,
        "toppings": ["Thousand Island Dressing",
                     "Sauerkraut",
                     "Pickles"
                     ],
        "price": 8.99
    }

    # TODO: serialize to JSON using dumps
    jsonStr = json.dumps(pythonData, indent = 4)

    # TODO: print the resulting JSON string
    print("JSON Data: --------")
    print(jsonStr)


if __name__ == "__main__":
    main()


JSON Data: --------
{
    "sandwich": "Reuben",
    "toasted": true,
    "toppings": [
        "Thousand Island Dressing",
        "Sauerkraut",
        "Pickles"
    ],
    "price": 8.99
}


### JSON exception handling

In [51]:
# json_err_start.py

# Process JSON data returned from a server

# use the JSON module
import json
from json import JSONDecodeError 

def main():
    # define a string of JSON code
    jsonStr = '''{
            "sandwich" : "Reuben",
            "toasted" : true,
            "toppings" : [
                "Thousand Island Dressing",
                "Sauerkraut",
                "Pickles"
            ],
            "price" : 8.99
        }'''

    # parse the JSON data using loads
    try:
        data = json.loads(jsonStr)
    except JSONDecodeError as err:
        print("Whoops, JSON decoding error")
        print(err.msg)
        print(err.lineno, err.colno)

    # print information from the data structure
    print("Sandwich: " + data['sandwich'])
    if (data['toasted']):
        print("And it's toasted!")
    for topping in data['toppings']:
        print("Topping: " + topping)


if __name__ == "__main__":
    main()

Sandwich: Reuben
And it's toasted!
Topping: Thousand Island Dressing
Topping: Sauerkraut
Topping: Pickles


### Requests and JSON

In [55]:
# json_req_start.py

# using the requests library to access internet data

#import the requests library
import requests
import json


def main():
    # Use requests to issue a standard HTTP GET request
    url = "http://httpbin.org/json"
    result = requests.get(url)

    # TODO: Use the built-in JSON function to return parsed data
    dataobj = result.json()
#    print(json.dumps(dataobj, indent=4))

    # TODO: Access data in the python object
    print(list(dataobj.keys()))
    
    print(dataobj['slideshow']['title'])
    print("There are {0} slides".format(len(dataobj['slideshow']['slides'])))


if __name__ == "__main__":
    main()


['slideshow']
Sample Slide Show
There are 2 slides


## (5) Simple XML Parsing

### XML parsing models
* **SAX**: Simple API for XML
    * reads entire document start to finish, sequentially
    * generates events as XML content is encountered
    * your app defines a class to handle content events
    * advantages:
        * memory efficient - doesn't need to load entire doc
        * fast - your app only gets events it cares about
        * easy to implement, simple API
    * drawbacks:
        * no random access to doc content
        * context is not passed to parser
        * cannot modify the XML file
* **DOM**: Document Object Model

### The Python SAX API

* import xml.sax
* xml.sax.parse(file, handler)
* xml.sax.parseString(string, handler)
* class xml.sax.ContentHandler

```
class MyContentHandler(xml.sax.ContentHandler):
  def __init))(self):
    # member variable goes here
  
  def startDocument(self):
    # processing starting
  
  def startElement(self, tagName, attrs):
    # opening tag and attrs have been parsed
  
  def characters(self, text):
    # member variable goes here
```

In [7]:
# parse XML data using the SAX parser

import requests
import xml.sax

# TODO: define the ContentHandler subclass for our content
class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.slideCount = 0
        self.itemCount = 0
        self.isInTitle = False

    #TODO: Handle startElement
    def startElement(self, tagName, attrs):
        if tagName == "slideshow":
            print("Slideshow title: " + attrs['title'])
        elif tagName == "slide":
            self.slideCount += 1
        elif tagName == "item":
            self.itemCount += 1
        elif tagName == "title":
            self.isInTitle = True

    #TODO: Handle endElement
    def endElement(self, tagName):
        if tagName == "title":
            self.isInTitle = False

    #TODO: Handle text data
    def characters(self, chars):
        if self.isInTitle:
            print("Title: " + chars)        

    #TODO: Handle startDocument
    def startDocument(self):
        print("About to start!")

    #TODO: Handle endDocument
    def endDocument(self):
        print("Finishing up!")
        


def main():
    # create a new content handler for the SAX parser
    handler = MyContentHandler()

    # use the Requests lib to get XML data from the server
    # remember that Requests auto-decodes our content
    url = "http://httpbin.org/xml"
    result = requests.get(url)
#    print(result.text)

    # TODO: call the parseString method on the XML text content received
    xml.sax.parseString(result.text, handler)
    

    # when we're done, print out some interesting results
    print("There were {0} slide elements".format(handler.slideCount))
    print("There were {0} item elements".format(handler.itemCount))


if __name__ == "__main__":
    main()


About to start!
Slideshow title: Sample Slide Show
Title: Wake up to WonderWidgets!
Title: Overview
Finishing up!
There were 2 slide elements
There were 3 item elements


If you don't need to have random access to different parts of the XML code at different times, then SAX can be a very memory efficient way of working with XML

## (6) XML DOM Parsing

### The DOM API
* You can...
    * access any part of an XML structure at random
    * modify the XML content
    * represents the XML as a hierarchical tree structure
    * xml.dom.minidom is a lightweight implementation
    
```
domtree = xml.dom.minidom.parseString(str)

elem.getElementById(id)
elem.getElementsByTagName(tagname)

elem.getAttribute(attrName)
elem.setAttribute(attrName, val

newElem = document.createElement(tagName)
newElem = document.createTextNode(strOfText)
elem.appendChild(newElem)

```

In [17]:
# dom_parse_start.py

# use the XML DOM to parse a document in memory

import xml.dom.minidom
import requests

def main():
    # retrieve the XML data using the requests library
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    
    # TODO: parse the returned content into a DOM structure
    domtree = xml.dom.minidom.parseString(result.text)
    rootnode = domtree.documentElement
    
    # TODO: display some information about the content
    print("The root element is {0}".format(rootnode.nodeName))
    print("Title: {0}".format(rootnode.getAttribute("title")))
    
    items = domtree.getElementsByTagName("item")
    print("There are {} item tags".format(items.length))
    
    # manipulate the XML content in memory
    # TODO: create a new item tag
    newItem = domtree.createElement("item")
    
    # TODO: add some text to the item
    newItem.appendChild(domtree.createTextNode("New item from code"))
    
    # TODO: now add the item to the first slide
    firstSlide = domtree.getElementsByTagName("slide")[0]
    firstSlide.appendChild(newItem)
    
    # TODO: now count the item tags again
    items = domtree.getElementsByTagName("item")
    print("Now there are {0} item tags".format(items.length))

if __name__ == "__main__":
    main()
    

The root element is slideshow
Title: Sample Slide Show
There are 3 item tags
Now there are 4 item tags


### The ElementTree API
* focuses on being simpler and more efficient than dom
* elements are treated like lists
* attributes are treated like dictionaries
* searching for content in XML is straightforward
```
elem.findall(queryExpression)
```
(DIDN'T COPY OVER TABLE)

### Using lxml

In [12]:
!pip install lxml



In [18]:
# use the lxml library to parse a document in memory
import requests
from lxml import etree

def main():
    # retrieve the XML data using the requests library
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    
    # todo: build a doc structure using the ElementTree API
    doc = etree.fromstring(result.content)
#    print(result.text)
    
    # todo: access the value of an attribute
    print(doc.tag)
    print(doc.attrib['title'])
    
    # todo: iterate over tags
    for elem in doc.findall("slide"):
        print(elem.tag)
    
    # todo: create a new slide
    newSlide = etree.SubElement(doc, "slide")
    newSlide.text = "This is a new slide"
    
    # todo: count the number of slides
    slideCount = len(doc.findall("slide"))
    itemCount = len(doc.findall(".//item"))
    
    print("There were {0} slide elements".format(slideCount))
    print("There were {0} item elements".format(itemCount))

if __name__ == "__main__":
    main()    

slideshow
Sample Slide Show
slide
slide
There were 3 slide elements
There were 3 item elements
