# Python, XML, JSON and the Web

- XML Extensive markup language (www.w3.org)
- JSON javascript object notation (www.json.org)
<br>number, string, boolean, null, array, object

Libraries
- urllib: url handling modules
-- urllib.request
-- urllib.error
-- urllib.parse
-- urllib.robotparser
- http: http modules
- json: json encoder and decoder
- xml: processing modules
- lxml
- request http


httpbin.org: links to data samples in different formats




### 02_urllib > urllib_start

In [16]:
# using urllib to request data
import urllib.request

def main():
    # the URL to retrieve our sample data from 
    url = "http://httpbin.org/xml"

    # TODO: open the URL and retrieve some data
    result = urllib.request.urlopen(url)
    # TODO: Print the result code from the request, should be 200 OK
    print("Result Code:  {0}".format(result.status))

    # TODO: print the returned data headers
    print("Headers: ----------------------")
    print(result.getheaders())
    

    # TODO: print the returned data itself
    #print("Returned data: ----------------------")
    #print(result.read())
    print("Returned decoded data: ----------------------")
    print(result.read().decode('utf-8'))

if __name__ == "__main__":
    main()

Result Code:  200
Headers: ----------------------
[('Date', 'Sun, 23 Aug 2020 23:36:34 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
Returned decoded data: ----------------------
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>


### Web Services
- Get: retrieve data from a web services
- Post: create or update data on a web services
- Put: create or update a specific data resource on a web services
- Patch: perform a partial data update or edit on a web service
- Delete: delete data on a web service

### 02_urllib > urllibdata_start

In [17]:
import urllib.request
import urllib.parse

def main():
    url = "http://httpbin.org/get"

    # TODO: create some data to pass to the GET request
    args = {
        "name": "Sandra Alpizar",
        "is_author": True
    }

    # TODO: the data needs to be url-encoded before passing as arguments
    data = urllib.parse.urlencode(args)
    print("Data after encode: ----------------------")
    print(data)

    # TODO: issue the request with the data params as part of the URL
    result = urllib.request.urlopen(url + "?" + data)

    # TODO: issue the request with a data parameter to use POST
    url = "http://httpbin.org/post"
    data = data.encode()
    print("Data after second encode: ----------------------")
    print(data)
    result = urllib.request.urlopen(url , data = data)
   
    print("Result code: {0}".format(result.status))
    print("Returned data: ----------------------")
    print(result.read().decode("utf-8"))


if __name__ == "__main__":
    main()


Data after encode: ----------------------
name=Sandra+Alpizar&is_author=True
Data after second encode: ----------------------
b'name=Sandra+Alpizar&is_author=True'
Result code: 200
Returned data: ----------------------
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "is_author": "True", 
    "name": "Sandra Alpizar"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "34", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.8", 
    "X-Amzn-Trace-Id": "Root=1-5f42fd84-9c8fdd349c1b3c10c51f2fa0"
  }, 
  "json": null, 
  "origin": "24.130.115.126", 
  "url": "http://httpbin.org/post"
}



### 02_urllib > urlliberr_start

In [18]:
# handling errors and status codes

# TODO: import the request, error, and status modules
import urllib.request
from http import HTTPStatus
from urllib.error import HTTPError, URLError


def main():
    url = "http://no-such-server.org"      # will generate a URLError
    #url = "http://httpbin.org/status/404"  # will generate an HTTPError
    #url = "http://httpbin.org/html"         # should work with no errors

    # TODO: use exception handling to attempt the URL access
    try:
        result = urllib.request.urlopen(url)
        print("Result code: {0}".format(result.status))

        #if (result.getcode() == 200):
        if (result.getcode() == HTTPStatus.OK):
            print(result.read().decode('utf-8'))
    #except HTTPError as err:
    #    print("Error {0}".format(err.code))
    except URLError as err:
        print("Poop, {0}".format(err.reason))

if __name__ == "__main__":
    main()



Poop, [Errno 8] nodename nor servname provided, or not known


- urllib only supports a subset of HTTP by default
- Does not automatically decode returned data
- Common features like cookies or authentication require more modules
- Difficult to implement advance features
- Processing returned data such as JSON is cumberson

### Requests: HTTP for Humans
https://docs.python-requests-org/en/master/

- Decodes return content for you
- Makes working with parameters, headers and cookies easier
- Parses JSON content when detected
- Handle errors, redirects and timeouts
- Advance features


In [19]:
!conda install requests --yes

Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



In [20]:
!conda list

# packages in environment at /anaconda2:
#
# Name                    Version                   Build  Channel
_anaconda_depends         5.3.1                    py27_0  
_ipyw_jlab_nb_ext_conf    0.1.0                    py27_0  
alabaster                 0.7.11                   py27_0  
anaconda                  custom                   py27_1  
anaconda-client           1.7.2                    py27_0  
anaconda-navigator        1.9.7                    py27_1  
anaconda-project          0.8.2                    py27_0  
appdirs                   1.4.3            py27h28b3542_0  
appnope                   0.1.0            py27hb466136_0  
appscript                 1.0.1            py27h1de35cc_1  
asn1crypto                0.24.0                   py27_0  
astroid                   1.6.5                    py27_0  
astropy                   2.0.8            py27h1d22016_0  
atomicwrites              1.2.1                    py27_0  
attrs                     18.2.0           py27h28

### 03_requests > requests_start

In [21]:
# using the requests library to access internet data

#import the requests library
import requests

def main():
    # TODO: Use requests to issue a standard HTTP GET request
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    printResults(result)
    
    # TODO: Send some parameters to the URL via a GET request
    # Note that requests handles this for you, no manual encoding
    print("----------------------  NEW GET: ----------------------")
    url = "http://httpbin.org/get"
    dataValues = {
        "key1":"value1",
        "key2":"value2"
    }
    result = requests.get(url, params = dataValues)
    printResults(result)
    
    print("----------------------  NEW POST: ----------------------")
    url = "http://httpbin.org/post"
    dataValues = {
        "key1":"value1",
        "key2":"value2"
    }
    result = requests.post(url, data = dataValues)
    printResults(result)
    
    # TODO: Pass a custom header to the server
    print("----------------------  NEW HEADER: ----------------------")
    url = "http://httpbin.org/get"
    headerValues = {
        "User-Agent" : "Sandra Alpizar App 1.0.0"
    }
    result = requests.get(url, headers = headerValues)
    printResults(result)
    

def printResults(resData):
    print("Result code: {0}".format(resData.status_code))
    print("\n")

    print("Headers: ----------------------")
    print(resData.headers)
    print("\n")

    print("Returned data: ----------------------")
    print(resData.content)
    
    print("\n")
    print("Returned as text: ----------------------")
    print(resData.text)

if __name__ == "__main__":
    main()


Result code: 200


Headers: ----------------------
{'Date': 'Sun, 23 Aug 2020 23:37:25 GMT', 'Content-Type': 'application/xml', 'Content-Length': '522', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}


Returned data: ----------------------
b'<?xml version=\'1.0\' encoding=\'us-ascii\'?>\n\n<!--  A SAMPLE set of slides  -->\n\n<slideshow \n    title="Sample Slide Show"\n    date="Date of publication"\n    author="Yours Truly"\n    >\n\n    <!-- TITLE SLIDE -->\n    <slide type="all">\n      <title>Wake up to WonderWidgets!</title>\n    </slide>\n\n    <!-- OVERVIEW -->\n    <slide type="all">\n        <title>Overview</title>\n        <item>Why <em>WonderWidgets</em> are great</item>\n        <item/>\n        <item>Who <em>buys</em> WonderWidgets</item>\n    </slide>\n\n</slideshow>'


Returned as text: ----------------------
<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -

### 03_requests > reqerrs_start

In [22]:
# using the requests library to access internet data

import requests
from requests import HTTPError, Timeout

def main():
    # Use requests to issue a standard HTTP GET request
    try:
        #url = "http://httpbin.org/status/404"
        url = "http://httpbin.org/delay/5"
        #limit of when the response needs to start
        result = requests.get(url, timeout = 2)
        result.raise_for_status()
        printResults(result)
    except HTTPError as err:
        print("Error: {0}".format(err))
    except Timeout as err:
        print("Request timed out: {0}".format(err))

    

def printResults(resData):
    print("Result code: {0}".format(resData.status_code))
    print("\n")

    print("Returned data: ----------------------")
    print(resData.text)


if __name__ == "__main__":
    main()


Request timed out: HTTPConnectionPool(host='httpbin.org', port=80): Read timed out. (read timeout=2)


### 03_requests > reqauth_start

In [23]:
# using the requests library to access internet data

import requests
from requests.auth import HTTPBasicAuth

def main():
    # Access a URL that requires authentication - the format of this 
    # URL is that you provide the username/password to auth against
    url = "http://httpbin.org/basic-auth/salpizar/mypwd"

    # TODO: Create a credentials object using HTTPBasicAuth
    myCreds = HTTPBasicAuth("salpizar", "mypwd")

    # TODO: Issue the request with the authentication credentials
    #result = requests.get(url, auth = myCreds)
    result = requests.get(url, auth = ("salpizar", "mypwd"))

    printResults(result)
    

def printResults(resData):
    print("Result code: {0}".format(resData.status_code))
    print("\n")

    print("Returned data: ----------------------")
    print(resData.text)

if __name__ == "__main__":
    main()


Result code: 200


Returned data: ----------------------
{
  "authenticated": true, 
  "user": "salpizar"
}



### The Python JSON Module
Parsing functions
- load(file)
- loads(string)

Serialization functions:
- dump(obj, file)
- dumps(obj)

Serializing Python Data to JSON
- dict : obj
- list/tuple : array
- str : string
- int, long, float, enums : number
- True : true
- False : false
- None : null

Parsin JSON to Python
- object : dict
- array : list
- string : str
- integer number : int
- floating number : float
- true, false : True, False
- null : none



### 04_json > json_serialize_start


In [24]:
# Process JSON data returned from a server

# TODO: use the JSON module
import json

def main():
    # define a string of JSON code
    jsonStr = '''{
            "sandwich" : "Reuben",
            "toasted" : true,
            "toppings" : [
                "Thousand Island Dressing",
                "Sauerkraut",
                "Pickles"
            ],
            "price" : 8.99
        }'''

    # TODO: parse the JSON data using loads
    data = json.loads(jsonStr)

    # TODO: print information from the data structure
    print("Sandwich: " + data['sandwich'])
    if(data["toasted"]):
        print("and is toasted!")
        
    for topping in data["toppings"]:
        print("Topping:" + topping)


if __name__ == "__main__":
    main()


Sandwich: Reuben
and is toasted!
Topping:Thousand Island Dressing
Topping:Sauerkraut
Topping:Pickles


### 04_json > json_parse_start

In [25]:
# Process JSON data returned from a server

# use the JSON module
import json


def main():
    # define a python ditcionary
    pythonData = {
        "sandwich": "Reuben",
        "toasted": True,
        "toppings": ["Thousand Island Dressing",
                     "Sauerkraut",
                     "Pickles"
                     ],
        "price": 8.99
    }

    # TODO: serialize to JSON using dumps
    jsonStr = json.dumps(pythonData, indent = 4)
    # TODO: print the resulting JSON string
    print("JSON Data: --------")
    print(jsonStr)


if __name__ == "__main__":
    main()


JSON Data: --------
{
    "sandwich": "Reuben",
    "toasted": true,
    "toppings": [
        "Thousand Island Dressing",
        "Sauerkraut",
        "Pickles"
    ],
    "price": 8.99
}


### 04_json > json_err_start

In [26]:
# Process JSON data returned from a server

# use the JSON module
import json
from json import JSONDecodeError


def main():
    # define a string of JSON code
    #removed the coma before price
    jsonStr = '''{
            "sandwich" : "Reuben",
            "toasted" : true,
            "toppings" : [
                "Thousand Island Dressing",
                "Sauerkraut",
                "Pickles"
            ],
            "price" : 8.99
        }'''

    # parse the JSON data using loads
    try:
        data = json.loads(jsonStr)
    except JSONDecodeError as err:
        print("Woops! JSON decoding error:")
        print(err.msg)
        print(err.lineno, err.colno)

    # print information from the data structure
    print("Sandwich: " + data['sandwich'])
    if (data['toasted']):
        print("And it's toasted!")
    for topping in data['toppings']:
        print("Topping: " + topping)


if __name__ == "__main__":
    main()


Sandwich: Reuben
And it's toasted!
Topping: Thousand Island Dressing
Topping: Sauerkraut
Topping: Pickles


### 04_json > json_req_start

In [27]:
# using the requests library to access internet data

#import the requests library
import requests
import json


def main():
    # Use requests to issue a standard HTTP GET request
    url = "http://httpbin.org/json"
    result = requests.get(url)

    # TODO: Use the built-in JSON function to return parsed data
    dataobj = result.json()
    print(json.dumps(dataobj, indent = 4))
    
    # TODO: Access data in the python object
    print(list(dataobj.keys()))
    
    print(dataobj["slideshow"]["title"])
    print("there are {0} slides.".format(len(dataobj["slideshow"]["slides"])))

if __name__ == "__main__":
    main()


{
    "slideshow": {
        "author": "Yours Truly",
        "date": "date of publication",
        "slides": [
            {
                "title": "Wake up to WonderWidgets!",
                "type": "all"
            },
            {
                "items": [
                    "Why <em>WonderWidgets</em> are great",
                    "Who <em>buys</em> WonderWidgets"
                ],
                "title": "Overview",
                "type": "all"
            }
        ],
        "title": "Sample Slide Show"
    }
}
['slideshow']
Sample Slide Show
there are 2 slides.


### XML Parsing

- SAX: Simple api for XML
- DOM: Document object model

SAX
- Reads endired document sequentially
- Generates events as XML content is encounter
- Your app defined a class to handle content events

Advantages:
- Memory efficient
- Fast
- Easy to implement

Drawbacks
- No random access to doc content
- Context is not passed to parser
- Cannot modify the XML file

API
- import xml.sax
- xml.sax.parse(file, handled)
- xml.sax.parseString(string, handled)
- class xml.sax.ContentHandles


### 05_xml_sax > sax_parse_start

In [28]:
# parse XML data using the SAX parser

import requests
import xml.sax

# TODO: define the ContentHandler subclass for our content
class MyContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.slideCount = 0
        self.itemCount = 0
        self.isInTitle = False

    #TODO: Handle startElement
    def startElement(self, tagName, attrs):
        if tagName == "slideshow":
            print("Slideshow title: ", attrs["title"])
        elif tagName == "slide":
            self.slideCount += 1
        elif tagName == "item":
            self.itemCount += 1
        elif tagName == "title":
            self.isInTitle = True

    #TODO: Handle endElement
    def endElement(self, tagName):
        if tagName == "title":
            self.isInTitle = False
        
    #TODO: Handle text data
    def characters(self, chars):
        if self.isInTitle:
            print("Title: " + chars)

    #TODO: Handle startDocument
    def startDocument(self):
        print("About to start")

    #TODO: Handle endDocument
    def endDocument(self):
        print("Finishing Up!")


def main():
    # create a new content handler for the SAX parser
    handler = MyContentHandler()

    # use the Requests lib to get XML data from the server
    # remember that Requests auto-decodes our content
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    print(result.text)

    # TODO: call the parseString method on the XML text content received
    xml.sax.parseString(result.text, handler)

    # when we're done, print out some interesting results
    print("There were {0} slide elements".format(handler.slideCount))
    print("There were {0} item elements".format(handler.itemCount))


if __name__ == "__main__":
    main()


<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>
About to start
Slideshow title:  Sample Slide Show
Title: Wake up to WonderWidgets!
Title: Overview
Finishing Up!
There were 2 slide elements
There were 3 item elements


### DOM API
https://docs.python.org/3/library/xml.dom.minidom.html

XML Essential Training

- Access any part of XML at random
- Modify XML
- Represents XML as hierarchical tree (loaded into memory)
- xml.dom.minidom lightweigh implementation
 
 
- domtree = xml.dom.minidom.parseString(str)
- elem.getElementById(id)
- elem.getElementByTagName(tagname)


- elem.getAttribute(attrName)
- elem.setAttribute(attrName, val)


- newElem = document.createElement(tagName)
- newElem = document.createTextNode(strOfText)
- elem.appendChild(newElem)


### 06_xml_dom > dom_parse_start

In [29]:
# Use the XML DOM to parse a document in memory

import xml.dom.minidom
import requests


def main():
    # retrieve the XML data using the requests library
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    
    # TODO: parse the returned content into a DOM structure
    domTree = xml.dom.minidom.parseString(result.text)
    rootNode = domTree.documentElement
    
    # TODO: display some information about the content
    print("The root element is {0}".format(rootNode.nodeName))
    print("Title: {0}".format(rootNode.getAttribute("title")))
    
    items = domTree.getElementsByTagName("item")
    print("There are {0} item tags".format(items.length))

    # manipulate the XML content in memory
    # TODO: create a new item tag
    newItem = domTree.createElement("item")

    # TODO: add some text to the item
    newItem.appendChild(domTree.createTextNode("Some text."))

    # TODO: now add the item to the first slide
    firstSlide = domTree.getElementsByTagName("slide")[0]
    firstSlide.appendChild(newItem)
    
    # TODO: Now count the item tags again
    items = domTree.getElementsByTagName("item")
    print("There are {0} item tags".format(items.length))

if __name__ == "__main__":
    main()


The root element is slideshow
Title: Sample Slide Show
There are 3 item tags
There are 4 item tags


### Element tree API

https://docs.python.org/3/library/xml.etree.elementtree.html

- simplifier and more efficient than DOM
- elements are treated as lists
- attributes are treaded as dictionaries


- easy search (XML essential training)
- elem.findall(queryExpression)
- tagname: find immediate child tagname elements
- //tagname: find all tagname elements regardless of where they are in the document
- tagname[attr]: find all tagname elements that have a specific attribute
- tagname[attr = val]: find atagname elements that have an attribute with a specific value

### 06_xml_dom > lxml_start

In [3]:
!conda install lxml --yes

usage: conda [-h] [-V] command ...
conda: error: unrecognized arguments: --upgrade


In [5]:
!pip3 install lxml

Collecting lxml
  Downloading lxml-4.5.2-cp38-cp38-macosx_10_9_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 1.5 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.8/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [1]:
!conda list

# packages in environment at /anaconda2:
#
# Name                    Version                   Build  Channel
_anaconda_depends         5.3.1                    py27_0  
_ipyw_jlab_nb_ext_conf    0.1.0                    py27_0  
alabaster                 0.7.11                   py27_0  
anaconda                  custom                   py27_1  
anaconda-client           1.7.2                    py27_0  
anaconda-navigator        1.9.7                    py27_1  
anaconda-project          0.8.2                    py27_0  
appdirs                   1.4.3            py27h28b3542_0  
appnope                   0.1.0            py27hb466136_0  
appscript                 1.0.1            py27h1de35cc_1  
asn1crypto                0.24.0                   py27_0  
astroid                   1.6.5                    py27_0  
astropy                   2.0.8            py27h1d22016_0  
atomicwrites              1.2.1                    py27_0  
attrs                     18.2.0 

In [30]:
# Use the lxml library to parse a document in memory

import requests
from lxml import etree


def main():
    # retrieve the XML data using the requests library
    url = "http://httpbin.org/xml"
    result = requests.get(url)
    print(result.text)
    
    # TODO: build a doc structure using the ElementTree API
    doc = etree.fromstring(result.content)

    # TODO: Access the value of an attribute
    print(doc.tag)
    print(doc.attrib["title"])
    
    # TODO: Iterate over tags
    for elem in doc.findall("slide"):
        print(elem.tag)

    # TODO: Create a new slide
    slideCount = len(doc.findall("slide"))
    print("There are {0} slide elements".format(slideCount))
    newSlide = etree.SubElement(doc, "slide")
    newSlide.text = "This is a new slide"

    # TODO: Count the number of slides
    slideCount = len(doc.findall("slide"))
    itemCount = len(doc.findall(".//item"))
    
    print("There are {0} slide elements".format(slideCount))
    print("There are {0} item elements".format(itemCount))


if __name__ == "__main__":
    main()


<?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>
slideshow
Sample Slide Show
slide
slide
There are 2 slide elements
There are 3 slide elements
There are 3 item elements
