# An Overview of Data Formats

APIs usually give you data in one of two formats: JSON and XML. JSON stands for JavaScript Object Notation, although it's used widely outside JavaScript programming. XML stands for eXtensible Markup Language. Both are ways to structure data as text. First, let's look at JSON, since it's the newer and more popular of the two formats. 

## JSON

Here is a simple example of some JSON data, as it might be returned from a book catalog like Corpus-DB or WorldCat. Make sure to run the cell below (select it and press the "Run" button in the toolbar above) since we'll use it later.

In [2]:
ulyssesData = """
              { "title": "Ulysses", 
                "publicationYear": 1922, 
                "author": { "name": "James Joyce", 
                            "dateOfBirth": "1882-02-02",
                            "dateOfDeath": "1941-01-13",
                            "books": ["Dubliners", 
                                      "A Portrait of the Artist as a Young Man",
                                      "Ulysses",
                                      "Finnegans Wake"],
                            "married": true },
                "publisher": "Shakespeare and Company"
                }
               """ 

I've formatted this data nicely, but JSON parsers don't care about whitespace, so that same data could also be written like this: 

```json
{'title': 'Ulysses', 'publicationYear': 1922, 'author': {'name': 'James Joyce', 'dateOfBirth': '1882-02-02', 'dateOfDeath': '1941-01-13', 'books': ['Dubliners', 'A Portrait of the Artist as a Young Man', 'Ulysses', "Finnegan's Wake"], 'married': True}, 'publisher': 'Shakespeare and Company'}
```

This can look messy sometimes, but don't worry—we'll make sense of it. There are a few things to note here. 

 - Curly brackets `{}` denote key-value pairs, as in Python dictionaries. `"title": "Ulysses"` is such a key-value pair: `"title"` is the key, or data label, and `"Ulysses"` is the value, or data itself. 
 - Square brackets, `[]`, denote lists, like the list of Joyce's books here. These are similar to Python's lists.
 - Also as in Python, strings are enclosed in quotation marks, and integers and other numbers aren't. 
 - The values `true` and `false`, known as Boolean values, are also not enclosed in quotation marks. 
 
The nice thing is, we don't *really* need to know all that, because Python will do all the translation for us. First, let's import the JSON parsing library: 

In [3]:
import json

Next, I'm going to load the string `ulyssesData` I just defined above, using the `json.loads()` function. `loads()` loads a string, but if you're loading from a JSON file, you might use `load()` instead. 

In [4]:
parsedJSON = json.loads(ulyssesData)

First, let's take a look at it:

In [5]:
parsedJSON

{'author': {'books': ['Dubliners',
   'A Portrait of the Artist as a Young Man',
   'Ulysses',
   'Finnegans Wake'],
  'dateOfBirth': '1882-02-02',
  'dateOfDeath': '1941-01-13',
  'married': True,
  'name': 'James Joyce'},
 'publicationYear': 1922,
 'publisher': 'Shakespeare and Company',
 'title': 'Ulysses'}

Python pretty-prints our data for us. (This is a good way of taking messy-looking data and making it look nice, by the way.) 

## Exploring JSON Data

Next, let's explore this data. First, what is this data type? 

In [6]:
type(parsedJSON)

dict

Great. It's just a Python `dictionary`. With dictionaries, we can use the `.keys()` method to see what keys (available data fields) we can get out of this object. 

In [8]:
parsedJSON.keys()

dict_keys(['title', 'publicationYear', 'author', 'publisher'])

Now what if we want to get the title of this book? We provide the key inside square brackets and quotes, and it will give us that key's value, like this:

In [9]:
parsedJSON['title']

'Ulysses'

Now by storing that in a variable, we can use it in any way we like. Say we're building a Twitter bot that recommends a book published this year 100 year ago (and also imagine that it's 2022). You could write code like this: 

In [10]:
# Store the `title` value in a variable called `book`.
book = parsedJSON['title']

# Combine that value we just stored with the string starting "Published..."
tweet = "Published one hundred years ago: " + book 

# Print the result, so we can see it here in the notebook.
print(tweet) 

Published one hundred years ago: Ulysses


Cool. Now let's get the author's name. Let's start by exploring the `author` field. What data type is it?

In [11]:
type(parsedJSON['author'])

dict

It's a dictionary, so we can do the same thing we did before, and get a list of keys for it:

In [12]:
parsedJSON['author'].keys()

dict_keys(['name', 'dateOfBirth', 'dateOfDeath', 'books', 'married'])

So the one we want is the `name` field of the `author` field. We can write this: 

In [13]:
parsedJSON['author']['name']

'James Joyce'

We could add that to our tweet, like this: 

In [14]:
author = parsedJSON['author']['name']
tweet = "Published one hundred years ago: " + book + " by " + author
print(tweet)

Published one hundred years ago: Ulysses by James Joyce


Our twitter bot is starting to take shape! But what if we wanted to check whether a book was really published 100 years ago? We can use the `publicationYear` field, and since it's an integer, compare it to the "current year" (2022 in this tutorial). So we can wrap the code above in an `if` statement: 

In [52]:
currentYear = 2022 # Not really
year = parsedJSON['publicationYear']
if year == currentYear-100: 
    author = parsedJSON['author']['name']
    tweet = "Published one hundred years ago: " + book + " by " + author
    print(tweet)

Published one hundred years ago: Ulysses by James Joyce


Now what if we wanted to add a list of other books by the same author? We could take advantage of the `books` list in the `author` field. First, let's see what we're dealing with:

In [53]:
parsedJSON['author']['books']

['Dubliners',
 'A Portrait of the Artist as a Young Man',
 'Ulysses',
 'Finnegans Wake']

Now let's make a simple comma-separated series of these books by joining this list together into a string using a comma and a space as a separator: 

In [54]:
otherBooks = parsedJSON['author']['books']
# We'll make a simple series by joining together this list of books with commas. 
otherBooksSeries = ", ".join(otherBooks)
print(otherBooksSeries)

Dubliners, A Portrait of the Artist as a Young Man, Ulysses, Finnegans Wake


In [55]:
tweet = "Published one hundred years ago: " + book + " by " + author + ", author of " + otherBooksSeries
print(tweet)

Published one hundred years ago: Ulysses by James Joyce, author of Dubliners, A Portrait of the Artist as a Young Man, Ulysses, Finnegans Wake


## Exercises

Exercise 1: adapt the Twitter bot example above so that it also says where _Ulysses_ was first published.


Exercise 2: Adapt the Twitter bot example above so that it includes the year of first publication. (This is tricky, since `1922` is actually an integer, not a string, so you won't be able to combine them using `+`. You'll have to first convert the integer to a string using the function `str()`. 

## A Real-World Example: Google Books

Let's try this with some real-world data, using Google Books. First, let's make sure we use the `requests` library for working with APIs:

In [56]:
import requests

I learn from [the Google Books API documentation](https://developers.google.com/books/docs/v1/getting_started) that a book search for books about quilting has the form `https://www.googleapis.com/books/v1/volumes?q=quilting`. We'll construct a query for `Ulysses` using that URL form. We can infer from that URL  form that the first part of that URL, `https://www.googleapis.com/books/v1/volumes`, is the *API endpoint*, and the part following the question mark is the *parameters list*. Those parameters are a key-value pair in the form `key=value`. The key in this case is `q`, short for "query," and the value is "quiting," our search term. In python terms, that's a dictionary that looks like `{"q": "quilting"}`. So let's say we want to query something else, like books that match the search `Ulysses`. We can pass this as a dictionary to the `params` option of the `requests.get()` function. In other words, we tell `requests` to perform an HTTP `GET` request to the endpoint with the URL we provide, and we tell it that there is one parameter `q` which has the value `Ulysses`. Here, you could either follow along with me, and get information about _Ulysses_, or you could substitute the name of another book.

In [101]:
endpoint = "https://www.googleapis.com/books/v1/volumes"
response = requests.get(endpoint, params={"q": "Ulysses James Joyce"})

We could've written that as `requests.get("https://www.googleapis.com/books/v1/volumes?q=Ulysses")`, of course, but this isn't really good practice. By passing the parameters to `requests` explicitly, this allows `requests` to encode our query in URL-safe format. For instance, it will transform `James Joyce` to `James%20Joyce`, where `%20` is the URL-safe code for a space (spaces aren't technically allowed in URLs). 

Now that we have a result, let's look at it. Verify that the response came back OK. 

In [114]:
response.ok

True

If you got a `200` code, it worked! If you got some other code, it didn't work. Check the [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to see what went wrong, or check `result.reason`. A `429` code, for instance, could mean that there have been too many requests to this API from our network. In a workshop such as ours, this is likely to happen at least once. 

If it worked, we can examine the response text. Here, I'm only going to peek at the first thousand characters of the text, since I expect it to be really long. 

In [103]:
print(response.text[:1000])

{
 "kind": "books#volumes",
 "totalItems": 1733,
 "items": [
  {
   "kind": "books#volume",
   "id": "iH6nCwAAQBAJ",
   "etag": "+iwVKVn96iQ",
   "selfLink": "https://www.googleapis.com/books/v1/volumes/iH6nCwAAQBAJ",
   "volumeInfo": {
    "title": "ULYSSES (Modern Classics Series)",
    "authors": [
     "James Joyce"
    ],
    "publisher": "e-artnow",
    "publishedDate": "2016-01-17",
    "description": "This carefully crafted ebook: “ULYSSES (Modern Classics Series)” is formatted for your eReader with a functional and detailed table of contents. Ulysses is a modernist novel by Irish writer James Joyce. It is considered to be one of the most important works of modernist literature, and has been called \"a demonstration and summation of the entire movement\". Ulysses chronicles the peripatetic appointments and encounters of Leopold Bloom in Dublin in the course of an ordinary day, 16 June 1904. Ulysses is the Latinised name of Odysseus, the hero of Homer's epic poem Odyssey, and th

Interesting! What kinds of data can we get from this? First, let's parse this response as JSON.

In [104]:
ulyssesParsed = json.loads(response.text)

In [105]:
ulyssesParsed.keys()

dict_keys(['kind', 'totalItems', 'items'])

Ok, how many items (books) did Google Books find for our query? 

In [106]:
ulyssesParsed['totalItems']

1733

Let's look at the first item. Remember, to get the first item of a Python list, we can "slice" it using `[]`, and get the first element of the list (technically the zeroth element, since Python starts counting at zero) using `[0]`:

In [107]:
ulyssesParsed['items'][0].keys()

dict_keys(['kind', 'id', 'etag', 'selfLink', 'volumeInfo', 'saleInfo', 'accessInfo', 'searchInfo'])

`volumeInfo` sounds promising. Let's see what it contains.

In [108]:
ulyssesInfo = ulyssesParsed['items'][0]['volumeInfo']
ulyssesInfo

{'allowAnonLogging': True,
 'authors': ['James Joyce'],
 'canonicalVolumeLink': 'https://market.android.com/details?id=book-iH6nCwAAQBAJ',
 'categories': ['Fiction'],
 'contentVersion': 'preview-1.0.0',
 'description': 'This carefully crafted ebook: “ULYSSES (Modern Classics Series)” is formatted for your eReader with a functional and detailed table of contents. Ulysses is a modernist novel by Irish writer James Joyce. It is considered to be one of the most important works of modernist literature, and has been called "a demonstration and summation of the entire movement". Ulysses chronicles the peripatetic appointments and encounters of Leopold Bloom in Dublin in the course of an ordinary day, 16 June 1904. Ulysses is the Latinised name of Odysseus, the hero of Homer\'s epic poem Odyssey, and the novel establishes a series of parallels between its characters and events and those of the poem (the correspondence of Leopold Bloom to Odysseus, Molly Bloom to Penelope, and Stephen Dedalus t

Let's grab the first author.

In [86]:
ulyssesInfo['authors'][0]

'James Joyce'

What if we want to see the image linked to in the `imageLinks` section? (Here I'm using some fancy magic to display this image in this Jupyter notebook. Don't worry about understanding how it works. 

In [99]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= ulyssesInfo['imageLinks']['thumbnail'])

Cool. What if we wanted to see if it was a very long book? 

In [109]:
ulyssesInfo['pageCount'] > 700

True

Something fancier: what about getting the average page count of the first five books that match our search? 

In [111]:
first5Ulysses = ulyssesParsed['items'][:5]
pageCounts = [book['volumeInfo']['pageCount'] for book in first5Ulysses]
pageCounts

[900, 1513, 854, 432, 645]

In [113]:
sum(pageCounts)/len(pageCounts)

868.8

Imagine writing a web app that would tell you whether a book, according to Google Books, was likely to be very long. You could do that with a function like this, which puts this all together: 

In [123]:
def isBookLongOnAverage(bookQuery): 
    endpoint = "https://www.googleapis.com/books/v1/volumes"
    response = requests.get(endpoint, params={"q": bookQuery})
    if not response.ok: 
        return "There was a bad response from the Google Books API. Can't continue. Sorry! Maybe try later?"
    else: 
        parsedJSON = json.loads(response.text)
        first5books = parsedJSON['items'][:5]
        pageCounts = [book['volumeInfo']['pageCount'] for book in first5books]
        averageCount = sum(pageCounts)/len(pageCounts)
        if averageCount > 600: 
            return "These books are over 600 pages, on average! " +\
                   "The book you searched for, " + bookQuery + ", is probably a long book!"
        else:
            return "These books are less than 600 pages, on average. " +\
                   "The book you searched for, " + bookQuery + ", is probably not very long."

Try it out! Enter your own search term here: 

In [124]:
isBookLongOnAverage('The Sun Also Rises')

'These books are less than 600 pages, on average. The book you searched for, The Sun Also Rises, is probably not very long.'

Note: if you get a `KeyError`, it means that we're looking for the `pageCount` property in our function, but there isn't one. We could get around this by checking whether `pageCount` exists first (`if 'pageCount' in 'volumeInfo'`), but for now, just try another query if that one fails. 

In [127]:
isBookLongOnAverage('Pride and Prejudice')

'These books are less than 600 pages, on average. The book you searched for, Pride and Prejudice, is probably not very long.'

## XML

Sometimes your data will come in XML format. The process is basically the same as with JSON, with a couple differences. We'll use the `xmltodict` library, just because it's easy, but there are lots of other libraries for parsing XML, like Beautiful Soup and LXML. First, try importing the library:

In [52]:
import xmltodict

If you get an error while running the command above, your computer may not have the `xmltodict` library installed. To install it, run the command `pip3 install xmltodict` in a terminal, or in a Jupyter notebook, prefixed with an exclamation point: `!`. (`pip3 install` is the command for installing Python 3 packages.)

Now let's try parsing some sample data. Here's our sample data from above, but in XML format. 

In [53]:
ulyssesXML = """
<book>
    <title>Ulysses</title>
    <publicationDate type="year">1922</publicationDate>
    <author>
        <name>James Joyce</name>
        <dateOfBirth>1882-02-02</dateOfBirth>
        <dateOfDeath>1941-01-13</dateOfDeath>
        <books>
            <book>Dubliners</book>
            <book>A Portrait of the Artist as a Young Man</book>
            <book>Ulysses</book>
            <book>Finnegans Wake</book>
        </books>
        <married>True</married>
    </author>
    <publisher>Shakespeare and Company</publisher>
</book>
"""

There are a few things to note here: 

 - every XML document should have a "root node," or a single wrapper element that contains everything else. Here, it's `<book>`. 
 - every XML field starts with a *tag*, like `<title>`, and ends with an *end-tag*, using `</`, like `</title>`. 
 - some tags have *attributes*, like `type="year"` in the tag `<publicationDate type="year">`. These are key-value pairs. 
 
Let's try getting data out of this structure.

In [54]:
parsedXml = xmltodict.parse(ulyssesXML)

This data is now in a Python dictionary, as before. Let's say we want to get the author of this book. 

In [55]:
parsedXml['book']['author']['name']

'James Joyce'

Or a list of books by that author: 

In [56]:
parsedXml['book']['author']['books']['book']

['Dubliners',
 'A Portrait of the Artist as a Young Man',
 'Ulysses',
 'Finnegans Wake']

That all works about the same as with JSON. One difference here is that data with attributes can be accessed like this: 

In [161]:
parsedXml['book']['publicationDate']['#text']

'1922'

In [162]:
parsedXml['book']['publicationDate']['@type']

'year'

(As you may have guessed, in this syntax, `#` means "a value," and `@` means "an attribute.")

## Example: Filler Text

There aren't too many APIs these days that return XML, but [Fillerama](http://chrisvalleskey.com/fillerama-post/) is one. It returns filler text from a number of movies and TV shows, in whatever format you like.

In [220]:
endpoint = "http://api.chrisvalleskey.com/fillerama/get.php"
parameters = {"count": 3,
              "format": "xml", # You could change this to JSON
              "show": "holygrail"}
response = requests.get(endpoint, params=parameters)

In [221]:
response

<Response [200]>

In [222]:
print(response.text)

<?xml version="1.0"?>
<fillerama>
<db>
<entry> 
<source>Arthur</source> 
<quote>Shut up!</quote> 
</entry> 
<entry> 
<source>Knights of Ni</source> 
<quote>Ni! Ni! Ni! Ni!</quote> 
</entry> 
<entry> 
<source>Arthur</source> 
<quote>Shut up! Will you shut up?!</quote> 
</entry> 
</db>
<headers>
<header>Help, help, I'm being repressed!</header>
<header>I'm not dead!</header>
<header>King Arthur</header>
<header>The Knights Who Say Ni demand a sacrifice!</header>
<header>Sets The Cinema Back 900 Years!</header>
<header>Am I right?</header>
<header>Dennis the Peasant</header>
<header>Sir Lancelot</header>
<header>How do you know she is a witch?</header>
<header>First shalt thou take out the Holy Pin</header>
<header>We want a shrubbery!!</header>
<header>Makes Ben Hur look like an Epic!</header>
<header>What&hellip; is your quest?</header>
<header>Blue. No, yel&hellip;</header>
<header>Bridgekeeper</header>
<header>What a strange person</header>
</headers>
</fillerama>


In [223]:
# Don't worry about this part. Our parser is breaking 
# on XML entities for some reason, so I'm just taking them out. 
import re
cleanText = re.sub('&.*;', '', response.text)

Parse our XML.

In [224]:
parsedXml = xmltodict.parse(cleanText)

Now let's write something to pretty-print these quotes. 

In [226]:
quotes = parsedXml['fillerama']['db']['entry']
for quote in quotes: 
    print(quote['source'] + ': ', quote['quote'])

Arthur:  Shut up!
Knights of Ni:  Ni! Ni! Ni! Ni!
Arthur:  Shut up! Will you shut up?!


## Exercise

Use fillerama to get filler text from another movie or TV show, and then display it nicely using a technique similar to my example above.