# JSON and Library Catalog Data

How can we extract data from APIs (machine-readable online data sources)?

In this lesson, we look at how we can use data from the web.
We will use real-world data from
[The National Library of Norway](https://www.nb.no/).
The National Library has a [search API](https://api.nb.no/)
which we will use.

## JSON

JSON is a machine-readable data format.
Machine-readable data makes it easy to read and process the information with a computer.
JSON data is usually tree structured, with multiple levels containing information.
This is kind of like a directory tree containing files.

## Fetching Data
To fetch data from the web, we can use a library called requests that makes this task quite easy.
Since we are are fetching data in the JSON format, we will also import a library to decode JSON data.
Libraries are collections of code written by others that we can utilize instead of
writing everything from scratch ourselves.

In [None]:
import requests
import json

We need to specify the URL to the data we want to fetch.

In [None]:
URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"

We include some parameters that specifies which cases we want to load:
- `digitalAccessibleOnly` specifies that we only want documents that are available in fulltext
- `size` is the number of items to fetch
- `filter` narrows the search, in this case to books
- `q` is the search query

More parameters are listed in the [API documentation](https://api.nb.no/).

Now, let's fetch the data.

In [None]:
data = requests.get(URL).json()

This step both fetches and decodes the json data in one line. We can also do this step-by-step, to see how the process works.
If you don't want to get into the details at this point, you can skip ahead to the section "Using the data". 
The server response also contains metadata, but we want the content:

In [None]:
response = requests.get(URL)
content = response.content

We can look at the first 100 characters from the raw data. We can see the same data if we open the URL in a web browser:
https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon

In [None]:
print(content[:100])

To use the data, we must decode them. We must specify the character set, which is often UTF-8. Then we decode the json format into a Python dictionary.

In [None]:
text = content.decode("utf-8")
data = json.loads(text)

## Using the Data

We can print the data, however this is a lot of text:

In [None]:
print(data)

Instead, we can print only the keys using `list()`:

In [None]:
keys = list(data)
print(keys)

Our search results only contain the first few items. We say that we are viewing a *page* of the results.
The field `page` contains information about the current page.
This is a dictionary, and we can print the information:

In [None]:
page = data['page']
print(page)

The field `size` contains the number of hits in the database. This is usually different from the number of items we requested.
If the `size` is zero, we don't have any results and need to check the query in the URL.

In [None]:
size = page['size']
print(size)

That looks good. Let's fetch the list of items:

In [None]:
embedded = data['_embedded']
items = embedded['items']

Now we can inspect each item. Let's loop over the items and get some of the information.
The data contains various metadata about each item, such as the item ID.

It's often useful to look at the data in a web browser to get an overview.

In [None]:
for item in items:
    print("item ID:", item['id'])

Each item has a `metadata` dictionary with the title etc.

In [None]:
for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])

### <span style="color:green"> Exercise: Creator </span>

Complete the code below to print the creator of each item.
You will need to browse the [data](https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon).

In [None]:
import requests
import json

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    #your code here

## Following the Path

As mentioned, JSON data is a tree structure. It can contain many nested levels. In that case, we need to follow the path to find the entry we're looking for.
It's usually advisable to follow the path one step at a time. This makes it easier to find errors in our programs.

In [None]:
URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    
    # Step-by-step:
    links = item['_links']
    self = links['self']
    href = self['href']
    print("Item URL:", href)

    # We start a new path from links:
    thumbnail_large = links['thumbnail_large']
    thumbnail_URL = thumbnail_large['href'] # have already used the name 'href'
    print('Thumbnail:', thumbnail_URL)

    # Extra linebreak:
    print()

###  <span style="color:green"> Exercise: Presentation and URN </span>

The field `presentation` contains a link to the full text.
Complete the code below to print the `presentation` URL and the 'URN' of each item.

You will need to browse the [data](https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon).

In [None]:
import requests
import json

URL = "https://api.nb.no/catalog/v1/items?digitalAccessibleOnly=true&size=3&filter=mediatype:bøker&q=Bing,Jon"
data = requests.get(URL).json()
embedded = data['_embedded']
items = embedded['items']

for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    #your code here

## Working with Lists

Each item in the data set contains a list of one or more creators.
These lists are located in each item's `metadata` field.
We can use a `for`-loop to process the list items:

In [None]:
for item in items:
    metadata = item['metadata']
    print("Item title:", metadata['title'])
    creators = metadata['creators']
    for creator in creators:
        print('Creator:',creator)
    print() #insert empty line

##  <span style="color:blue">Key Points</span>

- The `requests` library can be used to fetch data from the web.
- Many data providers provide an API from which we can fetch data programatically.
- Parameters can be used to control what data we get from an API.
- Most APIs provide data in the JSON format, and JSON is well supported in Python.
- Additional filtering and processing of the retrieved data can be done using loops and conditions (`if`-statements, next chapter)