# Background Information

Today we'll be going through the full process of requesting the data we want from a REST API. For this workshop, we're focusing on Zenodo. Zenodo is an online Open Science research repository, which houses close to two and a half million publications, datasets, codebases, and other research related items. This makes it a great API to explore because there is a ton of data that we can collect. 

[Zenodo](https://zenodo.org)

Before we get started with any API, we need to check out the documentation to get familiar with the querying process. A lot of API's that you run into will work pretty similarly, but we still need to get the specifics.

[Zenodo API](https://developers.zenodo.org)

Taking a look at the documentation, there are a few key points to take note of.
First, we see that Zenodo uses a REST API, which means that we're going to be wanting to use the ```requests``` library in python to make our queries.

Second, we see that there's a section on authenticating our requests over HTTPS. Although the section outlines that the requests will fail if they're not authenticated, we don't actually need to go through the process of creating a personal token and authenticating for simple data queries to Zenodo. I'm not sure why, but it makes this demo a bit less cumbersome.

Third, we see that the base url for our requests is given (```https://zenodo.org/api/```), so we need to take note of that for later.

Continuing through, we see the standards that our responses will be JSON objects and successful responses will have a status_code of 200.

Now for the actual queries, we want to be searching over the records. This section tells us that the query url is going to be ```'https://zenodo.org/api/records'```, and it gives us some insight on the different parameters that we can use to perform a search. Now for this application, let's say that we're interested in collecting all of the records that Zenodo has on basketball. Since we want to collect all of the records, we won't need to worry about a lot of the parameters that restrict the search. Instead, we'll mainly be focusing on 

- q: The Elasticsearch query string search query
- page: Page number of results
- size: The number of result per page.

This means that the results are paginated, which is common for APIs that could potentially return a lot of data. If our query had a few hundred thousands results, returning all of them in one set wouldn't be feasible.

Now that we've got a rough idea of how we can work with the API, let's dive into it.

# Importing the necessary librarys

There are a couple of different librarys that we're going to be using, so let's start out importing them all at the top in order to keep the full code a bit cleaner.

1. ```requests``` 
    Since we're working with a REST API, we're going to want to use the requests library in order to send our requests to the API.
    
2. ```pandas```
    Since we're making a handful of requests and need to store them, we're going to use Pandas DataFrames to keep all of our responses in one place.
    
3. ```flatten_json```
    We saw that the data is returned in a JSON format. It'll be helpful to be able to view the "nested" JSON data in a flatter form.

In [50]:
import requests
import pandas as pd
from flatten_json import flatten

# Making Requests

As a refresher, let's take a look at how we make a request in general

In [19]:
# r = requests.get(url, params, headers)
r = requests.get("http://www.example.com/")
r

<Response [200]>

When we call the request.get module, it returns the response object, which is represented by the reponse code. In this case we have a 200 response, which means the request was successful. 

Next, let's break down what the full result contains. Python has a neat function called ```vars()``` which lets us see all the dictionary representation of an object.

In [20]:
vars(r)

{'_content': b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>

Now that we've got an idea about how these requests are formatted and how to make a request to Zenodo, let's try it out. 

## General Zenodo Requests

In [21]:
r = requests.get("https://zenodo.org/api/records")

So recall that the documentation mentioned that the response is given in JSON format, so let's take a look at what it gives.

In [25]:
output = r.json()
output.keys()

dict_keys(['aggregations', 'hits', 'links'])

We can see that the output is structured into "aggregations", "hits", and "links", so let's take a look at what each of those are.

In [26]:
output['aggregations']

{'access_right': {'buckets': [{'doc_count': 2282042, 'key': 'open'},
   {'doc_count': 48751, 'key': 'closed'},
   {'doc_count': 7861, 'key': 'restricted'},
   {'doc_count': 1154, 'key': 'embargoed'}],
  'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 0},
 'file_type': {'buckets': [{'doc_count': 1017189, 'key': 'pdf'},
   {'doc_count': 378670, 'key': 'png'},
   {'doc_count': 376719, 'key': 'jpg'},
   {'doc_count': 277663, 'key': 'html'},
   {'doc_count': 114234, 'key': 'zip'},
   {'doc_count': 27523, 'key': 'xlsx'},
   {'doc_count': 23999, 'key': 'txt'},
   {'doc_count': 20508, 'key': 'docx'},
   {'doc_count': 20031, 'key': 'csv'},
   {'doc_count': 18026, 'key': 'xml'}],
  'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 155872},
 'keywords': {'buckets': [{'doc_count': 958252, 'key': 'Taxonomy'},
   {'doc_count': 957121, 'key': 'Biodiversity'},
   {'doc_count': 613773, 'key': 'Animalia'},
   {'doc_count': 475703, 'key': 'Arthropoda'},
   {'doc_count': 328839, 'key':

In [27]:
output['hits']

{'hits': [{'conceptdoi': '10.5281/zenodo.6049499',
   'conceptrecid': '6049499',
   'created': '2022-02-11T22:19:44.693352+00:00',
   'doi': '10.5281/zenodo.6049500',
   'files': [{'bucket': 'dd8bf454-eda9-4a73-bef4-057c4a8ecdc9',
     'checksum': 'md5:48222928847fb6e8855dfca328923c5e',
     'key': 'treatment.html',
     'links': {'self': 'https://zenodo.org/api/files/dd8bf454-eda9-4a73-bef4-057c4a8ecdc9/treatment.html'},
     'size': 3946,
     'type': 'html'}],
   'id': 6049500,
   'links': {'badge': 'https://zenodo.org/badge/doi/10.5281/zenodo.6049500.svg',
    'bucket': 'https://zenodo.org/api/files/dd8bf454-eda9-4a73-bef4-057c4a8ecdc9',
    'conceptbadge': 'https://zenodo.org/badge/doi/10.5281/zenodo.6049499.svg',
    'conceptdoi': 'https://doi.org/10.5281/zenodo.6049499',
    'doi': 'https://doi.org/10.5281/zenodo.6049500',
    'html': 'https://zenodo.org/record/6049500',
    'latest': 'https://zenodo.org/api/records/6049500',
    'latest_html': 'https://zenodo.org/record/6049500

In [28]:
output['links']

{'next': 'https://zenodo.org/api/records/?sort=mostrecent&page=2&size=10',
 'self': 'https://zenodo.org/api/records/?sort=mostrecent&page=1&size=10'}

At first glance, it seems like aggregations is giving us some high level statistics about the breakdown of our query. Since we didn't make any specifications for search terms or time periods, it's reporting on all 2.3 million Zenodo records. The links data is giving some information on the current query and the next potential query page. And lastly, there's the hits data, which seems like what we really want. Although it looks like this is nested, so let's take a look at the breakdown of that.

In [30]:
output['hits'].keys()

dict_keys(['hits', 'total'])

In [31]:
output['hits']['total']

2339808

In [32]:
output['hits']['hits']

[{'conceptdoi': '10.5281/zenodo.6049499',
  'conceptrecid': '6049499',
  'created': '2022-02-11T22:19:44.693352+00:00',
  'doi': '10.5281/zenodo.6049500',
  'files': [{'bucket': 'dd8bf454-eda9-4a73-bef4-057c4a8ecdc9',
    'checksum': 'md5:48222928847fb6e8855dfca328923c5e',
    'key': 'treatment.html',
    'links': {'self': 'https://zenodo.org/api/files/dd8bf454-eda9-4a73-bef4-057c4a8ecdc9/treatment.html'},
    'size': 3946,
    'type': 'html'}],
  'id': 6049500,
  'links': {'badge': 'https://zenodo.org/badge/doi/10.5281/zenodo.6049500.svg',
   'bucket': 'https://zenodo.org/api/files/dd8bf454-eda9-4a73-bef4-057c4a8ecdc9',
   'conceptbadge': 'https://zenodo.org/badge/doi/10.5281/zenodo.6049499.svg',
   'conceptdoi': 'https://doi.org/10.5281/zenodo.6049499',
   'doi': 'https://doi.org/10.5281/zenodo.6049500',
   'html': 'https://zenodo.org/record/6049500',
   'latest': 'https://zenodo.org/api/records/6049500',
   'latest_html': 'https://zenodo.org/record/6049500',
   'self': 'https://zeno

So it looks like the total attribute of the hits is more high level information about the query, and the hits attribute contains the information that we're after. This looks pretty messy though, so we'll clean it up a bit by putting it into a DataFrame, which is easy to do since the JSON representations are stored as dictionarys.

In [34]:
df = pd.DataFrame(output['hits']['hits'])
df.head()

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated
0,10.5281/zenodo.6049499,6049499,2022-02-11T22:19:44.693352+00:00,10.5281/zenodo.6049500,[{'bucket': 'dd8bf454-eda9-4a73-bef4-057c4a8ec...,6049500,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[1161],1,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2022-02-11T22:19:45.377116+00:00
1,10.5281/zenodo.6049485,6049485,2022-02-11T22:19:39.963132+00:00,10.5281/zenodo.6049486,[{'bucket': 'c5f62e7a-70cc-4661-bae5-e411af947...,6049486,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[142922],1,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2022-02-11T22:19:40.663969+00:00
2,10.5281/zenodo.6049497,6049497,2022-02-11T22:19:20.718097+00:00,10.5281/zenodo.6049498,[{'bucket': 'f764133f-c760-4652-9d20-529e87c74...,6049498,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[1161],1,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2022-02-11T22:19:21.373023+00:00
3,10.5281/zenodo.6049495,6049495,2022-02-11T22:18:59.782690+00:00,10.5281/zenodo.6049496,[{'bucket': '33f8f248-d02d-44a2-b140-ec4d12ab8...,6049496,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[1161],1,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2022-02-11T22:19:00.498724+00:00
4,10.5281/zenodo.6049493,6049493,2022-02-11T22:18:33.892665+00:00,10.5281/zenodo.6049494,[{'bucket': 'd94b76c5-5c77-4388-baa5-fd42625fb...,6049494,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[1161],1,"{'downloads': 0.0, 'unique_downloads': 0.0, 'u...",2022-02-11T22:18:34.919807+00:00


Looking at the DataFrame, there's even more JSON nested in the hits data. This is getting kind of twisted, so let's just flatten the data and take a look at that to see what we're really getting back. First, let's take a look at the ```flatten_json``` functionality.

In [39]:
flatten_json.flatten?

So the ```flatten``` function takes a dictionary, but we're dealing with a list of dictionarys, and we need to flatten all of them.

In [51]:
flattened_hits = [flatten(json_dict) for json_dict in output['hits']['hits']]

That's actually going to come in handy a bunch, so we'll functionize it.

In [52]:
def flatten_list(dicts):
    """Flatten iterable of nested dictionaries."""
    # Need to make sure we don't accidentally pass in the wrong data
    assert all([isinstance(d, dict) for d in dicts])
    
    return [flatten(d) for d in dicts]

In [42]:
df = pd.DataFrame(flattened_hits)
df.head()

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files_0_bucket,files_0_checksum,files_0_key,files_0_links_self,files_0_size,files_0_type,...,files_4_size,files_4_type,files_5_bucket,files_5_checksum,files_5_key,files_5_links_self,files_5_size,files_5_type,metadata_creators_0_orcid,metadata_creators_1_affiliation
0,10.5281/zenodo.6049499,6049499,2022-02-11T22:19:44.693352+00:00,10.5281/zenodo.6049500,dd8bf454-eda9-4a73-bef4-057c4a8ecdc9,md5:48222928847fb6e8855dfca328923c5e,treatment.html,https://zenodo.org/api/files/dd8bf454-eda9-4a7...,3946,html,...,,,,,,,,,,
1,10.5281/zenodo.6049485,6049485,2022-02-11T22:19:39.963132+00:00,10.5281/zenodo.6049486,c5f62e7a-70cc-4661-bae5-e411af947284,md5:999a7ea04f5e5acf32817f7f99f5009c,5. Substitution des monnaies-RFEG-01-AWANA.pdf,https://zenodo.org/api/files/c5f62e7a-70cc-466...,704868,pdf,...,,,,,,,,,,
2,10.5281/zenodo.6049497,6049497,2022-02-11T22:19:20.718097+00:00,10.5281/zenodo.6049498,f764133f-c760-4652-9d20-529e87c748d1,md5:e26a1a3e213e2e1f42d8d2990c5efb07,treatment.html,https://zenodo.org/api/files/f764133f-c760-465...,2853,html,...,,,,,,,,,,
3,10.5281/zenodo.6049495,6049495,2022-02-11T22:18:59.782690+00:00,10.5281/zenodo.6049496,33f8f248-d02d-44a2-b140-ec4d12ab894d,md5:eecb1aff16ffe4ab2651fc7d4c8c0aac,treatment.html,https://zenodo.org/api/files/33f8f248-d02d-44a...,2835,html,...,,,,,,,,,,
4,10.5281/zenodo.6049493,6049493,2022-02-11T22:18:33.892665+00:00,10.5281/zenodo.6049494,d94b76c5-5c77-4388-baa5-fd42625fb584,md5:47802756fa1a2563916bc24bbc5860e2,treatment.html,https://zenodo.org/api/files/d94b76c5-5c77-438...,3738,html,...,,,,,,,,,,


That's much more manageable to read, but now our DataFrame has 153 columns instead of just 12. This can be handy if we need all of the data in a flat format, but definitely be careful when you're unnesting content that can balloon up like that. 

## Specific Zenodo Requests

Now that we've seen how the general format of Zenodo requests are, we can finally move forward with getting all of the specific basketball data that we want.

In order to make the request, we have to pass those parameters to the ```params``` argument in the ```requests.get()``` function, which takes a dictionary. Let's set it up now so that we can continue to reference it.

In [44]:
zenodo_records_url = 'https://zenodo.org/api/records'

In [45]:
search_params = {'q': 'basketball'}

In [46]:
r = requests.get(url = zenodo_records_url, params=search_params)

In [53]:
output = r.json()
hits = output['hits']['hits']

In [54]:
df = pd.DataFrame(hits)
df.head()

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated
0,10.5281/zenodo.1089029,1089029,2018-01-16T23:10:49.693282+00:00,10.5281/zenodo.1089030,[{'bucket': '36246f8b-1f87-49cf-8c69-4dd77f006...,1089030,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[32148],5,"{'downloads': 41.0, 'unique_downloads': 40.0, ...",2020-01-20T14:21:30.873221+00:00
1,10.5281/zenodo.1148789,1148789,2018-09-18T20:28:40.176686+00:00,10.5281/zenodo.1148790,[{'bucket': '41d12a34-6bec-424e-aaef-7dbf24a0e...,1148790,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],4,"{'downloads': 11.0, 'unique_downloads': 10.0, ...",2020-01-20T16:29:18.605040+00:00
2,10.5281/zenodo.3890793,3890793,2020-06-12T04:35:43.477175+00:00,10.5281/zenodo.3890794,[{'bucket': '013e1566-eba3-401b-9695-95df07275...,3890794,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[104501],2,"{'downloads': 5.0, 'unique_downloads': 5.0, 'u...",2020-06-12T10:18:20.919328+00:00
3,10.5281/zenodo.1112400,1112400,2018-01-04T18:08:25.746394+00:00,10.5281/zenodo.1112401,[{'bucket': 'ea8b6c08-6008-4d04-8422-f4980c0b4...,1112401,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],6,"{'downloads': 34.0, 'unique_downloads': 33.0, ...",2020-01-20T13:35:27.378940+00:00
4,10.5281/zenodo.4291144,4291144,2020-11-25T23:45:58.213063+00:00,10.5281/zenodo.4291145,[{'bucket': 'a5fee344-6682-4d4b-8f33-bb29f1882...,4291145,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[97381],2,"{'downloads': 20.0, 'unique_downloads': 19.0, ...",2020-11-26T00:27:14.980777+00:00


In [55]:
df.shape

(10, 12)

And now we can see that we're getting out basketball results back. Although we're only getting 10 results, so we need to use the pagination parameters that we read about earlier. To make this easier, let's put this into a function too.

First, let's quickly check the assumptions that we're making so that our function doesn't crash. We're assuming that we're going to search for a string, since that's what Zenodo requires. We're assuming that we get a proper 200 status code. We're also assuming that the results are given in the ```r.json()``` nested hits hits item. Is this always going to happen?

In [59]:
r = requests.get(zenodo_records_url, params={'q': 'fjkdajfeiowajtio'})

In [60]:
r.json()

{'aggregations': {'access_right': {'buckets': [],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0},
  'file_type': {'buckets': [],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0},
  'keywords': {'buckets': [],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0},
  'type': {'buckets': [],
   'doc_count_error_upper_bound': 0,
   'sum_other_doc_count': 0}},
 'hits': {'hits': [], 'total': 0},
 'links': {'self': 'https://zenodo.org/api/records/?sort=bestmatch&q=fjkdajfeiowajtio&page=1&size=10'}}

Now we see that ```output['hits']['hits']``` doesn't exist. so we need to be careful about just indexing the output dictionary when it might be empty. If we aren't sure that a dictionary has a key or not, we can use the ```dict.get()``` function to be safe. By default, an improper key will return None.

In [71]:
def get_zenodo_search_output(search_term, page, size):
    """Return the output for a Zenodo record query.
    
    Parameters
    ----------
    search_term : str
    page : int
    size : int
    
    Returns
    -------
    output_df : pandas.DataFrame
    """
    
    assert isinstance(search_term, str)
    assert isinstance(page, int)
    assert isinstance(size, int)
    
    search_url = 'https://zenodo.org/api/records'
    search_params = {'q': search_term, 'page': page, 'size': size}
    
    r = requests.get(search_url, params=search_params)
    output = r.json()
    
    if r.status_code == 200 and output.get('hits').get('hits'):
        output_df = pd.DataFrame(output['hits']['hits'])
    else:
        return r
    
    return output_df

In [62]:
get_zenodo_search_output('basketball', 1, 20)

Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated
0,10.5281/zenodo.1089029,1089029,2018-01-16T23:10:49.693282+00:00,10.5281/zenodo.1089030,[{'bucket': '36246f8b-1f87-49cf-8c69-4dd77f006...,1089030,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[32148],5,"{'downloads': 41.0, 'unique_downloads': 40.0, ...",2020-01-20T14:21:30.873221+00:00
1,10.5281/zenodo.1148789,1148789,2018-09-18T20:28:40.176686+00:00,10.5281/zenodo.1148790,[{'bucket': '41d12a34-6bec-424e-aaef-7dbf24a0e...,1148790,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],4,"{'downloads': 11.0, 'unique_downloads': 10.0, ...",2020-01-20T16:29:18.605040+00:00
2,10.5281/zenodo.3890793,3890793,2020-06-12T04:35:43.477175+00:00,10.5281/zenodo.3890794,[{'bucket': '013e1566-eba3-401b-9695-95df07275...,3890794,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[104501],2,"{'downloads': 5.0, 'unique_downloads': 5.0, 'u...",2020-06-12T10:18:20.919328+00:00
3,10.5281/zenodo.1112400,1112400,2018-01-04T18:08:25.746394+00:00,10.5281/zenodo.1112401,[{'bucket': 'ea8b6c08-6008-4d04-8422-f4980c0b4...,1112401,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],6,"{'downloads': 34.0, 'unique_downloads': 33.0, ...",2020-01-20T13:35:27.378940+00:00
4,10.5281/zenodo.4291144,4291144,2020-11-25T23:45:58.213063+00:00,10.5281/zenodo.4291145,[{'bucket': 'a5fee344-6682-4d4b-8f33-bb29f1882...,4291145,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[97381],2,"{'downloads': 20.0, 'unique_downloads': 19.0, ...",2020-11-26T00:27:14.980777+00:00
5,,4017476,2020-09-07T12:50:39.955996+00:00,10.12775/PPS.2020.06.02.017,[{'bucket': '4cfdcb23-5bd7-4a95-aea7-548366481...,4017477,{'badge': 'https://zenodo.org/badge/doi/10.127...,"{'access_right': 'open', 'access_right_categor...",[4783],2,"{'downloads': 19.0, 'unique_downloads': 18.0, ...",2020-09-08T00:59:25.027445+00:00
6,,668063,2016-11-24T23:11:09.420689+00:00,10.5281/zenodo.164890,[{'bucket': '221e2f4a-8a4e-4f94-bce0-55835219f...,164890,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],6,"{'downloads': 72.0, 'unique_downloads': 67.0, ...",2020-01-20T16:10:49.498529+00:00
7,10.5281/zenodo.2536463,2536463,2019-01-09T18:14:57.608168+00:00,10.5281/zenodo.2536464,[{'bucket': '00c6f3ef-80bb-4e17-9c18-36821b092...,2536464,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[31393],5,"{'downloads': 4.0, 'unique_downloads': 4.0, 'u...",2020-01-20T12:53:25.538815+00:00
8,10.5281/zenodo.5823438,5823438,2022-01-05T22:37:34.868041+00:00,10.5281/zenodo.5823439,[{'bucket': '6954518c-7f6f-4205-9b8b-8aa2f9117...,5823439,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[104501],2,"{'downloads': 6.0, 'unique_downloads': 6.0, 'u...",2022-01-06T01:48:50.742385+00:00
9,10.5281/zenodo.1062073,1062073,2018-01-31T22:34:33.904253+00:00,10.5281/zenodo.1062074,[{'bucket': 'd24aa1bf-4691-4537-8952-7913109d5...,1062074,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[32148],5,"{'downloads': 26.0, 'unique_downloads': 26.0, ...",2020-01-20T17:19:27.982554+00:00


Awesome, now we're able to get more results and only call one line of code to do it. Let's test it out.

In [67]:
output = get_zenodo_search_output('basketball', 100, 1000)
output

<Response [400]>

Now we see that we're getting a 400 response, so let's check it out.

In [68]:
vars(output)

{'_content': b'{"status": 400, "message": "Maximum number of 10000 results have been reached."}',
 '_content_consumed': True,
 '_next': None,
 'status_code': 400,
 'headers': {'Server': 'nginx', 'Date': 'Fri, 11 Feb 2022 23:03:50 GMT', 'Content-Type': 'application/json', 'Content-Length': '80', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Type, ETag, Link, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '57', 'X-RateLimit-Reset': '1644620691', 'Retry-After': '60', 'X-Frame-Options': 'sameorigin', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'Strict-Transport-Security': 'max-age=0', 'Referrer-Policy': 'strict-origin-when-cross-origin'},
 'raw': <urllib3.response.HTTPResponse at 0x7fc8afe94730>,
 'url': 'https://zenodo.org/api/records/?q=basketball&page=100&size=1000',
 'encoding': None,
 'history': [<Response [308]>],
 'reason': 'BAD REQUEST',
 'cookies': <Requ

So we can't search for more than 10,000 results at a time.

In [70]:
requests.get(zenodo_records_url, params={'q': 'basketball'}).json()['hits']['total']

354

Luckily we won't run into any 10,000 limit issues, but if we have time, let's try to figure out how we could get around this.

## Getting More Results

In order to get all of the records, we need to search over every page. Since we saw that there's a ```link``` attribute returned, let's continue to search that next link as long as it exists. In order to see what behavior to expect when we run out of pages, let's search the end results (which we can do since we know how many total records there are.

In [74]:
r = requests.get(zenodo_records_url, params={'q': 'basketball', 'page': 4, 'size': 100})
output = r.json()

In [79]:
output['links']

{'prev': 'https://zenodo.org/api/records/?sort=bestmatch&q=basketball&page=3&size=100',
 'self': 'https://zenodo.org/api/records/?sort=bestmatch&q=basketball&page=4&size=100'}

When there's no more pages, they just don't return a ```next``` link. So a potential loop for this could be something like
```python
# Perform initial search
r = requests.get(zenodo_search_url, params=search_params)
output = r.json()

if not r.status_code == 200 and output.get('hits').get('hits'):
    return None

cumulative_df = pd.DataFrame()

while r.status_code == 200 and output['links'].get('next'):
    # Get hits output
    hits_df = pd.DataFrame(output['hits']['hits'])
    
    # Append results to df
    cumulative_df = pd.concat([cumulative_df, hits_df]).reset_index(drop=True)
    
    # Get new results
    r = requests.get(zenodo_search_url, params=search_params)
    output = r.json()

return cumulative_df
```

In [89]:
def get_zenodo_search_output(search_term):
    """Return the output for a Zenodo record query.
    
    Parameters
    ----------
    search_term : str
    
    Returns
    -------
    cumulative_df : pandas.DataFrame
    """
    
    assert isinstance(search_term, str)
    
    search_url = 'https://zenodo.org/api/records'
    print('first search')
    search_params = {'q': search_term, 'size': 100} # Perform initial search
    print('finished first search')
    
    r = requests.get(search_url, params=search_params)
    output = r.json()

    if not r.status_code == 200 and output.get('hits').get('hits'):
        return None

    cumulative_df = pd.DataFrame()

    while r.status_code == 200 and output['links'].get('next'):
        print('searching...')
        # Get hits output
        hits_df = pd.DataFrame(output['hits']['hits'])

        # Append results to df
        cumulative_df = pd.concat([cumulative_df, hits_df]).reset_index(drop=True)

        # Get new results
        r = requests.get(output['links'].get('next'))
        output = r.json()
    
    hits_df = pd.DataFrame(output['hits']['hits'])
    cumulative_df = pd.concat([cumulative_df, hits_df])

    return cumulative_df

In [90]:
get_zenodo_search_output('basketball')

first search
finished first search
searching...
searching...
searching...


Unnamed: 0,conceptdoi,conceptrecid,created,doi,files,id,links,metadata,owners,revision,stats,updated
0,10.5281/zenodo.1089029,1089029,2018-01-16T23:10:49.693282+00:00,10.5281/zenodo.1089030,[{'bucket': '36246f8b-1f87-49cf-8c69-4dd77f006...,1089030,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[32148],5,"{'downloads': 41.0, 'unique_downloads': 40.0, ...",2020-01-20T14:21:30.873221+00:00
1,10.5281/zenodo.1148789,1148789,2018-09-18T20:28:40.176686+00:00,10.5281/zenodo.1148790,[{'bucket': '41d12a34-6bec-424e-aaef-7dbf24a0e...,1148790,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],4,"{'downloads': 11.0, 'unique_downloads': 10.0, ...",2020-01-20T16:29:18.605040+00:00
2,10.5281/zenodo.3890793,3890793,2020-06-12T04:35:43.477175+00:00,10.5281/zenodo.3890794,[{'bucket': '013e1566-eba3-401b-9695-95df07275...,3890794,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[104501],2,"{'downloads': 5.0, 'unique_downloads': 5.0, 'u...",2020-06-12T10:18:20.919328+00:00
3,10.5281/zenodo.1112400,1112400,2018-01-04T18:08:25.746394+00:00,10.5281/zenodo.1112401,[{'bucket': 'ea8b6c08-6008-4d04-8422-f4980c0b4...,1112401,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[21883],6,"{'downloads': 34.0, 'unique_downloads': 33.0, ...",2020-01-20T13:35:27.378940+00:00
4,10.5281/zenodo.4291144,4291144,2020-11-25T23:45:58.213063+00:00,10.5281/zenodo.4291145,[{'bucket': 'a5fee344-6682-4d4b-8f33-bb29f1882...,4291145,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[97381],2,"{'downloads': 20.0, 'unique_downloads': 19.0, ...",2020-11-26T00:27:14.980777+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
49,10.5281/zenodo.4499538,4499538,2021-02-04T01:36:34.252749+00:00,10.5281/zenodo.4499539,[{'bucket': '7ee76c88-7f00-460a-bdf6-cd479bfe2...,4499539,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[69890],2,"{'downloads': 28.0, 'unique_downloads': 22.0, ...",2021-02-15T12:50:07.287721+00:00
50,,629603,2016-02-16T08:57:50+00:00,10.5281/zenodo.45641,[{'bucket': '21225db2-b232-4146-b0d8-5a28e49d8...,45641,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[4783],9,"{'downloads': 151.0, 'unique_downloads': 142.0...",2020-01-20T16:29:23.274298+00:00
51,,630005,2016-02-25T05:03:11+00:00,10.5281/zenodo.45840,[{'bucket': '5da1d940-0af4-4e38-b4b3-741bb1ed4...,45840,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[4783],10,"{'downloads': 368.0, 'unique_downloads': 357.0...",2020-01-20T17:16:07.290985+00:00
52,,630139,2016-02-23T08:53:38+00:00,10.5281/zenodo.46140,[{'bucket': '9f9f7ee2-2800-4ab9-93ca-875569f6e...,46140,{'badge': 'https://zenodo.org/badge/doi/10.528...,"{'access_right': 'open', 'access_right_categor...",[4783],10,"{'downloads': 339.0, 'unique_downloads': 321.0...",2020-01-20T17:35:16.749075+00:00
