# CKAN Database

The CKAN Database is a commonly used and world renowned database where governments all around the world store their data. This notebook will serve as a means to parse through the api. 
</b>


**Information on the api can be found here:** [ckan api guide](https://docs.ckan.org/en/latest/api/#example-importing-datasets-with-the-ckan-api). General information on the ckan database and its participants can be found here: [ckan official website](https://ckan.org).

requests documentation: [here](https://requests.readthedocs.io/en/latest/)

pandas documentation: [here](https://pandas.pydata.org/docs/)

In [47]:
import requests
import pandas as pd

## Using Requests to access the API

The code below gives an example of using requests to pull from an api, as well as give an example of generally how this data is unpacked.

In [88]:
ckan_url = "http://catalog.data.gov/api/3/action/package_list"

api_token = "45tYhqFq71zd3xYo29eMgLESXiNml4Xxm9JfMmTl"

auth = {
    'X-Api-Key': api_token
}

response = requests.get(url = ckan_url,
                        headers = auth)

assert response.status_code == 200

In [83]:
print(response.status_code)

200


The particular request above gets the data catalog for all data.gov publicly available datasets, the catalog itself looks like this:

In [4]:
response_dict = response.json()

This is a lot of info, we probably want to see the keys, and maybe even just a list of the dataset, sourcename, and url, we can do this by first looking at the column names

In [5]:
response_dict.keys()

dict_keys(['help', 'success', 'result'])

### Note: To get help...

In [29]:
help_url = response_dict['help']

help_response = requests.get(url = help_url,
                           headers = auth)

assert help_response.status_code == 200

print(help_response.json()['result'].replace('`','\''))


    Searches for packages satisfying a given search criteria.

    This action accepts solr search query parameters (details below), and
    returns a dictionary of results, including dictized datasets that match
    the search criteria, a search count and also facet information.

    **Solr Parameters:**

    For more in depth treatment of each paramter, please read the
    'Solr Documentation
    <https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html>'_.

    This action accepts a *subset* of solr's search query parameters:


    :param q: the solr query.  Optional.  Default: ''"*:*"''
    :type q: string
    :param fq: any filter queries to apply.  Note: ''+site_id:{ckan_site_id}''
        is added to this string prior to the query being executed.
    :type fq: string
    :param fq_list: additional filter queries to apply.
    :type fq_list: list of strings
    :param sort: sorting of the search results.  Optional.  Default:
        '''score desc, metadata_modified 

### Continuing with unpacking datasets using the Solr query parameters

In [107]:
# This function will be used to render api data in chunks
class TooMuchError(Exception):
    pass


def getLimits():
    response = requests.get('http://catalog.data.gov/api/3',
                           headers = auth)
    print(response.headers)
    rate_limits = {
        'rate_limit': response.headers.get("X-RateLimit-Limit"),
        'rate_limit_remaining': response.headers.get("X-RateLimit-Remaining"),
        'rate_limit_reset':response.headers.get("X-RateLimit-Reset")
    }
    return rate_limits

def hitMetadata(query, **kwargs):
    if 'auth' not in kwargs.keys():
        print("¡you need a key!")
    #init params
    params = {
        'fq': query,
        'rows': 15
    }
    #init start
    if 'start' in kwargs.keys():
        params['start'] = kwargs['start']
    else:
        params['start'] = 0

    #get request
    response = requests.get(url = kwargs['url'], 
                            params = params,
                            headers = kwargs['auth'])
    #pass status
    assert response.status_code == 200

    #get json dump
    json_dump = response.json()

    #paginate
    kwargs['start'] += params['rows']

    #try results 
    try:
        kwargs['results'] += json_dump['result']['results']
    except:
        kwargs['results'] = json_dump['result']['results']
    #continue recursion if necessary
    print(kwargs['start'])
    
    if json_dump['result']['results']:
        hitMetadata(query, url=kwargs['url'], auth=kwargs['auth'], start=kwargs['start'], results=kwargs['results'])

    #if not return results
    return kwargs['results']
    # if json_dump['result']['count'] >= 100:
    #     error_message = '''
    #     You are trying to retrieve to much data at once and you are going to get me banned from the api, 
    #     break your query up into more bite size chunks and do this again, or if your are using my key
    #     (or yours) and you are trying to hit the api wait ten minutes between each large query
    #     '''
    #     raise TooMuchError(error_message)
    
    

In [108]:
query = 'organization.name = nasa'
'2a9ff0f7-11d8-41c4-b017-52c4fe0f260a'
metadata = hitMetadata(query, url = ckan_url, auth = auth, start = 0)

15
30
45
60
75
90
105
120
135
150
165
180
195
210
225
240
255
270
285
300
315
330
345
360
375
390
405
420
435
450
465
480
495
510
525
540
555
570
585
600
615
630
645
660
675
690
705
720
735
750
765
780
795
810
825
840
855
870
885
900
915
930
945
960
975
990
1005
1020
1035
1050
1065
1080
1095
1110
1125
1140
1155
1170
1185
1200
1215
1230
1245
1260
1275
1290
1305
1320
1335
1350
1365
1380
1395
1410
1425
1440
1455
1470


In [110]:
metadata[:100]

[{'author': None,
  'author_email': None,
  'creator_user_id': '1ecd1fb1-1be6-46bb-b90d-07a0762ed104',
  'id': '74780733-0e8b-48cc-9a22-8daa9e6ab2f9',
  'isopen': False,
  'license_id': None,
  'license_title': None,
  'maintainer': None,
  'maintainer_email': None,
  'metadata_created': '2023-08-30T04:56:11.061072',
  'metadata_modified': '2023-08-30T04:56:11.061077',
  'name': 'enviroatlas-potential-evapotranspiration-1950-2099-for-the-conterminous-united-states1',
  'notes': 'The EnviroAtlas Climate Scenarios were generated from NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) ensemble averages (the average of over 30 available climate models) for each of the four representative concentration pathways (RCP) for the contiguous U.S. at 30 arc-second (approx. 800 m2) spatial resolution. In addition to the three climate variables provided by the NEX-DCP30 dataset (minimum monthly temperature, maximum monthly temperature, and precipitation) a corresponding estimate of