# CKAN Database

The CKAN Database is a commonly used and world renowned database where governments all around the world store their data. This notebook will serve as a means to parse through the api. 
</b>


**Information on the api can be found here:** [ckan api guide](https://docs.ckan.org/en/latest/api/#example-importing-datasets-with-the-ckan-api). General information on the ckan database and its participants can be found here: [ckan official website](https://ckan.org).

requests documentation: [here](https://requests.readthedocs.io/en/latest/)

pandas documentation: [here](https://pandas.pydata.org/docs/)

In [3]:
import requests
import pandas as pd

## Using Requests to access the API

The code below gives an example of using requests to pull from an api, as well as give an example of generally how this data is unpacked.

In [4]:
ckan_url = "http://catalog.data.gov/api/3/action/package_list"

api_token = "45tYhqFq71zd3xYo29eMgLESXiNml4Xxm9JfMmTl"

auth = {
    'X-Api-Key': api_token
}

response = requests.get(url = ckan_url,
                        headers = auth)

assert response.status_code == 200

In [5]:
print(response.status_code)

200


The particular request above gets the data catalog for all data.gov publicly available datasets, the catalog itself looks like this:

In [6]:
response_dict = response.json()

This is a lot of info, we probably want to see the keys, and maybe even just a list of the dataset, sourcename, and url, we can do this by first looking at the column names

In [7]:
response_dict.keys()

dict_keys(['help', 'success', 'result'])

### Note: To get help...

In [8]:
help_url = response_dict['help']

help_response = requests.get(url = help_url,
                           headers = auth)

assert help_response.status_code == 200

print(help_response.json()['result'].replace('`','\''))


    Searches for packages satisfying a given search criteria.

    This action accepts solr search query parameters (details below), and
    returns a dictionary of results, including dictized datasets that match
    the search criteria, a search count and also facet information.

    **Solr Parameters:**

    For more in depth treatment of each paramter, please read the
    'Solr Documentation
    <https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html>'_.

    This action accepts a *subset* of solr's search query parameters:


    :param q: the solr query.  Optional.  Default: ''"*:*"''
    :type q: string
    :param fq: any filter queries to apply.  Note: ''+site_id:{ckan_site_id}''
        is added to this string prior to the query being executed.
    :type fq: string
    :param fq_list: additional filter queries to apply.
    :type fq_list: list of strings
    :param sort: sorting of the search results.  Optional.  Default:
        '''score desc, metadata_modified 

### Continuing with unpacking datasets using the Solr query parameters

In [47]:
class TooMuchError(Exception):
    pass
# This function will be used to render api data in chunks
def getLimits(url):
    response = requests.get(url,
                           headers = auth)
    print(response.headers)
    rate_limits = {
        'rate_limit': response.headers.get("X-RateLimit-Limit"),
        'rate_limit_remaining': response.headers.get("X-RateLimit-Remaining"),
        'rate_limit_reset':response.headers.get("X-RateLimit-Reset")
    }
    return rate_limits

def ckanifyTokens(params):
    ckanifyed = "&".join([key+"="+str(params[key]) for key in params.keys()])
    return ckanifyed

def hitMetadata(query, **kwargs):
    if 'auth' not in kwargs.keys():
        print("¡you need a key!")
    #init params
    params = {
        'q': query,
        #'fq': 'organization.name:doi-gov',
        'rows': 15,
        'wt':'python'
    }
    #ckanify the tokens 
    params_encoded = ckanifyTokens(params)
    print(params_encoded)
    #init start
    if 'start' in kwargs.keys():
        params['start'] = kwargs['start']
    elif 'fq' in kwargs.keys():
        params['fq'] = kwargs['fq']
    else:
        kwargs['start'] = 0
        params['start'] = 0
    #get request
    response = requests.get(url = kwargs['url'], 
                            params = params_encoded,
                            headers = kwargs['auth'])
    #pass status
    assert response.status_code == 200
    #get json dump
    json_dump = response.json()

    if json_dump['result']['count']>1000:
        msg = ''' your query is gathering '''+str(json_dump['result']['count'])+ ''' datasets, 
        consider rewriting your query to look for a topic more specific than your current query: 
        \"'''+query+'''\"'''
        raise TooMuchError(msg)
        
    #paginate
    kwargs['start'] += params['rows']+1
    #try results 
    try:
        kwargs['results'] += json_dump['result']['results']
    except:
        kwargs['results'] = json_dump['result']['results']
    #continue recursion if necessary
    print(json_dump['result']['count'])
    # if kwargs['start']<=json_dump['result']['count']:
    #     hitMetadata(query, url=kwargs['url'], auth=kwargs['auth'], start=kwargs['start'], results=kwargs['results'])
    return list(filter(None,kwargs['results']))    

In [51]:
# look for a subject to find a dataset
query = "dinosaurs"

#find metadata in data.gov for the subsequent datasets
results = hitMetadata(query=query,
            url=ckan_url,
            auth=auth)

q=dinosaurs&rows=15&wt=python
16


In [53]:
def resultsToPandas(results):
    common_keys = set.intersection(*map(set, results))
    transformed = {k: [dic[k] for dic in results] for k in common_keys}
    return pd.Data

resultsToPandas(results)

{'url': [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], 'num_resources': [2, 46, 8, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 6, 2], 'version': [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None], 'isopen': [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], 'relationships_as_subject': [[], [], [], [], [], [], [], [], [], [], [], [], [], [], []], 'resources': [[{'cache_last_updated': None, 'cache_url': None, 'conformsTo': 'https://www.fgdc.gov/schemas/metadata/', 'created': '2023-06-01T05:28:05.201712', 'description': 'The metadata original format', 'format': 'XML', 'hash': '', 'id': '87f51503-8ce2-4592-bd0e-879e09f0a14d', 'last_modified': None, 'metadata_modified': '2023-10-27T17:38:25.865971', 'mimetype': 'text/xml', 'mimetype_inner': None, 'name': 'Original Metadata', 'package_id': 'da4ce130-7fdc-454a-9dba-15c4f1608885', 'position': 0, 'resource_type': None, 