Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DataStore #4

Closed
rufuspollock opened this issue Nov 14, 2014 · 25 comments
Closed

Support for DataStore #4

rufuspollock opened this issue Nov 14, 2014 · 25 comments
Assignees
Milestone

Comments

@rufuspollock
Copy link

See http://docs.ckan.org/en/latest/maintaining/datastore.html

This would allow you to pull (and push) data from CKAN to R (dataframes)

@sckott
Copy link
Contributor

sckott commented Nov 14, 2014

Thanks for the suggestion @rgrp . Will do.

@sckott
Copy link
Contributor

sckott commented Nov 14, 2014

@rgrp Hmmm. I don't think I understand how the datastore differs from what I've already implemented. To be able to test pushing, is there a test datastore I can use? Can you point to some example endpoints that use the datastore?

@rufuspollock
Copy link
Author

http://datahub.io has DataStore turned on but you'll need to get an account. http://demo.ckan.org/ may work more easily.

@sckott
Copy link
Contributor

sckott commented Nov 14, 2014

Thanks!

@sckott sckott modified the milestone: v0.1 Dec 17, 2014
@sckott sckott self-assigned this Dec 17, 2014
sckott added a commit that referenced this issue Dec 23, 2014
sckott added a commit that referenced this issue Dec 23, 2014
@sckott
Copy link
Contributor

sckott commented Dec 23, 2014

@rgrp I added some functionality for datastore. If you have time could you see if the functions work for you? See ds_create_dataset(), ds_search(), and ds_search_sql(). What else should be added?

@Analect
Copy link

Analect commented Jan 16, 2015

@sckott
Hope you don't mind me commenting here. I was playing around with your ds_create_dataset and wanted to make a few observations and ask some questions.

I ran this, which appeared to work.
ds_create_dataset(package_id='eurostat-r-test', name='testR', path='/home/user/Downloads/testR.csv', key='xxxx-xxxx', url="http://my_ckan_instance_ip_addr/", as="list")

Up to now, I've been mainly using the python 'ckanapi' (https://github.com/ckan/ckanapi) package for interacting with the CKAN datastore and wanted to see if some of the more granular functionality available through that mechanism could be added to ckanr, if it already isn't in there, that is.

As with your ds_create_dataset, I supply certain credentials up-front when using ckanapi.

Establish a connection to a running CKAN instance

mysite = ckanapi.RemoteCKAN('http://my_ckan_ip_addr',
    apikey='xxxx')

For a new table, I may predefine field types, such as date / value / text fields. Subsetted example below.

Fields = [{
            "id":"TICKER_MOD",
            "type": "text",
            },
            {
            "id": "Amend_Date",
            "type": "timestamp",
            }]

I'm generally then pushing data directly from a pandas dataframe (akin to the R data.frame) to CKAN ... which requires some massaging to get in an acceptable JSON form. I generally upload a large dataframe in two steps ... first the fields and a single row ... to set the table up, as it were and then a follow-on 'upsert' of the bulk of the dataframe, chunking it in parts, where necessary, to get around memory issues. As per last line below, I'm supplying a resource_id to my datastore_create call, where your example uses a name and generates the resource on-the-fly.

just doing this to use datastore_create on first slice and push second slice as datastore_upsert

## in order to handle duplicates.
dffile0=dffile[:1]

## need to adjust the data to JSON line format for loading to CKAN
# first slice of df to create table in datastore and add a single record
output = cStringIO.StringIO()
dffile0.to_json(path_or_buf=output, date_format='iso', orient='records')
contents = output.getvalue()
records_new = pd.json.loads(contents)

# finally, push to CKAN datastore table
mysite.action.datastore_create(resource_id=resource['id'],
                                                fields=fields, records=records_new,
                                                force=True, primary_key=['TICKER'])

Does your ds_create_dataset handle supplying pre-set field formats, as above? Also, can it handle setting a primary key, which is necessary for the follow-on datastore_upsert function below?

mysite.action.datastore_upsert(resource_id=resource['id'],
                                                    records=records_new,
                                                    force=True, primary_key=['TICKER'])

You asked, "What else should be added?"

  1. Having the ability to push directly from a data.frame (rather than a csv file) would be great. Perhaps this is there ... I noted as="table', as="json", as="list" exists, but wasn't sure how each is handled. Is the as="json" used to potentially push directly from a data.frame. If so, are there some hoops (as I've had to do in python) to go through to get it into the appropriate format?
  2. Having the ability to pre-format fields, add a primary key and access to the CKAN upsert function via ckanr would be nice to have too.
  3. For large data.frames, maybe some form of chunking to CKAN is necessary, so it doesn't choke. I'm not sure the best approach to doing this in R.

On first working with CKAN, I struggled a bit ref the naming conventions in their data model. AFAIK, an organisation contains any number of datasets which are akin to data packages. Each dataset, can in turn contain resources ... files such as CSVs, PDFs etc... Some of these resources (such as CSVs, XLSs etc) can get ingested to the datastore (postgres), for added functionality. Hence, the ckanapi function naming datastore_create and datastore_upsert. In that context, perhaps ds_create_datastore might be a clearer naming approach ... since the dataset to which you are adding a resource already exists. Just a suggestion!

My experience with R is somewhat limited, but I would love to see a fully-functional ckanr library, so let me know if I can be of further help.
Colum

@sckott
Copy link
Contributor

sckott commented Jan 16, 2015

@Analect Thanks for your comments. To address the questions:

  • You asked Does your ds_create_dataset handle supplying pre-set field formats - What does pre-set field formats mean? I'm not sure I understand
  • No, there's no primary key arg yet, but I'll try to add that.
  • Submitting a data.frame in addition to a file should be easy to add - I'll do that
  • The as=* parameter is only for how the returned data is presented to you, as either json, R list, or table/data.frame
  • The chunking problem should be easy - I'll see about that
  • Thanks for notes on fxn naming - I'll think about that and make sure it's clear

@sckott
Copy link
Contributor

sckott commented Jan 16, 2015

hmmm, trying to work on adding function for http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create and http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_upsert but the terminology is so confusing, and I need much more time to understand this - will come back later.

@rufuspollock
Copy link
Author

@sckott what's confusing there - would be super useful to have this support so if we can help clarify let me know. Basically this is about creating tables and then inserting data to them. If you want worked example in python we have several.

@sckott
Copy link
Contributor

sckott commented Jan 16, 2015

@rgrp Thanks for offer to help.

Some examples would be great. It seems like there are multiple ways to use http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create and I don't quite understand the different use cases yet.

@Analect
Copy link

Analect commented Jan 16, 2015

@sckott
Ref the field types ... in my initial comment, I was illustrating how field types could be specified. I think the CKAN API makes a pretty good guess of type ahead of populating a table in postgres, but this is just so you can be explicit about a particular field type for data you are about to push to CKAN ... its an optional parameter, as you will see from the docs here:
http://docs.ckan.org/en/latest/maintaining/datastore.html?highlight=datastore#fields
Here are the valid types:
http://docs.ckan.org/en/latest/maintaining/datastore.html?highlight=datastore#valid-types

I'm sure @rgrp can supply you with a whole lot more relevant stuff, but here are a few relevant links that may help, beyond the very detailed docs at docs.ckan.org, that is!

CKAN data model visualisation: ckan/ckan#2053
And here's a useful summary of the docs I came across: http://assets-production.govstore.service.gov.uk/Gii%20Attachments/06b%20Cats%20-%20uploaded/G2.208%20-%20Open%20Knowledge%20Foundation%20Limited/Archive1/CKANWalkthrough.pdf

Simple python interaction with CKAN examples:
I think these mostly use Ian Ward's ckanapi. He has also created a nice command-line interface to CKAN discussed here: https://lists.okfn.org/pipermail/ckan-dev/2014-February/006774.html
https://github.com/ckan/example-add-dataset
https://github.com/ckan/example-update-datastore

Other example of python integration with datastore:
https://github.com/ckan/example-earthquake-datastore

As I said, @rgrp may be able to point you to other publically available stuff, some of which may be more relevant to some of the new functionality in ver 2.3 (resource views etc..)

Rgds,
Colum

@sckott
Copy link
Contributor

sckott commented Mar 19, 2015

@Analect Thanks very much for your help on this. Sorry it has been so long since I've worked on this. I will start hacking again on this soon...the python examples will surely help

@Analect
Copy link

Analect commented Apr 20, 2015

@sckott
I see you're getting a bit closer to supporting the datastore. Let me know if you need me to test anything.

@sckott
Copy link
Contributor

sckott commented Apr 20, 2015

@Analect thanks! will do

@sckott
Copy link
Contributor

sckott commented May 7, 2015

@Analect @rgrp If you have time, there are a few datastore functions now (ds_search, ds_search_ssq, ds_create_dataset), and would be great if you could test and report back with any bugs/feature requests

@Analect
Copy link

Analect commented May 9, 2015

@sckott
Thanks for your work on this. I played around a bit and here are my thoughts, for what they're worth. These are based on my particular use-case for the datastore, where my preference is to interact directly with the datastore, rather than upload a file to the filestore and have datapusher manage pushing the data from that file (which needs to be of type csv or xls). My experience is that datapusher isn't always good at getting field-types right and in any case, I'm often looking to add additional fields beyond those just contained in a file. The other problem with pushing files to the filestore is that this will trigger datapusher to run and potentially over-write an existing resource table in the datastore. So perhaps this is feedback on your work-in-progress 'ds_create' function, than necessarily the others mentioned above. If I'm jumping the gun here ... then sorry.

First things first, here's a simple R file to download some data from the ECB, partially clean it and save it locally as a CSV. Ideally, this file would be able to leverage your ds_create function to push directly to CKAN, but I'm not there yet, as I'll elaborate on this further down.

https://gist.github.com/Analect/378a61704941359e3e5a

As a test, I tried to push the dataframe (local dataframe in dplyr sense) in R directly to CKAN, but it hung for a long time and finally timed-out with a broken pipe, as some sort of chunking mechanism is probably needed. It looks like the convert function is doing the right transformation on the records from the ts_df_mod dataframe, but the fact that I'm not explicitly adding fields into this call (and setting these up ahead of time) probably doesn't help.

> ds_create(resource_id = '99205f0e-3a7e-4f90-90ba-cfb07edee791', records = ts_df_mod, aliases = 'ecb-ilm-data-test2', force = FALSE, key='4600a614-bb86-47f3-a944-c7ad9a8b280d', url="http://79.125.66.74/", as="list")
Error in function (type, msg, asError = TRUE)  : 
  Send failure: Broken pipe

I found myself reverting back to python and leveraging the ckanapi in order to get this done. Being able to stay in R would be nice. Here's the mechanism I've used in python. Probably not eloquent ... but it works! I'm going to use the demo.ckan.org end-point as you have done and have created a dataset in there called 'test-ckanr' ... On reflection .... OK, having tried this, it seems that demo.ckan.org has some limitation on the amount of data that can be pushed there as I'm getting an 'client error (413): Request Entity Too Large'. If I push the same data to my own CKAN instance, it's fine. In any case, here's the code, which might work if you just push a portion of the dataframe for testing.

https://gist.github.com/Analect/ea38dd75c282d8312a65

If you want to test for larger datasets, then maybe I can make contact with you directly via email and set you up with a test CKAN instance on AWS.

As you can see from the python script, I'm having to set up field types ahead of pushing data to the datastore. That's because using datapusher to do that correctly is not always reliable, as you can see from the screenshot below, where it ended up interpreting many fields as timestamps.

datapusher

Some functionality gaps I see that are perhaps there but I'm just not using the package correctly are:

  1. The ability to specify a package_id in ds_create, such that a resource_id gets created for a new non-existing resource automatically. At the moment, I find myself having to use ds_create_dataset and using a dummy url in order to create a blank resource_id that I can then use in the ds_create function.
  2. The ability to use something like sapply(ts_df_mod, class) ...see: http://stackoverflow.com/questions/21125222/determine-the-data-types-of-an-r-data-frames-columns that might be able to interpret the structure of a dataframe and set up fields appropriately/automatically in CKAN ahead of pushing the data.
  3. The ability to set a chunk size in terms of dataframe rows as a parameter in ds_create (as per the 5000 I use in the python script) such that CKAN doesn't choke on the data. Maybe there is another more efficient way of getting the data into postgres ... but at the same time using the CKAN API credentials/key so that user interaction with CKAN gets recorded on a users 'Activity Stream'.
  4. In order to be able to leverage the datastore_upsert (http://docs.ckan.org/en/latest/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_upsert) functionality, a primary_key is required. Whenever a new datastore table is generated in CKAN, it creates two new fields: _id and full_text. It would be nice to be able to specify the incrimenting _id field as the primary key. I've tried this in python using ckanapi, but I wasn't successful. Maybe the ckanr package gives the ability to create a separate incrimenting field to act as primary, as an option.

Let me know if I can be of further help here.
Colum

@sckott
Copy link
Contributor

sckott commented May 11, 2015

thanks @Analect - I'll get to this soon ⌚

@sckott
Copy link
Contributor

sckott commented May 19, 2015

@Analect

  • My email is scott@ropensci.org - can you contact me about setting up a ckan instance on aws? that way I should be able to debug these big data issues easier.
  • still working on getting ds_create() fixed, i'll have a another look at the python code you shared, and your comments above

On your numbered list:

  1. Right, i hadn't yet figured out how to just use a package_id and have the resource created automatically. on the to do list
  2. Right, seems like we can do that internally as part of prep process
  3. We should be able to do chunk size easily
  4. on primary_key, _id, full_text, and incrementing_id - heard about upsert, haven't tried it yet, will look into that.

@Analect
Copy link

Analect commented Sep 25, 2015

@sckott
Don't suppose datastore support is on the horizon for ckanr?
If I can be of futher help, please shout.

Thanks.

@sckott
Copy link
Contributor

sckott commented Sep 29, 2015

@Analect It is in the works. See any functions starting with ds_ in the pkg https://github.com/ropensci/ckanr/tree/master/R

I'll also start a label to put on issues that have to do with datastore items

@sckott
Copy link
Contributor

sckott commented Sep 29, 2015

This pkg is a big project, and lots of other pkgs to work on. And lots of travel this summer :)

@sckott
Copy link
Contributor

sckott commented Oct 21, 2015

Sorry all, but this isn't quite feature complete yet, but I want to push a first version to CRAN to start getting this more widely distributed, get more users, more feedback, etc. Hope to get this more sorted out soon, moving to next milestone

@sckott sckott added this to the v0.2 milestone Oct 21, 2015
@sckott sckott removed this from the v0.1 milestone Oct 21, 2015
@mattfullerton
Copy link
Contributor

mattfullerton commented Aug 8, 2016

I noticed a long time ago that the resource datastore creation process is poorly documented. Here is the structure for a simple HTTP (curl) call which shows how the request should look:

curl -X POST https://CKAN_INSTANCE_URL/api/3/action/datastore_create -H "Authorization: API_KEY" -d '{"force":"true", "resource": {"package_id": "PACKAGE ID"}, "primary_key":["name"], "fields": [{"id":"field1", "type": "one of text, json, date, time, timestamp, int, float, bool"},{"id":"field2", "type":"optional, but must be a valid type"}]}'

I noted that it is important to specify the primary key.

@sckott
Copy link
Contributor

sckott commented Aug 16, 2016

thanks for the help @mattfullerton

@sckott
Copy link
Contributor

sckott commented Jul 3, 2019

initial support is here in the pkg, lets open new issues as needed for certain parts of datastore

@sckott sckott closed this as completed Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants