A stand-alone webapp to automate importing data from various sources into CKAN and its DataStore.
Get it from github:
git clone ...
We use foreman (as provided by Heroku)
foreman start -f Procfile-dev
Focus on import of Data Package stored in Github
- Given URL to datapackage.json
- optionally a specific file(s) that have changed
- plus the CKAN API Key (our bearer token)
- Identify target CKAN dataset
- Use DataPackage.name attribute for Dataset name
- Identify target resources (by name)
- Start import process for each data file
Require
- CSV file URL
- schema (optional?)
- Target resource id (or dataset name + resource name?)
- CKAN API Key
Steps:
- Load CSV file (into memory or do we stream?)
- (?) Check Schema is valid
- Convert Schema to DataStore schema
- Send data to DataStore
- Receive webhook payload
- Determine if any action needed
- Boot up Data package import process
Extras / questions
- Handle data file renames
- How do we deal with a schema change (e.g. a change in type)
- Ans: if we drop data resource content and recreate we should be ok ...
- Do we try to be smart with updates and only push changed rows (probably not)
Persona:
- Data User - less sophisticated (uses Excel but may not know what an API is)
- Data Wrangler - more sophisticated (knows what an API is)
As a Data Wrangler I want to provide my file and have it imported into CKAN so that I get a Data API
What kind of file?
- CSV file
- Excel file
- GeoJSON file
- ...
How do I provide
- web interface
- API (POST/GET url string or POST file content)
Questions:
- Do we validate the file?
- Do we have some process for e.g. tweaking the field types
- What is the mapping between file and Dataset / Resource
Implementation
- DataPusher already does most of this
- What's missing is any kind of edit metadata step
- No user interface
As a XXX I want to push my data file to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
- This is very similar to import file - only difference is we get push notifications (github webhooks). so merge this with that example.
As a XXX I want to push my tabular data package to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date
- As it is already a data package importing should be very simple
- If file is large we may need to worry about queues etc but probably keep it simple for present
- How do we determine dataset to associate this with in CKAN?
As a XXX I want to provide my file and have it imported into CKAN so that I get a nice Dataset
- what distinguishes from existing system? Ans: one-click nature
As a Data Wrangler I want to have my data file automatically re-imported at regular intervals so that the DataStore (and Data API) stays up to date with my data.