Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add ability to use existing data packages #980

Closed
wants to merge 9 commits into from

Conversation

ethanwhite
Copy link
Member

Downloads Frictionless Data data packages from the web and adds
them to the retriever. Accomplished by:

  • Retrieving the JSON metadata file
  • Modifying it to work with the retriever
  • Writing it to the scripts directory

New packages are added by adding a name-url pair to
scripts/datapackages.yml.

@ethanwhite
Copy link
Member Author

ethanwhite commented Jul 23, 2017

The basics of this are working. Running:

from retriever.lib.get_datapackages import get_dps, download_dps_json
dps = get_dps()
download_dps_json(dps)

Will grab the json data package file, modify it to run with the retriever, and add it to the ~/.retriever/scripts directory.

If other folks' data package files looked exactly like ours we'd just need to hook this code up, but they don't, and I think this is going to mean a fair bit of additional work. E.g.,

  • There are lots of fields that are allowed in the json according to the spec that we don't process explicitly in compile.py. We currently pass those through unquoted, but many of them are strings and so this leads to errors loading the resulting .py scripts.
  • Our usage of type doesn't not currently match the spec, so we aren't handling declared types correctly. This will require some changes to lib/engine.py and to the individual engines, probably with some care given to supporting both our old usage and the proper spec to maintain backwards compatibility.
  • I suspect there will be a number of other issues like this, these are just the first two I've come across
  • If we're going to handle other peoples data packages we also need to think about security and making sure that we're sanitizing inputs.

Downloads Frictionless Data data packages from the web and adds
them to the retriever. Accomplished by:

* Retrieving the JSON metadata file
* Modifying it to work with the retriever
* Writing it to the scripts directory

New packages are added by adding a name-url pair to
scripts/datapackages.yml.
External data packages will have a number of keys that we don't
use in the retriever. This only loads the keys that we work with,
handles their types correctly, and ignores the rest.

This solves issues with non-retriever keys whose values were strings
not being handled properly.
The retriever still describes types using its old system instead of the
official frictionless data spec. This converts frictionless data types to
retriever types to allow proper handling.
There is no system for ensuring that data package names are unique and
descriptive. This replaces the name in the original dp with the name
specified in datapackage.yml
To properly test this feature it was necessary to add datapackages.yml
to the master branch separately.
@@ -26,6 +27,10 @@ def download_from_repository(filepath, newpath, repo=REPOSITORY):
raise
pass

def download_external_dps():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clean up.

  • more line space before the function.

'number': 'double',
'integer': 'int',
'date': 'char'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Closing bracket not indented well
  • Double spacing between functions
  • Do we need the extra libs urlparse, urlunparse, ParseResult

zhangcandrew pushed a commit to zhangcandrew/retriever that referenced this pull request Nov 28, 2017
2.Add new tables for spatial data.
For the retriever to handle Geospatial data

3.Define table for each script before pre-processing.

4.Added the functionaly to handle
compressed/archied datasets by using key word archieved
and the files that are in the zip for each table.

5. Handling data packages registered in the `datapackage.yml` file
I have remove the scripts from this file that seem
not to have followed the standards.
To make this possible, we are matching the major `frictionless`
types, integer,string etc,  with the types in the retriever and falling
back to char for some other types
For example, datetime YYYY-MM-DDThh:mm:ssZ.

Ref: This is a combination of the work from PR weecology#814 weecology#980 weecology#1005
zhangcandrew pushed a commit to zhangcandrew/retriever that referenced this pull request Nov 28, 2017
2.Add new tables for spatial data.
For the retriever to handle Geospatial data

3.Define table for each script before pre-processing.

4.Added the functionaly to handle
compressed/archied datasets by using key word archieved
and the files that are in the zip for each table.

5. Handling data packages registered in the `datapackage.yml` file
I have remove the scripts from this file that seem
not to have followed the standards.
To make this possible, we are matching the major `frictionless`
types, integer,string etc,  with the types in the retriever and falling
back to char for some other types
For example, datetime YYYY-MM-DDThh:mm:ssZ.

Ref: This is a combination of the work from PR weecology#814 weecology#980 weecology#1005
zhangcandrew pushed a commit to zhangcandrew/retriever that referenced this pull request Nov 28, 2017
2.Add new tables for spatial data.
For the retriever to handle Geospatial data

3.Define table for each script before pre-processing.

4.Added the functionaly to handle
compressed/archied datasets by using key word archieved
and the files that are in the zip for each table.

5. Handling data packages registered in the `datapackage.yml` file
I have remove the scripts from this file that seem
not to have followed the standards.
To make this possible, we are matching the major `frictionless`
types, integer,string etc,  with the types in the retriever and falling
back to char for some other types
For example, datetime YYYY-MM-DDThh:mm:ssZ.

Ref: This is a combination of the work from PR weecology#814 weecology#980 weecology#1005
@zhangcandrew
Copy link
Member

Closing this pr now that changes have been incorporated into PR #1010. Feel free to open this PR up again if anything new comes to light!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants