[WIP] Add ability to use existing data packages #980

ethanwhite · 2017-07-23T01:51:07Z

Downloads Frictionless Data data packages from the web and adds
them to the retriever. Accomplished by:

Retrieving the JSON metadata file
Modifying it to work with the retriever
Writing it to the scripts directory

New packages are added by adding a name-url pair to
scripts/datapackages.yml.

ethanwhite · 2017-07-23T01:58:17Z

The basics of this are working. Running:

from retriever.lib.get_datapackages import get_dps, download_dps_json
dps = get_dps()
download_dps_json(dps)

Will grab the json data package file, modify it to run with the retriever, and add it to the ~/.retriever/scripts directory.

If other folks' data package files looked exactly like ours we'd just need to hook this code up, but they don't, and I think this is going to mean a fair bit of additional work. E.g.,

There are lots of fields that are allowed in the json according to the spec that we don't process explicitly in compile.py. We currently pass those through unquoted, but many of them are strings and so this leads to errors loading the resulting .py scripts.
Our usage of type doesn't not currently match the spec, so we aren't handling declared types correctly. This will require some changes to lib/engine.py and to the individual engines, probably with some care given to supporting both our old usage and the proper spec to maintain backwards compatibility.
I suspect there will be a number of other issues like this, these are just the first two I've come across
If we're going to handle other peoples data packages we also need to think about security and making sure that we're sanitizing inputs.

Downloads Frictionless Data data packages from the web and adds them to the retriever. Accomplished by: * Retrieving the JSON metadata file * Modifying it to work with the retriever * Writing it to the scripts directory New packages are added by adding a name-url pair to scripts/datapackages.yml.

External data packages will have a number of keys that we don't use in the retriever. This only loads the keys that we work with, handles their types correctly, and ignores the rest. This solves issues with non-retriever keys whose values were strings not being handled properly.

The retriever still describes types using its old system instead of the official frictionless data spec. This converts frictionless data types to retriever types to allow proper handling.

There is no system for ensuring that data package names are unique and descriptive. This replaces the name in the original dp with the name specified in datapackage.yml

To properly test this feature it was necessary to add datapackages.yml to the master branch separately.

henrykironde · 2017-08-01T15:47:56Z

retriever/lib/repository.py

@@ -26,6 +27,10 @@ def download_from_repository(filepath, newpath, repo=REPOSITORY):
        raise
        pass

+def download_external_dps():


Some clean up.

more line space before the function.

henrykironde · 2017-08-01T15:47:59Z

retriever/lib/get_datapackages.py

+             'number': 'double',
+             'integer': 'int',
+             'date': 'char'
+    }


Closing bracket not indented well

Double spacing between functions

Do we need the extra libs urlparse, urlunparse, ParseResult

2.Add new tables for spatial data. For the retriever to handle Geospatial data 3.Define table for each script before pre-processing. 4.Added the functionaly to handle compressed/archied datasets by using key word archieved and the files that are in the zip for each table. 5. Handling data packages registered in the `datapackage.yml` file I have remove the scripts from this file that seem not to have followed the standards. To make this possible, we are matching the major `frictionless` types, integer,string etc, with the types in the retriever and falling back to char for some other types For example, datetime YYYY-MM-DDThh:mm:ssZ. Ref: This is a combination of the work from PR weecology#814 weecology#980 weecology#1005

zhangcandrew · 2017-11-30T20:08:01Z

Closing this pr now that changes have been incorporated into PR #1010. Feel free to open this PR up again if anything new comes to light!

henrykironde added the Under Review and Tests label Jul 23, 2017

This was referenced Jul 23, 2017

Use Frictionless data spec for types #981

Closed

Add GDP data #967

Closed

Add airports dataset #966

Closed

ethanwhite added 7 commits July 28, 2017 21:47

Update old use of script_version to version

96c506e

Match datapackage types to retriever types

acbf916

The retriever still describes types using its old system instead of the official frictionless data spec. This converts frictionless data types to retriever types to allow proper handling.

Download external data packages with check_for_updates

84c7c9d

Add additional external data packages

da46860

Replace the original dp name with the name chosen for the retriever

a4ef09b

There is no system for ensuring that data package names are unique and descriptive. This replaces the name in the original dp with the name specified in datapackage.yml

ethanwhite force-pushed the extern-datapacks branch from b07d216 to a4ef09b Compare July 29, 2017 02:59

ethanwhite mentioned this pull request Jul 29, 2017

Add list of external data packages #987

Merged

ethanwhite added 2 commits August 1, 2017 08:52

Remove datapackages.yml since already added

3fec04f

To properly test this feature it was necessary to add datapackages.yml to the master branch separately.

Add yaml as dependency

373a702

ethanwhite force-pushed the extern-datapacks branch from 4e6b494 to 373a702 Compare August 1, 2017 13:17

henrykironde reviewed Aug 1, 2017

View reviewed changes

henrykironde mentioned this pull request Oct 18, 2017

Removes compiling from json to python scripts. #1010

Closed

zhangcandrew closed this Nov 30, 2017

ethanwhite mentioned this pull request Jan 19, 2020

Add datasets from data hub awesome collection #1404

Open

21 tasks

henrykironde mentioned this pull request Jul 9, 2020

Add frictionless data packages support #1489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add ability to use existing data packages #980

[WIP] Add ability to use existing data packages #980

ethanwhite commented Jul 23, 2017

ethanwhite commented Jul 23, 2017 •

edited

Loading

henrykironde Aug 1, 2017

henrykironde Aug 1, 2017

zhangcandrew commented Nov 30, 2017

[WIP] Add ability to use existing data packages #980

[WIP] Add ability to use existing data packages #980

Conversation

ethanwhite commented Jul 23, 2017

ethanwhite commented Jul 23, 2017 • edited Loading

henrykironde Aug 1, 2017

Choose a reason for hiding this comment

henrykironde Aug 1, 2017

Choose a reason for hiding this comment

zhangcandrew commented Nov 30, 2017

ethanwhite commented Jul 23, 2017 •

edited

Loading