Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tableschema metadata #1003

Closed
JDziurlaj opened this issue Apr 22, 2020 · 12 comments
Closed

Support for tableschema metadata #1003

JDziurlaj opened this issue Apr 22, 2020 · 12 comments
Labels
est:Pending Pending estimate of implementation effort est:Score=2 Score for estimate of effort required (scale of 1 upwards) f:Feature-request This issue is a request for a new feature support:Approved Approved to be done under the support agreement support This issue is a candidate to complete under the support agreement

Comments

@JDziurlaj
Copy link

We can now import a table-schema without any associated data (via #852). This is very helpful. One of the key motivations of that issue was to allow a user to validate their data against a published table-schema before passing it on. However, while the package generated by data-curator does provide the table-schema used by the data provider, it is changed somewhat from what was originally imported via import column properties. Furthermore, any metadata in the table-schema is removed.

This is a problem for us as we expect our format to evolve over time and thus we need to know which version of the table-schema was used in order to calibrate our ingestion routines.

Desired Behavior

The most obvious solution would be to support the metadata described by the frictionless folks. I don't see any need to allow users to edit it within the tool, it is enough that it is preserved from import to export.

The second approach, which would also be acceptable, would be to hash (preferably via SHA1) the incoming table-schema (from import column properties), and emit this as a key under the table-schema section of the data-package. This would at least allow us to correlate the version that the user imported to our repository of table-schemas.

@ghost
Copy link

ghost commented Apr 23, 2020

Hi @JDziurlaj
The issue of keys sitting outside of the schema in the exported datapackage.json has been raised in issue: #972. Is this part of the problem? It may be that there are previous and more recent additions that we have incorporated yet - please let me know what you believe the significant differences are - this will help in considering the upcoming priorities (the caveat being that our timeframes are somewhat limited) for this release.
I think that what you are raising here is similar to: Issue: #987. If so, perhaps we could collapse this
issue into that one - and you could specify further there. If I've misunderstood, and they are separate, please let me know.

@ghost ghost added est:Pending Pending estimate of implementation effort assessment:Required Assessment of effort required f:Feature-request This issue is a request for a new feature support This issue is a candidate to complete under the support agreement labels Apr 23, 2020
@JDziurlaj
Copy link
Author

#987 seems to already be implemented, as you commented. I think the confusion comes down to the adage that "one man's data is another man's metadata". The table-schema is metadata about what the user is providing. What I am requesting here is another level above that, metadata about the table-schema itself. I can't speak to #972, we do not currently use foreign keys.

@ghost
Copy link

ghost commented May 18, 2020

Hi @JDziurlaj

At the moment we're shortlisting potential candidates for this release cycle and I think there are ideas here that I'd like to raise with our sponsors as worth consideration. Please correct me on any of the below:

  • there's an issue with incoming metadata (specifically if 'locking' properties is on) and some/all these properties not being held on export
  • there's no way to know which version of frictionless schemas are being adhered to once export package is created (perhaps there is a place where this is/can be reported)
  • without examining by hand the incoming schema and the outgoing schema, the user cannot quickly determine if they are both, in fact the same, different (hash being a way to solve this)

I think a hash and reporting of version (if there is not a 'right way to show this already in frictionless spec) are straightforward enough to do and something that I can raise with sponsors.

If I've understood, there's a second issue here about the persistence of some properties from the original import to export. I'll look through to see what happens to existing metadata from import to export again, but if you had some examples of data that show this to post here, that might help speed things along for working out effort required. I can certainly look at where we might be able to lock things down further or perhaps not overwrite certain properties

@ghost ghost added this to the Import and Export enhancements milestone May 18, 2020
@ghost ghost added est:Score=2 Score for estimate of effort required (scale of 1 upwards) and removed assessment:Required Assessment of effort required labels May 18, 2020
@markwheels markwheels added the support:Approved Approved to be done under the support agreement label May 18, 2020
@JDziurlaj
Copy link
Author

Hi @mattRedBox,

You captured the essence of our request. A minor correction would be that metadata is lost whether or not the properties are 'locked' or not. Here is a Gist showing a table-schema prior to import.

There is metadata such as created, lastModified, version, etc. that has been removed from the data-curator "packaged" version.

@ghost
Copy link

ghost commented May 18, 2020

Hi @JDziurlaj
Ah, thanks now I see, yes I think was a deliberate implementation a while back to go one way or the other - we stayed with what we thought was conservative at the time to just stick with frictionless properties that we could manage without introducing potential risks. However there may be more-recently implemented frictionless properties which we're not fully capturing yet - I'll have a look and see.
I have this issue in our next milestone (it sits under the major tasks to do which are #987 and #986), so I may be able to get to this, perhaps introducing a toggle to turn this behaviour on/off ie: keep existing properties and write out afterwards. You'll note that we also have #988, which I think ties into this idea well of having other properties.
Not sure if I'll have time to get to all the ideas, above (we tend to throw as many ideas as we can into the milestones and just work through in priority until we have chewed up our time allocation), but I think I can get to introducing the means to show the frictionless schema versions used in Data Curator. Did you have any thoughts about how/where we could/should display these schema versions?

@ghost
Copy link

ghost commented May 18, 2020

Hi @JDziurlaj
And thanks for this gist. Do I have your permission to use it or parts or in implementing tests that I might add to the application?

@JDziurlaj
Copy link
Author

Did you have any thoughts about how/where we could/should display these schema versions?

I would put it under the Table sidebar, much like Package has a version.

Yes, you may use the Gists, they are publicly available.

@ghost
Copy link

ghost commented May 19, 2020

Ok thanks @JDziurlaj
Yep makes sense - I'll try to use a label that makes it clear though that this is (a read-only?) frictionless schema version, rather than the version assigned to the metadata by the user. Help text too might be useful.

@ghost
Copy link

ghost commented Jun 1, 2020

Looking to combine some of this work, the non-Data-Curator properties persistence, with: #988.

ghost pushed a commit that referenced this issue Jun 23, 2020
… columns, tables or packages. Removed commented out code.
ghost pushed a commit that referenced this issue Jun 23, 2020
…side of app life. Corrected ids so unique.
ghost pushed a commit that referenced this issue Jun 23, 2020
ghost pushed a commit that referenced this issue Jun 23, 2020
…ackage properties when custom removed. Keep all preferences in column, table, packages to simplify synchronizing. Match for custom type only happens in rendering of view.
ghost pushed a commit that referenced this issue Jun 23, 2020
… for TableProperties as for PackageProperties.
ghost pushed a commit that referenced this issue Jun 23, 2020
ghost pushed a commit that referenced this issue Jun 23, 2020
…name is unique in preferences. Set max width for custom name (but still allow user to scroll full text.
ghost pushed a commit that referenced this issue Jun 23, 2020
ghost pushed a commit that referenced this issue Jun 23, 2020
…e, table and columns on export. Ensure custom names are not reserved by current package,table or column properties.
ghost pushed a commit that referenced this issue Jun 23, 2020
… 'reserved' properties. Fixed in export so store properties not mutated.
ghost pushed a commit that referenced this issue Jun 23, 2020
ghost pushed a commit that referenced this issue Jun 29, 2020
@ghost
Copy link

ghost commented Jul 6, 2020

Hi @JDziurlaj
I have been able to include the underlying up-to-date frictionless libraries underneath, but there are a couple of properties (like date) that we don't explicitly add yet to Data Curator.
Because of this, putting a version might give people the wrong idea I think about what is and isn't present in the displayed properties.
As we were already doing work on #988 and because our sponsor was particularly keen on this idea (having custom properties), I thought this might be a way to satisfy at least one of the issues raised here (having existing properties that didn't propogate through).
So following on this idea, a user can now go into 'Preferences' menu and specify:

  • Column
  • Table
  • Package
    properties and have these added to the respective menus, where a user can manually add values (we just haven't added the means to get these another way yet).
    That way these custom property names are persisted beyond Data-curator closing.
    Custom key/value pairs can also be exported. While it doesn't address adding a hash for file or package properties used, hopefully it provides some use for issue(s) raised here - the work to accommodate custom key/value pairs for import and export unfortunately took up all of the remaining time I had allocated to this release.

@ghost ghost closed this as completed Jul 6, 2020
@JDziurlaj
Copy link
Author

So to be clear, you are saying that custom properties are or are not maintained through the Import Package/Column properties pulldown function?

@ghost
Copy link

ghost commented Jul 15, 2020

Hi @JDziurlaj
Yes custom properties are maintained (with updates) in lastest beta release.
At the moment, the catch is of course:

  • you have to have these custom names created under the Data Curator 'Preferences' menu. If the custom names don't exist, an import won't bring in these custom key values.
  • the custom key/value pairs can only be strings at the moment.

I've lodged an issue to address the use of 'toggling' certain conventional/expected behaviours. It could be that we add to this list, say, a preference toggle that:

  • toggles between whether custom names are automatically created where they don't exist (a little tricky as we'd have to make sure that the custom names didn't clash with frictionless property names)
  • has a sensible default for whether this toggle is initially turned on or off.

ghost pushed a commit that referenced this issue Apr 21, 2021
… columns, tables or packages. Removed commented out code.
ghost pushed a commit that referenced this issue Apr 21, 2021
…side of app life. Corrected ids so unique.
ghost pushed a commit that referenced this issue Apr 21, 2021
ghost pushed a commit that referenced this issue Apr 21, 2021
…ackage properties when custom removed. Keep all preferences in column, table, packages to simplify synchronizing. Match for custom type only happens in rendering of view.
ghost pushed a commit that referenced this issue Apr 21, 2021
… for TableProperties as for PackageProperties.
ghost pushed a commit that referenced this issue Apr 21, 2021
ghost pushed a commit that referenced this issue Apr 21, 2021
…name is unique in preferences. Set max width for custom name (but still allow user to scroll full text.
ghost pushed a commit that referenced this issue Apr 21, 2021
ghost pushed a commit that referenced this issue Apr 21, 2021
…e, table and columns on export. Ensure custom names are not reserved by current package,table or column properties.
ghost pushed a commit that referenced this issue Apr 21, 2021
… 'reserved' properties. Fixed in export so store properties not mutated.
ghost pushed a commit that referenced this issue Apr 21, 2021
ghost pushed a commit that referenced this issue Apr 21, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
est:Pending Pending estimate of implementation effort est:Score=2 Score for estimate of effort required (scale of 1 upwards) f:Feature-request This issue is a request for a new feature support:Approved Approved to be done under the support agreement support This issue is a candidate to complete under the support agreement
Projects
None yet
Development

No branches or pull requests

2 participants