-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic Property-Values mechanism for long-tail use (e.g. e-commerce, EXIF) #263
Comments
Used in #262 for Automobiles. |
I've asked Martin to break out the property-value piece from his larger proposal (currently all in one branch). |
I just created an individual pull request for the property-values contribution for better tracking, see here: As for the comments raised by Tom Marsh in https://lists.w3.org/Archives/Public/public-vocabs/2015Jan/0004.html: "I am supportive of adding the proposed property-value and EXIF changes into schema.org, but I would like to see them separated out from the other changes so we can approve and incorporate them independently and so that there is a clearer change history for people to follow in GitHub." This is implemented with this pull request. "Assuming we make this change, however, I think it is essential that we provide guidance on when it is acceptable to use these constructs. In particular, if publishers start to use property-value pairs where there are equivalent schematized properties, it will significantly dilute the value of the vocabulary. Therefore, I think we need to document a requirement that the name in the property-value pairs cannot match a schema.org property (it would therefore be considered invalid markup if the name did match)." I added a note to the additionalProperty property, stating "Note: Do not use additionalProperty if there is a specific property for this characteristic readily defined in schema.org." As for prohibiting a property-value pair with a name that exists in schema.org, I am not recommending that, because a) publishers may have properties in their local databases that accidentally clash with schema.org names (e.g. from a table of 200 product features for a technical product). Catching those and implementing a specific handling can be a bit challenge for implementers. b) the local database schemas might use names for properties with a different meaning (e.g. weight for the package weight). It is clear that we should encourage publishers to use specific properties when possible and that the mechanism must not be used to dilute the core vocabulary. I think the current proposal strikes a balance. "I think we should also have an informal agreement as a community that we will make additions to the vocabulary for any properties that turn out to be widely used in property-value pairs so that we can encourage more normalized and consistent representations." I agree, but this is something that should be mentioned in a blogpost or implementation notes for this feature. |
@mfhepp, the changes mostly look good to me. I added a few comments in the change itself. Beyond that:
|
@tmarshbing: Thanks!
It all boils down whether you want a lot of such data or rather less and more conforming. Since additionalProperty is essentially limited to Product and Place I recommend to avoid such a strong and formal requirement. Martin |
For 2, I think removing it would be fine. There are already lots of examples (which is great!!). For 1 and 3, I'd like to get some additional perspectives from others before we decide. @danbri, for example, what are your thoughts? My take for 1 is that it shouldn't break backward compatibility to add language saying that free-text is allowed on unitCode. Presumably, clients and tools already have to handle the case that the unit code is not recognized. This would more clearly define what the behavior should be in such cases. That said, if others also support adding unitText, I don't have a big problem with it. For 3, I would prefer the "rather less and more conforming" version. To some extent, I see this analogous to the question of whether we would rather have a non-marked-up product details page or one that conforms to schema.org. Given a sufficiently sophisticated client, we can read the not-marked-up page, but markup makes it so that many more clients can successfully read the data. If we make it too easy for publishers to "just use name-value-pairs", I think we will end up in a situation closer to the not-marked-up page case for consumers since the names in the name-value pairs will have no agreed-upon meaning. To put it another way, if we end up with more total data (including name-value pairs) but less data mapped to the vocabulary, I think we've done the community a disservice. |
I just removed the example for multidimensional values, see mfhepp@bd79ff8. |
As for unitText vs. unitCode, I had a chat with Dan yesterday and explained why I have a pretty strong preference to keep the two properties:
So if you are fine with it, I would stick to unitText. |
As for avoiding the misuse of the new mechanism for existing schema.org properties: I also discussed this with Dan and we reached agreement that this should be handled in the documentation. The current text says so pretty clearly; we should complement that by a blog post at the time of the release or afterwards. I am against a strict handling of this, because of the following: One of the main use cases for this are shop and other e-commerce applications. In the past, we build or help others develop many extension packages for shop software, which are now running on 50 - 100 k shop sites with likely billions of products and offers. This was only possible because most of the extensions allow for "one-click" installations with a clever mapping from the internal db schemas to schema.org / GoodRelations, with no need for the shop owner to manually define complex mappings etc. Now, in such software, the product features are typically defined by the shop owner or important from a vast amount of data sources, and products can have 30 - 200 of them. Asking a developer to
will be a very significant burden for a developer. Yet it will not necessarily improve the amount or quality of data you have. Developers will either choose to exclude such properties from the markup or use simple heuristics, which may not work reliably. So I would tell developers:
So if you are fine with it, I keep the current description. The wording for a blogpost at release time should be discussed. |
Thanks for the explanation, Martin. Would it be possible to add a couple of sentences to the documentation outlining the benefits of using the existing properties. In particular, consumers of the data can make better sense of well-defined properties. |
Something along these lines? "Note: publishers should be aware that applications designed to use specific schema.org properties (e.g. http://schema.org/width, http://schema.org/color, http://schema.org/gtin13, ...) will typically expect such data to be provided using those properties, rather than using the generic property/value mechanism." |
That works for me. |
ok! martin hepp
|
Hi Dan: Martinmartin hepp http://www.heppnetz.de On 20 Mar 2015, at 14:10, Dan Brickley notifications@github.com wrote:
|
This is now fixed and included in the pull request. See mfhepp@97df3ee |
@tmarshbing and I had a chat about this yesterday and would like work up some 'health warning' text so that publishers understand the extra value that comes from using 'real' schema.org properties. My phrasing above was a bit vague, so we may take @mfhepp's as a starting point: "So I would tell developers: Tom may make a more specific suggestion here... |
I would really like for you to mention succinctly that: Taking Advantage of Schema.org Properties and Promoting Reuse for effective Market Reach You as a Developer or Business Owner might feel as though your terms and phrasing are better suited than a competitors or the general market or even Schema.org's choices, and may think that use of Property/Value will lead to market differentiation and effective reach for your targeted audiences. However, what you might be doing in actuality is fragmenting your own industry. By not correctly aligning with peers (or even competitors) you might be confusing Search and App filters and your potential customers, ultimately hurting your penetration into those targeted audiences you were striving for. "You shoot yourself in the foot". But by correctly taking advantage of existing Schema.org aligned terms, concepts, and our existing industry properties, you can help to deliver helpful hints to Search, App, and Market tools & filters. This allows unlimited and unexpected possibilities for your market reach, such as a consumer making their own targeted choice and being highly satisfied finding your product meets their exact needs, as well as allowing marketing tools to leverage in your competitive favor through effective ads, campaigns, & materials...all reusing and sharing the same language and semantics that is Schema.org. |
I hesitate to use to strong language. We want to motivate sites to publish lots of product details data with that mechanism. While I agree that this should not be regarded as a shortcut that frees lazy developers from using propert schema.org properties, it should also not be described too negatively. Motivating Web sites to mark-up product data sheets with 50 - 200 properties across hundreds of industries is a huge opportunity, and schema:PropertyValue is, from a few years of trying to lift such data, the most feasible approach so far. Note that the proposal comes from our attempts to develop extensions that automatically add schema.org markup to Web shop software and PIM applications. They typically manage product features only at the level of named properties - string + value, sometimes an extra string for a unit or interval. The only way to write such extensions with hassle-free installation is to take the data from the shops and PIM applications as they stand, without asking the shop owners to manually map their properties to standard properties. Actually, we tried both ways: Some of the extensions we developed allow the granular configuration at the level of individual products or the mapping of popular properties like GTIN13 to standard GoodRelations properties. Such features have almost never been used, and if used, the result was often unreliable. So please - let's clarify that additionalProperty is not recommended for existing predefined properties, but that
Also note that the semantic heterogeneity of product feature is very significant, so it is sometimes really hard to judge whether available data matches an existing properties. Martinmartin hepp http://www.heppnetz.de
|
I agree with all your points @mfhepp certainly, who wouldn't. However, we should try to place a very light and slight onus of due diligence for sites and give them the necessary background information and best practices, despite past historical hassles. So I guess what the good guidance encompasses is what is up for discussion...as @danbri says "a health warning text". And I think I make some important points in my version of the "health warning text" 1 more point: @mfhepp don't you agree that we have had that proliferation of data already? ...the problem space was that it was not structured enough, if at all. Sure Property/Value helps a bit...but taking the time to provide highly structured data benefits everyone. We should always try to encourage the latter before the former and that's where I think this "health text warning" should try to promote as well as your bail out mechaisms for non-standards, which I agree with. |
I think the current wording as on http://sdo-property-value-and-cars.appspot.com/additionalProperty is sufficient. As for the proliferation of product data semantics: This has plenty of causes and has been an open problem for decades, so while I agree it will be good to strive for more uniform data structures, we should not mix this with the immediate aim of providing a mechanism for exposing such data as it is available now. |
If I might speak as a SEO specialist of Sanoma for a moment, for the site I'm currently working on Property/Value is the only realistic option there is for providing additional markup for close to a million items. Roughly a million items of which it isn't known upfront whether it's a Product or a Service, nor which specifications they have. Now because these items are added to the site via programmatic solutions there's no method for manually adjusting markup/values. But more importantly, even if there would be a solution to do so manually, the site adds/removes/modifies close to 100.000 items PER DAY, meaning my employer would have to employ ±1000 Jarnos to be able to provide 'highly structured data'. Something that's definitely not going to happen, meaning we either deploy the Property/Value solution or don't publish any specifications at all. It's as simple as that. Now I agree with @thadguidry that proper guidance should be given but I also agree with @mfhepp that strong language should be avoided or else we run the risk publishers (like the one I work for) might feel there's no or too little value in publishing Property/Value markup and therefore probably will decide not publish it at all. Something I feel would be a big loss as, like @mfhepp, I think 'some structured data' is always better than none at all. |
@jvandriel I am sympathetic that there is some effort involved in providing highly structured data. But let us try to not encourage laziness is all I am saying. In your particular case, there are probably programmatic solutions that solve your issue, and would not require more than 1 person to manage. A good algorithm that can give you over 95% accuracy to determine if something is a Product or Service is all that your probably missing. =) And if it does not exist already, it could be built through machine learning and human cognition...even using http://crowdcrafting.org/ or some such. I just want everyone to do there part and I understand its asking others to provide something for free. But we still need to encourage and enlighten them that the time and resources they spend help to expand the knowledge of their products and services. That includes content providers not taking unnecessary shortcuts by saying "its too hard". Let's encourage a mentality of "if you think its hard to provide highly structured data, you might consider that your not taking the right approach and there are folks that can certainly help you take the right approach to provide highly structured data via best practices, programmatic solutions, machine learning, and human cognition, just to name a few". @mfhepp I am not trying to distract the aim of Property/Value. We need it. Everyone does. I just want to make sure we give proper guidance, advise them that things are not as hard as they seem to provide structured data, and in many cases, programmatic or other solutions exist to help even further. I just won't accept laziness. |
@thadguidry As you know, we are in agreement, so let's not start a virtual conflict ;-) but...
As said: Let's not mix the general aim of providing more machine-friendly information on the Web with the very tangible property-value mechanism for product features. People have tried for decades to e.g. consolidate taxonomic information about products (UNSPSC, eClass, ....), without major success. I will be convinced in a minute if you point me to an algorithmic solution that establishes proper alignment between all the standards from http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (*) They are all available in OWL and follow a common GoodRelations meta-model. Still I know of no automated solution to align them. So it should be much easier than the general challenge which you consider "easy" ;-) Martin (*) A few of them must be generated locally using http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL due to copyright restrictions. |
Also: We should force publishers of data on the Web to first complete a major data-cleansing and enrichment process before they can use schema.org. That would put a major delay on the whole process. That being said, we are in agreement that it is perfectly valid to create incentives for them - the better your data, the better will the search engines understand and present your information. |
@mfhepp Yup. :) No conflict. I just take a harder stance on the topic than others. +1 for "health warning text" in some form. Not necessarily mine. But something. |
I am also still very keen on health warning text. I would be fine with the wording @danbri proposed based on @mfhepp's original text: "Always use specific schema.org properties when a) they exist and b) you can populate them. Using PropertyValue as a substitute will typically not trigger the same effect as using the original, specific property." @thadguidry, I wonder if we could put a longer set of best practices - some version of what you started above - in a doc page/blog post and refer to the best practices also from the health text. In that way, I would hope we could address the concerns about not sounding too negative while still providing enough guidance to prevent publishers from shooting themselves in the foot. Thoughts? |
@tmarshbing yes, I had the same thoughts. I think a Blog post would be fine, looks like we could collect comments from it also, if need be. Take whatever you want from my example, it's CC0. And blogging it makes it easier for folks to share the info, socially. |
I've added the disclaimer, and also into releases.html |
Good find @mfhepp - it's done. One less open issue :) And thanks everyone for the discussion! |
* Create BiologicalEntity class. Not as an extension yet... * Started drafting BioChemEntity type * Add BioChemEntity properties * Add ChemicalSubstance * Add DNA * Add gene * Fix typos * Add enzyme type * Add molecular entity * Add dot at the end of comments And fix hasSequence domains and ranges * Add protein type and properties * Add RNA * Add sequence annotation and missing property sources * Add sequence range * Add sequence range * change number to toext for temperature so the unit can be included as well * Add temperature comment * Change molecular weigth to text so the unit can be specified * Specializes some Number ranges * Add phenotype * Add range to expressedIn * Add protein structure adjusting domains and ranges when needed. And updating temperature range from Text to Quantity * Add lab protocol * Add interpro entries (sequence matching model) and add inverseOf statements and links * Enabled bio extension Commented out classes that fail to work * Removed old definition of BiologicalEntity * Fixed issues with Enzyme - Tweaks to wording of terms - Fixes to make valid HTML - Omitting keywords property * Fixes in HTML syntax to make MolecularEntity display * Fixed html syntax * Fixed #263 by correcting HTML * Fixed HTML syntax issues * Fixed typo * Added Taxon type and properties * Added Sample type * Added Study type * Change sample to sampleUsed * Added ranges to boundMolecule And fix "typo" on ranges for LabProtocol sampleUsed * Move Protein properties to BioChemEntity * Added definitions for types: BioChemEntity, ChemicalSubstance, MolecularEntity, and Protein * Initial DataRecord definition Included one class and one property that need to be updated * DataRecord class description * Initial course properties @njall to complete * included in dataset property * datasource removed - not sure what this does * Update protein and proteinstructure types * Update Protein description * Rename ProteinStructure to BioChemStructure * Add BioChemStructure description * Fix typo on massResolution BioChemStructure property * Update text for boundMolecule property in BioChemStructure * Update creationMethod property description to include not only annotation but also BioChemEntity * Update sequenceLocation property description to include not only annotation but also BioChemEntity * Add one initial example to BioChemEntity * Update bio.rdfa * initial example * Use always DefinedTerm rather than CategoryCode Note: Terms in an ontology are not necessarily categorical so DefinedTerm is a safer option * uniprot record example * Add URL to expected type for GO related properties * @njall is going to do this directly on schema.org This file is no longer needed. * Fix to make DataRecord work on appspot Means that appspot no longer gives a 500 error * Improve metadata on DataRecord Using the example of metadata capture from Course extension as good practice here * Redefining Sample as BioSample * Added collector property to BioSample * Added custodiam property * Added dateCreated property to BioSample * Added location properties to BioSample * Added material property to BioSample * Created age property on BioSample * Created phenotype property on BioSample * Added gender to BioSample * Made note that we need a definition for phenotype * created pedigree property on BioSample * Created isControl property on BioSample * Updating schema.org code to latest version * Extracted Gene definition from main file * Fixed definition of inverse properties * Gene definition fixes BioSchemas/specifications#272 * Removed hasOrtholog property fixes BioSchemas/specifications#276 * Removed hasHomolog property Fixes BioSchemas/specifications#275 * expressedIn definition added Issue BioSchemas/specifications#274 * Removed affectsFunction property Fixes BioSchemas/specifications#273 * Removed presentInCollection property from Gene BioSchemas/specifications#278 * Removed partOfOperon property Fixes issue BioSchemas/specifications#277 * Added link to GitHub issue to track reviews * Fixed data record example so that it displays * Rename age property to samplingAge Discussed in issue BioSchemas/specifications#306 * Removed old comment * Updated location properties As per discussion in issue BioSchemas/specifications#306 * Removed material property from Biosample As per discussion on mailing list and captured in issue BioSchemas/specifications#306 * Removing pedigree property from BioSample As per discussion in issue BioSchemas/specifications#306 * phenotype property removed from BioSample No clear indication of the usage currently, as discussed in issues BioSchemas/specifications#306 and BioSchemas/specifications#319 * Separated each type into its own file * Corrected typo in SequenceAnnotation * Move isEncodedByBCE and inverse to BioChemEntity * Hiding types that will not be included in the first release * Hiding ranges where types not part of delivery. * Removed hasSequenceAnnotation property from BioChemEntity * Only link to types that are included in the release * Notes on the derivation of BioSample * Fixed formatting * Summary of design decisions * Summary of design decisions * Missed smiles property * Added ToDo task * Removed text that is only pertinent to BioSample * Summary of design decisions * Summary of design decisions * Summary of design decisions * Summary of design decisions * Updated defn BioSchemas/specifications#327 * Added chemicalComposition BioSchemas/specifications#327 * Correct property definitions BioSchemas/specifications#327 * Added details of ontologies used in design * adding issues and mailing list to BCE * updating mailing list links * correcting whitespace issue * Refined text for links to issues and discussions * changing issue to issues * Refined text for links to issues and discussions * fixing typo * Refined text for links to issues and discussions * fixing typo * Refined text for links to issues and discussions * fixing typo * Refined text for links to issues and discussions * Refined text for links to issues and discussions * Refined text for links to issues and discussions * Two minor tweaks in the description. - IUPAC names are not only for organic compounds - the InChI is mostly for molecular entities and captures the chemical graph largely (and not chemical substances in general) * Restored 'molecular weight as Text' Closes BioSchemas/specifications#347 * Added statement to track issue * Updated to also allow QuantitativeValue. * Updated the `Number` example with representative examples for molecular weight. * Added links to issues, corrected labels of links * Fixed links to related issues * Started drafting readme file * Renamed file for consistency * Drafted release notes * Removed completed ToDos. Reformatted presentation of properties * Added note about schema:Substance. Closes BioSchemas/specifications#327 * Added relationship to domain ontologies, closes BioSchemas/specifications#348 * Fixed display issues * taxonRank is not equivalent but based on dwc:taxonRank dwc:taxonRank takes only literal values whereas the new taxonRank property may take literals as well as URIs from third party vocabularies * Declare hasSequence as a subPropertyOf hasRepresentation * Renamed hasSequence and updated definition * Declare sub-property relationships for MolecularEntity properties * Clarified smiles representations * Converted BioChemEntity to turtle representation * Converted to turtle * Converted to turtle * Converted to turtle * Converted to turtle * Converted to turtle * Converted to turtle * Moved coding notes to Bioschemas/specifications repo * Updated links to coding notes * Updated description of deployments * Removed files for types not being submitted These files are all available on BioChemEntity branch * Removed BioSample from first submission * Tweaks to pass tests and remove not needed old files * Actions issue #2862 - introduce Bioschema terms into pending * Fixed typo Co-authored-by: ljgarcia <leylajael@gmail.com> Co-authored-by: Alasdair Gray <A.J.G.Gray@hw.ac.uk> Co-authored-by: ljgc <ljgarcia@users.noreply.github.com> Co-authored-by: Sarala M. Wimalaratne <sarala.dissanayake@gmail.com> Co-authored-by: Kenneth McLeod <kcm1@hw.ac.uk> Co-authored-by: Egon Willighagen <egon.willighagen@gmail.com> Co-authored-by: Franck Michel <frmichel@users.noreply.github.com> Co-authored-by: Dataliberate <rjw@dataliberate.com>
Martin Hepp has a proposal: http://lists.w3.org/Archives/Public/public-vocabs/2014Dec/0057.html
The text was updated successfully, but these errors were encountered: