Add a generic Property-Values mechanism for long-tail use (e.g. e-commerce, EXIF) #263

Closed
danbri opened this Issue Jan 22, 2015 · 30 comments

Projects

None yet

6 participants

@danbri danbri self-assigned this Jan 22, 2015
@danbri
Contributor
danbri commented Jan 29, 2015

Used in #262 for Automobiles.

@danbri
Contributor
danbri commented Feb 10, 2015

I've asked Martin to break out the property-value piece from his larger proposal (currently all in one branch).

@mfhepp
Contributor
mfhepp commented Mar 11, 2015

I just created an individual pull request for the property-values contribution for better tracking, see here:

#377

As for the comments raised by Tom Marsh in https://lists.w3.org/Archives/Public/public-vocabs/2015Jan/0004.html:

"I am supportive of adding the proposed property-value and EXIF changes into schema.org, but I would like to see them separated out from the other changes so we can approve and incorporate them independently and so that there is a clearer change history for people to follow in GitHub."

This is implemented with this pull request.

"Assuming we make this change, however, I think it is essential that we provide guidance on when it is acceptable to use these constructs. In particular, if publishers start to use property-value pairs where there are equivalent schematized properties, it will significantly dilute the value of the vocabulary. Therefore, I think we need to document a requirement that the name in the property-value pairs cannot match a schema.org property (it would therefore be considered invalid markup if the name did match)."

I added a note to the additionalProperty property, stating

"Note: Do not use additionalProperty if there is a specific property for this characteristic readily defined in schema.org."

As for prohibiting a property-value pair with a name that exists in schema.org, I am not recommending that, because

a) publishers may have properties in their local databases that accidentally clash with schema.org names (e.g. from a table of 200 product features for a technical product). Catching those and implementing a specific handling can be a bit challenge for implementers.

b) the local database schemas might use names for properties with a different meaning (e.g. weight for the package weight).

It is clear that we should encourage publishers to use specific properties when possible and that the mechanism must not be used to dilute the core vocabulary. I think the current proposal strikes a balance.
We might want to clarify this in a blogpost or implementation notes for this feature.

"I think we should also have an informal agreement as a community that we will make additions to the vocabulary for any properties that turn out to be widely used in property-value pairs so that we can encourage more normalized and consistent representations."

I agree, but this is something that should be mentioned in a blogpost or implementation notes for this feature.

@tmarshbing

@mfhepp, the changes mostly look good to me. I added a few comments in the change itself. Beyond that:

  1. Do we really need unitText? An alternative would be to loosen the requirements for unitCode to allow either a code or the text that unitText allows. This seems more consistent with other areas where we allow either free-text or URIs.
  2. The example of sensor size (for multi-dimensional properties) seems like a stretch. The markup doesn't differentiate between multi-valued (e.g., the Ethernet and USB example) and multi-dimensional. How would a consumer know which it is meant to be? Assuming there is no way to distinguish, I would recommend that the multidimensional value be represented as a single value, such as "23.2 x 15.4".
  3. I still think we need to prohibit property-value pairs with names that exist in schema.org. For the extension mechanism being discussed (http://lists.w3.org/Archives/Public/public-vocabs/2015Mar/0034.html), we wouldn't allow extensions to reuse existing names. I don't see why we would make a different decision in this case, which is also, in some sense, an extension mechanism. It does add additional burden to the publisher (assuming they want to be compliant, of course), but I think we want them to spend the time to understand what parts of their data can be expressed natively in schema.org. PropertyValue should only be for the parts that aren't already in the vocabulary not as a way to get around or ignore the vocabulary.
@mfhepp
Contributor
mfhepp commented Mar 13, 2015

@tmarshbing: Thanks!

  1. As for unitText: I would like to keep unitText and unitCode distinct. On one hand, we do want people to use proper UNCEFACT Common Codes when they can, because they are much more reliable and allow unit conversion etc. On the other hand, we want people to publish unit information as text when they cannot do better. Having two separate properties for this maintains backward compatibility with the original GoodRelations model, tools, and data, and reduces the task for a consuming client.

  2. I can update the example or we remove it for the moment.

  3. I am fine with both directions - it is more important that we have additionalProperty as soon as possible. For publishers who may have 200 properties for products to handle, I would however not introduce a barrier to filter them for matching properties in schema.org, because the core motivation for the approach is that most e-commerce sites have plenty of such data but cannot map it to standardized properties (there is a long argument on the W3C mailing list why this is so). Now, forcing them to filter this data for existing property names in schema.org takes away that simplicity. I would rather recommend that consuming clients ignore such properties, or give priority to schema.org properties, than to make such markup formally invalid. Note that the source systems store the properties as instance data, so it is not something that can be fixed at the schema level: We want to extend our shop software extension modules so that they can also expose product features without the shop operators to manually map their product features to any given standard, but the extensions can only access the database schemas for storing property-values, while the actual properties are defined at the shop level.

It all boils down whether you want a lot of such data or rather less and more conforming.

Since additionalProperty is essentially limited to Product and Place I recommend to avoid such a strong and formal requirement.

Martin

@tmarshbing

For 2, I think removing it would be fine. There are already lots of examples (which is great!!).

For 1 and 3, I'd like to get some additional perspectives from others before we decide. @danbri, for example, what are your thoughts?

My take for 1 is that it shouldn't break backward compatibility to add language saying that free-text is allowed on unitCode. Presumably, clients and tools already have to handle the case that the unit code is not recognized. This would more clearly define what the behavior should be in such cases. That said, if others also support adding unitText, I don't have a big problem with it.

For 3, I would prefer the "rather less and more conforming" version. To some extent, I see this analogous to the question of whether we would rather have a non-marked-up product details page or one that conforms to schema.org. Given a sufficiently sophisticated client, we can read the not-marked-up page, but markup makes it so that many more clients can successfully read the data. If we make it too easy for publishers to "just use name-value-pairs", I think we will end up in a situation closer to the not-marked-up page case for consumers since the names in the name-value pairs will have no agreed-upon meaning. To put it another way, if we end up with more total data (including name-value pairs) but less data mapped to the vocabulary, I think we've done the community a disservice.

@mfhepp
Contributor
mfhepp commented Mar 20, 2015

I just removed the example for multidimensional values, see mfhepp@bd79ff8.

@mfhepp
Contributor
mfhepp commented Mar 20, 2015

As for unitText vs. unitCode, I had a chat with Dan yesterday and explained why I have a pretty strong preference to keep the two properties:

  1. This allows us to keep markup for both the unitCode and a human-readable version of the unit, which can be useful in many cases.
  2. Historically, the main motivation for the whole mechanism and the unitText property was that there can be "refinement and lifting services" that take schema.org/GoodRelations data and enrich it. It is much easier to e.g. add a new triple with the proper unit code than to replace the unit text value with a unit code.
  3. We want to keep up the motivation for publishers to use UN/CEFACT code on QuantitativeValue and PropertyValue whenever they can, for that makes unit conversion etc. much easier.

So if you are fine with it, I would stick to unitText.

@mfhepp
Contributor
mfhepp commented Mar 20, 2015

As for avoiding the misuse of the new mechanism for existing schema.org properties: I also discussed this with Dan and we reached agreement that this should be handled in the documentation.

The current text says so pretty clearly; we should complement that by a blog post at the time of the release or afterwards.

I am against a strict handling of this, because of the following: One of the main use cases for this are shop and other e-commerce applications. In the past, we build or help others develop many extension packages for shop software, which are now running on 50 - 100 k shop sites with likely billions of products and offers. This was only possible because most of the extensions allow for "one-click" installations with a clever mapping from the internal db schemas to schema.org / GoodRelations, with no need for the shop owner to manually define complex mappings etc.

Now, in such software, the product features are typically defined by the shop owner or important from a vast amount of data sources, and products can have 30 - 200 of them.

Asking a developer to

  1. filter out property names that are "reserved" (not just for product but in schema.org as a whole) and
  2. heuristically map those to special markup (e.g. schema:weight with schema:QuantitativeValue)

will be a very significant burden for a developer. Yet it will not necessarily improve the amount or quality of data you have. Developers will either choose to exclude such properties from the markup or use simple heuristics, which may not work reliably.

So I would tell developers:

  1. Always use specific schema.org properties when
    a) they exist and
    b) you can populate them.
  2. Using PropertyValue as a substitute will typically not trigger the same effect as using the original, specific property.

So if you are fine with it, I keep the current description. The wording for a blogpost at release time should be discussed.

@vholland
Contributor

Thanks for the explanation, Martin. Would it be possible to add a couple of sentences to the documentation outlining the benefits of using the existing properties. In particular, consumers of the data can make better sense of well-defined properties.

@danbri
Contributor
danbri commented Mar 20, 2015

Something along these lines? "Note: publishers should be aware that applications designed to use specific schema.org properties (e.g. http://schema.org/width, http://schema.org/color, http://schema.org/gtin13, ...) will typically expect such data to be provided using those properties, rather than using the generic property/value mechanism."

@vholland
Contributor

That works for me.

@mfhepp
Contributor
mfhepp commented Mar 20, 2015

ok!


martin hepp
www: http://www.heppnetz.de/
email: mhepp@computer.org

On 20.03.2015, at 14:10, Dan Brickley notifications@github.com wrote:

Something along these lines? "Note: publishers should be aware that applications designed to use specific schema.org properties (e.g. http://schema.org/width, http://schema.org/color, http://schema.org/gtin13, ...) will typically expect such data to be provided using those properties, rather than using the generic property/value mechanism."


Reply to this email directly or view it on GitHub.

@mfhepp
Contributor
mfhepp commented Mar 24, 2015

Hi Dan:
I will add this to my pull request asap.

Martin

martin hepp http://www.heppnetz.de
mhepp@computer.org @mfhepp

On 20 Mar 2015, at 14:10, Dan Brickley notifications@github.com wrote:

Something along these lines? "Note: publishers should be aware that applications designed to use specific schema.org properties (e.g. http://schema.org/width, http://schema.org/color, http://schema.org/gtin13, ...) will typically expect such data to be provided using those properties, rather than using the generic property/value mechanism."


Reply to this email directly or view it on GitHub.

@mfhepp
Contributor
mfhepp commented Mar 24, 2015

This is now fixed and included in the pull request. See mfhepp@97df3ee

@danbri
Contributor
danbri commented Apr 8, 2015

@tmarshbing and I had a chat about this yesterday and would like work up some 'health warning' text so that publishers understand the extra value that comes from using 'real' schema.org properties. My phrasing above was a bit vague, so we may take @mfhepp's as a starting point:

"So I would tell developers:
Always use specific schema.org properties when a) they exist and b) you can populate them.
Using PropertyValue as a substitute will typically not trigger the same effect as using the original, specific property."

Tom may make a more specific suggestion here...

@thadguidry

I would really like for you to mention succinctly that:

Taking Advantage of Schema.org Properties and Promoting Reuse for effective Market Reach

You as a Developer or Business Owner might feel as though your terms and phrasing are better suited than a competitors or the general market or even Schema.org's choices, and may think that use of Property/Value will lead to market differentiation and effective reach for your targeted audiences.

However, what you might be doing in actuality is fragmenting your own industry. By not correctly aligning with peers (or even competitors) you might be confusing Search and App filters and your potential customers, ultimately hurting your penetration into those targeted audiences you were striving for. "You shoot yourself in the foot".

But by correctly taking advantage of existing Schema.org aligned terms, concepts, and our existing industry properties, you can help to deliver helpful hints to Search, App, and Market tools & filters. This allows unlimited and unexpected possibilities for your market reach, such as a consumer making their own targeted choice and being highly satisfied finding your product meets their exact needs, as well as allowing marketing tools to leverage in your competitive favor through effective ads, campaigns, & materials...all reusing and sharing the same language and semantics that is Schema.org.

@mfhepp
Contributor
mfhepp commented Apr 8, 2015

I hesitate to use to strong language. We want to motivate sites to publish lots of product details data with that mechanism. While I agree that this should not be regarded as a shortcut that frees lazy developers from using propert schema.org properties, it should also not be described too negatively.

Motivating Web sites to mark-up product data sheets with 50 - 200 properties across hundreds of industries is a huge opportunity, and schema:PropertyValue is, from a few years of trying to lift such data, the most feasible approach so far.

Note that the proposal comes from our attempts to develop extensions that automatically add schema.org markup to Web shop software and PIM applications. They typically manage product features only at the level of named properties - string + value, sometimes an extra string for a unit or interval.

The only way to write such extensions with hassle-free installation is to take the data from the shops and PIM applications as they stand, without asking the shop owners to manually map their properties to standard properties.

Actually, we tried both ways: Some of the extensions we developed allow the granular configuration at the level of individual products or the mapping of popular properties like GTIN13 to standard GoodRelations properties.

Such features have almost never been used, and if used, the result was often unreliable.

So please - let's clarify that additionalProperty is not recommended for existing predefined properties, but that

  1. it is perfectly valid and recommended for non-standard product properties and
  2. it is better to use additionalProperty than not exposing a product feature.

Also note that the semantic heterogeneity of product feature is very significant, so it is sometimes really hard to judge whether available data matches an existing properties.

Martin

martin hepp http://www.heppnetz.de
mhepp@computer.org @mfhepp

On 08 Apr 2015, at 21:41, Thad Guidry notifications@github.com wrote:

I would really like for you to mention succinctly that:

Taking Advantage of Schema.org Properties and Promoting Reuse for effective Market Reach

You as a Developer or Business Owner might feel as though your terms and phrasing is better suited than a competitors or the general market or even Schema.org's choices, and may think that use of Property/Value will lead to market differentiation and effective reach for your targeted audiences.

However, what you might be doing in actuality is fragmenting your own industry. By not correctly aligning with peers (or even competitors) you might be confusing Search and App filters and your potential customers, ultimately hurting your penetration into those targeted audiences you were striving for. "You shoot yourself in the foot".

But by correctly taking advantage of existing Schema.org aligned terms, concepts, and our existing industry properties, you can help to deliver helpful hints to Search, App, and Market tools & filters. This allows unlimited and unexpected possibilities for your market reach, such as a consumer making their own targeted choice and being highly satisfied finding your product meets their exact needs, as well as allowing marketing tools to leverage in your competitive favor through effective ads, campaigns, & materials based...all reusing and sharing the same language and semantics that is Schema.org.


Reply to this email directly or view it on GitHub.

@thadguidry

I agree with all your points @mfhepp certainly, who wouldn't.

However, we should try to place a very light and slight onus of due diligence for sites and give them the necessary background information and best practices, despite past historical hassles.
We cannot do everything for everyone but we can provide good guidance and we should....

So I guess what the good guidance encompasses is what is up for discussion...as @danbri says "a health warning text". And I think I make some important points in my version of the "health warning text"

1 more point: @mfhepp don't you agree that we have had that proliferation of data already? ...the problem space was that it was not structured enough, if at all. Sure Property/Value helps a bit...but taking the time to provide highly structured data benefits everyone. We should always try to encourage the latter before the former and that's where I think this "health text warning" should try to promote as well as your bail out mechaisms for non-standards, which I agree with.

@mfhepp
Contributor
mfhepp commented Apr 9, 2015

I think the current wording as on

http://sdo-property-value-and-cars.appspot.com/additionalProperty

is sufficient.

As for the proliferation of product data semantics: This has plenty of causes and has been an open problem for decades, so while I agree it will be good to strive for more uniform data structures, we should not mix this with the immediate aim of providing a mechanism for exposing such data as it is available now.

@jvandriel

If I might speak as a SEO specialist of Sanoma for a moment, for the site I'm currently working on Property/Value is the only realistic option there is for providing additional markup for close to a million items. Roughly a million items of which it isn't known upfront whether it's a Product or a Service, nor which specifications they have.

Now because these items are added to the site via programmatic solutions there's no method for manually adjusting markup/values. But more importantly, even if there would be a solution to do so manually, the site adds/removes/modifies close to 100.000 items PER DAY, meaning my employer would have to employ ±1000 Jarnos to be able to provide 'highly structured data'. Something that's definitely not going to happen, meaning we either deploy the Property/Value solution or don't publish any specifications at all. It's as simple as that.

Now I agree with @thadguidry that proper guidance should be given but I also agree with @mfhepp that strong language should be avoided or else we run the risk publishers (like the one I work for) might feel there's no or too little value in publishing Property/Value markup and therefore probably will decide not publish it at all.

Something I feel would be a big loss as, like @mfhepp, I think 'some structured data' is always better than none at all.

@thadguidry

@jvandriel I am sympathetic that there is some effort involved in providing highly structured data. But let us try to not encourage laziness is all I am saying. In your particular case, there are probably programmatic solutions that solve your issue, and would not require more than 1 person to manage. A good algorithm that can give you over 95% accuracy to determine if something is a Product or Service is all that your probably missing. =) And if it does not exist already, it could be built through machine learning and human cognition...even using http://crowdcrafting.org/ or some such.

I just want everyone to do there part and I understand its asking others to provide something for free. But we still need to encourage and enlighten them that the time and resources they spend help to expand the knowledge of their products and services.

That includes content providers not taking unnecessary shortcuts by saying "its too hard". Let's encourage a mentality of "if you think its hard to provide highly structured data, you might consider that your not taking the right approach and there are folks that can certainly help you take the right approach to provide highly structured data via best practices, programmatic solutions, machine learning, and human cognition, just to name a few".

@mfhepp I am not trying to distract the aim of Property/Value. We need it. Everyone does. I just want to make sure we give proper guidance, advise them that things are not as hard as they seem to provide structured data, and in many cases, programmatic or other solutions exist to help even further.

I just won't accept laziness.

@mfhepp
Contributor
mfhepp commented Apr 9, 2015

@thadguidry As you know, we are in agreement, so let's not start a virtual conflict ;-) but...

I just won't accept laziness.
If we hadn't accepted laziness e.g. wrt. broken links and invalid markup, the Web would not have become what it is.

As said: Let's not mix the general aim of providing more machine-friendly information on the Web with the very tangible property-value mechanism for product features.

People have tried for decades to e.g. consolidate taxonomic information about products (UNSPSC, eClass, ....), without major success.

I will be convinced in a minute if you point me to an algorithmic solution that establishes proper alignment between all the standards from

http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (*)

They are all available in OWL and follow a common GoodRelations meta-model. Still I know of no automated solution to align them.

So it should be much easier than the general challenge which you consider "easy" ;-)

Martin

(*) A few of them must be generated locally using http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL due to copyright restrictions.

@mfhepp
Contributor
mfhepp commented Apr 9, 2015

Also: We should force publishers of data on the Web to first complete a major data-cleansing and enrichment process before they can use schema.org. That would put a major delay on the whole process.

That being said, we are in agreement that it is perfectly valid to create incentives for them - the better your data, the better will the search engines understand and present your information.

@thadguidry

@mfhepp Yup. :) No conflict. I just take a harder stance on the topic than others.

+1 for "health warning text" in some form. Not necessarily mine. But something.

@tmarshbing

I am also still very keen on health warning text. I would be fine with the wording @danbri proposed based on @mfhepp's original text: "Always use specific schema.org properties when a) they exist and b) you can populate them. Using PropertyValue as a substitute will typically not trigger the same effect as using the original, specific property."

@thadguidry, I wonder if we could put a longer set of best practices - some version of what you started above - in a doc page/blog post and refer to the best practices also from the health text. In that way, I would hope we could address the concerns about not sounding too negative while still providing enough guidance to prevent publishers from shooting themselves in the foot. Thoughts?

@thadguidry

@tmarshbing yes, I had the same thoughts. I think a Blog post would be fine, looks like we could collect comments from it also, if need be. Take whatever you want from my example, it's CC0. And blogging it makes it easier for folks to share the info, socially.

@danbri
Contributor
danbri commented May 12, 2015

I've added the disclaimer, and also into releases.html

@vholland
Contributor

@mfhepp @danbri

I am scanning for easy issues to implement or close. Any reason to leave this open?

@danbri
Contributor
danbri commented Sep 15, 2015

Good find @mfhepp - it's done. One less open issue :) And thanks everyone for the discussion!

http://schema.org/PropertyValue

@danbri danbri closed this Sep 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment