Skip to content

Commit

Permalink
Update of BP4: Make your spatial data indexable by search engines
Browse files Browse the repository at this point in the history
This addresses ACTION-232, ACTION-233 and parts of ACTION-234.
  • Loading branch information
cportele committed Jan 19, 2017
1 parent f8bd077 commit 54f389e
Showing 1 changed file with 112 additions and 20 deletions.
132 changes: 112 additions & 20 deletions bp/index.html
Expand Up @@ -959,7 +959,6 @@ <h4 class="subhead">Benefits</h4>
<div class="practice">
<p><span id="indexable-by-search-engines" class="practicelab">Make your spatial data indexable by search engines</span></p>
<p class="practicedesc">Search engines should be able to crawl spatial data on the Web and index spatial things for direct discovery by users.</p>
<p class="issue" data-number="446">Can we consider the publishing <em>spatial</em> data in way that search engines can crawl as a <em>best</em> practice? We’re confident that this is the correct thing to do, but where is the evidence of adoption?</p>
<section class="axioms">
<p class="subhead">Why</p>

Expand All @@ -970,53 +969,146 @@ <h4 class="subhead">Benefits</h4>
<li>the data itself is not indexed - discovery relies on the metadata records that are often sparsely populated or out of date.</li>
</ol>

<p>Search engines are the common starting point for people looking for content on the Web that is widely understood. By publishing spatial data in a way that enables their crawlers to index <a>spatial things</a>, the fidelity of search results should improve. Users will be able to directly search for specific entities rather than having to look for a dataset and then parse through it; e.g. to search for "Anne Frank’s House" (<a href="https://g.co/kg/m/02s5hd"><code>https://g.co/kg/m/02s5hd</code></a>) rather than looking for a dataset about "Cultural Heritage in Amsterdam" and hoping that it contains a reference to what you’re interested in.</p>

<p class="note">At present, spatial information is not widely exploited by search engines. However, by increasing the volume of spatial information presented to search engines, and the consistency with which it is provided, we expect search engines to begin offering spatial search functions. We already see evidence of this in the form of contextual search, such as prioritization of search results from nearby entities.</p>
<p>Search engines are the common starting point for people looking for content on the Web that is widely understood. By publishing spatial data in a way that enables their crawlers to index spatial datasets including each <a>spatial thing</a>, the fidelity of search results should improve. Users will be able to directly search for specific entities rather than having to look for a dataset and then parse through it; e.g. to search for "Anne Frank’s House" (<a href="https://g.co/kg/m/02s5hd"><code>https://g.co/kg/m/02s5hd</code></a>) rather than looking for a dataset about "Cultural Heritage in Amsterdam" and hoping that it contains a reference to what you’re interested in.</p>

<p class="issue" data-number="181">More discussion is required on how to structure meaningful (spatial) queries with search engines (e.g. based on identifier, location, time etc.).</p>
<p class="note">At present, spatial information is not widely exploited by search engines. However, by increasing the volume of spatial information presented to search engines, and the consistency with which it is provided, we expect search engines to begin offering spatial search functions. We already see evidence of this in the form of contextual search, such as prioritization of search results from nearby entities. In addition, search engines are beginning to offer more structured, custom searches that return only results that include certain [[SCHEMA-ORG]] types, like <a href="http://schema.org/Dataset">Dataset</a>, <a href="http://schema.org/Place">Place</a> or <a href="http://schema.org/City">City</a>.</p>
</section>
<section class="outcome">
<p class="subhead">Intended Outcome</p>
<p>Information about spatial things is indexed by search engines.</p>
<p>Information about spatial datasets and things is indexed by search engines.</p>

<p>Users can find spatial things using common search engines.</p>
</section>
<section class="how">
<p class="subhead">Possible Approach to Implementation</p>

<p>First, publish a HTML Web-page for each spatial thing. Second, make sure that those pages can be crawled.</p>
<p>In general, you need to</p>
<ol>
<li>publish a HTML Web-page for the spatial dataset and each spatial thing in it, and</li>
<li>make sure that those pages can be crawled.</li>
</ol>

<p>The Web-page for the dataset is an entry point for the search engines to crawl your data. This "landing page" needs to include links that the Web-crawler can follow to reach the page for each spatial thing in the dataset. Where you have a larger collection of spatial things, you should support paging through the collection.</p>

<p>You should also consider using <a href="https://www.sitemaps.org/index.html">Sitemaps</a> to direct the Web-crawler, but sitemaps currently are limited to several thousands of entries and will not work for larger datasets.</p>

<p>For very large datasets paging through thousands of pages is not useful for a human either. Consider supporting filtering and/or organise the spatial things in subsets to .</p>

<aside class="example">
<p>In case of an address dataset, you could organise the spatial things (the addresses) by municipality, post code and street name in order to support a human user to get to a building with a few clicks.</p>
</aside>

<p>A pre-condition for this best practice is <a href="#globally-unique-ids">Best Practice 7 ("Use globally unique persistent HTTP URIs for spatial things")</a> as persistent identifiers are essential to support reliable indexing and linking. Traditionally spatial datasets have not been maintained with stable identifiers for spatial things, but to share spatial data on the Web stable identifiers are a must. Sharing spatial data is more than "just" making the dataset available on the Web.</p>

<p>Each Web-page can likely be generated programmatically from the data you hold about the spatial thing, either directly from the data or by using an API that makes the data available on the Web.</p>

<aside class="example">
<p>Possible implementation approaches for addressing this best practice in the context of an existing SDI are discussed in more detail in <a href="#convenience-apis">Best Practice 11: Expose spatial data through 'convenience APIs'</a> for additional information. For example, by using a proxy tool like <a href="https://github.com/interactive-instruments/ldproxy">ldproxy</a> or by mapping the data in the SDI dynamically to crawlable resources on the web using the [[R2RML]] standard and Linked Data Publication tools. Both approaches generate crawlable data from features in your spatial datasets at query time and allow to enrich the data on the Web with additional information and links.</p>
</aside>

<p>It is important to keep in mind that the HTML representations should not mainly be designed for the search engines, but they should present the data in a clear and understandable way to human users. The page about the spatial thing should be useful to a user and encourage others to link to the page when they share other information about the spatial thing. This typically will also improve the ranking of these pages in search results.</p>

<aside class="example">
<p>The <a href="http://maps.nanaimo.ca/data/property/">Property Search in the City of Nanaimo, Canada</a> provides a landing page and one page per property. The landing page offers a search capability and the option to browse by street.</p>
<p>The <a href="http://environment.data.gov.uk/bwq/profiles/">Bathing Water Quality Explorer for England</a> provides a landing page and one page per site. Sites can be searched, selected from a list or in a map.</p>
<p>In both cases, the pages of the spatial things are generated from the underlying data at request time.</p>
<p>The property Web-pages in Nanaimo also use [[MICRODATA]] annotations using [[SCHEMA-ORG]], which is discussed below.</p>
</aside>

<p>In addition to exposing the spatial data as linked HTML Web-pages, indexing by web-engines can be further enhanced by incorporating a description of the spatial thing as structured markup (in particular [[MICRODATA]] or [[JSON-LD]] annotations using [[SCHEMA-ORG]]) as this enables the search engines to make more detailed assumptions about your resource. It is important to note that this is not only helpful to search engines, but also to other tools that want to understand more about the semantics of the resource, for example, its location.</p>

<p>In [[SCHEMA-ORG]], a spatial dataset is a <a href="http://schema.org/Dataset">Dataset</a> and a spatial thing is in general a <a href="http://schema.org/Place">Place</a> or an <a href="http://schema.org/Event">Event</a>. For some types of spatial things, more specific sub-types exist, for example <a href="http://schema.org/City">City</a> or <a href="http://schema.org/Mountain">Mountain</a>.

<p>Map your source data of your SDI dynamically to crawlable resources on the web with the [[R2RML]] standard and Linked Data Publication tools. With this process you'll generate crawlable data from features in your spatial data at query time. By using the R2RML it is possible to expose more detailed information and references then with standard WFS</p>
<p class='note'>We should point to Best Practice 11, possible implementation 3</p>
<p>Indexing can be further enhanced by incorporating a description of the spatial thing as structured markup (such as [[MICRODATA]] and [[SCHEMA-ORG]]) as this enables the search engines to make more detailed assumptions about your resource. Each Web-page can likely be generated programmatically from the data you hold about corresponding spatial thing. These Web-pages should also provide a mechanism to download data in the formats you decide to support. [[DWBP]] <a href="https://www.w3.org/TR/dwbp/#MultipleFormats">Best Practice 14: Provide data in multiple formats</a> provides guidance on how you might publish HTML content alongside the data.</p>
<p>Location information about a spatial thing is typically provided using a geometry (<a href="http://schema.org/GeoCoordinates">GeoCoordinates</a> or <a href="http://schema.org/GeoShape">GeoShape</a>) or a <a href="http://schema.org/PostalAddress">PostalAddress</a>. [[SCHEMA-ORG]] coordinates are restricted to WGS 84 with longitude and latitude. Supported geometry types are points, line strings, polygons, boxes and circles.</p>

<p>Through the use of [[SCHEMA-ORG]] annotations, search engines and others can connect location information with other information, e.g. about the nature of the spatial thing, opening hours, contact details, etc.</p>

<p>The use of [[SCHEMA-ORG]] for spatial data is in its early days and has to be understood as an "emerging practice".</p>

<p>You also need to provide an entry point for the search engines to crawl your data. Create a “landing page” for the dataset, itself incorporating structured markup, from which the Web-crawler can follow links to the page for each spatial thing. Where you have a large collection of spatial things, you may need to allow the Web-crawler to page through the collection. You should also consider using <a href="https://www.sitemaps.org/index.html">Sitemaps</a> to direct the Web-crawler.</p>
<aside class="example">
<p>This code-snippet illustrates a [[JSON-LD]] annotation using a [[SCHEMA-ORG]] <a href="http://schema.org/Dataset">Dataset</a> for an address dataset in the Netherlands that may be embedded in the HTML of the Web-page. It includes a name, a description, the spatial coverage using a bounding box, the URL of the Web-page, and a link to another dataset containing this dataset. The same annotation could also be provided using [[MICRODATA]], but we use [[JSON-LD]] here as this presents the structured data in a more human-readable way.</p>
<pre class="highlight" id="ex-schemaorg-dataset" title="">&lt;script type="application/ld+json">
{
"@context" : {
"@vocab" : "http://schema.org/"
},
"@type" : "Dataset",
"@id" : "http://www.ldproxy.net/bag/inspireadressen/",
"name" : "Adressen",
"description" : "INSPIRE Adressen afkomstig uit de basisregistratie Adressen, beschikbaar voor heel Nederland",
"url" : "http://www.ldproxy.net/bag/inspireadressen/",
"isPartOf" : {
"@type" : "Dataset",
"url" : "http://www.ldproxy.net/bag/"
},
"keywords" : "Adressen",
"spatialCoverage" : {
"@type" : "Place",
"geo" : {
"@type" : "GeoShape",
"box" : "3.053,47.975 7.24,53.504"
}
}
}
&lt;/script></pre>
<p>This code-snippet illustrates a [[JSON-LD]] annotation using a [[SCHEMA-ORG]] <a href="http://schema.org/Place">Place</a> for the address of the "Anne Frank’s House" in that dataset. It includes the location, the URL of the Web-page, and the structured postal address information.</p>
<pre class="highlight" id="ex-schemaorg-dataset" title="">&lt;script type="application/ld+json">
{
"@context" : {
"@vocab" : "http://schema.org/"
},
"@type" : "Place",
"@id" : "http://www.ldproxy.net/bag/inspireadressen/inspireadressen.3329155",
"url" : "http://www.ldproxy.net/bag/inspireadressen/inspireadressen.3329155",
"geo" : {
"@type" : "GeoCoordinates",
"longitude" : "4.8839893538143055",
"latitude" : "52.37520202332491"
},
"address" : {
"@type" : "PostalAddress",
"streetAddress" : "Prinsengracht 267",
"addressLocality" : "Amsterdam",
"postalCode" : "1016GV"
}
}
&lt;/script></pre>
</aside>

<p>The Web-pages should also provide a mechanism to download data in the formats you decide to support. [[DWBP]] <a href="https://www.w3.org/TR/dwbp/#MultipleFormats">Best Practice 14 ("Provide data in multiple formats")</a> provides guidance.</p>

<p>Typically multiple formats for a resource are supported using two mechanisms: HTTP content negotiation and by adding format-specific file extensions to the resource URI like ".json", ".xml" or ".ttl". Content negotiation is the standard mechanism of HTTP and the format-specific URIs enable the use of clickable links to the resource in a specific format.</p>

<p>Search engines may also index resource representations in other formats than HTML.</p>

<aside class="example">
<p>example(s) to be added; including ...</p>
<ul>
<li><a href="http://maps.nanaimo.ca/data/property/">City of Nanaimo, Canada</a> … one page per resource; generated at request time; uses [[MICRODATA]] and [[SCHEMA-ORG]]</li>
</ul>
<p class="issue" data-number="179">Example required for search engine indexing of dataset / data-stream.</p>
<p class="issue" data-number="180">Example required for search engine directly indexing structured spatial data (e.g. GeoJSON or KML).</p>
<p class="issue" data-number="447">Detailed examples are required for HTML pages and the structured markup. It’s early days for using [[SCHEMA-ORG]] to describe spatial attributes - can we write some useful statements and badge them as “emerging practice” to drive the right behaviors?</p>
<p>At the time of writing, Google is indexing KML documents and supporting advanced searches that are restricted to KML documents. GML files are also indexed, but only like any other XML documents. JSON, including GeoJSON, is currently not indexed.</p>
</aside>

<p class="note">In 2016, these topics were analysed in a testbed organised by Geonovum in the Netherlands. More details can be found in reports from the testbed: <a href="http://geo4web-testbed.github.io/topic4/">Spatial Data on the Web using the current SDI</a> and <a href="https://github.com/geo4web-testbed/topic3/wiki">Crawlable geospatial data using the ecosystem of the Web and Linked Data</a>.</p>
</section>
<section class="test">
<p class="subhead">How to Test</p>
<p>...</p>
<p>Monitor the search consoles of the search engines about the progress in indexing your Web-pages and their structured data. In case any errors are reported, try to fix them.</p>
</section>
<section class="ucr">
<p class="subhead">Evidence</p>
<p><span>Relevant requirements</span>: {... hyperlinked list of use cases ...}</p>
<p><span>Relevant requirements</span>: <a href="https://www.w3.org/TR/sdw-ucr/#BoundingBoxCentroid">R-BoundingBoxCentroid</a>, <a href="https://www.w3.org/TR/sdw-ucr/#Crawlability">R-Crawlability</a>, <a href="https://www.w3.org/TR/sdw-ucr/#Discoverability">R-Discoverability</a>, <a href="https://www.w3.org/TR/sdw-ucr/#Linkability">R-Linkability</a>, <a href="https://www.w3.org/TR/sdw-ucr/#MachineToMachine">R-MachineToMachine</a>.</p>
</section>
<section class="benefits">
<h4 class="subhead">Benefits</h4>
<ul class="benefitsList">
<li>Discoverability</li>
</ul>
</section>

<br/>
<p class="note">The following issues should be addressed by this version of the best practice:</p>
<p class="issue" data-number="446">Can we consider the publishing <em>spatial</em> data in way that search engines can crawl as a <em>best</em> practice? We’re confident that this is the correct thing to do, but where is the evidence of adoption?</p>
<p class="issue" data-number="447">Detailed examples are required for HTML pages and the structured markup. It’s early days for using [[SCHEMA-ORG]] to describe spatial attributes - can we write some useful statements and badge them as “emerging practice” to drive the right behaviors?</p>
<p class="issue" data-number="179">Example required for search engine indexing of dataset / data-stream.</p>
<p class="issue" data-number="180">Example required for search engine directly indexing structured spatial data (e.g. GeoJSON or KML).</p>
<p class="issue" data-number="181">More discussion is required on how to structure meaningful (spatial) queries with search engines (e.g. based on identifier, location, time etc.).</p>
</div>

</section>
Expand Down

0 comments on commit 54f389e

Please sign in to comment.