-
-
Notifications
You must be signed in to change notification settings - Fork 428
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added support for HTML OpenSearch results.
Many OpenSearch systems do not provide results as standard RSS/Atom feeds but only as HTML. This modification add some support for custom OpenSearch HTML results through the use of mapping files (as already done for federated Solr search) relying on CSS-like selectors to retrieve information from HTML content. An example mapping file is provided to map results from the www.npmjs.com OpenSearch URL.
- Loading branch information
Showing
7 changed files
with
475 additions
and
74 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# www.npmjs.com HTML search results mapping | ||
# OpenSearch description : https://www.npmjs.com/opensearch.xml | ||
# OpenSearch template URL : https://www.npmjs.com/search?q={searchTerms} | ||
|
||
# This is an example mapping file for OpenSearch systems or search APIs providing results only as HTML | ||
# When possible, it is preferable to use an OpenSearch URL providing results as standard RSS or Atom feed as mapping is generic | ||
# Selectors are using CSS or JQuery-like syntax, as described at https://jsoup.org/apidocs/org/jsoup/select/Selector.html | ||
# Standard Java properties file syntax is used here instead of usual YaCy Configuration syntax to easily allow '#' characters in values (example : _result=div#result li) | ||
# Character encoding is assumed to be ISO-8859-1 | ||
|
||
# Result node selector (required) | ||
# In this example, a list item such as : <li class="package-details css-ywvx7i" data-reactid="n"> | ||
_result=.package-details | ||
|
||
# Result link selector relative to the selected result block (required) | ||
# In this example, a link such as <a href="https://www.npmjs.com/package/packageName" class="name css-1nx9rl1">packageName</a> | ||
_sku=.name | ||
|
||
# field mappings | ||
# YaCyFieldname = HTML text node selector, relative to the result block | ||
# In this example title is the text of the link so it has the same selector | ||
title=.name | ||
# In this example the description is in a paragraph tag such as <p class="description css-zqstoe">Package description</p> | ||
description_txt=.description |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous stuff (imho) !
as it is just one step away from tapping into - BORROWING search results without any service offer - from anywhere -.
What would not be kosher in my view.
My assumption is that originator has no intention to share if not provided as service/xml
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @reger24 , and thank you for your feedback.
Maybe I should have submitted this proposal to discussion before committing, but to my mind this is not breaking YaCy existing features philosophy, rather just adding support for one more data format.
To my mind, when someone/some organization setups a server and provides a public search service, this is to be shared, or else it is would be restricted to authenticated users only. There can be many reasons to not provide search results as standard RSS/Atom feed (maybe the author considers this is not useful, not the trend, its alternative search APIs are sufficient, or is misinformed about standards...), but in the end it is just a technical limit for users that should be able to easily process data as they which in a standard way.
This mapping feature is just one more way to process HTML data without the need to add an external tool such as RSS-Bridge.
I would also add I believe this is not just borrowing search results, but exposing and sharing them into the YaCy network using your own hardware resources, which is somehow rather valuable for the originator...
Edit : and I believe we should not worry too much for websites who really want to block their HTML search service to non-browser requests. For example stackoverflow.com currently renders a Captcha when you try to get search results with anything else than a browser....
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
robots.txt policy should now be checked (commit 6e89d12) for a more respectful usage of external OpenSearch systems.
I also found faroo.com (their API now clearly requires a key to be used) and en.search.wordpress.com both disallow bots on their OpenSearch url. Isn't is time to remove these from YaCy default heuristics config?
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mentioned defaults are not useable .... then we should remove e.g. faroo (haven't checked for a while).
With your argumentation before, I do not agree to "happy opensearch yahoo, bing, google" by css tags (until no capcha is required). I agree with you, there is a high risk that no one except me would use the term "BORROWING" and I've doubts that originator sees a own benefit.
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @reger24 , at least borrowing results to the previously mentionned search engines is not likely to easily happen as their robots.txt policy is clear and should now be respected by YaCy heuristics.
By the way I am not sure to completely understand what you mean...
Said a bit differently, do you fear that by extending YaCy OpenSearch support to HTML results, too many users would be tempted to abandon the activities of selecting interesting websites, crawling and indexing them, and would instead dumbly rely on results provided by a limited selection of generalistic and highly centralized search engines or data-sources?
To my mind with the current YaCy search implementation, heuristics is rather a fallback mechanism so integrating results from other sources should not kill YaCy distributed index specificity. Furthermore if we consider that most major generalist search engines block bots respecting robots.txt policies from reusing their HTML results.
The YaCy network has not enough resources to index the whole Internet so why not accept supplementary OpenSearch sources which apparently do not restrict their use to bots? Personnally, in my daily use of YaCy, my peers have almost no more disk space left but I am happy to get some more heuristics results from www.npmjs.com or developer.mozilla.org documentation when other YaCy peers returned nothing relevant.
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Years ago I was amused by https://www.seroundtable.com/bing-cheats-google-poll-12895.html
http://www.forbes.com/sites/jeffbercovici/2011/02/01/why-google-should-thank-bing-for-cheating/#1aec366c3106
I'm too like heuristics but my concern is, it's intended as "Opensearch" enhancement but it asks for misuse e.g. to get around any "Opensearch" intention of originator or to ask for a api key etc.
But it's late and time for a good night story for a closing ;-))
Once up a time there was a alternative search engine but after years their relevancy was so rotten bad that they introduced a cheat sheet option to be able to lookup superior sites to unauthorized harvest at least some useful for good and became the best tool thereafter.
bf16de2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh eh thanks for these refreshing links and story ;)
Just to clarify one last point, my intent here is not to workaround YaCy relevancy but rather extend the distributed index network with some more systems in a decentralized way (to my mind). (I said
but to be more clear I should rather have said