Skip to content

Commit

Permalink
Added support for HTML OpenSearch results.
Browse files Browse the repository at this point in the history
Many OpenSearch systems do not provide results as standard RSS/Atom
feeds but only as HTML. 

This modification add some support for custom OpenSearch HTML results
through the use of mapping files (as already done for federated Solr
search) relying on CSS-like selectors to retrieve information from HTML
content.

An example mapping file is provided to map results from the
www.npmjs.com OpenSearch URL.
  • Loading branch information
luccioman committed Feb 13, 2017
1 parent a79194a commit bf16de2
Show file tree
Hide file tree
Showing 7 changed files with 475 additions and 74 deletions.
24 changes: 24 additions & 0 deletions defaults/federatecfg/npmjs.html.map.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# www.npmjs.com HTML search results mapping
# OpenSearch description : https://www.npmjs.com/opensearch.xml
# OpenSearch template URL : https://www.npmjs.com/search?q={searchTerms}

# This is an example mapping file for OpenSearch systems or search APIs providing results only as HTML
# When possible, it is preferable to use an OpenSearch URL providing results as standard RSS or Atom feed as mapping is generic
# Selectors are using CSS or JQuery-like syntax, as described at https://jsoup.org/apidocs/org/jsoup/select/Selector.html
# Standard Java properties file syntax is used here instead of usual YaCy Configuration syntax to easily allow '#' characters in values (example : _result=div#result li)
# Character encoding is assumed to be ISO-8859-1

# Result node selector (required)
# In this example, a list item such as : <li class="package-details css-ywvx7i" data-reactid="n">
_result=.package-details

# Result link selector relative to the selected result block (required)
# In this example, a link such as <a href="https://www.npmjs.com/package/packageName" class="name css-1nx9rl1">packageName</a>
_sku=.name

# field mappings
# YaCyFieldname = HTML text node selector, relative to the result block
# In this example title is the text of the link so it has the same selector
title=.name
# In this example the description is in a paragraph tag such as <p class="description css-zqstoe">Package description</p>
description_txt=.description
5 changes: 4 additions & 1 deletion defaults/heuristicopensearch.conf
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,14 @@
## - all lines beginning with '#' and where the second character is not '#' are commented-out keyword lines
##

## Additional mapping files for OpenSearch HTML results can be set in DATA/SETTINGS/federatecfg/[name].html.map.properties

#Faroo-News = http://www.faroo.com/api?q={searchTerms}&start={startIndex}&length=20&l=en&src=news&f=rss # get results from Faroo news-search
#WordPress.com = http://en.search.wordpress.com/?q={searchTerms}&f=feed&page={startPage?} #Search WordPress.com Blogs
#Sueddeutsche.de = http://suche.sueddeutsche.de/query/{searchTerms}?output=rss # Sueddeutsche Zeitung Artikel Archiv
#Los Angeles Times = http://framework.latimes.com/?s={searchTerms}&feed=rss2
#Archive-It = http://archive-it.org/seam/resource/opensearch?q={searchTerms}&n=20 # archiving cultural heritage on the web
#Archive-It = http://archive-it.org/seam/resource/opensearch?q={searchTerms}&n=20 # archiving cultural heritage on the web
#npmjs = https://www.npmjs.com/search?q={searchTerms} # Search JavaScript packages from the npm repository

## In addition to OpenSearch systems other connectors are available to query foreign systems
## the syntax is
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,16 +109,20 @@ public void search(final SearchEvent theSearch) {
@Override
public void run() {
Thread.currentThread().setName("heuristic:" + instancename);
ConcurrentLog.info("YACY SEARCH (federated)", "Send search query to " + instancename);
theSearch.oneFeederStarted();
List<URIMetadataNode> doclist = query(theSearch.getQuery());
if (doclist != null) {
ConcurrentLog.info("YACY SEARCH (federated)", "Got " + doclist.size() + " documents from " + instancename);
Map<String, LinkedHashSet<String>> snippets = new HashMap<String, LinkedHashSet<String>>(); // add nodes doesn't allow null
Map<String, ReversibleScoreMap<String>> facets = new HashMap<String, ReversibleScoreMap<String>>(); // add nodes doesn't allow null
theSearch.addNodes(doclist, facets, snippets, false, instancename, doclist.size());

for (URIMetadataNode doc : doclist) {
theSearch.addHeuristic(doc.hash(), instancename, false);
}
} else {
ConcurrentLog.info("YACY SEARCH (federated)", "Got no results from " + instancename);
}
// that's all we need to display serach result
theSearch.oneFeederTerminated();
Expand Down
27 changes: 16 additions & 11 deletions source/net/yacy/cora/federate/FederateSearchManager.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,22 @@
*/
package net.yacy.cora.federate;

import net.yacy.cora.federate.opensearch.OpenSearchConnector;
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;

import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;

import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;

import net.yacy.cora.document.analysis.Classification;
import net.yacy.cora.document.id.MultiProtocolURL;
import net.yacy.cora.federate.opensearch.OpenSearchConnector;
import net.yacy.cora.federate.solr.connector.SolrConnector;
import net.yacy.cora.federate.yacy.CacheStrategy;
import net.yacy.cora.storage.Configuration;
Expand All @@ -49,8 +52,6 @@
import net.yacy.search.query.QueryParams;
import net.yacy.search.query.SearchEvent;
import net.yacy.search.schema.WebgraphSchema;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;

/**
* Handling of queries to configured remote OpenSearch systems.
Expand Down Expand Up @@ -107,8 +108,8 @@ public FederateSearchManager(Switchboard sb) {
ConcurrentLog.config("FederateSearchManager", "Error in configuration of: " + url);
}
} else { // handle opensearch url template
OpenSearchConnector osc = new OpenSearchConnector();
if (osc.init(name, url)) {
OpenSearchConnector osc = new OpenSearchConnector(url);
if (osc.init(name, sb.getDataPath()+ "/DATA/SETTINGS/federatecfg/" + OpenSearchConnector.htmlMappingFileName(name))) {
conlist.add(osc);
}
}
Expand Down Expand Up @@ -234,8 +235,13 @@ public boolean addOpenSearchTarget(String name, String urlTemplate, boolean acti
try {
conf.commit();
if (active) {
OpenSearchConnector osd = new OpenSearchConnector();
if (osd.init(name, urlTemplate)) {
OpenSearchConnector osd = new OpenSearchConnector(urlTemplate);
String htmlMappingFile = null;
Switchboard sb = Switchboard.getSwitchboard();
if(sb != null) {
htmlMappingFile = sb.getDataPath()+ "/DATA/SETTINGS/federatecfg/" + OpenSearchConnector.htmlMappingFileName(name);
}
if (osd.init(name, htmlMappingFile)) {
conlist.add(osd);
}
}
Expand Down Expand Up @@ -407,9 +413,8 @@ public boolean init(String cfgFileName) {
ConcurrentLog.config("FederateSearchManager", "Init error in configuration of: " + url);
}
} else { // handle opensearch url template
OpenSearchConnector osd;
osd = new OpenSearchConnector();
if (osd.init(name, url)) {
OpenSearchConnector osd = new OpenSearchConnector(url);
if (osd.init(name, confFile.getParent()+"/federatecfg/" + OpenSearchConnector.htmlMappingFileName(name))) {
conlist.add(osd);
}
}
Expand Down
Loading

7 comments on commit bf16de2

@reger24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous stuff (imho) !
as it is just one step away from tapping into - BORROWING search results without any service offer - from anywhere -.
What would not be kosher in my view.
My assumption is that originator has no intention to share if not provided as service/xml

@luccioman
Copy link
Member Author

@luccioman luccioman commented on bf16de2 Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @reger24 , and thank you for your feedback.
Maybe I should have submitted this proposal to discussion before committing, but to my mind this is not breaking YaCy existing features philosophy, rather just adding support for one more data format.

To my mind, when someone/some organization setups a server and provides a public search service, this is to be shared, or else it is would be restricted to authenticated users only. There can be many reasons to not provide search results as standard RSS/Atom feed (maybe the author considers this is not useful, not the trend, its alternative search APIs are sufficient, or is misinformed about standards...), but in the end it is just a technical limit for users that should be able to easily process data as they which in a standard way.

This mapping feature is just one more way to process HTML data without the need to add an external tool such as RSS-Bridge.

I would also add I believe this is not just borrowing search results, but exposing and sharing them into the YaCy network using your own hardware resources, which is somehow rather valuable for the originator...

Edit : and I believe we should not worry too much for websites who really want to block their HTML search service to non-browser requests. For example stackoverflow.com currently renders a Captcha when you try to get search results with anything else than a browser....

@luccioman
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

robots.txt policy should now be checked (commit 6e89d12) for a more respectful usage of external OpenSearch systems.
I also found faroo.com (their API now clearly requires a key to be used) and en.search.wordpress.com both disallow bots on their OpenSearch url. Isn't is time to remove these from YaCy default heuristics config?

@reger24
Copy link
Member

@reger24 reger24 commented on bf16de2 Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mentioned defaults are not useable .... then we should remove e.g. faroo (haven't checked for a while).

With your argumentation before, I do not agree to "happy opensearch yahoo, bing, google" by css tags (until no capcha is required). I agree with you, there is a high risk that no one except me would use the term "BORROWING" and I've doubts that originator sees a own benefit.

@luccioman
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @reger24 , at least borrowing results to the previously mentionned search engines is not likely to easily happen as their robots.txt policy is clear and should now be respected by YaCy heuristics.

By the way I am not sure to completely understand what you mean...
Said a bit differently, do you fear that by extending YaCy OpenSearch support to HTML results, too many users would be tempted to abandon the activities of selecting interesting websites, crawling and indexing them, and would instead dumbly rely on results provided by a limited selection of generalistic and highly centralized search engines or data-sources?

To my mind with the current YaCy search implementation, heuristics is rather a fallback mechanism so integrating results from other sources should not kill YaCy distributed index specificity. Furthermore if we consider that most major generalist search engines block bots respecting robots.txt policies from reusing their HTML results.

The YaCy network has not enough resources to index the whole Internet so why not accept supplementary OpenSearch sources which apparently do not restrict their use to bots? Personnally, in my daily use of YaCy, my peers have almost no more disk space left but I am happy to get some more heuristics results from www.npmjs.com or developer.mozilla.org documentation when other YaCy peers returned nothing relevant.

@reger24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Years ago I was amused by https://www.seroundtable.com/bing-cheats-google-poll-12895.html
http://www.forbes.com/sites/jeffbercovici/2011/02/01/why-google-should-thank-bing-for-cheating/#1aec366c3106
I'm too like heuristics but my concern is, it's intended as "Opensearch" enhancement but it asks for misuse e.g. to get around any "Opensearch" intention of originator or to ask for a api key etc.
But it's late and time for a good night story for a closing ;-))
Once up a time there was a alternative search engine but after years their relevancy was so rotten bad that they introduced a cheat sheet option to be able to lookup superior sites to unauthorized harvest at least some useful for good and became the best tool thereafter.

@luccioman
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh eh thanks for these refreshing links and story ;)
Just to clarify one last point, my intent here is not to workaround YaCy relevancy but rather extend the distributed index network with some more systems in a decentralized way (to my mind). (I said

when other YaCy peers returned nothing relevant

but to be more clear I should rather have said

when other YaCy peers returned nothing at all

Please sign in to comment.