Permalink
Browse files

Refactored and enforced Solr mandatory fields for proper operation

- Added a new method to check activation of mandatory fields on
Collection Configuration commit, consistently with checks previously
performed in Switchboard startup and with mandatory fields in the
default schema.
- Reorganized default schema and CollectionConfiguration enumeration :
moved no more mandatory fields in a specific section, and moved fields
enabled at startup to the mandatory section. 
- Marked mandatory fields as required and with stronger font in the
IndexSchema_p.html page
  • Loading branch information...
luccioman committed Feb 20, 2017
1 parent e5858bc commit c68a8be2d910e2adf7e79b3632016295ec55cc32
@@ -24,6 +24,9 @@ dates_in_content_dts
## the number of entries in dates_in_content_sxt
dates_in_content_count_i
## time when resource was loaded
load_date_dt
## content of itemprop attributes with content='startDate'
startDates_dts
@@ -42,39 +45,24 @@ www_unique_b
## content of title tag, text (mandatory field)
title
## the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of title, used to compute title_unique_b
#title_exact_signature_l
## flag shows if title is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same title, the unique-flag is set to false, boolean
#title_unique_b
## id of the host, a 6-byte hash that is part of the document id (mandatory field)
host_id_s
## the md5 of the raw source
#md5_s
## host of the url, string
host_s
## the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of text_t
exact_signature_l
## flag shows if exact_signature_l is unique at the time of document creation, used for double-check during search
exact_signature_unique_b
## counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
#exact_signature_copycount_i
## 64 bit of the Lookup3Signature from EnhancedTextProfileSignature of text_t
fuzzy_signature_l
## intermediate data produced in EnhancedTextProfileSignature: a list of word frequencies
#fuzzy_signature_text_t
## flag shows if fuzzy_signature_l is unique at the time of document creation, used for double-check during search
fuzzy_signature_unique_b
## counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
#fuzzy_signature_copycount_i
## the size of the raw source (mandatory field)
size_i
@@ -87,9 +75,6 @@ failtype_s
## html status return code (i.e. "200" for ok), -1 if not loaded (see content of failreason_t for this case), int (mandatory field)
httpstatus_i
## redirect url if the error code is 299 < httpstatus_i < 310
#httpstatus_redirect_s
## number of unique http references, should be equal to references_internal_i + references_external_i
references_i
@@ -105,18 +90,64 @@ references_exthosts_i
## crawl depth of web page according to the number of steps that the crawler did to get to this document; if the crawl was started at a root document, then this is equal to the clickdepth
crawldepth_i
## key from a harvest process (i.e. the crawl profile hash key) which is needed for near-realtime postprocessing. This shall be deleted as soon as postprocessing has been terminated.
harvestkey_s
## the file name extension
url_file_ext_s
## either the second level domain or, if a ccSLD is used, the third level domain. Needed to search in the url
host_organization_s
## internal links, only the protocol. Needed for HostBrowser
inboundlinks_protocol_sxt
## internal links, the url only without the protocol. For correct assembly of inboundlinks inboundlinks_protocol_sxt + inboundlinks_urlstub_sxt is needed
inboundlinks_urlstub_sxt
## external links, only the protocol. For correct assembly of outboundlinks outboundlinks_protocol_sxt + outboundlinks_urlstub_sxt is needed
outboundlinks_protocol_sxt
## external links, the url only without the protocol. Needed to enhance the crawler
outboundlinks_urlstub_sxt
## all image links without the protocol and '://'. For correct assembly of image url images_protocol_sxt + images_urlstub_sxt is needed
images_urlstub_sxt
## all image link protocols
images_protocol_sxt
### No more mandatory (have been mandatory in some older YaCy versions)
## the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of title, used to compute title_unique_b
#title_exact_signature_l
## flag shows if title is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same title, the unique-flag is set to false, boolean
#title_unique_b
## the md5 of the raw source
#md5_s
## counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
#exact_signature_copycount_i
## intermediate data produced in EnhancedTextProfileSignature: a list of word frequencies
#fuzzy_signature_text_t
## counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
#fuzzy_signature_copycount_i
## redirect url if the error code is 299 < httpstatus_i < 310
#httpstatus_redirect_s
## needed (post-)processing steps on this metadata set
#process_sxt
## key from a harvest process (i.e. the crawl profile hash key) which is needed for near-realtime postprocessing. This shall be deleted as soon as postprocessing has been terminated.
harvestkey_s
### optional but highly recommended values, part of the index distribution process
## time when resource was loaded
load_date_dt
## date until resource shall be considered as fresh
fresh_date_dt
@@ -260,21 +291,9 @@ h6_txt
## content of <meta name="generator" content=#content#> tag, text
#metagenerator_t
## internal links, only the protocol
inboundlinks_protocol_sxt
## internal links, the url only without the protocol
inboundlinks_urlstub_sxt
## internal links, the visible anchor text
inboundlinks_anchortext_txt
## external links, only the protocol
outboundlinks_protocol_sxt
## external links, the url only without the protocol
outboundlinks_urlstub_sxt
## external links, the visible anchor text
outboundlinks_anchortext_txt
@@ -293,12 +312,6 @@ icons_sizes_sxt
## all text/words appearing in image alt texts or the tokenized url
images_text_t
## all image links without the protocol and '://'
images_urlstub_sxt
## all image link protocols
images_protocol_sxt
## all image link alt tag
images_alt_sxt
@@ -416,9 +429,6 @@ url_file_name_s
## tokens generated from url_file_name_s which can be used for better matching and result boosting
#url_file_name_tokens_t
## the file name extension
url_file_ext_s
## number of all path elements in the url hpath (see: http://www.ietf.org/rfc/rfc1738.txt) without the file name
url_paths_count_i
@@ -437,15 +447,9 @@ url_paths_sxt
## number of all characters in the url == length of sku field
url_chars_i
## host of the url, string
host_s
## the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used.
#host_dnc_s
## either the second level domain or, if a ccSLD is used, the third level domain
host_organization_s
## the organization and dnc concatenated with '.'
#host_organizationdnc_s
@@ -44,8 +44,8 @@ <h2>Solr Schema Editor</h2>
</tr>
#{schema}#
<tr class="TableCell#(dark)#Light::Dark::Summary#(/dark)#">
<td align="center"><input type="checkbox" name="schema_#[key]#" value="checked" #(checked)#::checked="checked"#(/checked)#/></td>
<td align="left">#[key]#</td>
<td align="center"><input type="checkbox" id="schema_#[key]#" name="schema_#[key]#" value="checked" #(checked)#::checked="checked"#(/checked)# #(required)#::required="required"#(/required)#/></td>
<td align="left"><label for="schema_#[key]#">#(required)#::<strong>#(/required)##[key]##(required)#::</strong>#(/required)#</label></td>
<td align="left"><input type="text" name="schema_solrfieldname_#[key]#" value="#[solrfieldname]#"/></td>
<td align="left">#[comment]#</td>
</tr>
@@ -119,6 +119,7 @@ public static serverObjects respond(@SuppressWarnings("unused") final RequestHea
if (showline) {
prop.put("schema_" + c + "_dark", dark ? 1 : 0); dark = !dark;
prop.put("schema_" + c + "_checked", cs.contains(field.name()) ? 1 : 0);
prop.put("schema_" + c + "_required", field.isMandatory() ? 1 : 0);
prop.putHTML("schema_" + c + "_key", field.name());
prop.putHTML("schema_" + c + "_solrfieldname",field.name().equalsIgnoreCase(field.getSolrFieldName()) ? "" : field.getSolrFieldName());
if (field.getComment() != null) prop.putHTML("schema_" + c + "_comment",field.getComment());
@@ -33,7 +33,10 @@
*/
public String name(); // default field name (according to SolCell default schema) <= enum.name()
public String getSolrFieldName(); // return the default or custom solr field name to use for solr requests
/**
* @return the default or custom solr field name to use for solr requests
*/
public String getSolrFieldName();
public SolrType getType();
@@ -51,6 +54,11 @@
public String getComment();
/**
* @return true when this field is mandatory for proper operation
*/
public boolean isMandatory();
public void setSolrFieldName(String name);
public void add(final SolrInputDocument doc, final String value);
@@ -420,7 +420,7 @@ public long getCountByQuery(String querystring) {
/**
* check if a given document, identified by url hash as document id exists
* @param id the url hash and document id
* @return the load date if any entry in solr exists, -1 otherwise
* @return the load date if any entry in solr exists, null otherwise
* @throws IOException
*/
@Override
@@ -102,7 +102,6 @@
import net.yacy.cora.document.id.MultiProtocolURL;
import net.yacy.cora.federate.solr.FailCategory;
import net.yacy.cora.federate.solr.Ranking;
import net.yacy.cora.federate.solr.SchemaConfiguration;
import net.yacy.cora.federate.solr.connector.ShardSelection;
import net.yacy.cora.federate.solr.connector.SolrConnector.LoadTimeURL;
import net.yacy.cora.federate.solr.instance.RemoteInstance;
@@ -488,29 +487,6 @@ public void run() {
// update the working scheme with the backup scheme. This is necessary to include new features.
// new features are always activated by default (if activated in input-backupScheme)
solrCollectionConfigurationWork.fill(solrCollectionConfigurationInit, true);
// switch on some fields which are necessary for ranking and faceting
SchemaConfiguration.Entry entry;
for (CollectionSchema field: new CollectionSchema[]{
CollectionSchema.host_s, CollectionSchema.load_date_dt,
CollectionSchema.url_file_ext_s, CollectionSchema.last_modified, // needed for media search and /date operator
/*YaCySchema.url_paths_sxt,*/ CollectionSchema.host_organization_s, // needed to search in the url
/*YaCySchema.inboundlinks_protocol_sxt,*/ CollectionSchema.inboundlinks_urlstub_sxt, // needed for HostBrowser
/*YaCySchema.outboundlinks_protocol_sxt,*/ CollectionSchema.outboundlinks_urlstub_sxt,// needed to enhance the crawler
CollectionSchema.httpstatus_i // used in all search queries to filter out error documents
}) {
entry = solrCollectionConfigurationWork.get(field.name());
if (entry != null) {
entry.setEnable(true);
solrCollectionConfigurationWork.put(field.name(), entry);
}
}
// activate some fields that are necessary here
entry = solrCollectionConfigurationWork.get(CollectionSchema.images_urlstub_sxt.getSolrFieldName());
if (entry != null) {
entry.setEnable(true);
solrCollectionConfigurationWork.put(CollectionSchema.images_urlstub_sxt.getSolrFieldName(), entry);
}
solrCollectionConfigurationWork.commit();
} catch (final IOException e) {ConcurrentLog.logException(e);}
@@ -150,6 +150,7 @@ public CollectionConfiguration(final File configurationFile, final boolean lazy)
ConcurrentLog.warn("SolrCollectionWriter", " solr schema file " + configurationFile.getAbsolutePath() + " is missing declaration for '" + field.name() + "'");
}
}
checkMandatoryFields(); // Check minimum needed fields for proper operation are enabled
checkFieldRelationConsistency();
}
@@ -183,6 +184,27 @@ private void checkFieldRelationConsistency() {
this.put(CollectionSchema.images_protocol_sxt.name(), e);
}
}
/**
* Check and update schema configuration with fields strictly needed for proper YaCy operation.
*/
private void checkMandatoryFields() {
SchemaConfiguration.Entry entry;
for (CollectionSchema field: CollectionSchema.values()) {
if(field.isMandatory()) {
entry = this.get(field.name());
if (entry != null) {
if(!entry.enabled()) {
entry.setEnable(true);
ConcurrentLog.info("SolrCollectionWriter", "Forced activation of mandatory field " + field.name());
}
} else {
this.put(field.name(), new Entry(field.name(), field.getSolrFieldName(), true));
ConcurrentLog.info("SolrCollectionWriter", "Added missing mandatory field " + field.name());
}
}
}
}
public String[] allFields() {
ArrayList<String> a = new ArrayList<>(this.size());
@@ -215,6 +237,7 @@ public Ranking getRanking(final String name) {
*/
@Override
public void commit() throws IOException {
checkMandatoryFields(); // Check minimum needed fields for proper operation are enabled
checkFieldRelationConsistency(); // in case of changes, check related fields are enabled before save
try {
super.commit();
Oops, something went wrong.

3 comments on commit c68a8be

@reger24

This comment has been minimized.

Member

reger24 replied Feb 20, 2017

Hi @luccioman ,
good idea, but the collection of mandatory fields seems to me a bit biased ... ?
I favor to override users (my) choice if it is essential mandatory (otherwise I'd rate it default).
I don't see fields like dates_in_content_xxx , http_unique_b, www_unique_b, host_id_s, host_s, exact_signature_l, exact_signature_unique_b, fuzzy_signature_l, fuzzy_signature_unique_b, references_i, references_internal_i, references_external_i, references_exthosts_i, crawldepth_i, harvestkey_s, url_file_ext_s, host_organization_s, inboundlinks_urlstub_sxt, inboundlinks_protocol_sxt, outboundlinks_protocol_sxt, outboundlinks_urlstub_sxt, images_urlstub_sxt, images_protocol_sxt as won't work without (mandatory).

But would add (to my bias) fields like language_s, author, description_txt, keywords, text_t, collection_sxt to the mandatories.

P.S. another viewpoint on mandatory might imho be the metadataexchange https://github.com/yacy/yacy_search_server/blob/master/source/net/yacy/kelondro/data/meta/URIMetadataNode.java#L106

@luccioman

This comment has been minimized.

Member

luccioman replied Feb 21, 2017

Hello @reger24 , my initial intent was to add checks for mandatory fields at Solr Schema configuration commit instead of only at startup to prevent the possibility of indexing with missing fields later required. My starting point was with the load_date_dt field which could make the Index Export fail (see commit e5858bc).

Then I took the opportunity to harden checks on fields commented with

mandatory values, do not disable them, YaCy won't work without them

in the default collection schema. But I may have taken this comment a bit too seriously.... Also to my mind the fields "host_s, load_date_dt, url_file_ext_s, last_modified, host_organization_s, inboundlinks_urlstub_sxt, outboundlinks_urlstub_sxt and httpstatus_i" were de facto mandatory because forced enabled at startup...

But yes field by field analysis is probably needed to check what is really mandatory for proper operation.

@reger24

This comment has been minimized.

Member

reger24 replied Feb 22, 2017

Agree 100% with your last sentence.

### mandatory values, do not disable them, YaCy won't work without them

I think that statement in the solr.collection.schema file is ment only for the fields which include the "(mandatory field)" remark in the comment, like id.

## primary key of document, the URL hash, string (mandatory field)
id

Please sign in to comment.