Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scottx611x/indexing updates #1716

Merged
merged 9 commits into from
May 8, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 15 additions & 6 deletions refinery/core/search_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class DataSetIndex(indexes.SearchIndex, indexes.Indexable):
dbid = indexes.IntegerField(model_attr='id')
uuid = indexes.CharField(model_attr='uuid')
summary = indexes.CharField(model_attr='summary', null=True)
description = indexes.CharField(null=True)
description = indexes.EdgeNgramField(null=True)
creation_date = indexes.DateTimeField(model_attr='creation_date')
modification_date = indexes.DateTimeField(model_attr='modification_date')
submitter = indexes.MultiValueField(null=True)
Expand Down Expand Up @@ -65,15 +65,20 @@ def prepare_submitter(self, object):
submitters = []

for contact in investigation.contact_set.all():
submitters.append(contact.last_name + ", " + contact.first_name)
submitters.append(
"{}, {}".format(contact.last_name, contact.first_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simple cases like these format is probably an overkill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep in line with what our coding style guide states?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coding style guide only says to replace the old %-based syntax with the new format-based syntax. It doesn't actually say anything about replacing a simple string concatenation with format. In this case it comes down to readability vs separation of style from data. I'll let you decide.

Copy link
Member Author

@scottx611x scottx611x May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only saw this in our style guide regarding python string formatting, I'm going to add a note about the move away from%s outside of logging statements and one on simple string concatenation:
screen shot 2017-05-08 at 4 56 26 pm

)

studies = investigation.study_set.all()
for study in studies:
for contact in study.contact_set.all():
submitters.append(
contact.last_name + ", " + contact.first_name)
"{}, {}".format(contact.last_name, contact.first_name)
)

return set(submitters)
# Cast to `list` looks redundant, but MultiValueField stores sets
# improperly, introducing a search bug. See: http://bit.ly/2pZLE5c
return list(set(submitters))

def prepare_measurement(self, object):
investigation = object.get_investigation()
Expand All @@ -88,7 +93,9 @@ def prepare_measurement(self, object):
for assay in study.assay_set.all():
measurements.append(assay.measurement)

return set(measurements)
# Cast to `list` looks redundant, but MultiValueField stores sets
# improperly, introducing a search bug. See: http://bit.ly/2pZLE5c
return list(set(measurements))

def prepare_technology(self, object):
investigation = object.get_investigation()
Expand All @@ -103,7 +110,9 @@ def prepare_technology(self, object):
for assay in study.assay_set.all():
technologies.append(assay.technology)

return set(technologies)
# Cast to `list` looks redundant, but MultiValueField stores sets
# improperly, introducing a search bug. See: http://bit.ly/2pZLE5c
return list(set(technologies))

# from:
# http://django-haystack.readthedocs.org/en/latest/rich_content_extraction.html
Expand Down
16 changes: 9 additions & 7 deletions refinery/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@
import core
import data_set_manager

# These imports go against our coding style guide, but are necessary for the
# time being due to mutual import issues
from core.search_indexes import DataSetIndex
from data_set_manager.search_indexes import NodeIndex

logger = logging.getLogger(__name__)


Expand All @@ -44,8 +49,7 @@ def update_data_set_index(data_set):

logger.info('Updated data set (uuid: %s) index', data_set.uuid)
try:
core.search_indexes.DataSetIndex().update_object(data_set,
using='core')
DataSetIndex().update_object(data_set, using='core')
except Exception as e:
""" Solr is expected to fail and raise an exception when
it is not running.
Expand Down Expand Up @@ -355,8 +359,7 @@ def delete_data_set_index(data_set):

logger.debug('Deleted data set (uuid: %s) index', data_set.uuid)
try:
core.search_indexes.DataSetIndex().remove_object(data_set,
using='core')
DataSetIndex().remove_object(data_set, using='core')
except Exception as e:
""" Solr is expected to fail and raise an exception when
it is not running.
Expand Down Expand Up @@ -754,16 +757,15 @@ def delete_analysis_index(node_instance):
"""Remove a Analysis' related document from Solr's index.
"""
try:
data_set_manager.search_indexes.NodeIndex().remove_object(
node_instance, using='data_set_manager')
NodeIndex().remove_object(node_instance, using='data_set_manager')
logger.debug('Deleted Analysis\' NodeIndex with (uuid: %s)',
node_instance.uuid)
except Exception as e:
""" Solr is expected to fail and raise an exception when
it is not running.
(e.g. Travis CI doesn't support solr yet)
"""
logger.error("Could not delete from NodeIndex:", e)
logger.error("Could not delete from NodeIndex: %s", e)


def invalidate_cached_object(instance, is_test=False):
Expand Down
17 changes: 16 additions & 1 deletion refinery/solr/core/conf/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,21 @@
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
</analyzer>
</fieldType>

<fieldType name="description_edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
</analyzer>
</fieldType>
</types>

<fields>
Expand Down Expand Up @@ -168,7 +183,7 @@

<field name="name" type="text_en" indexed="true" stored="true" multiValued="false" />

<field name="description" type="text_en" indexed="true" stored="true" multiValued="false" />
<field name="description" type="description_edge_ngram" indexed="true" stored="true" multiValued="false" />

<field name="content_auto" type="edge_ngram" indexed="true" stored="true" multiValued="false" />

Expand Down
2 changes: 1 addition & 1 deletion refinery/ui/source/js/commons/data-sets/search-api.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ function DataSetSearchApiFactory ($sce, settings, solrService, sessionService) {
// Query
q: searchQuery,
// Query fields
qf: 'title^0.5 accession submitter text',
qf: 'title^0.5 accession submitter text description',
// # results returned
rows: limit,
// Start of return
Expand Down
2 changes: 1 addition & 1 deletion refinery/ui/source/js/commons/data-sets/search-api.spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ describe('DataSet.search-api: unit tests', function () {
'hl.simple.post': '%3C%2Fem%3E',
'hl.simple.pre': '%3Cem%3E',
q: _query,
qf: 'title%5E0.5+accession+submitter+text',
qf: 'title%5E0.5+accession+submitter+text+description',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You now search twice on the description because the description is part of the text field. The text field is composed of multiple other fields and make up the core part of the document to be search for in Solr. Check out templates/searches/indexes/core/dataset_text.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible solutions:

  1. do not search on description explicitly
  2. remove the description from the text field. On the other hand the text field should contain all the content of a document to be searched for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also forgot to mention that using EdgeNGram increases the memory footprint of Solr quite a but because the description will be shredded into little pieces of overlapping n-grams which will all be stored explicitly as far as I remember.

Here's a useful post explaining a bit of what's going on: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-td4214392.html

Copy link
Member Author

@scottx611x scottx611x May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flekschas : I was finding that not all DataSet description information was included in this text field.

For example:
With "EdgeNGramming":
screen shot 2017-05-04 at 2 43 18 pm

and Normally:
screen shot 2017-05-04 at 2 38 26 pm

This approach is more resource intensive, but it doesn't seem too bad at first glance.
Using 117 DataSets:

EdgeNGram:

{
  "solr_home":"/vagrant/refinery/solr/",
  "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
  "startTime":"2017-05-04T18:54:01.44Z",
  "uptime":"0 days, 0 hours, 0 minutes, 46 seconds",
  "memory":"76.3 MB (%15.6) of 490.7 MB"}

Normal:

{
  "solr_home":"/vagrant/refinery/solr/",
  "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
  "startTime":"2017-05-04T18:43:10.319Z",
  "uptime":"0 days, 0 hours, 6 minutes, 0 seconds",
  "memory":"74.5 MB (%15.2) of 490.7 MB"}

I'm probably not educated enough on this topic to see any repurcussions?

@ngehlenborg: Would we be okay with taking this resource hit for possibly better search results?

Copy link
Member

@flekschas flekschas May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottx611x You are finding different data sets because of the switch to EdgeNGram not because the information wasn't there. The memory foot print is expected to be the same (there is an overhead on indexing only) but the size of the indices might increase. Using EdgeNGram is fine but now you search twice on description (you search on the description field alone and in combination with the text field). This might not have a huge impact but could slow things down in the future or artificially boost hits in the description.

I would also test few more search queries to see if side effects pop up (I ran into this problem after I added synonym search for example, but only for some queries). E.g.:

  • ES
  • iPS
  • cel
  • RNA
  • ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional context, the search index increased from 1.23Mb -> 1.31Mb after indexing with the DataSet description as an EdgeNGram field.

rows: _limit,
start: _offset,
synonyms: '' + !!_synonyms + '',
Expand Down