New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scottx611x/indexing updates #1716
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1716 +/- ##
===========================================
+ Coverage 38.8% 41.19% +2.38%
===========================================
Files 365 367 +2
Lines 24837 26096 +1259
Branches 1251 1263 +12
===========================================
+ Hits 9638 10749 +1111
- Misses 15199 15347 +148
Continue to review full report at Codecov.
|
refinery/core/utils.py
Outdated
@@ -21,6 +21,9 @@ | |||
import core | |||
import data_set_manager | |||
|
|||
from core.search_indexes import DataSetIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use explicit relative imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hackdna I would love to, but neither core
nor data_set_manager
have a reference to search_indexes
refinery/core/utils.py
Outdated
@@ -21,6 +21,9 @@ | |||
import core | |||
import data_set_manager | |||
|
|||
from core.search_indexes import DataSetIndex | |||
from data_set_manager.search_indexes import NodeIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_set_manager
is already imported above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hackdna For example we were referencing the indexes beforelike so, but things were broken:
(refinery-platform)vagrant@refinery:/vagrant/refinery$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import core
>>> core.search_indexes.DataSetindex
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'search_indexes'
>>> import data_set_manager
>>> data_set_manager.search_indexes.NodeIndex
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'search_indexes'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core.utils
is also the file that introduces a nifty spiderweb of imports.
There is not much we can do atm without some major refactoring/tackling of technical debt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is due to mutual imports. Could you add a comment? Also, could you reference this in the circular imports Trello card?
Also, all other module imports should be fixed according to the style guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out comment for refinery/ui/source/js/commons/data-sets/search-api.spec.js
@@ -68,7 +68,7 @@ describe('DataSet.search-api: unit tests', function () { | |||
'hl.simple.post': '%3C%2Fem%3E', | |||
'hl.simple.pre': '%3Cem%3E', | |||
q: _query, | |||
qf: 'title%5E0.5+accession+submitter+text', | |||
qf: 'title%5E0.5+accession+submitter+text+description', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You now search twice on the description because the description is part of the text
field. The text
field is composed of multiple other fields and make up the core part of the document to be search for in Solr. Check out templates/searches/indexes/core/dataset_text.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two possible solutions:
- do not search on description explicitly
- remove the description from the
text
field. On the other hand thetext
field should contain all the content of a document to be searched for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also forgot to mention that using EdgeNGram increases the memory footprint of Solr quite a but because the description will be shredded into little pieces of overlapping n-grams which will all be stored explicitly as far as I remember.
Here's a useful post explaining a bit of what's going on: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-td4214392.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flekschas : I was finding that not all DataSet description information was included in this text field.
For example:
With "EdgeNGramming"
:
This approach is more resource intensive, but it doesn't seem too bad at first glance.
Using 117
DataSets:
EdgeNGram:
{
"solr_home":"/vagrant/refinery/solr/",
"version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
"startTime":"2017-05-04T18:54:01.44Z",
"uptime":"0 days, 0 hours, 0 minutes, 46 seconds",
"memory":"76.3 MB (%15.6) of 490.7 MB"}
Normal:
{
"solr_home":"/vagrant/refinery/solr/",
"version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
"startTime":"2017-05-04T18:43:10.319Z",
"uptime":"0 days, 0 hours, 6 minutes, 0 seconds",
"memory":"74.5 MB (%15.2) of 490.7 MB"}
I'm probably not educated enough on this topic to see any repurcussions?
@ngehlenborg: Would we be okay with taking this resource hit for possibly better search results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottx611x You are finding different data sets because of the switch to EdgeNGram not because the information wasn't there. The memory foot print is expected to be the same (there is an overhead on indexing only) but the size of the indices might increase. Using EdgeNGram is fine but now you search twice on description
(you search on the description field alone and in combination with the text field). This might not have a huge impact but could slow things down in the future or artificially boost hits in the description.
I would also test few more search queries to see if side effects pop up (I ran into this problem after I added synonym search for example, but only for some queries). E.g.:
- ES
- iPS
- cel
- RNA
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For additional context, the search index increased from 1.23Mb
-> 1.31Mb
after indexing with the DataSet description as an EdgeNGram
field.
… for in DataSet descriptions
refinery/core/search_indexes.py
Outdated
|
||
return set(submitters) | ||
return list(set(submitters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears to be redundant since submitters
is already a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not redundant. We cast to set
to get unique submitters
and back to list
to fix a search bug.
Solr kept the set
in its index like so: submitter: "set(u'Armstrong, Scott')"
, but
breaks this field on whitespaces to do a prefix search upon.
Due to the prefix searching, one could search for Scott
and set(u'Armstrong
but not for Armstrong
alone.
Solr stores a list properly in this case: "submitter": ["Armstrong, Scott"],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, could you add a short comment about this (to all three locations)?
@@ -65,15 +65,18 @@ def prepare_submitter(self, object): | |||
submitters = [] | |||
|
|||
for contact in investigation.contact_set.all(): | |||
submitters.append(contact.last_name + ", " + contact.first_name) | |||
submitters.append( | |||
"{}, {}".format(contact.last_name, contact.first_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In simple cases like these format
is probably an overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep in line with what our coding style guide states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coding style guide only says to replace the old %
-based syntax with the new format
-based syntax. It doesn't actually say anything about replacing a simple string concatenation with format
. In this case it comes down to readability vs separation of style from data. I'll let you decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description
to anEdgeNgramField
for better Dataset Description search by tokenizing on whitespace'module' object has no attribute 'search_indexes'
error to allow DataSet indexing to work againOnce merged a
./manage.py rebuild_index --using=core --batch-size=25
will need to be run