Scottx611x/indexing updates #1716

scottx611x · 2017-05-04T16:22:12Z

Change description to an EdgeNgramField for better Dataset Description search by tokenizing on whitespace
Fix 'module' object has no attribute 'search_indexes' error to allow DataSet indexing to work again
Add description to solr query fields
Addresses: Revise indexing of data sets in Refinery #1702
Addresses: Search discrepancy: cdx vs. cdx2 #1703

Once merged a ./manage.py rebuild_index --using=core --batch-size=25 will need to be run

codecov-io · 2017-05-04T16:38:20Z

Codecov Report

Merging #1716 into develop will increase coverage by 2.38%.
The diff coverage is 27.27%.

@@             Coverage Diff             @@
##           develop    #1716      +/-   ##
===========================================
+ Coverage     38.8%   41.19%   +2.38%     
===========================================
  Files          365      367       +2     
  Lines        24837    26096    +1259     
  Branches      1251     1263      +12     
===========================================
+ Hits          9638    10749    +1111     
- Misses       15199    15347     +148

Impacted Files	Coverage Δ
...inery/ui/source/js/commons/data-sets/search-api.js	`85.71% <ø> (ø)`	⬆️
refinery/core/search_indexes.py	`43.26% <20%> (ø)`	⬆️
refinery/core/utils.py	`35% <33.33%> (+0.46%)`	⬆️
...i/source/js/tool-launch/ctrls/tool-display-ctrl.js	`100% <0%> (ø)`	⬆️
refinery/selenium_testing/tests.py	`100% <0%> (ø)`	⬆️
refinery/data_set_manager/tests.py	`100% <0%> (ø)`	⬆️
refinery/tool_manager/urls.py	`100% <0%> (ø)`	⬆️
refinery/ui/source/js/commons/services/tools.js	`100% <0%> (ø)`
...rce/js/tool-launch/services/tool-launch-service.js	`93.33% <0%> (ø)`
refinery/core/api.py	`50.75% <0%> (+0.12%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45f7ca6...39b8a71. Read the comment docs.

hackdna · 2017-05-04T16:38:40Z

refinery/core/utils.py

@@ -21,6 +21,9 @@
 import core
 import data_set_manager

+from core.search_indexes import DataSetIndex


Use explicit relative imports

@hackdna I would love to, but neither core nor data_set_manager have a reference to search_indexes

hackdna · 2017-05-04T16:39:24Z

refinery/core/utils.py

@@ -21,6 +21,9 @@
 import core
 import data_set_manager

+from core.search_indexes import DataSetIndex
+from data_set_manager.search_indexes import NodeIndex


data_set_manager is already imported above

@hackdna For example we were referencing the indexes beforelike so, but things were broken:

(refinery-platform)vagrant@refinery:/vagrant/refinery$ python Python 2.7.6 (default, Oct 26 2016, 20:30:19) [GCC 4.8.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import core >>> core.search_indexes.DataSetindex Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'search_indexes' >>> import data_set_manager >>> data_set_manager.search_indexes.NodeIndex Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'search_indexes'

core.utils is also the file that introduces a nifty spiderweb of imports.

There is not much we can do atm without some major refactoring/tackling of technical debt

OK, this is due to mutual imports. Could you add a comment? Also, could you reference this in the circular imports Trello card?

hackdna · 2017-05-04T16:40:46Z

Also, all other module imports should be fixed according to the style guide.

flekschas

Check out comment for refinery/ui/source/js/commons/data-sets/search-api.spec.js

flekschas · 2017-05-04T18:14:48Z

refinery/ui/source/js/commons/data-sets/search-api.spec.js

@@ -68,7 +68,7 @@ describe('DataSet.search-api: unit tests', function () {
      'hl.simple.post': '%3C%2Fem%3E',
      'hl.simple.pre': '%3Cem%3E',
      q: _query,
-      qf: 'title%5E0.5+accession+submitter+text',
+      qf: 'title%5E0.5+accession+submitter+text+description',


You now search twice on the description because the description is part of the text field. The text field is composed of multiple other fields and make up the core part of the document to be search for in Solr. Check out templates/searches/indexes/core/dataset_text.txt

There are two possible solutions:

do not search on description explicitly

remove the description from the text field. On the other hand the text field should contain all the content of a document to be searched for.

I also forgot to mention that using EdgeNGram increases the memory footprint of Solr quite a but because the description will be shredded into little pieces of overlapping n-grams which will all be stored explicitly as far as I remember.

Here's a useful post explaining a bit of what's going on: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-td4214392.html

@flekschas : I was finding that not all DataSet description information was included in this text field.

For example:
With "EdgeNGramming":

and Normally:

This approach is more resource intensive, but it doesn't seem too bad at first glance.
Using 117 DataSets:

EdgeNGram:

{ "solr_home":"/vagrant/refinery/solr/", "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15", "startTime":"2017-05-04T18:54:01.44Z", "uptime":"0 days, 0 hours, 0 minutes, 46 seconds", "memory":"76.3 MB (%15.6) of 490.7 MB"}

Normal:

{ "solr_home":"/vagrant/refinery/solr/", "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15", "startTime":"2017-05-04T18:43:10.319Z", "uptime":"0 days, 0 hours, 6 minutes, 0 seconds", "memory":"74.5 MB (%15.2) of 490.7 MB"}

I'm probably not educated enough on this topic to see any repurcussions?

@ngehlenborg: Would we be okay with taking this resource hit for possibly better search results?

@scottx611x You are finding different data sets because of the switch to EdgeNGram not because the information wasn't there. The memory foot print is expected to be the same (there is an overhead on indexing only) but the size of the indices might increase. Using EdgeNGram is fine but now you search twice on description (you search on the description field alone and in combination with the text field). This might not have a huge impact but could slow things down in the future or artificially boost hits in the description.

I would also test few more search queries to see if side effects pop up (I ran into this problem after I added synonym search for example, but only for some queries). E.g.:

ES

iPS

cel

RNA

...

For additional context, the search index increased from 1.23Mb -> 1.31Mb after indexing with the DataSet description as an EdgeNGram field.

… for in DataSet descriptions

hackdna · 2017-05-08T19:49:22Z

refinery/core/search_indexes.py


-        return set(submitters)
+        return list(set(submitters))


Appears to be redundant since submitters is already a list

This is not redundant. We cast to set to get unique submitters and back to list to fix a search bug.

Solr kept the set in its index like so: submitter: "set(u'Armstrong, Scott')", but
breaks this field on whitespaces to do a prefix search upon.
Due to the prefix searching, one could search for Scott and set(u'Armstrong but not for Armstrong alone.

Solr stores a list properly in this case: "submitter": ["Armstrong, Scott"],

OK, could you add a short comment about this (to all three locations)?

hackdna · 2017-05-08T19:50:04Z

refinery/core/search_indexes.py

@@ -65,15 +65,18 @@ def prepare_submitter(self, object):
        submitters = []

        for contact in investigation.contact_set.all():
-            submitters.append(contact.last_name + ", " + contact.first_name)
+            submitters.append(
+                "{}, {}".format(contact.last_name, contact.first_name)


In simple cases like these format is probably an overkill

Why not keep in line with what our coding style guide states?

The coding style guide only says to replace the old %-based syntax with the new format-based syntax. It doesn't actually say anything about replacing a simple string concatenation with format. In this case it comes down to readability vs separation of style from data. I'll let you decide.

I only saw this in our style guide regarding python string formatting, I'm going to add a note about the move away from%s outside of logging statements and one on simple string concatenation:

scottx611x added 4 commits May 4, 2017 09:41

Update description to be an EdgeNGram field for better search results

8110620

Add description to query fields

e0db7b7

Fix imports

79f4e7a

Update unit test

56ab31a

scottx611x requested review from hackdna, flekschas and jkmarx May 4, 2017 16:29

scottx611x added this to the Release 1.5.6 milestone May 4, 2017

hackdna requested changes May 4, 2017

View reviewed changes

Add comment

d0b3ee8

flekschas requested changes May 4, 2017

View reviewed changes

sjhosui mentioned this pull request May 5, 2017

Revise indexing of data sets in Refinery #1702

Closed

7 tasks

scottx611x added 2 commits May 5, 2017 14:18

Fix Styling

291de47

Cast set() to list() to fix bug with data retrieval through search

c575f50

scottx611x mentioned this pull request May 8, 2017

Search discrepancy: cdx vs. cdx2 #1703

Closed

Provide new fieldType to constrain size of ngrams indexed & queried…

00f5f49

… for in DataSet descriptions

scottx611x requested review from hackdna and flekschas and removed request for jkmarx May 8, 2017 19:34

hackdna requested changes May 8, 2017

View reviewed changes

Add comments

39b8a71

hackdna approved these changes May 8, 2017

View reviewed changes

scottx611x merged commit 72b4211 into develop May 8, 2017

scottx611x deleted the scottx611x/indexing_updates branch May 8, 2017 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scottx611x/indexing updates #1716

Scottx611x/indexing updates #1716

scottx611x commented May 4, 2017 •

edited

codecov-io commented May 4, 2017 •

edited

hackdna May 4, 2017

scottx611x May 4, 2017

hackdna May 4, 2017

scottx611x May 4, 2017 •

edited

scottx611x May 4, 2017

hackdna May 4, 2017

hackdna commented May 4, 2017 •

edited

flekschas left a comment

flekschas May 4, 2017

flekschas May 4, 2017

flekschas May 4, 2017

scottx611x May 4, 2017 •

edited

flekschas May 4, 2017 •

edited

scottx611x May 5, 2017

hackdna May 8, 2017

scottx611x May 8, 2017 •

edited

hackdna May 8, 2017

hackdna May 8, 2017

scottx611x May 8, 2017

hackdna May 8, 2017

scottx611x May 8, 2017 •

edited

Scottx611x/indexing updates #1716

Scottx611x/indexing updates #1716

Conversation

scottx611x commented May 4, 2017 • edited

codecov-io commented May 4, 2017 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottx611x May 4, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackdna commented May 4, 2017 • edited

flekschas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottx611x May 4, 2017 • edited

Choose a reason for hiding this comment

flekschas May 4, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottx611x May 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottx611x May 8, 2017 • edited

Choose a reason for hiding this comment

scottx611x commented May 4, 2017 •

edited

codecov-io commented May 4, 2017 •

edited

scottx611x May 4, 2017 •

edited

hackdna commented May 4, 2017 •

edited

scottx611x May 4, 2017 •

edited

flekschas May 4, 2017 •

edited

scottx611x May 8, 2017 •

edited

scottx611x May 8, 2017 •

edited