Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scottx611x/indexing updates #1716

Merged
merged 9 commits into from May 8, 2017
Merged

Conversation

scottx611x
Copy link
Member

@scottx611x scottx611x commented May 4, 2017

Once merged a ./manage.py rebuild_index --using=core --batch-size=25 will need to be run

@scottx611x scottx611x added this to the Release 1.5.6 milestone May 4, 2017
@codecov-io
Copy link

codecov-io commented May 4, 2017

Codecov Report

Merging #1716 into develop will increase coverage by 2.38%.
The diff coverage is 27.27%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1716      +/-   ##
===========================================
+ Coverage     38.8%   41.19%   +2.38%     
===========================================
  Files          365      367       +2     
  Lines        24837    26096    +1259     
  Branches      1251     1263      +12     
===========================================
+ Hits          9638    10749    +1111     
- Misses       15199    15347     +148
Impacted Files Coverage Δ
...inery/ui/source/js/commons/data-sets/search-api.js 85.71% <ø> (ø) ⬆️
refinery/core/search_indexes.py 43.26% <20%> (ø) ⬆️
refinery/core/utils.py 35% <33.33%> (+0.46%) ⬆️
...i/source/js/tool-launch/ctrls/tool-display-ctrl.js 100% <0%> (ø) ⬆️
refinery/selenium_testing/tests.py 100% <0%> (ø) ⬆️
refinery/data_set_manager/tests.py 100% <0%> (ø) ⬆️
refinery/tool_manager/urls.py 100% <0%> (ø) ⬆️
refinery/ui/source/js/commons/services/tools.js 100% <0%> (ø)
...rce/js/tool-launch/services/tool-launch-service.js 93.33% <0%> (ø)
refinery/core/api.py 50.75% <0%> (+0.12%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45f7ca6...39b8a71. Read the comment docs.

@@ -21,6 +21,9 @@
import core
import data_set_manager

from core.search_indexes import DataSetIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use explicit relative imports

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hackdna I would love to, but neither core nor data_set_manager have a reference to search_indexes

@@ -21,6 +21,9 @@
import core
import data_set_manager

from core.search_indexes import DataSetIndex
from data_set_manager.search_indexes import NodeIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_set_manager is already imported above

Copy link
Member Author

@scottx611x scottx611x May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hackdna For example we were referencing the indexes beforelike so, but things were broken:

(refinery-platform)vagrant@refinery:/vagrant/refinery$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import core
>>> core.search_indexes.DataSetindex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'search_indexes'
>>> import data_set_manager
>>> data_set_manager.search_indexes.NodeIndex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'search_indexes'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core.utils is also the file that introduces a nifty spiderweb of imports.

There is not much we can do atm without some major refactoring/tackling of technical debt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is due to mutual imports. Could you add a comment? Also, could you reference this in the circular imports Trello card?

@hackdna
Copy link
Member

hackdna commented May 4, 2017

Also, all other module imports should be fixed according to the style guide.

Copy link
Member

@flekschas flekschas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out comment for refinery/ui/source/js/commons/data-sets/search-api.spec.js

@@ -68,7 +68,7 @@ describe('DataSet.search-api: unit tests', function () {
'hl.simple.post': '%3C%2Fem%3E',
'hl.simple.pre': '%3Cem%3E',
q: _query,
qf: 'title%5E0.5+accession+submitter+text',
qf: 'title%5E0.5+accession+submitter+text+description',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You now search twice on the description because the description is part of the text field. The text field is composed of multiple other fields and make up the core part of the document to be search for in Solr. Check out templates/searches/indexes/core/dataset_text.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible solutions:

  1. do not search on description explicitly
  2. remove the description from the text field. On the other hand the text field should contain all the content of a document to be searched for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also forgot to mention that using EdgeNGram increases the memory footprint of Solr quite a but because the description will be shredded into little pieces of overlapping n-grams which will all be stored explicitly as far as I remember.

Here's a useful post explaining a bit of what's going on: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-td4214392.html

Copy link
Member Author

@scottx611x scottx611x May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flekschas : I was finding that not all DataSet description information was included in this text field.

For example:
With "EdgeNGramming":
screen shot 2017-05-04 at 2 43 18 pm

and Normally:
screen shot 2017-05-04 at 2 38 26 pm

This approach is more resource intensive, but it doesn't seem too bad at first glance.
Using 117 DataSets:

EdgeNGram:

{
  "solr_home":"/vagrant/refinery/solr/",
  "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
  "startTime":"2017-05-04T18:54:01.44Z",
  "uptime":"0 days, 0 hours, 0 minutes, 46 seconds",
  "memory":"76.3 MB (%15.6) of 490.7 MB"}

Normal:

{
  "solr_home":"/vagrant/refinery/solr/",
  "version":"5.3.1 1703449 - noble - 2015-09-17 01:48:15",
  "startTime":"2017-05-04T18:43:10.319Z",
  "uptime":"0 days, 0 hours, 6 minutes, 0 seconds",
  "memory":"74.5 MB (%15.2) of 490.7 MB"}

I'm probably not educated enough on this topic to see any repurcussions?

@ngehlenborg: Would we be okay with taking this resource hit for possibly better search results?

Copy link
Member

@flekschas flekschas May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottx611x You are finding different data sets because of the switch to EdgeNGram not because the information wasn't there. The memory foot print is expected to be the same (there is an overhead on indexing only) but the size of the indices might increase. Using EdgeNGram is fine but now you search twice on description (you search on the description field alone and in combination with the text field). This might not have a huge impact but could slow things down in the future or artificially boost hits in the description.

I would also test few more search queries to see if side effects pop up (I ran into this problem after I added synonym search for example, but only for some queries). E.g.:

  • ES
  • iPS
  • cel
  • RNA
  • ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional context, the search index increased from 1.23Mb -> 1.31Mb after indexing with the DataSet description as an EdgeNGram field.

@scottx611x scottx611x requested review from hackdna and flekschas and removed request for jkmarx May 8, 2017 19:34

return set(submitters)
return list(set(submitters))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears to be redundant since submitters is already a list

Copy link
Member Author

@scottx611x scottx611x May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not redundant. We cast to set to get unique submitters and back to list to fix a search bug.

Solr kept the set in its index like so: submitter: "set(u'Armstrong, Scott')", but
breaks this field on whitespaces to do a prefix search upon.
Due to the prefix searching, one could search for Scott and set(u'Armstrong but not for Armstrong alone.

Solr stores a list properly in this case: "submitter": ["Armstrong, Scott"],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, could you add a short comment about this (to all three locations)?

@@ -65,15 +65,18 @@ def prepare_submitter(self, object):
submitters = []

for contact in investigation.contact_set.all():
submitters.append(contact.last_name + ", " + contact.first_name)
submitters.append(
"{}, {}".format(contact.last_name, contact.first_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simple cases like these format is probably an overkill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep in line with what our coding style guide states?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coding style guide only says to replace the old %-based syntax with the new format-based syntax. It doesn't actually say anything about replacing a simple string concatenation with format. In this case it comes down to readability vs separation of style from data. I'll let you decide.

Copy link
Member Author

@scottx611x scottx611x May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only saw this in our style guide regarding python string formatting, I'm going to add a note about the move away from%s outside of logging statements and one on simple string concatenation:
screen shot 2017-05-08 at 4 56 26 pm

@scottx611x scottx611x merged commit 72b4211 into develop May 8, 2017
@scottx611x scottx611x deleted the scottx611x/indexing_updates branch May 8, 2017 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants