-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scottx611x/indexing updates #1716
Changes from all commits
8110620
e0db7b7
79f4e7a
56ab31a
d0b3ee8
291de47
c575f50
00f5f49
39b8a71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,7 +68,7 @@ describe('DataSet.search-api: unit tests', function () { | |
'hl.simple.post': '%3C%2Fem%3E', | ||
'hl.simple.pre': '%3Cem%3E', | ||
q: _query, | ||
qf: 'title%5E0.5+accession+submitter+text', | ||
qf: 'title%5E0.5+accession+submitter+text+description', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You now search twice on the description because the description is part of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two possible solutions:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also forgot to mention that using EdgeNGram increases the memory footprint of Solr quite a but because the description will be shredded into little pieces of overlapping n-grams which will all be stored explicitly as far as I remember. Here's a useful post explaining a bit of what's going on: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-td4214392.html There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @flekschas : I was finding that not all DataSet description information was included in this text field. For example: This approach is more resource intensive, but it doesn't seem too bad at first glance. EdgeNGram:
Normal:
I'm probably not educated enough on this topic to see any repurcussions? @ngehlenborg: Would we be okay with taking this resource hit for possibly better search results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @scottx611x You are finding different data sets because of the switch to EdgeNGram not because the information wasn't there. The memory foot print is expected to be the same (there is an overhead on indexing only) but the size of the indices might increase. Using EdgeNGram is fine but now you search twice on I would also test few more search queries to see if side effects pop up (I ran into this problem after I added synonym search for example, but only for some queries). E.g.:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For additional context, the search index increased from |
||
rows: _limit, | ||
start: _offset, | ||
synonyms: '' + !!_synonyms + '', | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In simple cases like these
format
is probably an overkillThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep in line with what our coding style guide states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coding style guide only says to replace the old
%
-based syntax with the newformat
-based syntax. It doesn't actually say anything about replacing a simple string concatenation withformat
. In this case it comes down to readability vs separation of style from data. I'll let you decide.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only saw this in our style guide regarding python string formatting, I'm going to add a note about the move away from
%s
outside of logging statements and one on simple string concatenation: