Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solr_search: automatic adjusting of maxRows #102

Merged
merged 8 commits into from Aug 8, 2017

Conversation

@1havran
Copy link
Contributor

commented Jul 21, 2017

There is a performance penalty when asking for too many rows.
https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_too_many_rows

If number of rows is specified and is higher than configured limit (50000),
solr_search first queries Solr for a number of records with rows=0.
Then the initial number of rows is replaced by real number of records
and solr_search is executed.

solr_search: automatic adjusting of maxRows
There is a performance penalty when asking for too many rows.
https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_too_many_rows

If number of rows is specified and is higher than configured limit (50000),
solr_search first queries Solr for number of records with rows=0.
Then initial number of rows is replaced by real number of records
and solr_search is executed.
@sckott

This comment has been minimized.

Copy link
Member

commented Jul 21, 2017

Can you add examples so I can see how you envision usage

@1havran

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2017

This PR modifies the solr_search function. The change is transparent to the user. It is not changing the output of the solr_search. When quering data, it does not make sense to request more rows than total number of existing records. This is especially important together with high rate queries. As per Solr wiki, Solr will first allocate a memory for each requested row (28bytes per ScoreDoc) and then returns results. The Solr memory is allocated for all requested rows even if the number of matching records is less then requested rows.

As an example, if rows=999999999, then Solr will try to allocate 28bytes X 999999999, that is ~26GB of memory. With high rate queries using rows=999999999, Solr would need to allocate ~26GB multiple times. Java Garbage Collector kicks in, that would result most likely to the temporary Solr unavailability (404 Error). Almost certainly for collections with 10M+ documents.

If requested rows are higher than existing rows (and higher than configurable threshold (minOptimizedRows=50000)), solr_search will adjust the rows to the number of existing rows using same contraints and return the results. Max rows optimization can be turned off by changing the boolean variable in the solr_search (maxRowOptimize=FALSE). The optimization should be turned off when using solr_search in aggregated solr functions that provide own logic for searching (such as quering Solr in pages).

  1. Rows = NULL - no change in solr_search
    If a number of rows is not specified, Solr will use internal default limit for rows (10).
    out1<-solr_search('techproducts', q="maxtor")
    http://localhost:8983/solr/techproducts/select?q=maxtor&wt=json

  2. Rows <= 50000 - no change in solr_search
    out2<-solr_search('techproducts', q="maxtor", rows=50000)
    http://localhost:8983/solr/techproducts/select?q=maxtor&rows=50000&wt=json

  3. Rows > 50000 - solr_search will find out how many records will be returned by the search. Then it will query the data with adjusted row number.
    out3<-solr_search('techproducts', q="maxtor", rows=50001)
    http://localhost:8983/solr/techproducts/select?q=maxtor&rows=0&wt=json
    http://localhost:8983/solr/techproducts/select?q=maxtor&rows=1&wt=json

  4. Result comparism
    identical(out1,out2)
    [1] TRUE
    identical(out2,out3)
    [1] TRUE

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 26, 2017

thanks @1havran for the details

  • seems there would be a performance hit from making this extra query to get number of results before making the full query - but perhaps that perf. hit is less than if you don't do that check first at least when a very large number of results are to be returned?
  • what about the other Solr methods? group, highlight, facet, etc. ? assuming we go with your change, should we also change those to match?

if (!is.null(rows) & (optimizeMaxRows)) {
if (rows>minOptimizedRows) {
out <- solr_search_exec(name=name, q=q, rows='0', wt='json', raw='TRUE')

This comment has been minimized.

Copy link
@sckott

sckott Jul 26, 2017

Member

boolean like TRUE instead of a string

This comment has been minimized.

Copy link
@sckott

sckott Jul 26, 2017

Member

0 is a numeric, not a string

callopts=list(), raw=FALSE, parsetype='df', concat=',',
optimizeMaxRows=TRUE, minOptimizedRows=50000, ...) {

if (!is.null(rows) & (optimizeMaxRows)) {

This comment has been minimized.

Copy link
@sckott

sckott Jul 26, 2017

Member

please change to !is.null(rows) && optimizeMaxRows

optimizeMaxRows=TRUE, minOptimizedRows=50000, ...) {

if (!is.null(rows) & (optimizeMaxRows)) {
if (rows>minOptimizedRows) {

This comment has been minimized.

Copy link
@sckott

sckott Jul 26, 2017

Member

space between operator please rows > minOptimizedRows

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 26, 2017

@1havran AFAICT i'm not sure the behavior is that which is desired.

in the block

if (!is.null(rows) & (optimizeMaxRows)) {
    if (rows>minOptimizedRows) {
      out <- solr_search_exec(name=name, q=q, rows=0, wt='json', raw=TRUE)
      outJson <- fromJSON(out)
      rows <- outJson$response$numFound
    }
  }

you are assigning the numFound value as the new rows value that then gets passed on to the next search to return docs. e.g., if the user sets rows = 50001, and we're using the PLOS SOLR API as I have in the examples in docs, the numFound is at time of writing 1841104 - which is the new rows value - so the user is now fetching 1841104 docs documents instead of 50001.

I imagine that's not what you intended?

@jefferis

This comment has been minimized.

Copy link

commented Jul 26, 2017

A related suggestion

  1. make the new behaviour FALSE by default
  2. if rows=Inf then figure out how many rows exist (as suggested above) and fetch all of them
  3. enable this by having optimizeMaxRows=!is.finite(rows) as the default
if (rows>minOptimizedRows) {
out <- solr_search_exec(name=name, q=q, rows='0', wt='json', raw='TRUE')
outJson <- fromJSON(out)
rows <- outJson$response$numFound

This comment has been minimized.

Copy link
@jefferis

jefferis Jul 26, 2017

rows <- min(rows, outJson$response$numFound)

?

fix for adjusting the rows when they are lower than number of found r…
…ecords

rows can be specified as negative number (-1) to get all available rows
@1havran

This comment has been minimized.

Copy link
Contributor Author

commented Jul 26, 2017

@sckott

  • The time for running the query with 0 results should be very fast (few miliseconds; of course it depends on the query). If the query contains more than 50000 records, then total time for running 2 queries, first with 0 rows and second with adjusted rows, should not be significantly different than running just one query with 50000 rows.
  • Fixed bug when the rows value is less than found records
  • I would recommend to have it enabled by default for other methods in the same way. In that case, probably optimizeMaxRows and minOptimizedRows could be initialized by solr_connect

@jefferis

  • Now, the rows will be adjusted to numFound when rows<0 or rows>numFound
  • Re optimizeMaxRows=TRUE, I would recommend to have it enabled and maybe adjust the minOptimizedRows to higher value if needed.
@sckott

This comment has been minimized.

Copy link
Member

commented Jul 26, 2017

  • agree, perf. hit isn't really a big deal, and users can avoid it by setting optimizeMaxRows =NULL
  • If we roll out to the other Solr routes i can do that after merge

I'm curious how clients in other languages handle this, if at all. Does anyone know?

@1havran can you add some tests to https://github.com/ropensci/solrium/blob/master/tests/testthat/test-solr_search.r for this new functionality and make sure at least tests for solr_search pass with your changes

@sckott

This comment has been minimized.

Copy link
Member

commented Jul 26, 2017

@jefferis

make the new behaviour FALSE by default

curious, would it be a deal breaker if it was set to TRUE by default

For your comment about rows=Inf , is that functionality you want? - if so, i guess i'd think twice about doing that since an unaware user could easily then request all records from a database, not knowing how many they're getting

@jefferis

This comment has been minimized.

Copy link

commented Jul 26, 2017

curious, would it be a deal breaker if it was set to TRUE by default

No. If you're happy with setting a fairly high minimum number of rows before it's engaged, that seems fine.

For your comment about rows=Inf , is that functionality you want? - if so, i guess i'd think twice about doing that since an unaware user could easily then request all records from a database, not knowing how many they're getting

I've used that e.g. https://github.com/jefferis/vfbr/blob/master/R/vfb_query.R#L122-L134 and found it useful.

@1havran

This comment has been minimized.

Copy link
Contributor Author

commented Jul 27, 2017

@sckott the tests have been included in the test-solr_search.r

@sckott

This comment has been minimized.

Copy link
Member

commented Aug 5, 2017

aside: curious @jefferis looks like you use httr straight up - would you consider using solrium under certain circumstances?

@jefferis

This comment has been minimized.

Copy link

commented Aug 5, 2017

@sckott

This comment has been minimized.

Copy link
Member

commented Aug 8, 2017

thanks much @1havran for your contribution

@sckott sckott merged commit 45b1d0d into ropensci:master Aug 8, 2017

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details

@sckott sckott added this to the v0.5 milestone Aug 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.