You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Linking performance on large datasets is not terrible but I was sure we could do better. In my case, 20M queries to be performed on 3M Lucene indexed records. It took 30 mins using 4 m4.4xlarge EC2 instances using AWS EMR.
Describe the solution you'd like
I have an RDD with record IDs that I can relate back to my record metadata. All I care about is a record ID and the Lucene query for every single record ID, that will link that record to a Lucene document (ie find a match).
Describe alternatives you've considered
ElasticSearch (given it's Lucene based) but I like the ephemeral nature of Spark jobs, ie no database maintenance, support and overhead.
Implementation
This alternative link method implementation reduced the above runtime to 6 mins from 30 (!).
Basically our 'other' RDD is now a pair of Lucene query strings and whatever metadata, in my case just simply Long values (IDs of some records that are stored elsewhere).
Given many records can share the same Lucene query string, what you get back is a pair RDD, where left side has all the metadata's as Iterable (again, in my case, just a collection of record IDs) and the right side is the matches.
Is your feature request related to a problem? Please describe.
Linking performance on large datasets is not terrible but I was sure we could do better. In my case, 20M queries to be performed on 3M Lucene indexed records. It took 30 mins using 4 m4.4xlarge EC2 instances using AWS EMR.
Describe the solution you'd like
I have an RDD with record IDs that I can relate back to my record metadata. All I care about is a record ID and the Lucene query for every single record ID, that will link that record to a Lucene document (ie find a match).
Describe alternatives you've considered
ElasticSearch (given it's Lucene based) but I like the ephemeral nature of Spark jobs, ie no database maintenance, support and overhead.
Implementation
This alternative
link
method implementation reduced the above runtime to 6 mins from 30 (!).Basically our 'other' RDD is now a pair of Lucene query strings and whatever metadata, in my case just simply Long values (IDs of some records that are stored elsewhere).
Given many records can share the same Lucene query string, what you get back is a pair RDD, where left side has all the metadata's as Iterable (again, in my case, just a collection of record IDs) and the right side is the matches.
Basically this makes the developer to come up with identifiers so no need to use
.zipWithIndex
: https://github.com/zouzias/spark-lucenerdd/blob/develop/src/main/scala/org/zouzias/spark/lucenerdd/LuceneRDD.scala#L217This is early day implementation but please consider it adding it to the codebase.
The text was updated successfully, but these errors were encountered: