Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help debugging a OOM issue when the search population increases #228

Open
yeikel opened this issue Feb 20, 2020 · 3 comments
Open

Help debugging a OOM issue when the search population increases #228

yeikel opened this issue Feb 20, 2020 · 3 comments

Comments

@yeikel
Copy link
Contributor

yeikel commented Feb 20, 2020

Describe the bug

I have a LuceneRDD that I construct using the following syntax :

val index = LuceneRDD(data).persist(StorageLevel.MEMORY_AND_DISK)

For which I run against another dataset using the following code :

val results = index.link(searchBase.rdd, Linker.queryBuilderLinker, topK).filter(_._2.nonEmpty)

My distribution is the following :

index : 90M records , 300 partitions
searchBase : 2M records , 10 partitions
Topk : 1000

My cluster configuration is :

Executor-cores 2 
Executor-memory 19g
Driver-memory 25g 

And my linker method is

lucenerdd {
  linker.method = "cartesian"
  store.mode = "disk"
}

Code that is failing :

flatMap at LuceneRDD.scala:222

And the exception is :

Job aborted due to stage failure: Task 9525 in stage 11.0 failed 4 times, most recent failure: Lost task 9525.3 in stage 11.0 (TID 12771, executor 89): ExecutorLostFailure (executor 89 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 21.0 GB of 21 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Failed task :

image

Increasing the number of partitions for searchBase or index did not make any difference. The only time that the job succeeds is when I reduce my search base to 1M. Reducing topk did not make any difference as well

My cluster is big enough to be able to cache the index base (I am able to see that in the storage page) , so I am having a hard time understanding the source of the OOM. :

image

image

Thanks for your help!

@yeikel yeikel changed the title Help debugging a OOM issue with Index Help debugging a OOM issue when the search population increases Feb 20, 2020
@yeikel
Copy link
Contributor Author

yeikel commented Feb 24, 2020

@zouzias Do you have any idea?
I keep getting this error and changing the partitioning does not help

@zouzias
Copy link
Owner

zouzias commented Feb 26, 2020

Can you try with linker method linker.method = "collectbroadcast"?

I think you should be able to broadcast the search base to all executors since they are only 2M.

See: https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/resources/reference.conf#L19

Also, if you have any fields that you do not use for linkage, you should remove them from the DataFrame, i.e.

val fields = Array("usefulCol1", "usefulCol2", ...)
val projectedSearchBase = searchBase.select(fields)

and then

val results = index.link(searchBase.rdd, Linker.queryBuilderLinker, topK).filter(_._2.nonEmpty) 

@yeikel
Copy link
Contributor Author

yeikel commented Feb 26, 2020

Thank you for your response @zouzias

Both dataframes contain only one field. It is a string field containing addresses so they are not that long too.

I can't use linker.method = "collectbroadcast" because I am using the custom build discussed on #162 . I am using relatively complex queries and they are very hard to build using Strings (and hard to manage overall) . Query builders do a better job here.

Do you have any idea about what could be the source of the OOM error? It seems that executors are allocating too much much memory , but the executors are relatively big and sizes are relatively small.

I tried with even smaller datasets and it is still failing (size > 100K). It works sometimes with bigger datasets , but it is not reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants