Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use one memory mapped file for plasma #3871

Merged
merged 37 commits into from
Feb 7, 2019

Conversation

pcmoritz
Copy link
Contributor

Trying the change in apache/arrow#3490

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11195/
Test PASSed.

@pcmoritz pcmoritz changed the title [WIP] Use one memory mapped file for plasma Use one memory mapped file for plasma Jan 30, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11288/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Because we are updating Arrow, this should also fix #3846.

@zhijunfu
Copy link
Contributor

Using a single mmap'ed file should help with shared memory usage.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11407/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11408/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 1, 2019

@raulchen @guoyuhong Can you have a look at the Java failure? Could it be related to apache/arrow#3306, does the Java API need to be changed to incorporate that PR?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11409/
Test PASSed.

@raulchen
Copy link
Contributor

raulchen commented Feb 2, 2019

@raulchen @guoyuhong Can you have a look at the Java failure? Could it be related to apache/arrow#3306, does the Java API need to be changed to incorporate that PR?

Doesn't seem to be related with that PR. I'll try to test it locally in a bit.

@guoyuhong
Copy link
Contributor

guoyuhong commented Feb 2, 2019

@raulchen I cannot repro it in my mac. I can repro it in my linux system. It may be caused by the PR mentioned by @pcmoritz .

Stack: [0x00007f99de1f7000,0x00007f99de2f7000],  sp=0x00007f99de2f5410,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6b9835]  jni_ThrowNew+0xb5
C  [libplasma_java.so+0x16975]  Java_org_apache_arrow_plasma_PlasmaClientJNI_create+0x315
j  org.apache.arrow.plasma.PlasmaClientJNI.create(J[BI[B)Ljava/nio/ByteBuffer;+0
j  org.apache.arrow.plasma.PlasmaClient.put([B[B[B)V+11
j  org.ray.runtime.objectstore.ObjectStoreProxy.put(Lorg/ray/api/id/UniqueId;Ljava/lang/Object;Ljava/lang/Object;)V+21
j  org.ray.runtime.AbstractRayRuntime.put(Lorg/ray/api/id/UniqueId;Ljava/lang/Object;)V+27
j  org.ray.runtime.Worker.execute(Lorg/ray/runtime/task/TaskSpec;)V+190
j  org.ray.runtime.Worker.loop()V+31
j  org.ray.runtime.AbstractRayRuntime.loop()V+4
j  org.ray.runtime.runner.worker.DefaultWorker.main([Ljava/lang/String;)V+27
v  ~StubRoutines::call_stub
V  [libjvm.so+0x68855f]  JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x107f
V  [libjvm.so+0x6b557d]  jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x32d
V  [libjvm.so+0x6b703f]  jni_CallStaticVoidMethod+0x17f
C  [libjli.so+0x38a4]  JavaMain+0x604
C  [libpthread.so.0+0x7e25]  start_thread+0xc5

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.apache.arrow.plasma.PlasmaClientJNI.create(J[BI[B)Ljava/nio/ByteBuffer;+0
j  org.apache.arrow.plasma.PlasmaClient.put([B[B[B)V+11
j  org.ray.runtime.objectstore.ObjectStoreProxy.put(Lorg/ray/api/id/UniqueId;Ljava/lang/Object;Ljava/lang/Object;)V+21
j  org.ray.runtime.AbstractRayRuntime.put(Lorg/ray/api/id/UniqueId;Ljava/lang/Object;)V+27
j  org.ray.runtime.Worker.execute(Lorg/ray/runtime/task/TaskSpec;)V+190
j  org.ray.runtime.Worker.loop()V+31
j  org.ray.runtime.AbstractRayRuntime.loop()V+4
j  org.ray.runtime.runner.worker.DefaultWorker.main([Ljava/lang/String;)V+27
v  ~StubRoutines::call_stub

Maybe env->FindClass("org/apache/arrow/plasma/exceptions/DuplicateObjectException"); return something wrong?

@guoyuhong
Copy link
Contributor

guoyuhong commented Feb 2, 2019

@pcmoritz In my local Mac or Linux I still found there is a java building failure which can be fixed by this PR: apache/arrow#2738 . I will reopen this PR. Could you please help to merge it? This PR will fix the ray-java building error showed in following picture.
image

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 2, 2019

@guoyuhong Thanks for looking into this! Do you have an idea what the right fix is? This is very high priority since there is a number of bugs that are blocked on being able to upgrade arrow.

@guoyuhong
Copy link
Contributor

@pcmoritz It looks like that the Arrow version in pom.xml is too old. So java runtime cannot find the corresponding class. I have updated the pom.xml. @raulchen could help to review this change. @raulchen also said #3687 will make the JAVA test pass.

@guoyuhong
Copy link
Contributor

@pcmoritz With @raulchen 's help, the test passed in my local environment. I copied the change in ObjectStoreProxy.java from #3687 . Otherwise, the two PR is dead-locked.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 2, 2019

Great, thanks a lot! I'll try to get this PR merged tomorrow then :)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11433/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11440/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11458/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11462/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11586/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 5, 2019

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11588/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11590/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 6, 2019

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11594/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 6, 2019

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11595/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 6, 2019

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11596/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11592/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11597/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11598/
Test FAILed.

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Feb 6, 2019

Jenkins retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11601/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11602/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11599/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11605/
Test PASSed.

Copy link
Contributor

@guoyuhong guoyuhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed. Thanks for your effort!

@@ -197,7 +197,7 @@ def g(x):
def ray_start_reconstruction(request):
num_nodes = request.param

plasma_store_memory = 10**9
plasma_store_memory = int(0.5 * 10**9)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed because the space calculations in plasma are now done slightly differently (there can now be a discrepancy between the plasma store memory and the size of the memory mapped files, in practice I have seen < 10%).

This change is to be on the safe side to prevent flakiness. With 0.8 here I was still seeing test_simple and test_recursive in stress_tests.py failing occasionally.

@robertnishihara
Copy link
Collaborator

Let's hold off on merging this until after the 0.6.3 release (which should be soon).

@robertnishihara robertnishihara merged commit 3bb6567 into ray-project:master Feb 7, 2019
@robertnishihara robertnishihara deleted the ray-plasma-one-file branch February 7, 2019 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants