You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am aware, that passing byte arrays via py4j is not (supposed to be) very efficient (although some copy operations along the path could be avoided).
Nevertheless I was initially surprised by the following performance difference (here in py4j-0.10.9.5.jar included in PySpark on Linux):
import time
b=spark.sparkContext._jvm.java.nio.ByteBuffer.allocate(4096)
t0=time.time()
for i in range(0,100):
u=b.array()
print(time.time()-t0)
0.04267597198486328
b=spark.sparkContext._jvm.java.nio.ByteBuffer.allocate(8192)
t0=time.time()
for i in range(0,100):
u=b.array()
print(time.time()-t0)
4.404087543487549
It turns out that the code suffers from Nagle algorithm here. E.g. in the CallCommand
writer.write(returnCommand);
writer.flush();
if writing returnCommand exceeds the buffer of the BufferedWriter, there are two writes to the socket output.
After disabling the Nagle algorithm for loopback sockets by adding the following in ClientServerConnection.java
super();
this.socket = socket;
// added
if (socket.getLocalAddress().isLoopbackAddress()) socket.setTcpNoDelay(true);
this.reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), Charset.forName("UTF-8")));
I get the following run time measurements:
0.047772884368896484
0.07696914672851562
I think, that for loopback sockets disabling the algorithm does not have any disadvantages, since buffering occurs in the BufferedWriter. Possibly one could disable it in general.
The text was updated successfully, but these errors were encountered:
I am aware, that passing byte arrays via py4j is not (supposed to be) very efficient (although some copy operations along the path could be avoided).
Nevertheless I was initially surprised by the following performance difference (here in py4j-0.10.9.5.jar included in PySpark on Linux):
import time
b=spark.sparkContext._jvm.java.nio.ByteBuffer.allocate(4096)
t0=time.time()
for i in range(0,100):
u=b.array()
print(time.time()-t0)
0.04267597198486328
b=spark.sparkContext._jvm.java.nio.ByteBuffer.allocate(8192)
t0=time.time()
for i in range(0,100):
u=b.array()
print(time.time()-t0)
4.404087543487549
It turns out that the code suffers from Nagle algorithm here. E.g. in the CallCommand
if writing returnCommand exceeds the buffer of the BufferedWriter, there are two writes to the socket output.
After disabling the Nagle algorithm for loopback sockets by adding the following in ClientServerConnection.java
I get the following run time measurements:
0.047772884368896484
0.07696914672851562
I think, that for loopback sockets disabling the algorithm does not have any disadvantages, since buffering occurs in the BufferedWriter. Possibly one could disable it in general.
The text was updated successfully, but these errors were encountered: