Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net::TCPClient#socket_write corrupts data on short writes #10

Closed
SpComb opened this issue Jul 10, 2017 · 7 comments
Closed

Net::TCPClient#socket_write corrupts data on short writes #10

SpComb opened this issue Jul 10, 2017 · 7 comments

Comments

@SpComb
Copy link

SpComb commented Jul 10, 2017

The Net::TCPClient#socket_write calls socket.write_nonblock(data). However, it does not check the returned number of bytes written:

It returns the number of bytes written.

#write_nonblock just calls the write(2) system call. [...] The result may also be smaller than string.length (partial write). The caller should care such errors and partial write.

I believe that in the case of full socket write buffers, and calls to write with large buffers that the kernel does not accept in a single write syscall could result in lost data, as the write syscall only sends part of the buffer. This leads to corruption once that write returns, and the next write is called.


Here's a repro case demonstrating this issue: https://gist.github.com/SpComb/c8b857fc5bb575a1f859f7ea57603a29

The test-client.rb sends one ~1m lines of sequential integers formatted as '%07d\n' (00000001..0998001). It sends them 1k lines at a time, in 8kb write calls. It logs the return value from Net::TCPClient#write, which seems to be the value returned by socket.write_nonblocking.

The test-server.rb reads one line at a time, and prints . N + sleeps 1s/10k lines for the first 100k lines, and then 1s/100k lines for the rest. It parses each line as an integer, comparing it to the previous line, logging a ! message if the lines are not strictly sequential

During the first slow read period, the sending client gets blocked on a full send buffer, which shows up as short writes on the client after about ~400-500kb. It blocks for a short period, and then continues, as the server empties its recv buffer and makes room for more data:

connected: #<Net::TCPClient:0x00000000f92db0>
write 0...: 8000
write 1000...: 8000
write 2000...: 8000
write 3000...: 8000
...
write 440000...: 8000
write 441000...: 8000
write 442000...: 4218
write 443000...: 8000
write 444000...: 8000
....
write 638000...: 8000
write 639000...: 3592
write 640000...: 8000
...
write 826000...: 8000
write 827000...: 8000
write 828000...: 2109
write 829000...: 8000
write 830000...: 8000
...
write 998000...: 8000
write 999000...: 8000

Each of these short writes results in corruption on the server:

connect: 127.0.0.1:37232
. 0
. 10000
. 20000
. 30000
. 40000
. 50000
. 60000
. 70000
. 80000
. 90000
. 100000
. 200000
. 300000
. 400000
! 40443000 -40000474
! 0443001 +39999999
. 500000
. 600000
! 0640000 -552
. 700000
. 800000
! 82820829000 -82820000738
! 0829001 +82819999999
. 900000
eof
@Fivell
Copy link
Contributor

Fivell commented Sep 26, 2017

@SpComb did you fixed it ?

@Fivell
Copy link
Contributor

Fivell commented Sep 26, 2017

looks like implementation should be similar to lann/tcp-timeout-ruby@a167299#diff-783b62709d9f59031b907690bb165c6cR66

Fivell added a commit to didww/net_tcp_client that referenced this issue Sep 26, 2017
Fivell added a commit to didww/net_tcp_client that referenced this issue Sep 26, 2017
Fivell added a commit to didww/net_tcp_client that referenced this issue Sep 26, 2017
@reidmorrison
Copy link
Owner

@Fivell since it appears you have the code fix, do you want to submit a PR to merge your changes?
Have you had a chance to try out your changes in production?

@Fivell
Copy link
Contributor

Fivell commented Jan 3, 2019

@reidmorrison yes we tested it in production, however I have no specs for it

@Fivell
Copy link
Contributor

Fivell commented Jan 3, 2019

@reidmorrison seems travis.yml need to be fixed

@reidmorrison
Copy link
Owner

Merged the fix and will research Travis failures. Thank you for the PR.

@ebertech
Copy link

I think this also occurs on short reads...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants