Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket source with bytes framing and TCP mode only sends data after connection is closed #17136

Closed
dekelpilli opened this issue Apr 13, 2023 · 7 comments
Labels
type: bug A code related bug.

Comments

@dekelpilli
Copy link

dekelpilli commented Apr 13, 2023

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

As the title says, the socket source with bytes framing doesn't seem to be behaving correctly when the mode is tcp.

In addition to the tests detailed in the Discord link below, I also set up tests using vector on both sides. The data would only appear on the receiving Vector instance after the sending Vector instance was stopped, and even then all of the data would be concatenated into one event. I believe the netcat tests in the Discord also show that this is an issue with the socket source, rather than with the sink.

Configuration

Receiving:

{
  "healthchecks": {
    "enabled": false
  },
  "log_schema": {
    "source_type_key": ""
  },
  "sources": {
    "a": {
      "address": "0.0.0.0:30000",
      "port_key": "",
      "mode": "tcp",
      "type": "socket",
      "host_key": "",
      "framing": {
        "method": "bytes"
      }
    },
    "b": {
      "address": "0.0.0.0:30001",
      "port_key": "",
      "mode": "tcp",
      "type": "socket",
      "host_key": "",
      "framing": {
        "method": "newline_delimited"
      },
      "decoding": {"codec": "bytes"}
    },
    "c": {
      "address": "0.0.0.0:30002",
      "framing": {
        "method": "bytes"
      },
      "port_key": "",
      "mode": "udp",
      "type": "socket",
      "host_key": ""
    }
  },
  "transforms": {
  },
  "sinks": {
    "ljpdseuegrsvnbvaagwkj": {
      "encoding": {
        "codec": "json"
      },
      "inputs": [
        "*"
      ],
      "target": "stderr",
      "type": "console"
    }
  }
}

Sending:

{
  "log_schema": {
    "source_type_key": ""
  },
  "sources": {
    "h": 
    {
      "type": "http",
      "acknowledgements": false,
      "address": "0.0.0.0:${PORT:-1234}",
      "framing": {"method":"newline_delimited"},
      "path_key": ""
    }
  },
  "transforms": {
  },
  "sinks": {
    "tcp": {
      "type": "socket",
      "inputs": ["h"],
      "address": "0.0.0.0:30001",
      "mode": "tcp",
      "encoding": {
        "codec": "raw_message"
      },
      "framing": {
        "method": "bytes"
      },
      "buffer": {
        "max_events": 1
      }
    }
  }
}

Version

vector 0.28.2 (x86_64-apple-darwin 986dd37 2023-04-10)

Debug Output

No response

Example Data

I sent three separate http messages to the sending Vector:
abc
123
and ab. Once I shut down that Vector instance, the following data appeared in the receiving Vector:

{"message":"abc123ab","timestamp":"2023-04-13T03:06:18.673551Z"}

Additional Context

No response

References

@dekelpilli dekelpilli added the type: bug A code related bug. label Apr 13, 2023
@neuronull
Copy link
Contributor

Hello, thanks for providing these details and reproducible configs.

I think what is going on here (both for the socket source case, and the socket sink case), is that when using the bytes framing,

Byte frames are passed through as-is according to the underlying I/O boundaries (for example, split between messages or stream segments).

meaning that these boundaries haven't been crossed in your example cases.

To illustrate this, if the framing of the socket sink in the Sending config is changed to be newline_delimited, then sending the example data you provided in separate requests, results in each one delivered to the console of the receiving instance.

That is because the socket sink now knows when to cut the boundary for an event. With the framing method set to bytes, even though the http source received one event, the i/o stream is still open as far as the socket sink is concerned.

This also further supports @tobz's observation in the discord thread about netcat behavior (when only using the Receiving config).

@neuronull
Copy link
Contributor

Hi @dekelpilli , have you had a chance to review my previous comment?
Essentially, I believe the source is working as intended.

@dekelpilli
Copy link
Author

Hi @neuronull - I still believe this is not how the source should behave. My understanding is that the "underlying I/O's boundary" should just be a single TCP packet, rather than a full connection. Where the structure of a TCP packet is <segment header><data>, I would expect source a above to send the bytes content of after every packet, and that the presence/absence of newlines within that data to make no impact on the framing of events emitted by the source.

This is my understanding of "split between messages or stream segments", rather than splitting messages when the connection is closed. I also believe this is far more useful, and if for TCP we take the I/O boundary to mean the start of a connection until the end of that connection, then bytes framing on the TCP is not a very useful feature (as it would require breaking and recreating the connection to get messages to progress).

@jszwedko
Copy link
Member

Hi @neuronull - I still believe this is not how the source should behave. My understanding is that the "underlying I/O's boundary" should just be a single TCP packet, rather than a full connection. Where the structure of a TCP packet is <segment header><data>, I would expect source a above to send the bytes content of after every packet, and that the presence/absence of newlines within that data to make no impact on the framing of events emitted by the source.

This is my understanding of "split between messages or stream segments", rather than splitting messages when the connection is closed. I also believe this is far more useful, and if for TCP we take the I/O boundary to mean the start of a connection until the end of that connection, then bytes framing on the TCP is not a very useful feature (as it would require breaking and recreating the connection to get messages to progress).

I completely agree with you that the bytes framing for TCP connections is not terribly useful, but I don't know that using TCP packets as the boundary is terribly useful either. In contrast with UDP packets, where the content of the packet is usually controlled by the application, TCP packet contents are largely opaque to the application. The OS can decide how to assemble them and routers in-between can fragment packets further so the receiver has very little guarantees it can have on the incoming data.

Maybe it'd be better to step back to understand your use-case a bit better. Would you be able to describe it? Answering questions like: what is sending the data? What is the data? What processing are you hoping to do in Vector? That would help us understand how it could be best modeled in Vector (including if it would make sense for the bytes framing to mean TCP packets for TCP connections).

@dekelpilli
Copy link
Author

dekelpilli commented Jun 3, 2023

what is sending the data?

Our own (thin) service that is receiving UDP messages and streaming them to Vector. At the moment, it's also adding newlines to the end of the messages as we're using newline_delimited (similar to source b above). This is, in large part, a workaround for the current issues with the UDP mode for the socket source (#15583, #8518). This service also performs other logic and collects various information in ways vector can't (and shouldn't need to), so having this as part of the service is not a large opportunity cost.

What is the data? What processing are you hoping to do in Vector?

Various application and device logs, which go through some transforms in vector before ending up in storage/loki sinks.


At the moment, this issue isn't impacting us very much because we have already implemented a workaround. That said, I do believe that if the current behaviour of the bytes framing won't change, it's worth clarifying the documentation to specify that the I/O boundary is the connection, rather than packets. One could even argue that it's worth deprecating that framing method for TCP, but I'm not sure that adds much value.

@jszwedko
Copy link
Member

Thanks for the additional details @dekelpilli ! Using the newline delimited framing for that case makes sense in light of the known issues for UDP (which would be nice to fix).

I agree that the connection based framing is not likely to be used much, but it is at least something senders using TCP have more control over than TCP packet splitting. That is, they can open a connection, send an "event", and close it but they can't guarantee that a single TCP packet will have a single event since TCP packets can be split in-transit.

Given the above discussion, I'll close this issue, but I appreciate you raising it!

jszwedko added a commit that referenced this issue Jun 23, 2023
I'm not aware of any sources that separate "steam segments" so I updated the language a bit to
account for how the `tcp` mode of the `socket` source handles `bytes` framing.

Reference: #17136

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
@jszwedko
Copy link
Member

Opened #17745 to clarify the docs.

github-merge-queue bot pushed a commit that referenced this issue Jun 23, 2023
I'm not aware of any sources that separate "steam segments" so I updated
the language a bit to
account for how the `tcp` mode of the `socket` source handles `bytes`
framing.

Reference: #17136

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

3 participants