Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remap runtime is heavily dominated by clone/drop times #6770

Closed
blt opened this issue Mar 15, 2021 · 5 comments
Closed

Remap runtime is heavily dominated by clone/drop times #6770

blt opened this issue Mar 15, 2021 · 5 comments
Assignees
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language type: task Generic non-code related tasks

Comments

@blt
Copy link
Contributor

blt commented Mar 15, 2021

@lukesteensen and I have been profiling vector 6193e68 across two configs, seen here and here. Our hypothesis was that the nohop configuration would perform better than the multihop configuration, but the opposite is true. For instance:

 Performance counter stats for 'bash -c zcat baked.log.gz | /home/blt/projects/com/datadog/timberio/vector-test-harness/cases/performance/klog_remap_to_blackhole/bin/vector_gen-6193e68e5b3e50b7e5469d6a8699e25d07bcf57e -qq --config vector-multihop.toml' (10 runs):
        100,771.53 msec task-clock                #    3.771 CPUs utilized            ( +-  0.33% )
           633,521      context-switches          #    0.006 M/sec                    ( +-  3.33% )
               275      cpu-migrations            #    0.003 K/sec                    ( +-  3.85% )
             8,759      page-faults               #    0.087 K/sec                    ( +-  0.24% )
   238,463,149,986      cycles                    #    2.366 GHz                      ( +-  0.53% )  (83.34%)
     9,407,007,362      stalled-cycles-frontend   #    3.94% frontend cycles idle     ( +-  1.44% )  (83.32%)
   118,882,228,458      stalled-cycles-backend    #   49.85% backend cycles idle      ( +-  0.90% )  (83.37%)
   325,001,862,213      instructions              #    1.36  insn per cycle
                                                  #    0.37  stalled cycles per insn  ( +-  0.06% )  (83.28%)
    56,917,463,766      branches                  #  564.817 M/sec                    ( +-  0.08% )  (83.33%)
       166,816,926      branch-misses             #    0.29% of all branches          ( +-  0.47% )  (83.36%)
           26.7223 +- 0.0645 seconds time elapsed  ( +-  0.24% )

 Performance counter stats for 'bash -c zcat baked.log.gz | /home/blt/projects/com/datadog/timberio/vector-test-harness/cases/performance/klog_remap_to_blackhole/bin/vector_gen-6193e68e5b3e50b7e5469d6a8699e25d07bcf57e -qq --config vector-nohop.toml' (10 runs):
        111,934.76 msec task-clock                #    3.073 CPUs utilized            ( +-  0.45% )
         1,060,241      context-switches          #    0.009 M/sec                    ( +-  1.90% )
               370      cpu-migrations            #    0.003 K/sec                    ( +-  7.33% )
             7,044      page-faults               #    0.063 K/sec                    ( +-  0.92% )
   240,814,108,668      cycles                    #    2.151 GHz                      ( +-  0.44% )  (83.31%)
    10,586,860,513      stalled-cycles-frontend   #    4.40% frontend cycles idle     ( +-  0.99% )  (83.35%)
   107,199,595,735      stalled-cycles-backend    #   44.52% backend cycles idle      ( +-  0.81% )  (83.37%)
   330,593,682,800      instructions              #    1.37  insn per cycle
                                                  #    0.32  stalled cycles per insn  ( +-  0.15% )  (83.36%)
    59,140,674,056      branches                  #  528.350 M/sec                    ( +-  0.19% )  (83.33%)
       291,923,061      branch-misses             #    0.49% of all branches          ( +-  1.62% )  (83.27%)
            36.424 +- 0.356 seconds time elapsed  ( +-  0.98% )

The multihop configuration runs ~10 seconds faster than the nohop, hypothesis being that remap is a serial bottleneck in the nohop variant where in multihop its costs are spread across multiple threads. This is born out by the flamegraphs of the two configs, seen here. Notably the diff flamegraph does not show a negative difference except in a slightly elevated backtrace generation in the nohop variant.

The following screenshot show major contributors to runtime:
Screenshot from 2021-03-15 13-54-22

For the purposes of this issue we can see remap in two places:

Screenshot from 2021-03-15 13-57-43
Screenshot from 2021-03-15 13-59-36

Note the high degree to which drops and BTreeMap clones appear in this flamegraph. VRL does a non-trivial amount of cloning internally and these costs do show up.

cc @timberio/vector-processing

@blt blt added type: task Generic non-code related tasks domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language labels Mar 15, 2021
@blt blt self-assigned this Mar 15, 2021
@StephenWakely
Copy link
Contributor

Interesting. It looks like there are two main culprits.

First is the clone that we need to do to ensure the event isn't mutated by the condition. There is a similar clone that is made by the transform so we can roll back on error.

The other culprit is due to the way the Target trait is defined - https://github.com/timberio/vector/blob/master/src/event/log_event.rs#L247. To fetch a value from the event it returns the value and not a reference, thus requiring a clone.

This is partially because Vrl Value is different to Vector Value so a Vrl Value needs to be created and returned. Also for metric events, the actual Value doesn't really exist in the event, so the value has to be created and returned. If it were a reference there would be nothing to own the value the reference is pointing at.

The Target trait is essentially the interface between Vector and Vrl, and I suspect most of the gains can be made by rethinking the way this is working. First to allow it to only create copies of data when it is modified and second to provide some way for it to handle references to Vector Values when possible and own (and cache) created data when references aren't possible.

@blt
Copy link
Contributor Author

blt commented Mar 19, 2021

As noted in #6787 if we adjust the internal representation of VRL's map to be cheaper to clone -- going through various pointer types and even a simple vec of tuples, essentially -- the runtime of this configuration:

data_dir = "data"                                                                                                                                                                                                                        [7/821]

[sources.stdin]
type = "stdin"

[transforms.remap]
type = "remap"
inputs = ["stdin"]
source = """
# json_parser transform equivalent
parsed_json, err = parse_json(.message)
if err != null {
  ., _ = merge(., parsed_json)
}
# regex_parser transform equivalent (double escape character classes because toml)
matches, err = parse_regex(.source, r'^/([\\w]+)/([\\w]+)/([\\w]+)/(?P<pod>[\\w\\-]+)_(?P<namespace>[\\w\\-]+)_(?P<container>[\\w\\-\\.]+)-(\\w+).log$')
if err != null {
  .pod = matches.pod
  .namespace = matches.namespace
  .container = matches.container
}
# add_fields transform equivalent
.origin_id, err = .namespace + "/" + .pod + "/" + .container
.agent_name = "vector"
.host = "${VECTOR_SELF_NODE_NAME}"
# coercer transform equivalent
ts, err = to_timestamp(.time)
if err != null {
  .timestamp = ts
}
# original remap logic, improved
nested_json, err = parse_json(.log)
if err != null {
  .log = nested_json
  .log_type = "json"
} else {
  .log_type = "text"
}
"""

[transforms.swimlanes]
type = "swimlanes"
inputs = ["remap"]

[transforms.swimlanes.lanes.json]
type = "remap"
source = """
  .log_type == "json"
"""

[transforms.swimlanes.lanes.text]
type = "remap"
source = """
  .log_type == "text"
"""

[transforms.reduce]
type = "reduce"
inputs = ["swimlanes.text"]
group_by = ["origin_id"]
merge_strategies.log = "concat_newline"
starts_when.type = "remap"
starts_when.source = '''match!(.log, r'^\d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}') || match!(.log, r'^\[?\d{2}\S\w{3}\S\d{4}:\d{2}:\d{2}:\d{2} [+]\d{2,}\]') || match!(.log, r'^\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}.\d{3}') || match!(.log, r'^\[
\w+]: \d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}')'''

[sinks.blackhole]
type = "blackhole"
inputs = ["reduce", "swimlanes.text", "swimlanes.json"]

is dominated by clone costs of BTreeMaps. Notably this configuration which splits into multiple topology nodes

data_dir = "data"                                                                                                                                                                                                                      [13/1046]

[sources.stdin]
type = "stdin"

[transforms.json_parser]
type = "json_parser"
inputs = ["stdin"]

[transforms.regex_parser]
type = "regex_parser"
inputs = ["json_parser"]
drop_field = false
field = "source"
# Take namespace,pod,container name from file field of /var/log/containers/
patterns = ['^/([\w]+)/([\w]+)/([\w]+)/(?P<pod>[\w\-]+)_(?P<namespace>[\w\-]+)_(?P<container>[\w\-\.]+)-(\w+).log$']
types.pod = "string"
types.namespace = "string"
types.container = "string"

[transforms.add_fields]
type = "add_fields"
inputs = ["regex_parser"]
overwrite = true
fields.origin_id = "{{ namespace }}/{{ pod }}/{{ container }}"
fields.agent_name = "vector"
fields.host = "${VECTOR_SELF_NODE_NAME}"
fields.timestamp = "{{ time }}"

[transforms.coercer]
type = "coercer"
inputs = ["add_fields"]
types.timestamp = "timestamp"

[transforms.remap]
type = "remap"
inputs = ["coercer"]
source = """
if starts_with(.log, "{") {
   .log = parse_json!(.log)
   .log_type = "json"
} else {
  .log_type = "text"
}
"""

[transforms.swimlanes]
type = "swimlanes"
inputs = ["remap"]

[transforms.swimlanes.lanes.json]
type = "remap"
source = """
  .log_type == "json"
"""

[transforms.swimlanes.lanes.text]
type = "remap"
source = """
  .log_type == "text"
"""

[transforms.reduce]
type = "reduce"
inputs = ["swimlanes.text"]
group_by = ["origin_id"]
merge_strategies.log = "concat_newline"
starts_when.type = "remap"
starts_when.source = '''match!(.log, r'^\d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}') || match!(.log, r'^\[?\d{2}\S\w{3}\S\d{4}:\d{2}:\d{2}:\d{2} [+]\d{2,}\]') || match!(.log, r'^\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}.\d{3}') || match!(.log, r'^\[
\w+]: \d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}')'''

[sinks.blackhole]
type = "blackhole"
inputs = ["reduce", "swimlanes.text", "swimlanes.json"]

pays similar costs but does so across multiple threads in the default vector setup and is faster as a result. If we make vector single-threaded they're more or less the same cost. I believe this strongly suggests that merely making clones cheap -- as is attempted in #6787 -- is not sufficient and we must remove clones. Pushing forward with #5374 strikes me as the next logical step.

@blt
Copy link
Contributor Author

blt commented Mar 31, 2021

Alright, update here. The work that @FungusHumungus is doing to get #5374 up to parity with master is ongoing and important to fixing this. Additionally, experimentation in https://github.com/blt/pipelines does help confirm that we do need to be operating on batches inside of vector, not individual Events. Allocate everything in batches, move everything around in batches, process them in batches. We're going to run out of synonyms.

@JeanMertz
Copy link
Contributor

A while has passed since this issue was last updated. There have been many performance-related improvements to VRL, including new path-lookup code, and a shared Value type between Vector and VRL.

It might be worth profiling this again, or closing the issue if it is superseded by other issues we've filed since.

@jszwedko
Copy link
Member

Closing since the investigation here is complete. When we pick up performance work again it'll likely start anew with new profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

4 participants