Remap runtime is heavily dominated by clone/drop times #6770

blt · 2021-03-15T21:00:37Z

@lukesteensen and I have been profiling vector 6193e68 across two configs, seen here and here. Our hypothesis was that the nohop configuration would perform better than the multihop configuration, but the opposite is true. For instance:

 Performance counter stats for 'bash -c zcat baked.log.gz | /home/blt/projects/com/datadog/timberio/vector-test-harness/cases/performance/klog_remap_to_blackhole/bin/vector_gen-6193e68e5b3e50b7e5469d6a8699e25d07bcf57e -qq --config vector-multihop.toml' (10 runs):
        100,771.53 msec task-clock                #    3.771 CPUs utilized            ( +-  0.33% )
           633,521      context-switches          #    0.006 M/sec                    ( +-  3.33% )
               275      cpu-migrations            #    0.003 K/sec                    ( +-  3.85% )
             8,759      page-faults               #    0.087 K/sec                    ( +-  0.24% )
   238,463,149,986      cycles                    #    2.366 GHz                      ( +-  0.53% )  (83.34%)
     9,407,007,362      stalled-cycles-frontend   #    3.94% frontend cycles idle     ( +-  1.44% )  (83.32%)
   118,882,228,458      stalled-cycles-backend    #   49.85% backend cycles idle      ( +-  0.90% )  (83.37%)
   325,001,862,213      instructions              #    1.36  insn per cycle
                                                  #    0.37  stalled cycles per insn  ( +-  0.06% )  (83.28%)
    56,917,463,766      branches                  #  564.817 M/sec                    ( +-  0.08% )  (83.33%)
       166,816,926      branch-misses             #    0.29% of all branches          ( +-  0.47% )  (83.36%)
           26.7223 +- 0.0645 seconds time elapsed  ( +-  0.24% )

 Performance counter stats for 'bash -c zcat baked.log.gz | /home/blt/projects/com/datadog/timberio/vector-test-harness/cases/performance/klog_remap_to_blackhole/bin/vector_gen-6193e68e5b3e50b7e5469d6a8699e25d07bcf57e -qq --config vector-nohop.toml' (10 runs):
        111,934.76 msec task-clock                #    3.073 CPUs utilized            ( +-  0.45% )
         1,060,241      context-switches          #    0.009 M/sec                    ( +-  1.90% )
               370      cpu-migrations            #    0.003 K/sec                    ( +-  7.33% )
             7,044      page-faults               #    0.063 K/sec                    ( +-  0.92% )
   240,814,108,668      cycles                    #    2.151 GHz                      ( +-  0.44% )  (83.31%)
    10,586,860,513      stalled-cycles-frontend   #    4.40% frontend cycles idle     ( +-  0.99% )  (83.35%)
   107,199,595,735      stalled-cycles-backend    #   44.52% backend cycles idle      ( +-  0.81% )  (83.37%)
   330,593,682,800      instructions              #    1.37  insn per cycle
                                                  #    0.32  stalled cycles per insn  ( +-  0.15% )  (83.36%)
    59,140,674,056      branches                  #  528.350 M/sec                    ( +-  0.19% )  (83.33%)
       291,923,061      branch-misses             #    0.49% of all branches          ( +-  1.62% )  (83.27%)
            36.424 +- 0.356 seconds time elapsed  ( +-  0.98% )

The multihop configuration runs ~10 seconds faster than the nohop, hypothesis being that remap is a serial bottleneck in the nohop variant where in multihop its costs are spread across multiple threads. This is born out by the flamegraphs of the two configs, seen here. Notably the diff flamegraph does not show a negative difference except in a slightly elevated backtrace generation in the nohop variant.

The following screenshot show major contributors to runtime:

For the purposes of this issue we can see remap in two places:

Note the high degree to which drops and BTreeMap clones appear in this flamegraph. VRL does a non-trivial amount of cloning internally and these costs do show up.

cc @timberio/vector-processing

The text was updated successfully, but these errors were encountered:

StephenWakely · 2021-03-16T11:35:05Z

Interesting. It looks like there are two main culprits.

First is the clone that we need to do to ensure the event isn't mutated by the condition. There is a similar clone that is made by the transform so we can roll back on error.

The other culprit is due to the way the Target trait is defined - https://github.com/timberio/vector/blob/master/src/event/log_event.rs#L247. To fetch a value from the event it returns the value and not a reference, thus requiring a clone.

This is partially because Vrl Value is different to Vector Value so a Vrl Value needs to be created and returned. Also for metric events, the actual Value doesn't really exist in the event, so the value has to be created and returned. If it were a reference there would be nothing to own the value the reference is pointing at.

The Target trait is essentially the interface between Vector and Vrl, and I suspect most of the gains can be made by rethinking the way this is working. First to allow it to only create copies of data when it is modified and second to provide some way for it to handle references to Vector Values when possible and own (and cache) created data when references aren't possible.

blt · 2021-03-19T18:28:48Z

As noted in #6787 if we adjust the internal representation of VRL's map to be cheaper to clone -- going through various pointer types and even a simple vec of tuples, essentially -- the runtime of this configuration:

data_dir = "data"                                                                                                                                                                                                                        [7/821]

[sources.stdin]
type = "stdin"

[transforms.remap]
type = "remap"
inputs = ["stdin"]
source = """
# json_parser transform equivalent
parsed_json, err = parse_json(.message)
if err != null {
  ., _ = merge(., parsed_json)
}
# regex_parser transform equivalent (double escape character classes because toml)
matches, err = parse_regex(.source, r'^/([\\w]+)/([\\w]+)/([\\w]+)/(?P<pod>[\\w\\-]+)_(?P<namespace>[\\w\\-]+)_(?P<container>[\\w\\-\\.]+)-(\\w+).log$')
if err != null {
  .pod = matches.pod
  .namespace = matches.namespace
  .container = matches.container
}
# add_fields transform equivalent
.origin_id, err = .namespace + "/" + .pod + "/" + .container
.agent_name = "vector"
.host = "${VECTOR_SELF_NODE_NAME}"
# coercer transform equivalent
ts, err = to_timestamp(.time)
if err != null {
  .timestamp = ts
}
# original remap logic, improved
nested_json, err = parse_json(.log)
if err != null {
  .log = nested_json
  .log_type = "json"
} else {
  .log_type = "text"
}
"""

[transforms.swimlanes]
type = "swimlanes"
inputs = ["remap"]

[transforms.swimlanes.lanes.json]
type = "remap"
source = """
  .log_type == "json"
"""

[transforms.swimlanes.lanes.text]
type = "remap"
source = """
  .log_type == "text"
"""

[transforms.reduce]
type = "reduce"
inputs = ["swimlanes.text"]
group_by = ["origin_id"]
merge_strategies.log = "concat_newline"
starts_when.type = "remap"
starts_when.source = '''match!(.log, r'^\d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}') || match!(.log, r'^\[?\d{2}\S\w{3}\S\d{4}:\d{2}:\d{2}:\d{2} [+]\d{2,}\]') || match!(.log, r'^\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}.\d{3}') || match!(.log, r'^\[
\w+]: \d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}')'''

[sinks.blackhole]
type = "blackhole"
inputs = ["reduce", "swimlanes.text", "swimlanes.json"]

is dominated by clone costs of BTreeMaps. Notably this configuration which splits into multiple topology nodes

data_dir = "data"                                                                                                                                                                                                                      [13/1046]

[sources.stdin]
type = "stdin"

[transforms.json_parser]
type = "json_parser"
inputs = ["stdin"]

[transforms.regex_parser]
type = "regex_parser"
inputs = ["json_parser"]
drop_field = false
field = "source"
# Take namespace,pod,container name from file field of /var/log/containers/
patterns = ['^/([\w]+)/([\w]+)/([\w]+)/(?P<pod>[\w\-]+)_(?P<namespace>[\w\-]+)_(?P<container>[\w\-\.]+)-(\w+).log$']
types.pod = "string"
types.namespace = "string"
types.container = "string"

[transforms.add_fields]
type = "add_fields"
inputs = ["regex_parser"]
overwrite = true
fields.origin_id = "{{ namespace }}/{{ pod }}/{{ container }}"
fields.agent_name = "vector"
fields.host = "${VECTOR_SELF_NODE_NAME}"
fields.timestamp = "{{ time }}"

[transforms.coercer]
type = "coercer"
inputs = ["add_fields"]
types.timestamp = "timestamp"

[transforms.remap]
type = "remap"
inputs = ["coercer"]
source = """
if starts_with(.log, "{") {
   .log = parse_json!(.log)
   .log_type = "json"
} else {
  .log_type = "text"
}
"""

[transforms.swimlanes]
type = "swimlanes"
inputs = ["remap"]

[transforms.swimlanes.lanes.json]
type = "remap"
source = """
  .log_type == "json"
"""

[transforms.swimlanes.lanes.text]
type = "remap"
source = """
  .log_type == "text"
"""

[transforms.reduce]
type = "reduce"
inputs = ["swimlanes.text"]
group_by = ["origin_id"]
merge_strategies.log = "concat_newline"
starts_when.type = "remap"
starts_when.source = '''match!(.log, r'^\d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}') || match!(.log, r'^\[?\d{2}\S\w{3}\S\d{4}:\d{2}:\d{2}:\d{2} [+]\d{2,}\]') || match!(.log, r'^\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}.\d{3}') || match!(.log, r'^\[
\w+]: \d{4}\D\d{2}\D\d{2}(T|\s)?\d{2}:\d{2}:\d{2}')'''

[sinks.blackhole]
type = "blackhole"
inputs = ["reduce", "swimlanes.text", "swimlanes.json"]

pays similar costs but does so across multiple threads in the default vector setup and is faster as a result. If we make vector single-threaded they're more or less the same cost. I believe this strongly suggests that merely making clones cheap -- as is attempted in #6787 -- is not sufficient and we must remove clones. Pushing forward with #5374 strikes me as the next logical step.

blt · 2021-03-31T16:43:44Z

Alright, update here. The work that @FungusHumungus is doing to get #5374 up to parity with master is ongoing and important to fixing this. Additionally, experimentation in https://github.com/blt/pipelines does help confirm that we do need to be operating on batches inside of vector, not individual Events. Allocate everything in batches, move everything around in batches, process them in batches. We're going to run out of synonyms.

JeanMertz · 2022-06-07T13:35:57Z

A while has passed since this issue was last updated. There have been many performance-related improvements to VRL, including new path-lookup code, and a shared Value type between Vector and VRL.

It might be worth profiling this again, or closing the issue if it is superseded by other issues we've filed since.

jszwedko · 2022-12-28T16:30:53Z

Closing since the investigation here is complete. When we pick up performance work again it'll likely start anew with new profiles.

blt added type: task Generic non-code related tasks domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language labels Mar 15, 2021

blt self-assigned this Mar 15, 2021

blt mentioned this issue Mar 15, 2021

Relax atomic operations of update_gauge and increment_counter metrics-rs/metrics#184

Merged

blt mentioned this issue Mar 16, 2021

chore(remap): Adjust VRL map type to do less cloning, dropping #6787

Closed

This was referenced Mar 19, 2021

Reduce Allocator Pressure #6831

Closed

chore: Introduce Box<str> immutable string type #6846

Closed

blt mentioned this issue Mar 31, 2021

chore: Adjust remap transform to be task, not function #6902

Closed

jszwedko closed this as completed Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remap runtime is heavily dominated by clone/drop times #6770

Remap runtime is heavily dominated by clone/drop times #6770

blt commented Mar 15, 2021 •

edited by binarylogic

StephenWakely commented Mar 16, 2021

blt commented Mar 19, 2021 •

edited

blt commented Mar 31, 2021

JeanMertz commented Jun 7, 2022

jszwedko commented Dec 28, 2022

Remap runtime is heavily dominated by clone/drop times #6770

Remap runtime is heavily dominated by clone/drop times #6770

Comments

blt commented Mar 15, 2021 • edited by binarylogic

StephenWakely commented Mar 16, 2021

blt commented Mar 19, 2021 • edited

blt commented Mar 31, 2021

JeanMertz commented Jun 7, 2022

jszwedko commented Dec 28, 2022

blt commented Mar 15, 2021 •

edited by binarylogic

blt commented Mar 19, 2021 •

edited