Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Enable TCMalloc background sampling #17758

Closed
1 task done
SrivastavaAnubhav opened this issue Jun 13, 2023 · 4 comments
Closed
1 task done

[DocDB] Enable TCMalloc background sampling #17758

SrivastavaAnubhav opened this issue Jun 13, 2023 · 4 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@SrivastavaAnubhav
Copy link
Contributor

SrivastavaAnubhav commented Jun 13, 2023

Jira Link: DB-6854

Description

Background sampling in TCMalloc samples allocations every profiler_sample_freq_bytes if enable_process_lifetime_heap_profiling is set to true. This is used by the TCMalloc profilers (reported on pprof/heap and pprof/heap_snapshot), which are important in diagnosing where memory leaks are coming from.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@SrivastavaAnubhav SrivastavaAnubhav added kind/enhancement This is an enhancement of an existing feature area/docdb YugabyteDB core features priority/medium Medium priority issue labels Jun 13, 2023
@SrivastavaAnubhav SrivastavaAnubhav self-assigned this Jun 13, 2023
SrivastavaAnubhav added a commit that referenced this issue Jun 23, 2023
Summary:
This diff enables background memory sampling every 100 KB by default, if Google TCMalloc is enabled. There appears to be no perf impact as tested on TPCC, sysbench read only, sysbench multi insert, and sysbench update index (see https://gist.github.com/SrivastavaAnubhav/af8bdd0128064d6a8b43f8e29c992aab).
Jira: DB-6854

Test Plan: Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D26380
SrivastavaAnubhav added a commit that referenced this issue Jul 4, 2023
Summary:
This diff increases tcmalloc's background sampling frequency from 100 KB to 1 MB. This is still sufficient to identify untracked memory on the order of megabytes such as the inbound call issue fixed in D26107.

From TCMalloc's sampler.h:
```
// With 1M average sample step:
//  the probability of sampling a 4K allocation is about 0.00390
//  the probability of sampling a 1MB allocation is about 0.632
//  the probability of sampling a 1GB allocation is about 1.0
```

This should also bring down the memory used by samples on very large machines (64 GB memory).
Assuming ~30 lines per stack trace line * 20 bytes for the stack ptr, each sample costs ~600 bytes.
 - With 100 KB sampling, a 64 GB server would have ~670000 samples, so the sample take ~400 MB.
 - With 1 MB sampling, a 64 GB server would have ~65536 samples, so the samples take ~39 MB.
Jira: DB-6854

Test Plan: Scan performance was tested on the Yugabyte perf service and seemed reasonable. See: https://gist.github.com/SrivastavaAnubhav/739425564d8407bdf7664cba122df7a1.

Reviewers: rthallam, bogdan

Reviewed By: rthallam, bogdan

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D26585
dr0pdb pushed a commit to dr0pdb/yugabyte-db that referenced this issue Jul 6, 2023
Summary:
This diff enables background memory sampling every 100 KB by default, if Google TCMalloc is enabled. There appears to be no perf impact as tested on TPCC, sysbench read only, sysbench multi insert, and sysbench update index (see https://gist.github.com/SrivastavaAnubhav/af8bdd0128064d6a8b43f8e29c992aab).
Jira: DB-6854

Test Plan: Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D26380
dr0pdb pushed a commit to dr0pdb/yugabyte-db that referenced this issue Jul 6, 2023
Summary:
This diff increases tcmalloc's background sampling frequency from 100 KB to 1 MB. This is still sufficient to identify untracked memory on the order of megabytes such as the inbound call issue fixed in D26107.

From TCMalloc's sampler.h:
```
// With 1M average sample step:
//  the probability of sampling a 4K allocation is about 0.00390
//  the probability of sampling a 1MB allocation is about 0.632
//  the probability of sampling a 1GB allocation is about 1.0
```

This should also bring down the memory used by samples on very large machines (64 GB memory).
Assuming ~30 lines per stack trace line * 20 bytes for the stack ptr, each sample costs ~600 bytes.
 - With 100 KB sampling, a 64 GB server would have ~670000 samples, so the sample take ~400 MB.
 - With 1 MB sampling, a 64 GB server would have ~65536 samples, so the samples take ~39 MB.
Jira: DB-6854

Test Plan: Scan performance was tested on the Yugabyte perf service and seemed reasonable. See: https://gist.github.com/SrivastavaAnubhav/739425564d8407bdf7664cba122df7a1.

Reviewers: rthallam, bogdan

Reviewed By: rthallam, bogdan

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D26585
@SrivastavaAnubhav
Copy link
Contributor Author

Memory tabs (gperftools tcmalloc vs google tcmalloc):
image
image

/pprof/heap_snapshot page:
image

/pprof/heap page:
image

SrivastavaAnubhav added a commit that referenced this issue Sep 19, 2023
…point for gperftools tcmalloc.

Summary:
Original commit: f142d7f / D26585
Original commit: c456bce / D27565

We want the change setting tcmalloc sampling frequency (D26585) and the change enabling sampling (D27565) in the same diff to avoid a perf regression, which is why they are being backporting together.

-----------------------------------

This diff enables /pprof/heap_snapshot for gperftools tcmalloc, and makes some quality-of-life to the endpoints:
- Background sampling for gperftools tcmalloc has always been on, with a default sampling frequency of 16.7 MB (https://github.com/yugabyte/gperftools/blob/f8c3aeb952fde9c79eb69cca1a5cf7e74820c127/src/sampler.cc#L80-L86). This diff changes it to follow FLAGS_profiler_sample_freq_bytes (1 MB as of the time of writing).
- Add flag enable_process_lifetime_heap_**sampling** and change semantics of enable_process_lifetime_heap_**profiling**.
  - Before this diff, enable_process_lifetime_heap_**profiling**  used to cause gperftools tcmalloc to sample every allocation (this is very expensive, and was always only for brief debugging) and google tcmalloc to sample at the rate specified by profiler_sample_freq_bytes. It used to be off by default for gperftools tcmalloc and on by default for google tcmalloc. After this diff, it is off by default for both, and has no effect on sampling for google tcmalloc.
  - enable_process_lifetime_heap_**sampling** is a new flag that controls whether allocations should be sampled at the rate specified by profiler_sample_freq_bytes. It is on by default for both gperftools and google tcmalloc.
  - The behavior for new universes using google tcmalloc is the same as before: sampling is enabled by default
  - The behavior for new universes using gperftools tcmalloc is different: sampling is now enabled by default
  - Upgrading from gperftools to google tcmalloc will enable sampling because the profiling flag will be true by default
  - Before this diff, downgrading from a google tcmalloc to gperftools tcmalloc would enable all-allocation sampling (since the _profiling flag was used for google tcmalloc sampling and gperftools tcmalloc all-allocation profiling). After this diff, downgrading will cause no change because the default are the same for both tcmallocs
- Use abseil's symbolizer instead of glog's. This is a couple orders of magnitude faster, and should speed up yb_prof as well.
- Add button for the /pprof/heap_snapshot endpoint for google and gperftools tcmalloc
- Add button for the /pprof/heap endpoint for google tcmalloc only (there is no human-readable output at this endpoint for gperftools, one must use yb_prof.py)
- Add order_by count / bytes option to the pprof URLs

Memory tabs:
Gperftools tcmalloc: https://user-images.githubusercontent.com/17299377/269038442-36e66a36-6c34-4076-87ff-0762c2690887.png
Google tcmalloc: https://user-images.githubusercontent.com/17299377/269038456-1f2edd82-1f61-4fcd-b404-ad3b9aa8853d.png

/pprof/heap_snapshot page:
https://user-images.githubusercontent.com/17299377/269038496-617ad08a-1c37-402e-84e2-9290c8eed00d.png

/pprof/heap page:
https://user-images.githubusercontent.com/17299377/269038514-e4680bac-ca8f-476d-8a02-e4a0733a6a25.png

Jira: DB-6854, DB-7502

Test Plan:
Reverted the fix in D26107 and verified that the expected stack showed up in /pprof/heap_snapshot.
Also ran `ybd --cxx_test --use_gperftools_tcmalloc pprof-path-handler_util-test --gtest_filter "SamplingProfilerTest.HeapSnapshot" -n 10000` locally.

See #17758 (comment) for perf results.
Perf tests on TPCC and Sysbench workloads did not show any performance changes against 2.18.3.0-b71.
I also ran Featurebench's scan workload and found no significant performance differences. Note: this workload seems to have high variance. My diff claimed there was a regression against 2.18.3.0-b73, but not against 2.18.4.0-b1, despite there being only 1 DB commit (9cf4912) in between the two, and which only affected xcluster.

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: mbautin, bogdan, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D28051
SrivastavaAnubhav added a commit that referenced this issue Oct 4, 2023
…point for gperftools tcmalloc.

Summary:
Original commit: f142d7f / D26585
Original commit: c456bce / D27565

-----------------------------------

Note: This diff introduces `profiler_sample_freq_bytes` with a default value of 1 MiB, and thus implicitly includes D26585 (which bumped the value up from 100 KiB).

This diff also adds Abseil support, since the symbolizer is much faster.

Original summary with google tcmalloc-specific parts removed since google tcmalloc is not available in 2.16 (see the original commit of D27565 for the full original summary):

> This diff enables /pprof/heap_snapshot for gperftools tcmalloc, and makes some quality-of-life to the endpoints:
> - Background sampling for gperftools tcmalloc has always been on, with a default sampling frequency of 16.7 MB (https://github.com/yugabyte/gperftools/blob/f8c3aeb952fde9c79eb69cca1a5cf7e74820c127/src/sampler.cc#L80-L86). This diff changes it to follow `profiler_sample_freq_bytes` (1 MB as of the time of writing).
> - Add flag enable_process_lifetime_heap_**sampling**.
>   - enable_process_lifetime_heap_**sampling** is a new flag that controls whether allocations should be sampled at the rate specified by `profiler_sample_freq_bytes`. It is on by default for both gperftools and google tcmalloc.
>   - The behavior for new universes using gperftools tcmalloc is different: sampling every `profiler_sample_freq_bytes` is now enabled by default
> - Use abseil's symbolizer instead of glog's. This is a couple orders of magnitude faster, and should speed up yb_prof (used for `/pprof/heap`) as well.
> - Add button for the /pprof/heap_snapshot endpoint
> - Add order_by count / bytes option to pprof/heap_snapshot
>
> Memory tabs:
> Gperftools tcmalloc: https://user-images.githubusercontent.com/17299377/269038442-36e66a36-6c34-4076-87ff-0762c2690887.png
>
> /pprof/heap_snapshot page:
> https://user-images.githubusercontent.com/17299377/269038496-617ad08a-1c37-402e-84e2-9290c8eed00d.png

Jira: DB-6854, DB-7502

Test Plan:
Reverted the fix in D26107 and verified that the expected stack showed up in /pprof/heap_snapshot.
Also ran `ybd --cxx_test --use_gperftools_tcmalloc pprof-path-handler_util-test --gtest_filter "SamplingProfilerTest.HeapSnapshot" -n 10000` locally.

Saw no significant perf impact; see #17758 (comment) for full results.

Reviewers: mbautin, esheng

Reviewed By: mbautin

Subscribers: yql, ybase, bogdan, mbautin

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D28702
SrivastavaAnubhav added a commit that referenced this issue Oct 11, 2023
…point for gperftools tcmalloc.

Summary:
Original commit: f142d7f / D26585
Original commit: c456bce / D27565

-----------------------------------

Note: This diff introduces `profiler_sample_freq_bytes` with a default value of 1 MiB, and thus implicitly includes D26585 (which bumped the value up from 100 KiB).

This diff also adds Abseil support (since the symbolizer is much faster) by updating thirdparty to commit `78ea75d798b43483779b760ba55e5f1e63494b0b`.

Original summary with google tcmalloc-specific parts removed since google tcmalloc is not available in 2.14 (see the original commit of D27565 for the full original summary):

> This diff enables /pprof/heap_snapshot for gperftools tcmalloc, and makes some quality-of-life to the endpoints:
> - Background sampling for gperftools tcmalloc has always been on, with a default sampling frequency of 16.7 MB (https://github.com/yugabyte/gperftools/blob/f8c3aeb952fde9c79eb69cca1a5cf7e74820c127/src/sampler.cc#L80-L86). This diff changes it to follow `profiler_sample_freq_bytes` (1 MB as of the time of writing).
> - Add flag enable_process_lifetime_heap_**sampling**.
>   - enable_process_lifetime_heap_**sampling** is a new flag that controls whether allocations should be sampled at the rate specified by `profiler_sample_freq_bytes`. It is on by default for both gperftools and google tcmalloc.
>   - The behavior for new universes using gperftools tcmalloc is different: sampling every `profiler_sample_freq_bytes` is now enabled by default
> - Use abseil's symbolizer instead of glog's. This is a couple orders of magnitude faster, and should speed up yb_prof (used for `/pprof/heap`) as well.
> - Add button for the /pprof/heap_snapshot endpoint
> - Add order_by count / bytes option to pprof/heap_snapshot
>
> Memory tabs:
> Gperftools tcmalloc: https://user-images.githubusercontent.com/17299377/269038442-36e66a36-6c34-4076-87ff-0762c2690887.png
>
> /pprof/heap_snapshot page:
> https://user-images.githubusercontent.com/17299377/269038496-617ad08a-1c37-402e-84e2-9290c8eed00d.png

Jira: DB-6854, DB-7502

Test Plan:
Reverted the fix in D26107 and verified that the expected stack showed up in /pprof/heap_snapshot.
Also ran `ybd --cxx_test --use_gperftools_tcmalloc pprof-path-handler_util-test --gtest_filter "SamplingProfilerTest.HeapSnapshot" -n 10000` locally.

See #17758 (comment) for perf results.
Jenkins: compile only

Reviewers: mbautin, esheng

Reviewed By: mbautin

Subscribers: mbautin, bogdan, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D28783
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

1 participant