Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use HTTP to enable jemalloc profile #4600

Merged
merged 25 commits into from Jun 7, 2019

Conversation

Projects
None yet
8 participants
@YangKeao
Copy link
Contributor

commented Apr 29, 2019

What have you changed? (mandatory)

  1. Add activate_prof and deactivate_prof for tikv_alloc to start and stop jemalloc profiling. These two functions mallctl_set the prof.active to true and false.

  2. Add entry /pprof/profile in status_server. Get the seconds in uri query part and start profiling for seconds and stop, return the dumped profile. For example: accessing /pprof/profile?seconds=30 will call activate_prof function first, and after 30 seconds a temporary file will be created to store the dumped profile. Finally the status_server return the file as response. Here are some special case:

    • Query without seconds will receive BAD REQUEST.
    • If mem-profiling and jemalloc features are not enabled (then activate_prof function will return an Error), ths response will be NOT FOUND
    • Other jemalloc error will cause INTERNAL ERROR
    • However, if activate_prof successed but deactivate_prof failed, no error will be returned. Maybe status_server needs a better way to handle error. (However, status_server is such a simple thing. I don't know whether )

What are the type of the changes? (mandatory)

New feature

How has this PR been tested? (mandatory)

I have tested it manually. Some unit tests will be added later.

After starting tikv-server with --status-addr="STATUS_SERVER", we can access http://STATUS_SERVER/pprof/profile?seconds=5 to dump profile for 5 seconds. We can use jeprof to use this result: run jeprof --web --seconds=30 tikv-server http://STATUS_SERVER will download profile from status server and generate a svg and open it in browser.

STATUS_SERVER above is the address of status_server and tikv-server is binary file path of tikv-server. Not only --web but also --svg, --pdf can be used for your convenience.

Does this PR affect documentation (docs) or release note? (mandatory)

No. I didn't find docs related to status_server. Maybe a new document for status_server is needed.

Does this PR affect tidb-ansible update? (mandatory)

No

Refer to a related PR or issue link (optional)

#4272

@sre-bot

This comment has been minimized.

Copy link
Collaborator

commented Apr 29, 2019

Hi contributor, thanks for your PR.

This patch needs to be approved by someone of admins. They should reply with "/ok-to-test" to accept this PR for running test automatically.

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch 3 times, most recently from a8f337d to e56d0c1 Apr 29, 2019

@zhouqiang-cl

This comment has been minimized.

Copy link
Collaborator

commented Apr 29, 2019

/test

Show resolved Hide resolved src/server/status_server.rs Outdated
@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

Thanks @YangKeao

Can you show us your manual test result? what is the output, how to use the result, etc.

@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

PTAL @brson

@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 5, 2019

Friendly ping @YangKeao

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@siddontang Sorry for replying late. Run jeprof --web --seconds=30 tikv-server http://STATUS_SERVER can generate svg from result and open it in browser. But I am not familiar with it's output so I cannot decide whether it's correct 😢

Show resolved Hide resolved components/tikv_alloc/src/error.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/jemalloc.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/jemalloc.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/jemalloc.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/jemalloc.rs Outdated

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch 2 times, most recently from b5dbb25 to bd69b21 May 6, 2019

@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

Great work, thanks @YangKeao, can you paste a screenshot of jeprof web output?

PTAL @BusyJay @brson

@BusyJay

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

What if there are multiple requests run concurrently?

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@siddontang
jeprof

@BusyJay Jemalloc only has a global profiler, so now it cannot handle two requests at the same time ( one of them may be cut halfway because the profiler may have been closed ). I can add an Mutex to guarantee only one request is profiling. The request coming late will wait until previous one has finished. (Add a guard to start and stop profiler automatically)

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch from 498a302 to beb9027 May 7, 2019

@zhouqiang-cl

This comment has been minimized.

Copy link
Collaborator

commented May 7, 2019

/test

@BusyJay
Copy link
Contributor

left a comment

It maybe better to move the lock from tikv_alloc to tikv itself. It seems a little odd to me too that an alloc crate requires futures.

Show resolved Hide resolved components/tikv_alloc/src/profiler_guard.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/profiler_guard.rs Outdated

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch from 51ce36b to f80f150 May 8, 2019

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

@BusyJay Thanks for your suggestion 😄 . profiler_guard was moved to server/status_server as a mod now.

@brson
Copy link
Contributor

left a comment

Great patch.

I did leave a number of requests inline, and I have a few more general comments.

Please remove the existing code in signal_handler.rs that calls dump_prof on SIGUSR2. It will no longer be used.

This code uses the existing dump_prof method, which is not so carefully written as the new code, and needs to be reworked a bit. In particular:

  • it should no longer itself set PROF_ACTIVE, since that is a separate step
  • it should handle errors, and those errors should be propagated to the http response
  • it should no longer take an Option since the only caller passes it Some and there's quite a bit of unsafe complexity imposed by that Option argument.

I know that's not your code, but the changes should be made so please do so.

The second resolution of the API seems like it might be too course. A second is a long time to a CPU. Milliseconds might be more appropriate. Not sure. Opinions @BusyJay?

More speculatively, HTTP can time out, and it's not obvious what the interaction here is between the API and HTTP timeouts. Fortunately all the resources here are closed on drop, so e.g. the profile won't be left turned on. The API as is may preclude profiling for an extended time because the request will time out.

The mutex around the profile request is not great, but for now I think it's fine, since this is a debugging tool. A more robust solution would simply extend the profiling duration while honoring the timeouts of each

Show resolved Hide resolved components/tikv_alloc/src/error.rs Outdated
Show resolved Hide resolved components/tikv_alloc/src/error.rs
Show resolved Hide resolved components/tikv_alloc/src/jemalloc.rs
Show resolved Hide resolved src/server/status_server.rs
Show resolved Hide resolved src/server/status_server.rs Outdated
Show resolved Hide resolved src/server/status_server.rs Outdated
Show resolved Hide resolved src/server/status_server.rs Outdated
Show resolved Hide resolved src/server/status_server.rs Outdated
Show resolved Hide resolved src/server/status_server.rs Outdated
@brson

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Does this patch still require someone to set MALLOC_CONF=prof:true to work? I was under the impression that enabling profiling was a 3 step process: build with --enable-profile, set opt.prof, set opt.prof.active.

@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

@brson

Does this patch still require someone to set MALLOC_CONF=prof:true to work?

I think we don't need this if we build with profiling feature now, because the HTTP can enable/disable automatically.

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2019

@siddontang No. Setting prof to be true is also required now. Because HTTP request only turns on/off prof_active without caring about prof.

Should HTTP request also auto turn on and off prof?

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2019

@brson Second was used as parameters because jeprof uses seconds.

For example jeprof --web --seconds=30 target/debug/tikv-server http://127.0.0.1:8079 will send a request to http://127.0.0.1:8079/pprof/profile?seconds=30

YangKeao added some commits May 10, 2019

tikv_alloc/error: impl Display for ProfError
For impl `std::error::Error` for ProfError, impl `Display` for ProfError

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server: Handle io::Error when creating TempDir
TempDir::new() may return io::Error. Handle this error by returning out.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server, tikv_alloc/error: Handle error for into_string()
OsString.into_string() may cause error when directory path contains
non-unicode letter.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server: Parse query in a better way.
Parse query part in URI in a better way. The previous method is too
hacky :(

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server: Add several HTTP headers for response
Add `X-Content-Type-Options`, `Content-Disposition`, `Content-Type` and
`Content-Length` to keep the same with net/http/pprof in golang.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server: Use a better way to handle Error
Return Error in one closure without use `then` several times.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
server/status_server: Return error as http response
Return the description of error as HTTP response

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
tikv_alloc/*, util/signal_handler: Remove Option in `dump_prof`
Remove existing code of handling `SIGUSR2`. It's no longer needed with
support of dumping profile with HTTP request.

Remove Option in `dump_prof` arguments because all usage of it is pass a
`Some`.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
tikv_alloc/*, server/status_server: handle error in dump_prof
Handle error in dump_prof function by returning it.

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch from 7fa4b61 to 5b60d0f May 13, 2019

@siddontang

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

@YangKeao

Setting prof to be true is also required now. Because HTTP request only turns on/off prof_active without caring about prof.

seem it is not convenient to pass this env at runtime. can we set MALLOC_CONF=prof:true directly when we start TiKV with profiling feature?

But we should ensure setting prof:true can't affect performance at first.

@YangKeao

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

@siddontang Is there anything more I can do for this PR? I have no idea whether prof:true will affect its performance ( I think benchmark is needed to decide but I don't have enough computing resource to run benchmark for it )

@brson

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

seem it is not convenient to pass this env at runtime. can we set MALLOC_CONF=prof:true directly when we start TiKV with profiling feature?

@siddontang

Yes it is technically possible. See the "tuning" section here. I don't know how complex it is to feed that configuration down through the various intermediary crates all the way to the jemalloc build system. I don't know if it will add additional overhead beyond what --enable-profiling does, but I suspect if it does it's mostly memory overhead, not instruction overhead, since it's a run-time option.

It looks to me like the most viable way to set "opt.prof" automatically is to create a statically initialized extern c-string, malloc_conf, containing its value. That could be done within tikv_alloc via cfgs. Though jemalloc-sys may provide some way to set --with-malloc-conf conditionally, I can't think offhand how it would do it, and haven't looked it up.

I don't think we necessarily have to solve that problem in this PR though.

@siddontang Is there anything more I can do for this PR? I have no idea whether prof:true will affect its performance ( I think benchmark is needed to decide but I don't have enough computing resource to run benchmark for it )

@YangKeao sorry for the long delay's here. I've given it a quick re-review and I think it is ok. Even if not, it's way off any critical path so any lingering bugs aren't a major problem.

I'm ok with merging, as long as we file a bug for the follow up to figure out how to set opt.prof automatically.

Let's see what @siddontang says though.

@brson
Copy link
Contributor

left a comment

LGTM

@brson brson dismissed their stale review via 2ca9bb2 Jun 3, 2019

@brson
Copy link
Contributor

left a comment

LGTM

@brson brson requested a review from siddontang Jun 3, 2019

@brson

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

I went ahead and fixed the merge conflict.

@BusyJay
Copy link
Contributor

left a comment

rest LGTM

}
},
None => {
let response = Response::builder()

This comment has been minimized.

Copy link
@BusyJay

BusyJay Jun 4, 2019

Contributor

Can it be set to a default value like 60 seconds instead? What's the behavior of TiDB?

This comment has been minimized.

Copy link
@YangKeao

YangKeao Jun 5, 2019

Author Contributor
sec, err := strconv.ParseInt(r.FormValue("seconds"), 10, 64)
if sec <= 0 || err != nil {
    sec = 10
} 

The default value in TiDB is 10 seconds.

This comment has been minimized.

Copy link
@YangKeao

YangKeao Jun 5, 2019

Author Contributor

Now, if seconds query parameter is not passed, the value will be set to 10 seconds. If it's not an integer, an error will be returned.

server/status_server: Set default value of `seconds`
Set default value of `seconds` parameter according to TiDB
status_server. (10 seconds)

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

@YangKeao YangKeao force-pushed the YangKeao:4272-jemalloc-profile-http-interface branch from b1207f4 to 629a118 Jun 5, 2019

@BusyJay

BusyJay approved these changes Jun 6, 2019

@siddontang

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

hello @YangKeao

do you mind adding a comment that we must export MALLOC_CONF=prof:true when using this feature in the following PR?

I will merge this, thank you very much.

@siddontang siddontang merged commit b743013 into tikv:master Jun 7, 2019

2 checks passed

DCO All commits are signed off!
Details
idc-jenkins-ci/test Jenkins job succeeded.
Details

sticnarf added a commit to sticnarf/tikv that referenced this pull request Jun 10, 2019

Remove all Box of RPN functions
Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Makefile: make sure gdb is installed before sse4.2 check (tikv#4832)

Signed-off-by: Kaige Ye <ye@kaige.org>

Upgrade sys-info (tikv#4760)

* *: upgrade sys-info crate

This fixes a problem with the next toolchain upgrade
where rust fails to link the native components of the crate.

Signed-off-by: Brian Anderson <andersrb@gmail.com>

* *: bump sys-info to 0.5.7

Signed-off-by: Brian Anderson <andersrb@gmail.com>

Batch Top N Executor (tikv#4825)

Signed-off-by: Breezewish <breezewish@pingcap.com>

Add help message in doc:go-client-api.md (tikv#4763)

* add help message in doc:go-client-api.md

Signed-off-by: yy <cacheyy@qq.com>

* update go-client-api.md

Signed-off-by: yy <cacheyy@qq.com>

Modify Makefile to distinguish between developer and packaging use cases (tikv#4687)

* make: Add new "dist_release" rules

To make the optimized build faster the existing "release" rules are going to
changed such that they are not identical to the actual releases. Primarily they
will not have debuginfo by default and will use thinLTO instead of LTO.

This adds new "dist_release", etc rules for the CI/CD system to use.

For now they are identical to the existing rules. After CI is updated
the "release" rules will be changed.

Signed-off-by: Brian Anderson <andersrb@gmail.com>

* make: Document release rules

Signed-off-by: Brian Anderson <andersrb@gmail.com>

* Makefile: indicate use of fail_release

Signed-off-by: Brian Anderson <andersrb@gmail.com>

* Clarify the distinction in instruction set for release targets

Signed-off-by: Brian Anderson <andersrb@gmail.com>

Makefile: fix gdb check (tikv#4840)

Signed-off-by: Kaige Ye <ye@kaige.org>

pessimistic-txn: solve non-pessimistic-lock conflict (tikv#4801)

* txn: replace is_pessimistic_lock to for_update_ts in Lock

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* pessimistic-txn: overwrite optimistic lock in pessimistic_prewrite if
request's for_update_ts is greater than lock's for_update_ts

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* modify comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* add comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* return Error let TiDB to resolve lock

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

* address comment

Signed-off-by: youjiali1995 <zlwgx1023@gmail.com>

coprocessor: add batch aggregate function BitAnd/BitOr/BitXor (tikv#4824)

Batch Top N Layered Benchmarks (tikv#4827)

* Add Top N benchmarks

Signed-off-by: Breezewish <breezewish@pingcap.com>

* Address some comments in previous PRs

Signed-off-by: Breezewish <breezewish@pingcap.com>

coprocessor: add batch aggregate function Max/Min (tikv#4837)

Implement RpnFunction MultiplyDecimal (tikv#4849)

Signed-off-by: Breezewish <breezewish@pingcap.com>

Add missing fsync calls in the snapshot module (tikv#4850)

Signed-off-by: Ben Pig Chu <benpichu@gmail.com>

use HTTP to enable jemalloc profile (tikv#4600)

use HTTP to enable jemalloc profile

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

coprocessor: use servo_arc in BatchTopNExecutor (tikv#4854)

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Fix clippy warnings

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Fix test

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Add docs to function.rs

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Add example output of the macro in the test of the macro.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

fix broken url for configuration options (tikv#4856)

Signed-off-by: Yukang <moorekang@gmail.com>

shrink the latch waiting list (tikv#4844)

Signed-off-by: zhangjinpeng1987 <zhangjinpeng@pingcap.com>

Fix clippy

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

scheduler use spin::Mutex (tikv#4829)

* scheduler use spinlock

Signed-off-by: zhangjinpeng1987 <zhangjinpeng@pingcap.com>

Better panic info

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.