Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report more information for memory profiling #61282

Closed
wants to merge 13 commits into from

Conversation

cloudhan
Copy link
Contributor

@cloudhan cloudhan commented Jul 6, 2021

Report pointed memory size, total allocated memory, total reserved size all in one report.

ptr and alloc_size will be used for associating with op trace.
allocated_size, reserved_size will be used for memory trace.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 6, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit d6130f3 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@cloudhan
Copy link
Contributor Author

cloudhan commented Jul 6, 2021

Let's see what happens if we change the API function prototype.

@cloudhan
Copy link
Contributor Author

cloudhan commented Jul 7, 2021

It seems to be OK to change the API.

Copy link
Contributor

@gaoteng-git gaoteng-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment at head of this PR should be changed. Suggest to explain why changing relative size to absolute size.
And the downstream profiler's processing on memory events will need to update when alloc_size is changed from relative size to absolute size.

torch/csrc/autograd/profiler_kineto.cpp Show resolved Hide resolved
@iramazanli iramazanli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 8, 2021
@gdankel
Copy link
Contributor

gdankel commented Jul 8, 2021

Seems reasonable to me. cc @ilia-cher @ngimel

@@ -323,7 +324,7 @@ void ProfiledCPUMemoryReporter::Delete(void* ptr) {
}
if (profile_memory) {
reportMemoryUsageToProfiler(
ptr, -nbytes, c10::Device(c10::DeviceType::CPU));
ptr, nbytes, allocated, 0, c10::Device(c10::DeviceType::CPU));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did nbytes sign change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, a mistake when bringing it back. fixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a mistake like this that's not caught by the tests is making me very nervous. Can you please add a test that would catch this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are added for CPU and CUDA. Some minor bug are found during testing.

I changed the cuda DeviceCachingAllocator to report Block wrapped ptr. It will be unwrapped by THCCachingAllocator and be wrapped it again as DataPtr. This will unify reported ptr semantic for CPU and CUDA allocator.

A bug is found due to free_block will change the passed in Block.

See the commit for details.

c10/core/Allocator.h Outdated Show resolved Hide resolved
@cloudhan cloudhan changed the title [DRAFT] memory profile change Report more information for memory profiling Jul 9, 2021
@albanD albanD removed their request for review July 9, 2021 16:02
@cloudhan
Copy link
Contributor Author

@ngimel I added cpu and cuda tests for reportMemoryUsage API. Hopefully it will catch API misuse in the future.

There is a lint error ask me to use angle bracket for include, I tried and it failed the compiling. How should I deal with it?

@soulitzer soulitzer removed their request for review July 15, 2021 21:13
@cloudhan
Copy link
Contributor Author

@gdankel Ping....

virtual void reportMemoryUsage(
void* ptr,
int64_t alloc_size,
int64_t allocated_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is confusing. How about total_allocated and total_reserved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New names will be much better. Will change.

Copy link
Contributor

@gdankel gdankel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@gdankel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@cloudhan cloudhan requested a review from gdankel July 28, 2021 16:54
@facebook-github-bot
Copy link
Contributor

@chaekit merged this pull request in 8bbcef5.

@cloudhan cloudhan deleted the memory_profile branch September 26, 2021 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants