init tls grad_mode/local_dispatch_key set while fork new thread in #113246

zhuhaozhe · 2023-11-08T07:03:19Z

TorchDynamo will guard grad_mode and the local dispatch key set.

Lines 13 to 16 in 3a42942

    
           struct LocalState { 
        
             // TLS state that changes operators 
        
             c10::impl::LocalDispatchKeySet dispatch_modifier; 
        
             bool grad_mode_enabled;

While using ThroughputBenchmark, those tls state will not be init as same as the main thread status.

pytorch/torch/csrc/utils/throughput_benchmark-inl.h

Lines 64 to 94 in 3a42942

    
           callers.emplace_back([&, thread_id]() { 
        
             // We use conditional variable as a barrier to make sure each thread 
        
             // performs required warmeup iterations before we start measuring 
        
             for (const auto j : c10::irange(config.num_warmup_iters)) { 
        
               (void)j; 
        
               runOnce(std::move(thread_inputs[thread_id][input_iters[thread_id]])); 
        
               ++input_iters[thread_id]; 
        
             } 
        
             { 
        
               std::unique_lock<std::mutex> lock(m); 
        
               ++initialized; 
        
               worker_main_cv.notify_one(); 
        
               // NOLINTNEXTLINE(bugprone-infinite-loop) 
        
               while (!start) { 
        
                 main_worker_cv.wait(lock); 
        
               } 
        
             } 
        
             LOG(INFO) << "Starting forward thread " << thread_id; 
        
             while (num_attempted_iters.fetch_add(1) < config.num_iters) { 
        
               runOnce(std::move(thread_inputs[thread_id][input_iters[thread_id]])); 
        
               ++input_iters[thread_id]; 
        
             } 
        
             { 
        
               std::unique_lock<std::mutex> lock(m); 
        
               ++finished; 
        
               worker_main_cv.notify_one(); 
        
               LOG(INFO) << "Shutting down forward thread " << thread_id 
        
                         << ". Total number of finished threads: " << finished; 
        
             } 
        
           });

Run following scripts

import torch
linear = torch.nn.Linear(128, 128)
compiled = torch.compile(linear)
x = torch.rand(10, 128)
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    compiled(x)
    compiled(x)

from torch._dynamo import config
config.error_on_recompile = True
from torch.utils import ThroughputBenchmark
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    bench = ThroughputBenchmark(compiled)
    bench.add_input(x)
    stats = bench.benchmark(
        num_calling_threads=10,
        num_warmup_iters=100,
        num_iters=100,
    )
    print(stats)

will lead to 2 re-compile reasons:

triggered by the following guard failure(s): ___check_global_state()
triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch.

This will trigger a re-compile in torchdynamo. But since ThroughputBenchmark is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ThroughputBenchmark to run torchdynamo optimized models.

Stack from ghstack (oldest at bottom):

-> init tls grad_mode/local_dispatch_key set while fork new thread in #113246

throughputbenchmark

pytorch-bot · 2023-11-08T07:03:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113246

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f7f34af with merge base 3ff4572 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

throughputbenchmark ghstack-source-id: 2d5dfa1 Pull Request resolved: #113246

throughputbenchmark [ghstack-poisoned]

throughputbenchmark ghstack-source-id: 86fba3f Pull Request resolved: #113246

…hread in" TorchDynamo will guard grad_mode and the local dispatch key set. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/dynamo/guards.cpp#L13-L16 While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/utils/throughput_benchmark-inl.h#L64-L94 Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark [ghstack-poisoned]

zhuhaozhe · 2023-11-15T08:17:52Z

@pytorchbot merge

pytorchmergebot · 2023-11-15T08:20:05Z

Merge failed

Reason: Approval needed from one of the following:
iseeyuan, chenyang78, lc0, atalman, ananthsub, ...

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

zhuhaozhe · 2023-11-19T06:49:53Z

@pytorchbot rebase

pytorchmergebot · 2023-11-19T06:51:33Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…hread in" TorchDynamo will guard grad_mode and the local dispatch key set. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/dynamo/guards.cpp#L13-L16 While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/utils/throughput_benchmark-inl.h#L64-L94 Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark [ghstack-poisoned]

pytorchmergebot · 2023-11-19T06:51:47Z

Successfully rebased gh/zhuhaozhe/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/113246)

throughputbenchmark ghstack-source-id: cf24d8b Pull Request resolved: #113246

zhuhaozhe · 2023-11-19T06:54:30Z

Hi, @iseeyuan, @chenyang78 , @atalman May you help to review this PR, the mergebot shows you are the owner.

throughputbenchmark ghstack-source-id: 1b0c97f Pull Request resolved: #113246

…hread in" TorchDynamo will guard grad_mode and the local dispatch key set. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/dynamo/guards.cpp#L13-L16 While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/utils/throughput_benchmark-inl.h#L64-L94 Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark [ghstack-poisoned]

zhuhaozhe · 2023-12-21T08:13:23Z

Hi, @desertfire May you help to review this PR?

zhuhaozhe · 2024-01-03T01:18:43Z

@pytorchbot rebase

pytorchmergebot · 2024-01-03T01:20:26Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…hread in" TorchDynamo will guard grad_mode and the local dispatch key set. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/dynamo/guards.cpp#L13-L16 While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. https://github.com/pytorch/pytorch/blob/3a429423fcf72430e7a36c79e263c877d7a4ef72/torch/csrc/utils/throughput_benchmark-inl.h#L64-L94 Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark [ghstack-poisoned]

pytorchmergebot · 2024-01-03T01:20:41Z

Successfully rebased gh/zhuhaozhe/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/113246)

throughputbenchmark ghstack-source-id: 86bb4b9 Pull Request resolved: #113246

zhuhaozhe · 2024-01-11T08:24:13Z

@pytorchbot merge

pytorchmergebot · 2024-01-11T08:26:26Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

zhuhaozhe · 2024-01-11T08:27:17Z

@pytorchbot merge

pytorchmergebot · 2024-01-11T08:30:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

zhuhaozhe added a commit that referenced this pull request Nov 8, 2023

init tls grad_mode/local_dispatch_key set while fork new thread in

5eac74e

throughputbenchmark ghstack-source-id: 2d5dfa1 Pull Request resolved: #113246

init tls grad_mode/local_dispatch_key set while fork new thread in

5216e8f

throughputbenchmark [ghstack-poisoned]

zhuhaozhe mentioned this pull request Nov 8, 2023

init tls grad_mode/local_dispatch_key set while fork new thread in #113247

Closed

pytorchbot added the open source label Nov 8, 2023

zhuhaozhe requested a review from jgong5 November 8, 2023 07:24

zhuhaozhe added a commit that referenced this pull request Nov 9, 2023

init tls grad_mode/local_dispatch_key set while fork new thread in

8009907

throughputbenchmark ghstack-source-id: 86fba3f Pull Request resolved: #113246

jgong5 approved these changes Nov 9, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 15, 2023

pytorchmergebot added the merging label Nov 15, 2023

pytorchmergebot removed the merging label Nov 15, 2023

zhuhaozhe requested review from ananthsub, atalman, chenyang78, iseeyuan and lc0 November 19, 2023 06:50

pytorchmergebot pushed a commit that referenced this pull request Nov 19, 2023

init tls grad_mode/local_dispatch_key set while fork new thread in

41ba243

throughputbenchmark ghstack-source-id: cf24d8b Pull Request resolved: #113246

zhuhaozhe added a commit that referenced this pull request Nov 20, 2023

init tls grad_mode/local_dispatch_key set while fork new thread in

08c5193

throughputbenchmark ghstack-source-id: 1b0c97f Pull Request resolved: #113246

zhuhaozhe requested review from desertfire and removed request for ananthsub December 21, 2023 08:12

pytorchmergebot pushed a commit that referenced this pull request Jan 3, 2024

init tls grad_mode/local_dispatch_key set while fork new thread in

9457150

throughputbenchmark ghstack-source-id: 86bb4b9 Pull Request resolved: #113246

desertfire approved these changes Jan 8, 2024

View reviewed changes

pytorchmergebot added the merging label Jan 11, 2024

pytorchmergebot removed the merging label Jan 11, 2024

zhuhaozhe added the topic: not user facing topic category label Jan 11, 2024

pytorchmergebot added the merging label Jan 11, 2024

pytorchmergebot closed this in 1cefc58 Jan 11, 2024

pytorchmergebot added Merged and removed merging labels Jan 11, 2024

facebook-github-bot deleted the gh/zhuhaozhe/2/head branch January 14, 2024 15:23

	struct LocalState {
	// TLS state that changes operators
	c10::impl::LocalDispatchKeySet dispatch_modifier;
	bool grad_mode_enabled;

	callers.emplace_back([&, thread_id]() {
	// We use conditional variable as a barrier to make sure each thread
	// performs required warmeup iterations before we start measuring
	for (const auto j : c10::irange(config.num_warmup_iters)) {
	(void)j;
	runOnce(std::move(thread_inputs[thread_id][input_iters[thread_id]]));
	++input_iters[thread_id];
	}
	{
	std::unique_lock<std::mutex> lock(m);
	++initialized;
	worker_main_cv.notify_one();
	// NOLINTNEXTLINE(bugprone-infinite-loop)
	while (!start) {
	main_worker_cv.wait(lock);
	}
	}
	LOG(INFO) << "Starting forward thread " << thread_id;
	while (num_attempted_iters.fetch_add(1) < config.num_iters) {
	runOnce(std::move(thread_inputs[thread_id][input_iters[thread_id]]));
	++input_iters[thread_id];
	}

	{
	std::unique_lock<std::mutex> lock(m);
	++finished;
	worker_main_cv.notify_one();
	LOG(INFO) << "Shutting down forward thread " << thread_id
	<< ". Total number of finished threads: " << finished;
	}
	});

init tls grad_mode/local_dispatch_key set while fork new thread in #113246

init tls grad_mode/local_dispatch_key set while fork new thread in #113246

Uh oh!

Conversation

zhuhaozhe commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113246

✅ No Failures

Uh oh!

zhuhaozhe commented Nov 15, 2023

Uh oh!

pytorchmergebot commented Nov 15, 2023

Merge failed

Uh oh!

zhuhaozhe commented Nov 19, 2023

Uh oh!

pytorchmergebot commented Nov 19, 2023

Uh oh!

pytorchmergebot commented Nov 19, 2023

Uh oh!

zhuhaozhe commented Nov 19, 2023

Uh oh!

zhuhaozhe commented Dec 21, 2023

Uh oh!

zhuhaozhe commented Jan 3, 2024

Uh oh!

pytorchmergebot commented Jan 3, 2024

Uh oh!

pytorchmergebot commented Jan 3, 2024

Uh oh!

zhuhaozhe commented Jan 11, 2024

Uh oh!

pytorchmergebot commented Jan 11, 2024

Merge failed

Uh oh!

zhuhaozhe commented Jan 11, 2024

Uh oh!

pytorchmergebot commented Jan 11, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhuhaozhe commented Nov 8, 2023 •

edited

Loading

pytorch-bot bot commented Nov 8, 2023 •

edited

Loading