Skip to content

Conversation

xmfan
Copy link
Member

@xmfan xmfan commented Sep 1, 2023

Adds simple_gpt + DTensor implemented in https://github.com/pytorch-labs/simple_gpt/pull/7 to torchbench

Tested via python benchmarks/dynamo/torchbench.py -d cuda --output-directory=benchmark_logs --output=performance.csv --inference --performance --timing --print-memory --multiprocess --nothing --only simple_gpt. Note: --nothing is used here to disable compile, since DTensor + compile isn't yet supported in main

dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,simple_gpt,1,0.966153,196.819773,-0.059319,1.000000,4.576880,4.576880,0,0,0,0
cuda,simple_gpt,1,0.967389,196.608152,-0.058833,1.000000,4.577404,4.577404,0,0,0,0
cuda,simple_gpt,1,0.973152,196.093583,-0.059316,1.000000,4.593133,4.593133,0,0,0,0
cuda,simple_gpt,1,0.973087,196.124046,-0.075580,1.000000,4.611483,4.611483,0,0,0,0
cuda,simple_gpt,1,0.967908,193.998484,-0.040192,1.000000,4.593133,4.593133,0,0,0,0
cuda,simple_gpt,1,0.968949,193.798088,-0.028878,1.000000,4.593133,4.593133,0,0,0,0

2 changes were required to the model:

@xmfan xmfan requested review from Chillee and H-Huang September 5, 2023 20:30
@xmfan xmfan marked this pull request as ready for review September 5, 2023 20:31
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

)

fabric = L.Fabric(devices=[self._rank], precision="bf16-true")
with fabric.init_module(empty_init=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is the only use of Lightning this init_module? I'm not sure what this does, but can we remove it by initializing with meta device? Then the implementation doesn't rely on third-party libraries can stay as close to native pytorch as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, let me update this with the latest init code that removed the lightning dep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually updated the initial code to avoid the lightning dependency (and it's actually much faster too!)

https://github.com/pytorch-labs/simple_gpt/blob/main/generate.py#L162

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nightly runs, do we usually load the actual weights? The load time should be quick, but the weights file is 10+GB just for LLaMA-7B. Otherwise, I could also update the default weights initialization to use random instead of just torch.zeros

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the real weights otherwise you might fail accuracy checks in unpredictable ways

Copy link
Member Author

@xmfan xmfan Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed if we only run this model through dynamorunner (model requires distributed, multiple GPUs)? The unit tests from this repo are skipped i.e. python test.py -k 'test_simple_gpt_', which includes the accuracy checks

@xuzhao9
Copy link
Contributor

xuzhao9 commented Sep 5, 2023

decorate torch.no_grad() on the caches, previously this was done outside the model, the entire eval call was wrapped in a torch.no_grad() context. After using torchbench, I notice even with only inference mode, we don't disable gradient calculations

This is because some models in torchbench (e.g. maml) does need gradient calculations in inference mode. So we leave the choice of whether enabling the gradient context to the eval test code.

There is an open issue about testing that the no_grad is enabled in the eval test: #1838

@xmfan xmfan temporarily deployed to docker-s3-upload September 6, 2023 01:47 — with GitHub Actions Inactive
@xmfan xmfan temporarily deployed to docker-s3-upload September 6, 2023 01:47 — with GitHub Actions Inactive
@xmfan
Copy link
Member Author

xmfan commented Sep 6, 2023

@pytorchbot merge

@facebook-github-bot
Copy link
Contributor

@xmfan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@msaroufim msaroufim self-requested a review September 7, 2023 19:04
@facebook-github-bot
Copy link
Contributor

@xmfan merged this pull request in 0a64c55.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Sep 14, 2023
…mo runner (#108438)

Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90

This is used for models which distribute over multiple GPUs e.g. simple_gpt pytorch/benchmark#1867

Also add an option to skip multiprocess only gpu models

Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt`

Pull Request resolved: #108438
Approved by: https://github.com/Chillee
cclauss pushed a commit to cclauss/benchmark that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants