Skip to content

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Mar 8, 2023

Stack from ghstack (oldest at bottom):

  • add graph-breaks baselines
  • add check_graph_breaks script (message users on regress or improvement)
  • hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96346

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dc4e562:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wconstab wconstab requested review from desertfire, ezyang and malfet and removed request for ezyang and malfet March 8, 2023 22:00
@wconstab wconstab added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Mar 8, 2023

python benchmarks/dynamo/check_breaks.py --actual \
"$TEST_REPORTS_DIR/inference_$suite$shard_id.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inference_$suite$shard_id.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put this in test_single_dynamo_benchmark? Also, a downside to having it be a separate script is if there is an unrelated failure, you don't get the graph break stats either. It is nice to report them all (I tell myself this is why we have a post-check script in the first place, so we can easily report all failed models and not just stop on the first one.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put this in test_single_dynamo_benchmark

I can try that. I was trying to only run the checker in the non-perf jobs but it looks like i can still do that from in there.

a downside to having it be a separate script is if there is an unrelated failure, you don't get the graph break stats either

can you say more here? iiuc the job will still run all the models first (and that prints graph break stats model by model as it goes), then the csv is saved (so you can always access it) at that point, and finally the check runs, which is the last thing .... OH do you mean, run the 2 checks (train/infr) after running the sweeps for train/infr?

i'm also confused about if there is an unrelated failure -- if the bench runner itself bails out, i'm not making things worse am i?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So our batch scripts run with -e, which means after the first one fails the second one never gets run. So the antipattern here is, eg rexnet flaky fails accuracy, we stop running, and you don't realize there are ALSO graph break mismatches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, the thing i was missing is the bench runner keeps running and reports fails at the end, so i could integrate the graph break checks with the runner too. makes sense, ill see if its hard

@@ -0,0 +1,43 @@
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that we actually want all these columns here? calls_captured in particular feels likely to wobble, in ways that we ought not to care about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about this, and it seems easier to check in the exact csv that is easy to produce using our current flows.

and to your point about wobble, we just don't check the other fields- we check "graph_breaks"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should generate two csvs: one with detailed, and another that is ok to check in. It's suboptimal to put in extra stuff even if it is ignored because it means that you'll get lots of unrelated wobbling whenever you update the csvs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the cleanest would be

  1. the msg also tells you "run this command: pytorch/benchmarks/dynamo/update_expected.py --pr {PRNUM}"
  2. there is some clever mapping so the script can download the right .zips, and extract the csvs (2 csv per zip currently)
  3. it can also spit out newly stripped CSVs that only contain the cared-about columns
  4. and 'git add' the changes so you don't have to

But i have to figure out (2), not sure if it is trivial or we have to do something to make it possible. CC @ZainRizvi any help here? (best practices for (a) how to interact with artifact files from a script (b) any stable 'ENV' inside test jobs that i can use to build a URL to download artifacts etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (2), you need an action after the test runs finish (see #95675 as an example). I am ok with doing this as a followup PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re (2) I'd probably do something dumb like feed the script a HUD url, and then just parse out the hyperlinks to find the one I need lol

msg += textwrap.dedent(
f"""
If this change is expected, you can update `{args.expected}` to reflect the new baseline.
This can either be done manually, or by downloading artifacts from your PR CI job."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the name of the artifact in HUD would be useful here

@@ -0,0 +1,43 @@
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,AlbertForMaskedLM,1,pass,439,1,0,0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I did not realize until seeing this, is that we get a csv per shard. This means that I have to click through and download eight times to get all of them. Maybe we should have a script that downloads all the csvs from a job...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i am wanting to build more tooling for this too. i have some other ideas...

@ezyang
Copy link
Contributor

ezyang commented Mar 8, 2023

Thanks for doing this!

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 9, 2023
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

ghstack-source-id: 8f5f9e9
Pull Request resolved: #96346
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 9, 2023
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

ghstack-source-id: 2972319
Pull Request resolved: #96346

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 9, 2023
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

ghstack-source-id: a334f1e
Pull Request resolved: #96346

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.
@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2023

if you are out of patience and just want to land something, holler and I'll probably approve you to move things along

@wconstab
Copy link
Contributor Author

if you are out of patience and just want to land something, holler and I'll probably approve you to move things along

i actually want to land it together with the WIP (next PR, which adds a script for downloading all the artifacts for the job) next week. Anyway i'm on PTO tmrw so i won't rush this in. It's probably landable though, if you're wanting it urgently, i can update the csvs monday and push it.

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

ok @ezyang if you feel inclined (and this passes CI) you can stamp and mergebot, otherwise i can take care of it next week. I have gotten the updater tool (next PR) in an almost landable state so i am not too worried about the inconvenience of updating the CSVs in the interim.

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs

[ghstack-poisoned]
@ezyang
Copy link
Contributor

ezyang commented Mar 13, 2023

By the way, for the differential between dynamic and non-dynamic, my preference is not to have a completely separate dynamic csv, and instead a hand-written list of "we have more graph breaks here".

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

By the way, for the differential between dynamic and non-dynamic, my preference is not to have a completely separate dynamic csv, and instead a hand-written list of "we have more graph breaks here".

what would the list contain, names of models and deltas compared to static?

In any case I planned to land this as is and probably let you tackle the dynamic part. Or at least let me come back to it next week.

- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ezyang
Copy link
Contributor

ezyang commented Mar 13, 2023

what would the list contain, names of models and deltas compared to static?

For the most part, I'd expect most models to have the same number of graph breaks with/without dynamic, so it would be a list of one or two models that this is not true for.

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda11.8-py3.10-gcc7-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Mar 14, 2023

@wconstab I think there is a bug in the script as it's now failing on periodic https://hud.pytorch.org/pytorch/pytorch/commit/906a1952c676dcccc72684da8e385d98e4704f68. The error is:

2023-03-14T17:51:28.9197764Z Traceback (most recent call last):
2023-03-14T17:51:28.9198275Z   File "/var/lib/jenkins/workspace/benchmarks/dynamo/check_graph_breaks.py", line 86, in <module>
2023-03-14T17:51:28.9198600Z     main()
2023-03-14T17:51:28.9198959Z   File "/var/lib/jenkins/workspace/benchmarks/dynamo/check_graph_breaks.py", line 79, in main
2023-03-14T17:51:28.9199435Z     failed, msg = check_graph_breaks(actual, expected, args.expected)
2023-03-14T17:51:28.9199870Z   File "/var/lib/jenkins/workspace/benchmarks/dynamo/check_graph_breaks.py", line 21, in check_graph_breaks
2023-03-14T17:51:28.9200225Z     expected_graph_breaks = get_field(expected_csv, model, "graph_breaks", typ=int)
2023-03-14T17:51:28.9200576Z   File "/var/lib/jenkins/workspace/benchmarks/dynamo/check_graph_breaks.py", line 10, in get_field
2023-03-14T17:51:28.9200892Z     return typ(csv.loc[csv["name"] == model_name][field])
2023-03-14T17:51:28.9201416Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/pandas/core/series.py", line 206, in wrapper
2023-03-14T17:51:28.9201754Z     raise TypeError(f"cannot convert the series to {converter}")
2023-03-14T17:51:28.9202085Z TypeError: cannot convert the series to <class 'int'>

This was missed because periodic wasn't included in this PR. It looks like a forward fix is needed? Or feel free to revert if you need time to look into this

@wconstab
Copy link
Contributor Author

This was missed because periodic wasn't included in this PR. It looks like a forward fix is needed? Or feel free to revert if you need time to look into this

I will shoot for a forward fix. Ping me if i'm taking too long.

@wconstab
Copy link
Contributor Author

@huydhn hopefully, #96780 will fix this. I've tagged it with periodic label so lets see if it helps.

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 23, 2023
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs
Pull Request resolved: pytorch/pytorch#96346
Approved by: https://github.com/ezyang
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 27, 2023
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs
Pull Request resolved: pytorch/pytorch#96346
Approved by: https://github.com/ezyang
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/130/head branch June 8, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants