New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[export] Serialize constrain_as_size ops #107386
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107386
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c7027cc with merge base aa9f6a4 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
k: RangeConstraint(v.lower, v.upper) for k, v in shape_env.var_to_range.items() | ||
} | ||
return range_constraints | ||
|
||
transformed_ep = ExportedProgram( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root problem is not only about range constraints themselves, but rather because we use ExportedProgram directly here, therefore some metadata is not guaranteed to be in sync by design.
transformed_ep = ExportedProgram( | ||
transformed_gm, | ||
transformed_gm.graph, | ||
copy.deepcopy(self.graph_signature), | ||
copy.deepcopy(self.call_spec), | ||
self.state_dict, | ||
copy.deepcopy(self.range_constraints), | ||
_get_updated_range_constraints(transformed_gm), | ||
copy.deepcopy(self.equality_constraints), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about equality constraints? I'm assuming they will be out of sync as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I'd like to understand what about shape env has changed. You seem to suggest that only constrain_as_size
is affected, which would explain why you focused only on range constraints. But I'm not sure yet what actually changed, so can't judge whether your PR is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we write a pass, we retrace the graph in a pass, it creates new symbolic values. For example, if a .item
call originally resulted in the symint i0
, after retracing, a new symint, i1
, will be produced. However, the i1
will not show up in the range constraints, causing some errors when deserializing.
Equality constraints won't be as out of sync, since it just stores the input string name and it's dimension. It won't be affected unless any passes modify the inputs. But that's the same problem to solve as the graph signature going out of sync.
As Zhengxu mentioned, the best solution would be to just retrace after .transform, but I think that will break downstream users which is why I added this helper function for now to update the range constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah. @gmagogsfm made a similar change to update the signature, but assuming that the structure of inputs / outputs hasn't changed. In your case it's indeed harder to assume that transform()
will preserve structure so that simple renaming is enough. So I agree that the best solution would be to retrace. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you're not just renaming. Could you comment on when your solution might be incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think that will break downstream users which is why I added this helper function for now to update the range constraints.
Could you clarify what will be broken at downstream? I think downstream shouldn't treat transform() as a stable API to rely on at all. If something is broken, we need to fix forward them on the other side when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment on when your solution might be incorrect?
I can't think of when my solution will be incorrect, but it's just that retracing is the most stable solution.
Could you clarify what will be broken at downstream? I think downstream shouldn't treat transform() as a stable API to rely on at all.
Well, currently transform() is treated as a stable API, so that's a problem. And for what will be broken, I think edge dialect will be broken as every time we retrace we go back to ATen dialect. A fix would be to then run the aten to edge passes after each transform() call. But un-stabilizing the transform() API and using retracing during passes feels like a separate issue from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angelayi Did we ever tell people transform() is a stable API in the first place? I don't quite remember.
Putting this aside, I think we should make def transform()
as def _transform()
in a separate diff, so that this doesn't confuse people any further.
I still think this function fundamentally breaks our invariant of only introducing ExportedProgram with torch.export(), therefore we shouldn't let people use it any time soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the code of _get_updated_range_constraints
, it seems like we could always reconstruct range constraints like this, no?
I'm still curious what makes us think this code might be incorrect, but I don't think it's too bad. Thus approving to unblock.
to be clear I think this PR is technically correct, but I'm super duper concerned about the approach we're taking to patch ExportedProgram each time when some metadata is out of sync. imho ExportedProgram should be a result of exporting, rather than transformations (and the name suggests this). I can let this specific PR go, but it could get complicated if we keep this approach moving forward because of all the metadata sync issues. |
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Since constrain_as_size has been fixed, I tried serializing it, but ran into some issues.
Notably, after each
.transform
call, I added a helper_get_updated_range_constraints
to update the range constrains list. This is because when we retrace in a pass, the symbolic values being used changes, so we need to update this dictionary.