Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize pytree to string #102577

Closed
wants to merge 3 commits into from
Closed

Serialize pytree to string #102577

wants to merge 3 commits into from

Conversation

angelayi
Copy link
Contributor

@angelayi angelayi commented May 30, 2023

  • list --> L(value1,value2)
  • tuple --> T(value1,value2)
  • dict --> D(key1:value1,key2:value2)
  • ordered dict --> O(key1:value1,key2:value2)
  • namedtuple --> N(type(key1, key2),value1,value2)
  • leaf --> *

Restrictions

  • serializing custom types is not supported
  • we only support serializing string keys in dictionaries

@pytorch-bot
Copy link

pytorch-bot bot commented May 30, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102577

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a75431:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@angelayi angelayi marked this pull request as ready for review May 30, 2023 23:16
@angelayi angelayi mentioned this pull request May 30, 2023
5 tasks
@zhxchen17
Copy link
Contributor

Why we chose word "type" for N(type(key1, key2),value1,value2)?

@angelayi
Copy link
Contributor Author

Why we chose word "type" for N(type(key1, key2),value1,value2)?

what's a better word.. 😅

@zhxchen17 zhxchen17 closed this May 31, 2023
@zhxchen17 zhxchen17 reopened this May 31, 2023
@zhxchen17
Copy link
Contributor

wrong button pressed sorry lol

@zhxchen17
Copy link
Contributor

Why we chose word "type" for N(type(key1, key2),value1,value2)?

what's a better word.. 😅

sorry I think I misunderstood. feel free to ignore my previous comment

angelayi added a commit that referenced this pull request May 31, 2023
Serialization TODOs:
- [ ] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] shape env
- [ ] graph module metadata?




[ghstack-poisoned]
angelayi added a commit that referenced this pull request May 31, 2023
Serialization TODOs:
- [ ] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] shape env
- [ ] graph module metadata?




[ghstack-poisoned]
Copy link
Contributor

@avikchaudhuri avikchaudhuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include dataclass support as well, while we're at it? Better than NamedTuple.

Do we really need both dict and OrderedDict? Why?

angelayi added a commit that referenced this pull request May 31, 2023
Serialization TODOs:
- [ ] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] shape env
- [ ] graph module metadata?




[ghstack-poisoned]
angelayi added a commit that referenced this pull request May 31, 2023
Serialization TODOs:
- [ ] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] shape env
- [ ] graph module metadata?




[ghstack-poisoned]
@angelayi
Copy link
Contributor Author

angelayi commented Jun 1, 2023

Why not include dataclass support as well, while we're at it? Better than NamedTuple. Do we really need both dict and OrderedDict? Why?

dataclass isn't pytree-ed right now. I'm just choosing what to pytree based on the exising default types that are pytree-ed.

updates to #102708

@angelayi angelayi closed this Jun 1, 2023
pytorchmergebot pushed a commit that referenced this pull request Jun 1, 2023
@angelayi angelayi mentioned this pull request Jun 2, 2023
5 tasks
alimoezzi pushed a commit to alimoezzi/pytorch that referenced this pull request Jun 3, 2023
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 1d3fc157a7a5c2e615dbcc7f0e87d76f2f4c43ed
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 033cb9a22d905d944e182dba3b191df4c52413c8
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 22b0c38ddf3887e5966c0fe0b00c6984c30d98a9
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 32766639106abc0c4cea03bd298254140e7f3a1a
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 8e7d5cd4769bd6b4dcf64036dab43d54d7d4493a
angelayi added a commit that referenced this pull request Jun 5, 2023
Summary:
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707

Reviewed By: zhxchen17

Differential Revision: D46362466

Pulled By: angelayi

fbshipit-source-id: 8627d9f783cea5af9c36b09f4216c7effc021593
pytorchmergebot pushed a commit that referenced this pull request Jun 6, 2023
v2 of #102125 because of git issues
corresponding deserialization diff: #102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: #102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: #102707
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
@ezyang
Copy link
Contributor

ezyang commented Jul 26, 2023

Is there any design doc about why we picked this particular format for string serialization (and also why we hand-rolled a string parser?) Single letter signifiers means that downstream pytree implementers are significantly at risk of collisions (you can only support 26 distinct types under this parsing scheme). Additionally, lack of quoting means that dictionary keys with colons can confuse the parser. It almost nearly would have been better to just use JSON instead...

pytorchmergebot pushed a commit that referenced this pull request Aug 27, 2023
Fixes #102577 (comment)

Serializing to json is more stable, and renamed the API:

```
# Takes in a treespec and returns the serialized treespec as a string. Also optionally takes in a protocol version number.
def treespec_dumps(treespec: TreeSpec, protocol: Optional[int] = None) -> str:
# Takes in a serialized treespec and outputs a TreeSpec
def treespec_loads(data: str) -> TreeSpec:
```

If users want to register their own serialization format for a given pytree, they can go through the `_register_treespec_serializer` API which optionally takes in a `getstate` and `setstate` function.
```
_register_treespec_serializer(type_, *, getstate, setstate)
# Takes in the context, and outputs a json-dumpable context
def getstate(context: Context) -> DumpableContext:
# Takes in a json-dumpable context, and reconstructs the original context
def setstate(dumpable_context: DumpableContext) -> Context:
```

We will serialize to the following dataclass, and then json.dump this it to string.
```
class TreeSpec
    type: Optional[str]  # a string name of the type. null for the case of a LeafSpec
    context: Optional[Any]  # optional, a json dumpable format of the context
    children_specs: List[TreeSpec],
}
```

If no getstate/setstate function is registered, we will by default serialize the context using `json.dumps/loads`. We will also serialize the type through `f"{typ.__module__}.{typ.__name__}"`.

Pull Request resolved: #106116
Approved by: https://github.com/zou3519
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants