-
Notifications
You must be signed in to change notification settings - Fork 754
Decompose after export in export_llama #15951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15951
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit afc924f with merge base 04f1e4d ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
3d09dc1 to
10783cb
Compare
|
LGTM! Address any failing CI tests before landing. |
Summary: `unwrap_tensor_subclass` was not unwrapping nested lora linears. This meant qdata/scale/zero were bundled together in the subclass, and separated at run decompositions inside to_edge_transform_and_lower. This is after nodes are tagged, meaning that the scales were not tagged, and remained in the PTE file after the rest of the weights were moved to a PTD file. It's recommended to move away from `unwrap_tensor_subclass` and rely on export + decomps. This PR adds a decomp after exporting in export_llama, and removes cases of `unwrap_tensor_subclass`. TODO: remove all cases of `unwrap_tensor_subclass` in ET. Test Plan: Add check that quantized weights are in PTD file (not PTE file) after quantization. This is a simple check, nested linears seem to be the real issue that decomposing resolves. TODO to add a test for that (probably e2e test with stories in subsequent PR) ``` python -m unittest executorch.backends.xnnpack.test.passes.test_propagate_custom_meta_pass ``` Reviewed By: metascroy Differential Revision: D87826410 Pulled By: lucylq
10783cb to
22a9e78
Compare
Summary: `unwrap_tensor_subclass` was not unwrapping nested lora linears. This meant qdata/scale/zero were bundled together in the subclass, and separated at run decompositions inside to_edge_transform_and_lower. This is after nodes are tagged, meaning that the scales were not tagged, and remained in the PTE file after the rest of the weights were moved to a PTD file. It's recommended to move away from `unwrap_tensor_subclass` and rely on export + decomps. This PR adds a decomp after exporting in export_llama, and removes cases of `unwrap_tensor_subclass`. TODO: remove all cases of `unwrap_tensor_subclass` in ET. Test Plan: Add check that quantized weights are in PTD file (not PTE file) after quantization. This is a simple check, nested linears seem to be the real issue that decomposing resolves. TODO to add a test for that (probably e2e test with stories in subsequent PR) ``` python -m unittest executorch.backends.xnnpack.test.passes.test_propagate_custom_meta_pass ``` Reviewed By: metascroy Differential Revision: D87826410 Pulled By: lucylq
22a9e78 to
d9e7c6c
Compare
Summary: `unwrap_tensor_subclass` was not unwrapping nested lora linears. This meant qdata/scale/zero were bundled together in the subclass, and separated at run decompositions inside to_edge_transform_and_lower. This is after nodes are tagged, meaning that the scales were not tagged, and remained in the PTE file after the rest of the weights were moved to a PTD file. It's recommended to move away from `unwrap_tensor_subclass` and rely on export + decomps. This PR adds a decomp after exporting in export_llama, and removes cases of `unwrap_tensor_subclass`. TODO: remove all cases of `unwrap_tensor_subclass` in ET. Test Plan: Add check that quantized weights are in PTD file (not PTE file) after quantization. This is a simple check, nested linears seem to be the real issue that decomposing resolves. TODO to add a test for that (probably e2e test with stories in subsequent PR) ``` python -m unittest executorch.backends.xnnpack.test.passes.test_propagate_custom_meta_pass ``` Reviewed By: metascroy Differential Revision: D87826410 Pulled By: lucylq
d9e7c6c to
afc924f
Compare
### Summary Use qwen0.6B with unsloth (instead of llama1B with torchtune) for lora test. 1. Smaller model / quicker test. 2. Eventually remove dependency on torchtune. 3. Qwen is not gated on HF. TODO: add quantized test after #15951 ``` Expected result prefix: <|im_start|>user Calculate 15% of 80?<|im_end|><|im_start|>assistant To calculate 15% of 80, we can multiply 80 by 0.15. 80 * 0.15 = 12 So, 15% of 80 is 12. #### 12 The answer is: 12<|im_end|> + echo 'Actual result: <|im_start|>user Calculate 15% of 80?<|im_end|><|im_start|>assistant To calculate 15% of 80, we can multiply 80 by 0.15. 80 * 0.15 = 12 So, 15% of 80 is 12. #### 12 The answer is: 12<|im_end|> PyTorchObserver {"prompt_tokens":15,"generated_tokens":65,"model_load_start_ms":1765320124550,"model_load_end_ms":1765320127516,"inference_start_ms":1765320152867,"inference_end_ms":1765320178119,"prompt_eval_end_ms":1765320153334,"first_token_ms":1765320153334,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}' Actual result: <|im_start|>user Calculate 15% of 80?<|im_end|><|im_start|>assistant To calculate 15% of 80, we can multiply 80 by 0.15. 80 * 0.15 = 12 So, 15% of 80 is 12. #### 12 The answer is: 12<|im_end|> PyTorchObserver {"prompt_tokens":15,"generated_tokens":65,"model_load_start_ms":1765320124550,"model_load_end_ms":1765320127516,"inference_start_ms":1765320152867,"inference_end_ms":1765320178119,"prompt_eval_end_ms":1765320153334,"first_token_ms":1765320153334,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000} + echo Success Success ```
Summary
unwrap_tensor_subclasswas not unwrapping nested lora linears. This meant qdata/scale/zero were bundled together in the subclass, and separated at run decompositions inside to_edge_transform_and_lower. This is after nodes are tagged, meaning that the scales were not tagged, and remained in the PTE file after the rest of the weights were moved to a PTD file.It's recommended to move away from
unwrap_tensor_subclassand rely on export + decomps. This PR adds a decomp after exporting in export_llama, and removes cases ofunwrap_tensor_subclass.TODO: remove all cases of
unwrap_tensor_subclassin ET.Test plan
Add check that quantized weights are in PTD file (not PTE file) after quantization. This is a simple check, nested linears seem to be the real issue that decomposing resolves. TODO to add a test for that (probably e2e test with stories in subsequent PR)