Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoquant #38

Merged
merged 13 commits into from
Mar 25, 2024
Merged

Autoquant #38

merged 13 commits into from
Mar 25, 2024

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented Feb 22, 2024

Stack from ghstack (oldest at bottom):

Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D55103983

Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Feb 22, 2024
Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 67787a19d26071a4a64a49cc1955190256df94f5
Pull Request resolved: #38
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 22, 2024
Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Feb 27, 2024
Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: ac88c078c2982853312629278d272e2a11b187a2
Pull Request resolved: #38
Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 2, 2024
Summary:

Test Plan: python test/test.py -k "autoquant"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 85bb27017fe7fa01f4ed66202b86496717dd2fd8
Pull Request resolved: #38
Summary:

currently issue where for multiple linear layers, get very slow dynamic
quant results on layer linear layers. unclear why.

Test Plan: python test/test.py -k "autoquant"

<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
187.4432 0
AUTOTUNE addmm(65536x3840, 65536x1280, 1280x3840)
  bias_addmm 2.9764 ms 100.0%
  triton_mm_1 3.6858 ms 80.8%
  triton_mm_2 3.7502 ms 79.4%
  addmm 3.7887 ms 78.6%
  triton_mm_3 4.1547 ms 71.6%
  triton_mm_4 4.2022 ms 70.8%
  triton_mm_0 4.7970 ms 62.0%
  triton_mm_8 4.9596 ms 60.0%
  triton_mm_7 5.4343 ms 54.8%
  triton_mm_10 6.9352 ms 42.9%
SingleProcess AUTOTUNE takes 5.6320 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f98800eb760>
f(*args, **kwargs)
  3.08 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.07677136734128
1311.548416 0
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
1311.548416 0
AUTOTUNE mixed_mm(65536x1280, 1280x3840)
  fallback_mixed_mm 2.5089 ms 100.0%
  triton_mm_13 6.4153 ms 39.1%
  triton_mm_14 6.6832 ms 37.5%
  triton_mm_12 7.0896 ms 35.4%
  triton_mm_16 7.5022 ms 33.4%
  triton_mm_15 7.8426 ms 32.0%
  triton_mm_19 9.5269 ms 26.3%
  triton_mm_20 11.2033 ms 22.4%
  triton_mm_17 13.1675 ms 19.1%
  triton_mm_18 13.8004 ms 18.2%
SingleProcess AUTOTUNE takes 2.4977 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff12050>
f(*args, **kwargs)
  3.68 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986ff27b80>
f(*args, **kwargs)
  3.10 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.6846738075837493 3.1023880932480097
2144.447488 25
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 1280]), torch.Size([3840, 1280]), torch.Size([3840]))
2144.447488 25
AUTOTUNE int_mm(65536x1280, 1280x3840, 65536x3840)
  triton_mm_43 2.0319 ms 100.0%
  triton_mm_35 2.8135 ms 72.2%
  triton_mm_42 3.1552 ms 64.4%
  triton_mm_36 3.1754 ms 64.0%
  triton_mm_44 3.3460 ms 60.7%
  triton_mm_41 3.4036 ms 59.7%
  triton_mm_37 3.5030 ms 58.0%
  triton_mm_34 3.6553 ms 55.6%
  triton_mm_38 3.9232 ms 51.8%
  triton_mm_40 9.1934 ms 22.1%
SingleProcess AUTOTUNE takes 8.1948 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f40>
f(*args, **kwargs)
  3.13 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f986cfd33a0>
f(*args, **kwargs)
  2.21 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 3.1286065466701984 2.210085652768612
2144.447488 22
<class 'torchao.quantization.autoquant.DefaultLinear'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2144.447488 22
AUTOTUNE addmm(65536x1280, 65536x3840, 3840x1280)
  bias_addmm 2.7966 ms 100.0%
  addmm 3.0447 ms 91.9%
  triton_mm_57 3.5612 ms 78.5%
  triton_mm_58 3.6919 ms 75.7%
  triton_mm_59 4.1908 ms 66.7%
  triton_mm_60 4.2350 ms 66.0%
  triton_mm_56 4.7210 ms 59.2%
  triton_mm_64 4.9001 ms 57.1%
  triton_mm_63 5.5218 ms 50.6%
  triton_mm_66 7.1417 ms 39.2%
SingleProcess AUTOTUNE takes 6.3734 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9888dd2b30>
f(*args, **kwargs)
  3.33 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.autoquant.DefaultLinear'> 3.329739556647837
2228.913664 39
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 39
AUTOTUNE mixed_mm(65536x3840, 3840x1280)
  fallback_mixed_mm 2.3987 ms 100.0%
  triton_mm_70 6.9153 ms 34.7%
  triton_mm_72 7.1634 ms 33.5%
  triton_mm_69 7.3164 ms 32.8%
  triton_mm_68 7.5070 ms 32.0%
  triton_mm_71 7.5631 ms 31.7%
  triton_mm_76 10.7759 ms 22.3%
  triton_mm_75 11.0692 ms 21.7%
  triton_mm_73 12.8898 ms 18.6%
  triton_mm_77 13.3715 ms 17.9%
SingleProcess AUTOTUNE takes 6.2342 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9880133fd0>
f(*args, **kwargs)
  3.48 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f988175b610>
f(*args, **kwargs)
  3.22 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8WeightOnlyQuantizedLinearWeight'> 3.4762858413159847 3.2240213360637426
2228.913664 38
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> (torch.Size([65536, 3840]), torch.Size([1280, 3840]), torch.Size([1280]))
2228.913664 38
AUTOTUNE int_mm(65536x3840, 3840x1280, 65536x1280)
  triton_mm_99 1.4307 ms 100.0%
  triton_mm_100 1.9041 ms 75.1%
  triton_mm_91 2.6079 ms 54.9%
  triton_mm_98 2.6363 ms 54.3%
  triton_mm_92 2.6691 ms 53.6%
  triton_mm_93 3.0178 ms 47.4%
  triton_mm_97 3.0233 ms 47.3%
  triton_mm_94 3.1872 ms 44.9%
  triton_mm_90 3.6072 ms 39.7%
  triton_mm_96 8.4695 ms 16.9%
SingleProcess AUTOTUNE takes 8.1095 seconds
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9881782f80>
f(*args, **kwargs)
  145.38 ms
  1 measurement, 20 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f9892843f70>
f(*args, **kwargs)
  143.98 ms
  1 measurement, 20 runs , 1 thread
<class 'torchao.quantization.subclass.Int8DynamicallyQuantizedLinearWeight'> 145.37517526187003 143.98446583654732
2230.364672 79

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 2, 2024
Summary:

Test Plan: python test/test.py -k "autoquant"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 13ee908c1beea415bc501358a7ac5c453419b432
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
(pytorch-labs/segment-anything-fast#114,
huggingface/diffusion-fast@176e85f)

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 5, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
(pytorch-labs/segment-anything-fast#114,
huggingface/diffusion-fast@176e85f)

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 398609989008c3bdf2c6e27403a3f7966883ba76
Pull Request resolved: #38
@HDCharles HDCharles requested a review from cpuhrsch March 5, 2024 23:52
@cpuhrsch cpuhrsch requested a review from msaroufim March 19, 2024 19:54
test/test.py Outdated Show resolved Hide resolved
@cpuhrsch
Copy link
Contributor

Follow up is a refactor and tutorial

Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0dbb2ffd09a4fcce471af979b039162916591b7a
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fddbaf2c203a1745e8a84980f778c45162576cbc
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f268031a1702302cc6baa5f91328874677a0959b
Pull Request resolved: #38
@HDCharles
Copy link
Contributor Author

@HDCharles has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55103983](https://our.internmc.facebook.com/intern/diff/D55103983)

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 94089f74edf54f8e2122e91498b25306d322f3ab
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55103983](https://our.internmc.facebook.com/intern/diff/D55103983)

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 94089f74edf54f8e2122e91498b25306d322f3ab
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55103983](https://our.internmc.facebook.com/intern/diff/D55103983)

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 94089f74edf54f8e2122e91498b25306d322f3ab
Pull Request resolved: #38
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55103983](https://our.internmc.facebook.com/intern/diff/D55103983)

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 19, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 37683856743b0c139b87b87c1b2c9acf92a9c15b
Pull Request resolved: #38
@HDCharles
Copy link
Contributor Author

@HDCharles has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55103983](https://our.internmc.facebook.com/intern/diff/D55103983)

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Mar 25, 2024
Summary: Adding autoquantization functionality, using hte do_quant api
we can test kernel speeds and pick the best quantization type (or no
quantization) for each layer.

Test Plan: python test/test.py -k "autoquant"

also tested on SAM and SDXL
pytorch-labs/segment-anything-fast#114
HDCharles/sdxl-fast@8d9942a

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 3c1199d84d316ae49d664b6a20ebed404734806e
Pull Request resolved: #38
@HDCharles HDCharles merged commit 17c670a into gh/HDCharles/1/base Mar 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants