Enable FP6-LLM kernel build on Windows #305

matthewdouglas · 2024-06-03T15:24:34Z

This PR includes a small set of changes to enable building the FP6-LLM kernels and the torch extension in general under Windows natively. Tested with MSVC 19.39 (VS2022 17.9) and NVCC 12.4.

I have not yet validated these changes with GCC, so keeping this in draft mode for now.

pytorch-bot · 2024-06-03T15:24:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/305

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b7d8ba1 with merge base 8a4e693 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-06-03T15:24:39Z

Hi @matthewdouglas!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

msaroufim · 2024-06-03T18:48:15Z

torchao/utils.py

-    return decorator
+
+
+def benchmark_model(model, num_runs, input_tensor):


these look like linting changes? can't quite see the difference

Maybe change in end of line symbols? iirc, Windows use different symbols for end of line.

msaroufim · 2024-06-03T18:49:02Z

torchao/csrc/cuda/fp6_llm/utils_parallel_dequant.cuh

-    u_int32_t *Frag1_PTR    = read_RPTR_Frag1;
-    u_int32_t *Frag2_PTR    = read_RPTR_Frag2;
+__device__ __forceinline__ void Dequant_32FP6_4Way(uint32_t                Reg[][4], 
+                                                   uint32_t * __restrict__ read_RPTR_Frag1, 


TIL about uint32_t vs u_int32_t lol

msaroufim · 2024-06-03T18:58:56Z

setup.py

-            "-O3" if not debug_mode else "-O0",
-        ]
-    }
+    if not IS_WINDOWS:


yeah feels like we should be testing this in CI, shouldn't be too hard to use windows machine for cpu but I'm not sure how abundant cuda enabled windows machines are in the github org

gau-nernst · 2024-06-03T21:01:41Z

torchao/csrc/cuda/fp6_llm/ptx_mma.cuh

-                                                  int                       slice_id) {
+__device__ __forceinline__ void B_FromSharedToReg(uint32_t Reg[][4],
+                                                  half     (*read_SPTR)[WARP_K+PADDING_SHARED_MEM_FOR_B_8],
+                                                  int      slice_id) {


Missing some __restrict__ here.
For Reg[][4], I don't know if we can add __restrict__ directly. Otherwise, maybe we need to change it to a pointer (so we can add back __restrict__). From what I know, Reg[][4] is still passed as pointer, but it allows us to do 2d-indexing (last dim is compile-time constant, so it translates to 4 * first_index + second_index).

gau-nernst · 2024-06-03T21:05:18Z

torchao/csrc/cuda/fp6_llm/utils_parallel_dequant.cuh

-    u_int32_t *OutputRegs   = reinterpret_cast<u_int32_t*> (Reg);
-    u_int32_t *Frag1_PTR    = read_RPTR_Frag1;
-    u_int32_t *Frag2_PTR    = read_RPTR_Frag2;
+__device__ __forceinline__ void Dequant_32FP6_4Way(uint32_t                Reg[][4], 


missing __restrict__ here

gau-nernst · 2024-06-19T00:28:02Z

Done via #396

Enable FP6-LLM kernel build on Windows

b7d8ba1

msaroufim requested a review from gau-nernst June 3, 2024 18:34

msaroufim reviewed Jun 3, 2024

View reviewed changes

gau-nernst requested changes Jun 3, 2024

View reviewed changes

msaroufim mentioned this pull request Jun 15, 2024

FP6 dtype! #208

Open

gau-nernst mentioned this pull request Jun 18, 2024

Add support for building CUDA extension on Windows #396

Merged

gau-nernst closed this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FP6-LLM kernel build on Windows #305

Enable FP6-LLM kernel build on Windows #305

matthewdouglas commented Jun 3, 2024

pytorch-bot bot commented Jun 3, 2024 •

edited

Loading

facebook-github-bot commented Jun 3, 2024

msaroufim Jun 3, 2024

gau-nernst Jun 3, 2024

msaroufim Jun 3, 2024

msaroufim Jun 3, 2024

gau-nernst Jun 3, 2024

gau-nernst Jun 3, 2024

gau-nernst commented Jun 19, 2024

		return decorator


		def benchmark_model(model, num_runs, input_tensor):

Enable FP6-LLM kernel build on Windows #305

Enable FP6-LLM kernel build on Windows #305

Conversation

matthewdouglas commented Jun 3, 2024

pytorch-bot bot commented Jun 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/305

✅ No Failures

facebook-github-bot commented Jun 3, 2024

Action Required

Process

msaroufim Jun 3, 2024

Choose a reason for hiding this comment

gau-nernst Jun 3, 2024

Choose a reason for hiding this comment

msaroufim Jun 3, 2024

Choose a reason for hiding this comment

msaroufim Jun 3, 2024

Choose a reason for hiding this comment

gau-nernst Jun 3, 2024

Choose a reason for hiding this comment

gau-nernst Jun 3, 2024

Choose a reason for hiding this comment

gau-nernst commented Jun 19, 2024

pytorch-bot bot commented Jun 3, 2024 •

edited

Loading