Layerwise int4 kimi#973

Draft

abhishek-singh591 wants to merge 69 commits into

abhishek-singh591:layerwise_int4_kimi

Contributor

abhishek-singh591 commented May 7, 2026 •

edited

Loading

Setup and Run Instructions

Follow the steps below to set up and run Kimi K2.5 layerwise export/compile using run.py.

Step 1: Download the Model

Download Kimi K2.5 from Hugging Face:

https://huggingface.co/moonshotai/Kimi-K2.5

Step 2: Set Model Path

run.py now requires --model_path (no hardcoded default).

Example:

--model_path /home/huggingface_hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f

Step 3: Command-Line Arguments

Argument	Description	Default
--model_path	Path to downloaded Kimi model (required)	None (required)
--aic_hw_version	Accelerator HW version passed to final compile	ai100
--window_size	Number of layers per export window	1
--layerwise_mode	single_qpc or multiple_qpc	single_qpc
--total_layers	Total text layers; auto-resolved from model config if not set	None
--num-devices	Number of devices for compile stages	1
--batch_size	Batch size for specialization config	1
--seq_len	Prefill/compile sequence length	1
--ctx_len	Context length	128
--num_cores	Number of accelerator cores	16
--mxfp6 / --no-mxfp6	Enable/disable MXFP6 matmul compile flag	Enabled
--mxint8_kv_cache / --no-mxint8_kv_cache	Enable/disable MXINT8 KV cache	Enabled
--enable_blocking / --no-enable_blocking	Enable/disable blocking in QAIC config	Disabled
--blocking_mode	Blocking mode	kv
--num_kv_heads_repeat	KV heads repeat count	1
--num_kv_blocks	Number of KV blocks	4
--head_block_size	Head block size	4
--absorption	Enable MLA absorption	Disabled
--online	Enable MLA online mode	Disabled
--prefill_only	Compile in prefill-only mode	Disabled

Step 4: Layerwise Strategy

single_qpc:
- Builds layerwise ONNX windows
- Merges via QEfficient.utils.layerwise_pipeline(...)
- Runs final full-model compile via compile_full_model.py flow
multiple_qpc:
- Compiles layerwise outputs directly with QEfficient.utils.compile_layerwise(...)
- Runs QEfficient.utils.inference_pipeline(...)

Step 5: Run Examples

Minimal:

  python run.py \
    --model_path /path/to/Kimi-K2.5/snapshot

Your requested test shape (total_layers=3):

  --model_path /home/huggingface_hub/models--moonshotai--Kimi-K2.5/snapshots/54383e83fa343a1331754112fb9e3410c55efa2f \
  --total_layers 3 \
  --window_size 1 \
  --layerwise_mode single_qpc \
  --aic_hw_version ai100

Multiple-QPC mode:

  python run.py \
    --model_path /path/to/Kimi-K2.5/snapshot \
    --layerwise_mode multiple_qpc \
    --window_size 1

Note on Layer Windows

Windowing uses half-open intervals:

[start, end)

start is inclusive
end is exclusive

Windows are generated from the top layer range down to 0 (for example, with total_layers=3, window_size=1: (2,3), (1,2),
(0,1)).

abhishek-singh591 and others added 30 commits

April 29, 2026 16:08


          Added all changes of layer wise for kimi model

68c6f98

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Made minor fix

17127ec

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update run.py

833857c

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update modeling_qeff.py

5db334f

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Made minor fix

59d1525

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Made minor fix

937feb4

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Made minor fix

48b3a3b

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Made minor fix

4a159d5

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Added thread pool for loading the QPC

7bd7872

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Added all the changes for layerwise

45651ac

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          minor fix

8e8eca4

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update blocked_attention_forwards.py

03ca639

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Merge branch 'main' into layerwise_kimi

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          port int4 changes

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>


          prefill_changes

28ba2b0

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>


          Merge branch 'main' into mla_int4_moe

8e0b2bd

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>


          Update layerwise_pipeline.py

4b0745e

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          use casttouint4 in prefill and example script for kimi-k2.5

c5d2397

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>


          Modeling fix

e3a8503

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Merging mla_int4_moe to layerwise_int4_kimi

531a9ab

Merge remote-tracking branch 'upstream/mla_int4_moe' into layerwise_int4_kimi


          Typecast scale params

57db7e4

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update run.py

15d4e23

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update run.py

93bf65b

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          fixed subfunction compilation issue

88500e3

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          Update layerwise_pipeline.py

a1c0f84

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update layerwise_pipeline.py

7bcd3a4

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update compile_particular_layer.py

47f7dc2

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update compile_layerwise.py

a4563b2

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update compile_layerwise.py

25f3713

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update compile_particular_layer.py

b481982

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>

abhishek-singh591 and others added 9 commits

May 8, 2026 20:56


          Disable blocking bydefault

49f4d29

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Fixed hash issue

1443fef

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Added minor fix

486e871

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Cherry picked from mla_int4_moe branch.

4a98f5b

fix prefill output

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>


          Update modeling_deepseek.py

b723335

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Cherry picked from mla_int_moe branch

e93ccd0

fixed EP Q chunking

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          Update modeling_deepseek_rope.py

f421577

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          fixed tracer

5e3330d

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          fixed ctxgather

35630ee

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

abhishek-singh591 mentioned this pull request

Layer wise changes for kimi model #954

Closed

abhishek-singh591 marked this pull request as draft

May 18, 2026 06:08


          Applied MLA_par_kV_blocking patch

46546ca

Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>

abhishek-singh591 force-pushed the layerwise_int4_kimi branch from 987f4a9 to 46546ca Compare

May 18, 2026 06:57

abhishek-singh591 and others added 11 commits

May 18, 2026 13:56


          Delete optimized_mla_par_kv_blocking.patch

8fcb869

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          cherry picked

eaac737

Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>


          Update attention_blocking.py

a07e7a0

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Added subfunction fix

aad7b90

Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>


          Update modeling_deepseek.py

efabc2c

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          fix(0415): align prefill MoE chunk export with packed dispatch

0cfdc71

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Update modeling_auto.py

29cee8c

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          Fix dtype issue

dd9b2f2

Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>


          Fix dtype issue

dfc2ad6

Signed-off-by: Abhishek kumar singh <sabhis@qti.qualcomm.com>


          Update modeling_deepseek.py

22c4589

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          fixed attention and ran linter

a9d5413

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

ochougul force-pushed the layerwise_int4_kimi branch from 9d11bf8 to a9d5413 Compare

May 18, 2026 22:39

ochougul and others added 5 commits

May 19, 2026 18:11


          fix for predication of experts

b2b72d6

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          fixed replicate KV, added super-fast attn, fixed MOE

2cb5c3b

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          added skip_kv and fixed MOE for prefill

1658f35

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>


          Added rope changes

aafb835

Signed-off-by: Abhishek Kumar Singh <sabhis@qti.qualcomm.com>


          fixed data type mismatch

d8d4887

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet