Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q3 ROADMAP #30

Open
8 of 21 tasks
robertgshaw2-neuralmagic opened this issue Jul 22, 2024 · 4 comments
Open
8 of 21 tasks

Q3 ROADMAP #30

robertgshaw2-neuralmagic opened this issue Jul 22, 2024 · 4 comments
Labels
roadmap Items planned to be worked on

Comments

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator

robertgshaw2-neuralmagic commented Jul 22, 2024

SUMMARY:

  • Avoid full pass through the model for quantization modifier
  • Data free oneshot
  • Runtime of GPTQ with large models – how to do a 70B model?
  • Runtime of GPTQ with act order
  • Remove “naive-quantized”
  • Native integration of compressed-tensors into AutoModelForCausalLM (deprecate SparseAutoModel)
  • run-compressed in AutoModelForCausalLM
  • KV cache quantization
  • Best practices documentation
  • Improve runtime of GPTQ with multi-gpu sequential_update=True
  • Support AWQ + other low bit methods
  • Support MoE models end-to-end through vllm
  • Support XXForConditionalLM and embedding models
  • Support Vision Language models
  • Documented examples of non-llama language models (gemma, phi, qwen, mixtral, deepseek, ... request others?)
  • Documented sparse transfer learning with TRL
  • Simplify Smoothquant for non-llama models
  • Improve loading time of pack-quantized models in SparseAutoModel
  • Understand why the BoS token has impact?
  • Integration with PEFT
  • 2:4 Sparsity
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic added enhancement New feature or request roadmap Items planned to be worked on and removed enhancement New feature or request labels Jul 22, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Near Term Items Near Term Roadmap Aug 11, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Near Term Roadmap Roadmap Aug 11, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title Roadmap ROADMAP Aug 11, 2024
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic changed the title ROADMAP Q3 ROADMAP Aug 11, 2024
@halexan
Copy link

halexan commented Sep 2, 2024

Looking forward to
Documented examples of non-llama language models (gemma, phi, qwen, mixtral, deepseek, ... request others?)

@CharlesRiggins
Copy link

CharlesRiggins commented Sep 3, 2024

This paper can be helpful for "understanding why the BoS token has an impact".

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

It is found that such huge outliers usually occur at the [BOS] token and some other uninformative initial tokens (e.g., "." or ",") at particular channels, regardless of the rest of the input sequence. We thus name these tokens pivot tokens given their dominating values in the activation. The attention scores will be concentrated on these pivot tokens than the rest ones, a.k.a attention sinks (Xiao et al., 2024).

@fengyang95
Copy link

May I ask when AWQ will be supported

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator Author

May I ask when AWQ will be supported

We are actively working on this now. Ideally in a week or so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Items planned to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants