Conversation
|
sage attention cant use in training. |
|
Hmm good point but it does seem to be "training" the LoRA. The weights are present and seem to have actual weights. I will look a little deeper to see what is actually going on. Maybe it's just keeping the initialized weights or something. SpargeAttn might require a significant "autotune" to produce a sparse attention model. thu-ml/SpargeAttn#6 I was initially exploring adding SpargeAttn but found that issue. |
|
Ok it seems it isn't adjusting the attention parameters but the others would be updated (like MLP/feed forward). I think keeping SageAttention could be fine but noting it is not for training? Possibly some work-arounds for the backwards that were applied for flash attention might make it work in the future. |
Requires https://github.com/thu-ml/SageAttention
Actual numbers may be different as not a full run to estimate with.