-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn into Python module, hack in transformers support #8
Conversation
Well, I deliberately didn't make it a Transformers module for a number of reasons. Whether you can wrap it up to look like a Transformers module without actually using any of the underlying Transformers code, I don't know. But just conforming to the interface poses some challenges. For one thing, the way Transformers handles its cache is problematic, with the cache being nothing but a list of tuples of key and value tensors from previous inferences. The only state carried forward is the shape of those tensors, meaning the only way to build the cache is with concatenation. This is hugely wasteful if you look at what ends up happening at the low level: For every token generated you have to completely reallocate and copy the entire cache. Worse still, every step leaves behind a memory gap that's going to be a little too small for any of the subsequent steps. It's guaranteed to cause a lot of fragmentation. There's a reason people have been having trouble using 30B models with the full sequence length on 24 GB GPUs and it's primarily the past-key-values cache. The weights, activations and cache are not too big for 24 GB of VRAM at all, but add in the fragmentation and PyTorch's unpredictable resource management, suddenly you're limited to 1500 tokens or so, and that's if you don't have too many browser tabs open at the same time. I made a hacky workaround for it at one point which patched LlamaAttention to use in-place concatenation on a past-key-values list preallocated with an underlying storage to fit the full sequence, using a narrow view of the storage to communicate the length of the cache between inferences. This worked, but it was ugly, confusing and not as efficient as it could be. It was also likely to break if anything else (like some other patch) tried to read or modify the cache, unaware of the special way its underlying storage worked. But by simply abandoning Transformers, I'm free to add whatever state I want to the cache without any hacks at all. As well, I can do more fun stuff like using multiple caches at once, which is how the beam search currently works with very little VRAM overhead (each beam has its own little cache that gets swapped into the main cache as required.) Now, you can simply drop the cache altogether and pass the full input_ids to the forward pass, but then you're doing up to 2000 times the amount of compute you need per token. Granted, it scales pretty well on GPUs, but you'd still be in the seconds per token rather than tokens per second at that point. The cache is essential, just as past_key_values is in Transformers. I also saved a few hundred megabytes of VRAM just by placing the token embeddings in system RAM by default. Since no computation is ever done on those embeddings and the extra PCIe bandwidth usage is minimal, there's no measurable performance tradeoff. And there are many other optimizations I have planned, which will bypass more and more of (Py)Torch and may deviate even more from the Transformers template. I've even considered splitting the model vertically to allow multiple GPUs to work in parallel which should be feasible for single-token sequences where the activation tensors end up being small. Overall I just can't see a Transformers-compatible interface as anything but an obstacle. In fact the only reason I'm holding on to deriving from nn.Module is I hope to support backpropagation somewhere down the line. But for inference it's largely redundant and might add a bunch of overhead. I need to measure that at some point. I will be adding a stable API eventually and packaging it as a library so inclusion in other projects becomes straightforward (and reliable!), but shaping that API around Transformers, which is already woefully unsuited for the task (hence all the hacks like GPTQ, Accelerate, etc.) is not the way to do it, I think. A wrapper of some sort could make sense, of course, I'm just not sure how it would work. And I don't want to limit what I can do in this project to abide by limitations imposed by Transformers. As for the setup of the C++/CUDA extension, building it with a setup.py script doesn't remove the need for They all come down to a mismatch between the currently installed extension and whatever software is expecting a specific version to be installed. Try out some interesting fork of alpaca_lora_4bit and suddenly you've broken something else that was built against a previous version of GPTQ that relies on a different version of quant_cuda... so you try to keep everything in its own Conda environment, or you use containers... either way it's a lot of complications just to permanently install a module that doesn't need to be permanently installed. Maybe I'm missing something, though. |
For the setup.py, I am indeed talking about distributing binaries. Your setup would not work out of the box on my machine for example, because I use Pytorch 2.0 with Cuda 11.8, but I have cuda 12.1 installed by the system. So I need a conda environment with the older cuda just to be able to run it. But if I have a precompiled wheel, I just need the conda env to build it once, afterwards it's irrelevant. That is what we've been doing to use GPTQ with KoboldAI, because you run into many many building problems on all different OS/driver/compiler configurations people use otherwise, as you mentioned. I didn't have to do much to provide a transformers-like interface that still uses your cache implementation, I can take a look into doing that completely without touching your classes. I just wanted to check whether it works at all beforehand. My question is if you are open to such a compatibility layer, provided it doesn't affect your implementation, or if you would prefer to keep it simple and just do your own interface, keeping using it entirely on the side of the user of your library. I'm not proposing you switch back to transformers. |
I definitely don't mind having a Transformers wrapper in the project. I do worry that it's a little too early, since so much is still up in the air. And I would definitely prefer not having Transformers as a dependency of the main library, just for two interface definitions. But a separate companion library defining a wrapper deriving from PreTrainedModel could be clean enough, as long as there's someone to maintain it whenever I make breaking changes to the underlying model, which is going to be happening for a while still. Looking at the pull request, though, you're only implementing a bare minimum of PreTrainedModel, so maybe the more correct approach is to create a wrapper specifically for KoboldAI? It would still derive from PreTrainedModel so it could plug in without any changes to Kobold, but it wouldn't present as being Transformers-compatible in general. I'm looking at this part here, for instance:
Which only cares about two cases (full sequence or single token), which I assume is fine for Kobold (?), but wouldn't work in a back-and-forth chat where each user prompt is appended to the context in a single forward pass. Not to mention the many weird ways it would break if you tried to load a model with Accelerate, or attach a LoRA with PeftModel, or whatever else. LoRAs will be supported soon, by the way, they will just be arguments passed to the forward() function along with the inputs, instead of hooks. For the C++ extension, I think, it should be possible to detect whether the extension is installed system-wide and only optionally build it at runtime otherwise. That would be the best of both worlds, and an optional setuptools script could potentially help with ROCm support, too (see the other pull request). |
Thank you, that makes sense. On second thought, it would probably be best to use your native way and include that as a plugin in KoboldAI, separate from all the transformers stuff. I'll take a look at that when I find the time. |
This isn't ready for merge in any way, I just wanted to show what I had to do to use exllama as a transformers model in KoboldAI. This is just hacked together as quickly as possible to check if it can work: it can.
The basic steps were:
ExLlama
fromPreTrainedModel
andExLlamaConfig
fromPreTrainedConfig
, add in the bare minimum of functions to make it compatible.input_ids
, because transformers always sends the whole block while you expect the block only once and then single tokens + the cache.I think we could keep the transformers compatibility separate from your model, if you prefer to keep your simpler structure. But I think compatibility is useful because it allows integration into existing projects easily.
My question is: Are you interested in pursuing this properly? It's great what you've done here, I'd love to expand on it and use it in other projects.