More local model options #2

bennmann · 2023-03-14T13:16:11Z

Hello,

What lines might one change to use model.generate of a local model on the same host?

I have a 16GB VRAM gaming GPU and have run local inference on bloomz-7B, RWKV 14B, Pythia 12B.

I want to be able to simply change a few lines to generate from a local model instead of hosting an alpa served version.

Thanks for your thoughts and consideration.

yangkevin2 · 2023-03-14T18:25:46Z

Hi,

The util function here https://github.com/yangkevin2/doc-story-generation/blob/main/story_generation/common/util.py#L927 interfaces with Alpa to get next-token logprobs. You could try changing that to use your local model instead. Just be aware that the quality of generated text might be a lot worse using a much smaller model, though.

Thanks,
Kevin

bennmann · 2023-03-15T00:35:37Z

Thank you! I created a branch and am now navigating my own personal dependency purgatory (using AMD GPU, ROCM, accelerate, and bitsandbytes 8bit, etc).
I will test these changes from the branch I just split and use model https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

Thanks for your guidance, I will remember you if it makes cool stories.

bennmann · 2023-03-15T13:02:13Z

Hi Kevin and anyone else,

There are also a lot of openai calls in various functions - maybe I can find them all and change them, this turned into more of a weekend project than an afternoon project, so I will be delayed replacing every openai call with a local model.generate equivalent.

I may return to this a little at a time, best guess a few weeks to completion as I've dug a little deeper over time. Anyone else feel free to look into my branch and try to suggest openai replacement model.generate equivalents.

-Ben

yangkevin2 · 2023-03-15T16:51:49Z

Oh, yeah if you don't want to use the GPT3 API at all you'll have to replace all of those. Sorry, thought you meant just the Alpa stuff.

As an additional note, using your local models on a 16GB GPU will also pretty seriously compromise the quality of the resulting outputs, especially the plan/outline generation-- I'm not convinced that that part would work at all without using an instruction-tuned model (specifically text-davinci-002, since that supports suffix context in addition to a prompt). And in our preliminary experiments using "smaller" 13B models for the main generation procedure, the story quality was quite a bit worse too.

bennmann · 2023-03-16T16:48:23Z

I have great hope for producing about 1 good generation out of less than 20 attempts with models today. I agree the quality in general will require more cherry picking outputs (reprompting?). With the improvements coming to the smaller models (such as Llama 13B competing with 175B GPT) getting a fully functional single GPU storyteller before new models come out seems worthwhile to me. I am very happy with the concepts in your paper and work the more I consider the works potential. And open source!

…

On Wed, Mar 15, 2023, 12:52 PM Kevin Yang ***@***.***> wrote: Oh, yeah if you don't want to use the GPT3 API at all you'll have to replace all of those. Sorry, thought you meant just the Alpa stuff. As an additional note, using your local models on a 16GB GPU will also pretty seriously compromise the quality of the resulting outputs, especially the plan/outline generation-- I'm not convinced that that part would work at all without using an instruction-tuned model (specifically text-davinci-002, since that supports suffix context in addition to a prompt). And in our preliminary experiments using "smaller" 13B models for the main generation procedure, the story quality was quite a bit worse too. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMUTTWJCVKAEZGFXGGDEJLW4HXTBANCNFSM6AAAAAAV2NXL6I> . You are receiving this because you authored the thread.Message ID: ***@***.***>

yangkevin2 · 2023-03-16T19:54:29Z

Yeah, if you're willing to do a bit of manual cherry-picking / interaction, then the requirements on model quality definitely go down significantly. I haven't tested with the new LLaMA models, but I agree it's likely they'd work better than the ones we tried previously (e.g., GPT-13B non-instruction-tuned). Would be curious to hear how it goes if you do end up trying that out.

Glad you enjoyed the work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More local model options #2

More local model options #2

bennmann commented Mar 14, 2023 •

edited

Loading

yangkevin2 commented Mar 14, 2023

bennmann commented Mar 15, 2023

bennmann commented Mar 15, 2023

yangkevin2 commented Mar 15, 2023

bennmann commented Mar 16, 2023 via email

yangkevin2 commented Mar 16, 2023

More local model options #2

More local model options #2

Comments

bennmann commented Mar 14, 2023 • edited Loading

yangkevin2 commented Mar 14, 2023

bennmann commented Mar 15, 2023

bennmann commented Mar 15, 2023

yangkevin2 commented Mar 15, 2023

bennmann commented Mar 16, 2023 via email

yangkevin2 commented Mar 16, 2023

bennmann commented Mar 14, 2023 •

edited

Loading