this tool converts ebooks (in .txt or .epub format) into dialogue based and chatml formatted groupchats. you can then use this format for creating datasets. you can either use gemini, OpenAI, or koboldcpp. I recommend you use koboldcpp for gbnf and alpaca for all of the prompts, which can be seen in prompts.py. the script works decent with 7B models (see examples), but on models like gpt-4o it rarely gets speakers wrong. note that you can use this script with any context size you want (4096, 8192, or even 32K context size) by editing config.yaml, but it is better to have 8192+ context.
Killed Once, Lived Twice by Gary Whitmore - kunoichi dpo v2 7B Q8_0 @ 8192 context (chatml | regular)
Drone World by Jim Kochanoff - gemma 2 9B @ 8192 context (chatml | regular)
The awakening by L C Ainsworth - kukulemon 7B Q8_0 @ 4096 context (chatml | regular)
- load book text (.txt or .epub) into a json ile
- break the text into smaller chunks (5 lines at a time)
- detect character names and aliases using an entity detection model and mask them with generic labels (Character_1, Character_2, etc)
- create summaries of text occasionally to use in prompts and to improve accuracy
- add context lines to the start and end of each chunk to improve accuracy
- label/convert each chunk using few-shot prompts and GBNF grammar
- process the converted text
- track progress and give eta
- unmask character names, replacing the generic lables with original names
- save the converted lines in both plaintext and chatml format
- run
git clone https://github.com/statchamber/ebook-to-chatml-conversion.git
- install koboldcpp and load a gguf model with at least 4096 context
- install dependencies
pip install -r requirements.txt
- edit config.yaml and change settings for example
max_convert
to how many paragraphs you want to convert - Create a folder called
./ebooks
and put your ebooks in it - run
python index.py
and the results should show up in./output
- run
git clone https://github.com/statchamber/ebook-to-chatml-conversion.git
- install dependencies
pip install -r requirements.txt
- edit config.yaml and change settings for example
max_convert
to how many paragraphs you want to convert - Create a folder called
./ebooks
and put your ebooks in it - run
python index.py
and the results should show up in./output
- chunk: decrease chunk.context if you are using lower context like 4096. increase it if you are using higher values like 32k. the more context lines you add, the less the AI will make mistakes. but, it will generate slower as it takes up more tokens