ebook to chatml conversion tool

description

this tool converts ebooks (in .txt or .epub format) into dialogue based and chatml formatted groupchats. you can then use this format for creating datasets. you can either use gemini, OpenAI, or koboldcpp. I recommend you use koboldcpp for gbnf and alpaca for all of the prompts, which can be seen in prompts.py. the script works decent with 7B models (see examples), but on models like gpt-4o it rarely gets speakers wrong. note that you can use this script with any context size you want (4096, 8192, or even 32K context size) by editing config.yaml, but it is better to have 8192+ context.

examples (200 lines)

Killed Once, Lived Twice by Gary Whitmore - kunoichi dpo v2 7B Q8_0 @ 8192 context (chatml | regular)

Drone World by Jim Kochanoff - gemma 2 9B @ 8192 context (chatml | regular)

The awakening by L C Ainsworth - kukulemon 7B Q8_0 @ 4096 context (chatml | regular)

how does it work in 10 steps?

load book text (.txt or .epub) into a json ile
break the text into smaller chunks (5 lines at a time)
detect character names and aliases using an entity detection model and mask them with generic labels (Character_1, Character_2, etc)
create summaries of text occasionally to use in prompts and to improve accuracy
add context lines to the start and end of each chunk to improve accuracy
label/convert each chunk using few-shot prompts and GBNF grammar
process the converted text
track progress and give eta
unmask character names, replacing the generic lables with original names
save the converted lines in both plaintext and chatml format

setup (koboldcpp)

run git clone https://github.com/statchamber/ebook-to-chatml-conversion.git
install koboldcpp and load a gguf model with at least 4096 context
install dependencies pip install -r requirements.txt
edit config.yaml and change settings for example max_convert to how many paragraphs you want to convert
Create a folder called ./ebooks and put your ebooks in it
run python index.py and the results should show up in ./output

setup (openai/gemini)

run git clone https://github.com/statchamber/ebook-to-chatml-conversion.git
install dependencies pip install -r requirements.txt
edit config.yaml and change settings for example max_convert to how many paragraphs you want to convert
Create a folder called ./ebooks and put your ebooks in it
run python index.py and the results should show up in ./output

config help

chunk: decrease chunk.context if you are using lower context like 4096. increase it if you are using higher values like 32k. the more context lines you add, the less the AI will make mistakes. but, it will generate slower as it takes up more tokens

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Conversion		Conversion
examples		examples
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
index.py		index.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ebook to chatml conversion tool

description

examples (200 lines)

how does it work in 10 steps?

setup (koboldcpp)

setup (openai/gemini)

config help

About

Releases

Packages

Languages

License

statchamber/ebook-to-chatml-conversion

Folders and files

Latest commit

History

Repository files navigation

ebook to chatml conversion tool

description

examples (200 lines)

how does it work in 10 steps?

setup (koboldcpp)

setup (openai/gemini)

config help

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages