Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object
We provide a demo on colab that can be easily run!
⚠: demo in colab needs to download the model from the internet and may run slowly due to GPU limitations (30s-60s an iter, usually need to train about 200 iters to get better results)
For more technical details check out the latest version of the paper on arxiv:
The top left is the original image, and the bottom left is the mask generated using the stylized objects via CRIS. The rest are stylized images generated with different stylized content. Our stylized translation results in various text conditions. The stylized images have the spatial structure of the content images with realistic textures corresponding to the text, while retaining the original style of the non-target regions.
❤A more detailed description of the source code is in the process of being organized and will be posted in a readme in this repository when the paper is accepted.
Splitting Stylized Instruction into Stylized Content and Stylized Objects using the Large Language Model.
The overall architecture of the system.
Segmentation score of different LLMs. We manually set up 100 Stylized Instruction and corresponding standard answers Stylized Content and Stylized Objects to check the correctness of different LLMs in performing Stylized Instruction comprehension and segmentation. LLM outputs are marked as correct when the stylized content and stylized objects are in perfect agreement with the standard answers, and the right-most columns are the scores we got from manually evaluating the segmentation effects of the LLMs. ChatGLM2-6B and Llama2-7b achieved high scores in the results of manual evaluation. We finally chose to use Llama2-7b as the LLM used for stylized instruction segmentation.
Comparison with existing state-of-the-art text-guided image style transfer models. Where CLIPstyler uses the yellow fields in the Stylized Instruction as style. The other baselines (ControlNet) use the entire Stylized Instruction as the input prompt. CLIPstyler and stable-diffusion-v1-5 image-to-image only output square images. For comparison, the output image of these two models is stretched to keep the same aspect ratio as the input Content Image.
We found that the threshold value t = 0.7 can just balance the stylization of the unsegmented part of the target object and the original image features of the non-target object region. A threshold t that is too small will result in non-targeted areas of the image being stylized, and too large will result in loss of stylization (loss of color or texture) in the targeted areas.
Python 3.10.13 & ptyorch 1.12.0+cu116 & ubuntu 20.04.1
$ conda create -n soulstyler python=3.10
$ conda activate soulstyler
$ pip install -r requirements.txt
$ pip install git+https://github.com/openai/CLIP.git
If you clone this repository locally, you will need to download this weight file to the root directory before running it. Running colab directly will automatically download the weights.
https://drive.google.com/file/d/10wo4R7sGWw5ITHpjtv3dIbIbkGpvkMiJ/view?usp=sharing
If you think it's long for him to generate a graph (due to the presence of random cropping, this takes a long time to reduce the loss, we'll follow up with optimizations 😳), you can run demo.py, which is a multi-threaded batch-run script that allows for multiple Stylized Content trainings to be performed simultaneously on a single GPU.
Some commands for testing can be found in democases.md (this is just a temporary draft file of commands, a more detailed description of the training, inference detail steps will follow in this GitHub repository)
CUDA_VISIBLE_DEVICES=0 python demo.py --case=0 --style=0,7
✅1. Colab online running demo
🔘2. Api interface for LLM segmentation methods (The huggingface demo is coming soon!🤗)
🔘3. Video style transfer
🔘4. Faster method of randomized cropping
@article{chen2023soulstyler,
title={Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object},
author={Junhao Chen and Peng Rong and Jingbo Sun and Chao Li and Xiang Li and Hongwu Lv},
year={2023},
eprint={2311.13562},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This code and model are available only for non-commercial research purposes as defined in the LICENSE (i.e., MIT LICENSE). Check the LICENSE
This implementation is mainly built based on CRIS, CLIPstyler and Llama 2.