Skip to content

Implemented avatar generation utilizing stable-diffusion and LoRA

Notifications You must be signed in to change notification settings

zhaosheng-thu/AvatarDiffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Avatar Diffusion:

Utilizing dreambooth and LoRA to trian or fine-tune

  • Dreambooth: Firstly, I selected 8 photos of the renowned football star Cristiano Ronaldo (my favourite football star) playing on the field as the training set. By loading the software-defined and using the modified code of diffusers, I selected these training parameters: --instance_prompt="a photo of cr7d man" \ --class_prompt="a photo of man" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=2e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=350 \ --report_to="wandb" and simply fine-tuned the weights of Unet. When prompted with "cr7d man, wearing a neat suit, facing the camera, smiling, high quality", I chose two images with high fidelity and delicate mesh. They are presented below:

    CR7 wearing a suit CR7 wearing a suit

    In this case, the training process takes approximately 5 minutes on four 3090 graphics cards, each with 24 GiB of VRAM.

    more inf

    In another training session, I selected 12 pictures of a female celebrity. Can you guess who she is?

    more inf

    During the training process, I observed that the optimal number of images in the training set ranges from approximately 5 to 10, particularly when training the avatar generation model. Additionally, the appropriate learning rate (lr-rate) is around 2e-6. A good value for --max_train_steps should be equal to the number of images in the training set multiplied by a factor between 50 and 100. This ensures that each image is learned 50 to 100 times. If --max_train_steps is set smaller than this range, we may find that the avatars generated are not as familiar with the training set, as the epoch for each image is too short and the learning process is not sufficiently thorough. Conversely, if a value larger than the range is chosen, we may encounter overfitting during training, resulting in noise on the avatar's face after denoising by Unet.

    ty wearing a suit CR7 wearing a suit CR7 wearing a suit
    More images generated by the fine-tuned models are shown above.
  • LoRA: I used the pokemon dataset in huggingface and utilized the initial code named train_text_to_image_lora.py and the default parameters of training. The results can be found here.

    more inf

Multi-View Clip

Coming soon...

About

Implemented avatar generation utilizing stable-diffusion and LoRA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages