Weiyu Li*1,2, Jiarui Liu*1,2, Rui Chen1,2, Yixun Liang2,3, Xuelin Chen4, Ping Tan1,2, Xiaoxiao Long1,2
TL; DR: CraftsMan (aka 匠心) is a two-stage text/image to 3D mesh generation model. By mimicking the modeling workflow of artist/craftsman, we propose to generate a coarse mesh (5s) with smooth geometry using 3D diffusion model and then refine it (20s) using enhanced multi-view normal maps generated by 2D normal diffusion, which is also can be in a interactive manner like Zbrush.
Important: the released ckpt are mainly trained on character, so it would perform better in this category and we plan to release more advanced pretrained models in the future.
This repo contains source code (training / inference) of 3D diffusion model, pretrained weights and gradio demo code of our 3D mesh generation project, you can find more visualizations on our project page and try our demo and tutorial. If you have high-quality 3D data or some other ideas, we very much welcome any form of cooperation.
Full abstract here
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implentation in 3D modeling softwares. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborate the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image, and leverages a powerful multi-view (MV) diffusion model to generates multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high efficiency in producing superior quality 3D assets compared to existing methods.- Pretrained Models
- Gradio & Huggingface Demo
- Inference
- Training
- Data Prepration
- Video
- Acknowledgement
- Citation
Hardware
We train our model on 32x A800 GPUs with a batch size of 32 per GPU for 7 days.The mesh refinement part is performed on a GTX 3080 GPU.
Setup environment
😃 We also provide a Dockerfile for easy installation, see Setup using Docker.
- Python 3.10.0
- PyTorch 2.1.0
- Cuda Toolkit 11.8.0
- Ubuntu 22.04
Clone this repository.
git clone https://github.com/wyysf-98/CraftsMan.git
Install the required packages.
conda create -n CraftsMan python=3.10
conda activate CraftsMan
conda install -c pytorch pytorch=2.3.0 torchvision=0.18.0 cudatoolkit=11.8 && \
pip install -r docker/requirements.txt
We provide the training and the inference code here for future research. The latent set diffusion model is heavily build on the same structure of Michelangelo, which is based on a perceiver and with 104M parameters.
Currently, We provide the models with 4 view images as condition and inject camera information via ModLN to the clip feature extractor. We will consider open source the further models according to the real situation.
## you can just get the model using wget:
wegt https://huggingface.co/wyysf/CraftsMan/blob/main/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6/config.yaml
wegt https://huggingface.co/wyysf/CraftsMan/blob/main/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6/model.ckpt
wegt https://huggingface.co/wyysf/CraftsMan/blob/main/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae/config.yaml
wegt https://huggingface.co/wyysf/CraftsMan/blob/main/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae/model.ckpt
## or you can git clone the repo:
git lfs install
git clone https://huggingface.co/wyysf/CraftsMan
If you download the models using wget, you should manually put them under the ckpts/image-to-shape-diffusion
directory.
We provide gradio demos with different text/image-to-MV diffusion models, such as CRM, Wonder3D and LGM. You can select different models to get better results. To run a gradio demo in your local machine, simply run:
python gradio_app.py --model_path ./ckpts/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae
To generate 3D meshes from images folders via command line, simply run:
python inference.py --input eval_data --device 0 --model ./ckpts/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae
You can modify the used mv-images model by:
python inference.py --input eval_data --mv_model 'ImageDream' --device 0 # support ['CRM', 'ImageDream', 'Wonder3D'] --model ./ckpts/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae
We use rembg to segment the foreground object by default. If the input image already has an alpha mask, please specify the no_rembg flag:
python inference.py --input 'apps/examples/1_cute_girl.webp' --device 0 --no_rembg --model ./ckpts/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae
If you have images from other views (left, right, bacj), you can specify images by:
python inference.py --input 'apps/examples/front.webp' --device 0 --right_view 'apps/examples/right.webp' --model ./ckpts/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6-aligned-vae
For more configs, please refer to the inference.py
.
We provide our training code to facilitate future research. And we provide a data sample in data
.
For the occupancy part, you can download from Objaverse-MIX for easy use.
For more training details and configs, please refer to the configs
folder.
### training the shape-autoencoder
python launch.py --config ./configs/shape-autoencoder/l256-e64-ne8-nd16.yaml \
--train --gpu 0
### training the image-to-shape diffusion model
python launch.py --config .configs/image-to-shape-diffusion/clip-mvrgb-modln-l256-e64-ne8-nd16-nl6.yaml \
--train --gpu 0
We are diligently working on the release of our mesh refinement code. Your patience is appreciated as we put the final touches on this exciting development." 🔧🚀
You can also find the video of mesh refinement part in the video.
Q: Tips to get better results.
- CraftsMan takes multi-view images as condition of the 3D diffusion model. By our experiments, compared with the reconstruction model like (Wonder3D, InstantMesh), our method is more robust to multi-view inconsistency. As we rely on the image-to-MV model, the facing direction of input images is very important and it always leads to good reconstruction.
- If you have your own multi-view images, it would be a good choice to use your own images rather than the generated ones
- Just like the 2D diffusion model, try different seeds, adjust the CFG scale or different scheduler. Good Luck.
- We will provide a version that conditioned on the text prompt, so you can use some positive and negative prompts.
- Inference code
- Training code
- Gradio & Hugging Face demo
- Model zoo, we will release more ckpt in the future
- Environment setup
- Data sample
- Code for mesh refine
- Thanks to LightIllusion for providing computational resources and Jianxiong Pan for data preprocessing. If you have any idea about high-quality 3D Generation, welcome to contact us!
- Thanks to Hugging Face for sponsoring the nicely demo!
- Thanks to 3DShape2VecSet for their amazing work, the latent set representation provides an efficient way to represent 3D shape!
- Thanks to Michelangelo for their great work, our model structure is heavily build on this repo!
- Thanks to CRM, Wonder3D and LGM for their released model about multi-view images generation. If you have a more advanced version and want to contribute to the community, we are welcome to update.
- Thanks to Objaverse, Objaverse-MIX for their open-sourced data, which help us to do many validation experiments.
- Thanks to ThreeStudio for their great repo, we follow their fantastic and easy-to-use code structure!
CraftsMan is under AGPL-3.0, so any downstream solution and products (including cloud services) that include CraftsMan code or a trained model (both pretrained or custom trained) inside it should be open-sourced to comply with the AGPL conditions. If you have any questions about the usage of CraftsMan, please contact us first.
@misc{li2024craftsman,
title = {CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner},
author = {Weiyu Li and Jiarui Liu and Rui Chen and Yixun Liang and Xuelin Chen and Ping Tan and Xiaoxiao Long},
year = {2024},
archivePrefix = {arXiv preprint arXiv:2405.14979},
primaryClass = {cs.CG}
}