Skip to content

MooreThreads/text2world

Repository files navigation

Text2World

Demo

Information

A Text2World pipeline based on 3D Gaussian Splatting and video generation technology.

Features:

  1. The first open-source pipeline combining T2V models with Mamba-Transformer architecture LRM
  2. More efficient LRM reconstruction model: Composed of a video VAE encoder and Mamba-Transformer structure, enabling the LRM model to process more images while consuming less GPU memory
  3. Provides two pathways for 3DGS construction:
    • Normal Path: Decodes latents and post-processes the video, then re-encodes to generate video latents
    • Remap Path (Experimental): Directly maps video latents to decoder-generated latents using a remap model, eliminating unnecessary decoder-encoder and post-processing steps, preparing for future end-to-end training

BENCHMARK

To validate the pipeline's effectiveness and LRM model performance, we created evaluation datasets:

  • Public dataset: Evaluates LRM's reconstruction performance in real-world scenarios.(from ac3d, re10k test set with 1980 scenes)
Dataset PSNR SSIM LPIPS
Public 29.34 0.87 0.205

Example

Render Render Render
Demo Demo Demo
Demo Demo Demo
Demo Demo Demo
Demo Demo Demo

Dataset

Data sources:

Data processing methods:

  1. Pose data preprocessing: pixelsplat
  2. Caption generation: VideoX-Fun or CameraCtrl

Inference

Considering the differences between diffusion latents and re-encoded VAE latents from post-processed videos, we provide two inference pipelines:

  • nonmap_pipeline.py: Processes diffusion-generated videos through post-processing before feeding to latentLRM for rendering
  • remap_pipeline.py: Remaps diffusion-generated latents to reduce discrepancies

Usage:

  • nonmap_pipeline.py: Takes the post-processed video generated by the diffusion model as input and feeds it to the latentLRM model for rendering. (Recommendation!!!)
  • remap_pipeline.py: Performs latent remapping on the diffusion-generated latent to mitigate these differences.

Command-Line Arguments

  • $pose_folder: Pose folder similar to RE10K
  • $prompt_txt: List of prompts
  • $MODEL_PATH: Video generation model parameters
  • $ckpt_path: ControlNet model parameters
  • $lrm_weight: LRM model parameters
  • $remap_weight (optional): Remap model parameters
  • $out_dir: Output directory

Non-Mapping Pipeline

python generate_nonmap_api.py \  
    --prompt $prompt_txt \   
    --lrm_weight $lrm_weight \  
    --pose_folder \  
    --base_model_path $MODEL_PATH \  
    --controlnet_model_path $ckpt_path \  
    --output_path $out_dir \  
    --start_camera_idx 0 \  
    --end_camera_idx 7 \  
    --stride_min 2 \  
    --stride_max 2 \  
    --height 480 \  
    --width 720 \  
    --controlnet_weights 1.0 \  
    --controlnet_guidance_start 0.0 \  
    --controlnet_guidance_end 0.4 \  
    --controlnet_transformer_num_attn_heads 4 \  
    --controlnet_transformer_attention_head_dim 64 \  
    --controlnet_transformer_out_proj_dim_factor 64 \  
    --num_inference_steps 20

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published