We propose PureCLIP-Depth, a completely prompt-free, decoder-free Monocular Depth Estimation (MDE) model that operates entirely within the Contrastive Language-Image Pre-training (CLIP) embedding space. Unlike recent models that rely heavily on geometric features, we explore a novel approach to MDE driven by conceptual information, performing computations directly within the conceptual CLIP space. The core of our method lies in learning a direct mapping from the RGB domain to the depth domain strictly inside this embedding space. Our approach achieves state-of-the-art performance among CLIP embedding-based models on both indoor and outdoor datasets.
Prediction on NYU Depth V2 dataset

Prediction on KITTI dataset (movie)
KITTI2.mp4
conda create -n PureCLIP-Depth -y python=3.12
conda activate PureCLIP-Depth
pip install -r requirement.txtWeight for NYU Depth V2 dataset
python main_train_nyu.pyThis repository is released under the MIT License.
| Project | Link | Description |
|---|---|---|
| CLIP | openai/CLIP | Official implementation of CLIP |
| PyTorch | pytorch/pytorch | Tensors and Dynamic neural networks |
