This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.
- Learning Transferable Visual Models From Natural Language Supervision [code]
- CLIP: Connecting Text and Images
- Multimodal Neurons in Artificial Neural Networks
- OpenCLIP (3rd-party, PyTorch) [code]
- Train-CLIP (3rd-party, PyTorch) [code]
- Paddle-CLIP (3rd-party, PaddlePaddle) [code]
- VQGAN-CLIP [code]
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [code]
- CLIP Guided Diffusion [code]
- CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [code]
- TargetCLIP: Image-Based CLIP-Guided Essence Transfer [code]
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation [code]
- Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [code]
- Roboflow Zero-shot Object Tracking [code]
- Zero-Shot Detection via Vision and Language Knowledge Distillation [code]
- Crop-CLIP [code]
- Detic: Detecting Twenty-thousand Classes using Image-level Supervision [code]
- CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
- SLIP: Self-supervision meets Language-Image Pre-training [code]
- ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension [code]
- Unsplash Image Search [code]
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [code]
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [code]
- Natural Language YouTube Search [code]
- CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP [code]
- clip-retrieval [code]
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval [code]
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [code]
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [code]
- Extending CLIP for Category-to-image Retrieval in E-commerce [code]
- Wav2CLIP: Learning Robust Audio Representations From CLIP [code]
- CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation [code]
- RegionCLIP: Region-based Language-Image Pretraining [code]
- CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification [code]
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [code]
- CyCLIP: Cyclic Contrastive Language-Image Pretraining [code]
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [code]
- DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [code]
- UniCLIP: Unified Framework for Contrastive Language–Image Pre-training [code]
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model [code]
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [code]
- PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [code]
- Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [code]
- Fine-tuned CLIP Models are Efficient Video Learners[code]
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining [code]
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm [code]
- Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision [code]
- CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation [code]
- Text2Mesh: Text-Driven Neural Stylization for Meshes [code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [code]
- CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders [code]
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [code]
- MotionCLIP: Exposing Human Motion Generation to CLIP Space [code]
- AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars [code]
- ClipFace: Text-guided Editing of Textured 3D Morphable Models [code]
- Big Sleep: A simple command line tool for text to image generation [code]
- Deep Daze: A simple command line tool for text to image generation [code]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage [code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [code]
- Learning to Prompt for Vision-Language Models [code]
- Conditional Prompt Learning for Vision-Language Models [code]
- Prompt-aligned Gradient for Prompt Tuning [code]
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters [code]
- Learning to Compose Soft Prompts for Compositional Zero-Shot Learning [code]
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [code]
- FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks [code]
- Frozen CLIP Models are Efficient Video Learners [code]
- Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization [code]
- MovieCLIP: Visual Scene Recognition in Movies [code]
- CLIP prefix captioning [code]
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning [code]
- ClipCap: CLIP Prefix for Image Captioning [code]
- Text-Only Training for Image Captioning using Noise-Injected CLIP [code]
- Fine-grained Image Captioning with CLIP Reward [code]
- HairCLIP: Design Your Hair by Text and Reference Image [code]
- CLIPstyler: Image Style Transfer with a Single Text Condition [code]
- CLIPasso: Semantically-Aware Object Sketching [code]
- Image-based CLIP-Guided Essence Transfer [code]
- CLIPDraw: Synthesize drawings to match a text prompt! [code]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage [code]
- Towards Counterfactual Image Manipulation via CLIP [code]
- ClipCrop: Conditioned Cropping Driven by Vision-Language Model [code]
- CLIPascene: Scene Sketching with Different Types and Levels of Abstraction [code]
- CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation [code]
- Image Segmentation Using Text and Image Prompts [code]
- Extract Free Dense Labels from CLIP [code]
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [code]
- PointCLIP: Point Cloud Understanding by CLIP [code]
- CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training [code]
- MotionCLIP: Exposing Human Motion Generation to CLIP Space [code]
- LidarCLIP or: How I Learned to Talk to Point Clouds[code]
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [code]
- AudioCLIP: Extending CLIP to Image, Text and Audio [code]
- Wav2CLIP: Learning Robust Audio Representations from Clip [code]
- AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization [code]
- Multilingual-CLIP [code]
- CLIP (With Haiku + Jax!) [code]
- CLIP-Event: Connecting Text and Images with Event Structures [code]
- How Much Can CLIP Benefit Vision-and-Language Tasks? [code]
- CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning [code]
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [code]
- CLIP-Event: Connecting Text and Images with Event Structures [code]
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet [code]
- Task Residual for Tuning Vision-Language Models [code]
Inspired by Awesome Visual-Transformer.