#### Contrastive Language-Image Pretraining (CLIP)
- provides a similarity score that indicates how well an image matches a given text prompt.
- For example, the score `tensor([0.3093])` represents the similarity between the image and text in a shared embedding space, with `higher` values indicating `stronger alignment`.

##### How CLIP Works
CLIP (Contrastive Language-Image Pretraining) is a `multimodal` model designed to measure the similarity between `images` and `text prompts`. Here’s how it works:

- **Image Encoder**: Transforms images into vector embeddings using models like ResNet or Vision Transformers (ViT).
- **Text Encoder**: Transforms text prompts into vector embeddings using a Transformer-based architecture.

CLIP learns to align image and text embeddings in a shared space by `maximizing` the similarity of correct pairs (image-caption) and minimizing unrelated pairs. 

This training method enables CLIP to understand and align images with text descriptions effectively.

##### How CLIP Predicts (Scores)
To calculate a similarity score:
1. **Encoding**: CLIP encodes both the image and text prompt into vector embeddings.
2. **Cosine Similarity**: It calculates the cosine similarity between these vectors. The score ranges from -1 to 1:
   - **Closer to 1**: Stronger alignment.
   - **Closer to -1**: Weaker alignment.

For example, a score of `0.3093` indicates moderate alignment between the image and prompt, suggesting room for improvement if higher alignment is desirable.

##### Fine-Tuning CLIP
Fine-tuning can help improve performance on specific datasets or domains but requires careful techniques to avoid overfitting:

1. **Custom Datasets**: Fine-tune with a domain-specific dataset (e.g., medical images) to improve alignment in specific fields.
2. **Contrastive Loss Adjustments**: Modifying contrastive loss can help re-optimize similarity sensitivity.
3. **Learning Rate Adjustments**: Lower learning rates prevent the model from drifting too far from general representations.
4. **Parameter Freezing**: Freeze layers in the encoder to preserve pretrained embeddings, modifying only specific layers.

##### Best Practices for Using CLIP

1. **Real-World Evaluation**: Validate CLIP’s similarity scores with real data to establish meaningful alignment thresholds.
2. **Handle Low Scores Carefully**: Low similarity scores may require fine-tuning or testing different prompts/images.
3. **High-Quality Prompts**: Clear and descriptive text prompts can improve CLIP’s alignment performance.
4. **Bias Awareness**: CLIP’s internet-sourced dataset can introduce biases. Monitor applications for unintended effects.
5. **Ensemble Approach**: Use CLIP alongside other models, like object detection or scene classification, for improved contextual accuracy.


In [1]:
import torch
from transformers import CLIPProcessor, CLIPModel

In [2]:
# Load CLIP model and processor
model     = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

config.json:   0%|          | 0.00/4.10k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/599M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]



In [3]:
prompt_used = '''
Capture the essence of adventure with a stunning visual of a futuristic 
astronaut exploring a space station that overlooks Earth, bathed in warm, 
golden light; a simple and clean scene that evokes feelings of epic power, 
set against a backdrop of soft colors
'''

In [8]:
from PIL import Image

In [9]:
image_path = r'D:\DOWNLOADS\generated_image.png'

In [10]:
# Load the image using PIL
generated_image = Image.open(image_path)

In [11]:
# Prepare the text and image
inputs = processor(text          = prompt_used, 
                   images        = generated_image, 
                   return_tensors= "pt", 
                   padding       = True)

In [12]:
outputs = model(**inputs)

In [13]:
text_embedding  = outputs.text_embeds
image_embedding = outputs.image_embeds

In [14]:
from torch.nn.functional import cosine_similarity

similarity_score = cosine_similarity(text_embedding, image_embedding)

In [15]:
similarity_score

tensor([0.3093], grad_fn=<SumBackward1>)