<a href="https://colab.research.google.com/github/vertexcite/echo_CLIP/blob/colab_notebook_01/echo-clip-google-colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding example

This is essentially the same as `embedding_example.py`, adapted to make it easy to run on Google Colab.

## Setup

In [None]:
!git clone https://github.com/echonet/echo_CLIP

Cloning into 'echo_CLIP'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 30 (delta 9), reused 21 (delta 2), pack-reused 0[K
Receiving objects: 100% (30/30), 1.23 MiB | 4.03 MiB/s, done.
Resolving deltas: 100% (9/9), done.


In [None]:
cd echo_CLIP

/content/echo_CLIP


In [None]:
!pip install open_clip_torch

Collecting open_clip_torch
  Downloading open_clip_torch-2.24.0-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Collecting ftfy (from open_clip_torch)
  Downloading ftfy-6.2.0-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting timm (from open_clip_torch)
  Downloading timm-0.9.16-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.9.0->open_clip_torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.9.0->open_clip_t

## Embedding example

In [None]:
from open_clip import create_model_and_transforms
from template_tokenizer import template_tokenize
import torchvision.transforms as T
import torch
import torch.nn.functional as F
from utils import read_avi

In [None]:
# Use EchoCLIP-R for retrieval-based tasks where you want to find
# the similarity between two echos, like in patient identification or
# echo report retrieval. It has a longer context window because it
# uses the template tokenizer, which we found increases its retrieval
# performance but decreases its performance on other zero-shot tasks.
echo_clip_r, _, preprocess_val = create_model_and_transforms(
    "hf-hub:mkaichristensen/echo-clip-r", precision="bf16", device="cuda"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


open_clip_pytorch_model.bin:   0%|          | 0.00/606M [00:00<?, ?B/s]

open_clip_config.json:   0%|          | 0.00/590 [00:00<?, ?B/s]

In [None]:
# We'll load a sample echo video and preprocess its frames.
test_video = read_avi(
    "example_video.avi",
    (224, 224),
)

In [None]:
test_video = torch.stack(
    [preprocess_val(T.ToPILImage()(frame)) for frame in test_video], dim=0
)
test_video = test_video.cuda()
test_video = test_video.to(torch.bfloat16)

In [None]:
# Be sure to normalize the CLIP embeddings after calculating them to make
# cosine similarity between embeddings easier to calculate.
test_video_embedding = F.normalize(echo_clip_r.encode_image(test_video), dim=-1)

In [None]:
# To get a single embedding for the entire video, we'll take the mean
# of the 10 frame embeddings.
test_video_embedding = test_video_embedding.mean(dim=0, keepdim=True)

In [None]:
# We'll now load an excerpt of the report associated with our echo
# and tokenize it using the template tokenizer.
with open("example_report.txt", "r") as f:
    test_report = f.read()

template_tokens = template_tokenize(test_report)
template_tokens = torch.tensor(template_tokens, dtype=torch.long).unsqueeze(0).cuda()
print(template_tokens)

tensor([[907, 261, 464, 800, 887, 469, 792, 887, 669, 804,  66, 830, 881, 788,
         697, 882, 634, 371, 884, 627, 800, 581, 882,  51, 168, 474, 882, 467,
         789, 887, 459, 783, 887, 394, 726, 575, 820, 887, 232, 882,  87, 547,
         486, 604, 782, 889, 789, 702, 677, 702, 702, 766, 488, 689, 883, 613,
         437, 176, 465, 496, 812, 887, 597, 820, 431, 882, 881, 883, 686, 336,
         908,   0,   0,   0,   0,   0,   0]], device='cuda:0')


In [None]:
# We can then embed the report using EchoCLIP-R.
test_report_embedding = F.normalize(echo_clip_r.encode_text(template_tokens), dim=-1)

In [None]:
print(test_report_embedding.shape)
print(test_video_embedding.shape)

torch.Size([1, 512])
torch.Size([1, 512])


In [None]:
# Since both embeddings are normalized, we can just take the dot product
# to get the cosine similarity between them.
similarity = (test_report_embedding @ test_video_embedding.T).squeeze(0)
print(similarity.item())

0.443359375


Try again with another "random" report.  (This was not done in `embedding_example.py`.)



In [None]:
# Example from https://www.sononet.us/publications/samplereports/EchoFinal1.pdf (from section "Final 2D interpretation")
# With some text taken from `example_report.txt` and appended in a slightly altered.
test_report2 = "SEVERE INCREASE IN LEFT ATRIAL VOLUME CONSISTENT WITH A HISTORY OF ELEVATED LV FILLING PRESSURES. NO INTRACARDIAC MASS OR THROMBUS. NO PERICARDIAL EFFUSION. MILD RIGHT VENTRICULAR HYPERTROPHY. THERE IS MILD MITRAL VALVE REGURGITATION. THE PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG." # @param {type:"string"}


In [None]:
# `test_report2` doesn't seem to be a great demo report, deleting the leading words has little effect on the tokenisation.
# The following shows just how much leading text can be deleted without affecting the tokenisation
# (it pregressively removes leading words, and only prints out the text when the tokens change).
print(test_report2)
words = test_report2.split()
new_report2 = ' '.join(words[1:])
previous_tokens = template_tokenize(new_report2)
for _ in range(len(words) - 1):
    template2_tokens = template_tokenize(new_report2)
    if (template2_tokens != previous_tokens) :
      print (new_report2)
    previous_tokens = template2_tokens
    words = new_report2.split()
    new_report2 = ' '.join(words[1:])



SEVERE INCREASE IN LEFT ATRIAL VOLUME CONSISTENT WITH A HISTORY OF ELEVATED LV FILLING PRESSURES. NO INTRACARDIAC MASS OR THROMBUS. NO PERICARDIAL EFFUSION. MILD RIGHT VENTRICULAR HYPERTROPHY. THERE IS MILD MITRAL VALVE REGURGITATION. THE PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.
OF ELEVATED LV FILLING PRESSURES. NO INTRACARDIAC MASS OR THROMBUS. NO PERICARDIAL EFFUSION. MILD RIGHT VENTRICULAR HYPERTROPHY. THERE IS MILD MITRAL VALVE REGURGITATION. THE PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.
RIGHT VENTRICULAR HYPERTROPHY. THERE IS MILD MITRAL VALVE REGURGITATION. THE PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.
IS MILD MITRAL VALVE REGURGITATION. THE PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.
PEAK TRANSMITRAL GRADIENT IS 8.7MMHG. THE MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.
MEAN TRANSMITRAL GRADIENT IS 2.8MMHG.


In [None]:
template2_tokens = template_tokenize(test_report2)
template2_tokens = torch.tensor(template2_tokens, dtype=torch.long).unsqueeze(0).cuda()
print(template2_tokens)

tensor([[907, 770, 561, 882, 474, 882, 467, 788, 894, 459, 782, 895, 908,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0]], device='cuda:0')


In [None]:
# We can then embed the report using EchoCLIP-R.
test_report2_embedding = F.normalize(echo_clip_r.encode_text(template2_tokens), dim=-1)

In [None]:
print(test_report2_embedding.shape)

torch.Size([1, 512])


In [None]:
# Since both embeddings are normalized, we can just take the dot product
# to get the cosine similarity between them.
similarity2 = (test_report2_embedding @ test_video_embedding.T).squeeze(0)
print(similarity2.item())

0.3203125


This has lower cosine similarity than `example_report.txt`