In [None]:
from math import trunc

from generator3.core import progress
!pip install fiftyone==1.4.1 torch torchvision umap-learn
from google.colab import drive
drive.mount('/content/drive')

import fiftyone as fo

name = "our-photos"
dir = "/content/drive/MyDrive/impatient-cv/flickr-as-pokemon"

dataset = fo.Dataset.from_dir(
    dataset_dir=dir,
    dataset_type=fo.types.FiftyOneDataset,
    name=name
)

print(dataset)

## Using ResNet18 trained on "typical data"

We are going to calculate the embeddings using the same ResNet18 model from the FiftyOne Model Zoo that we used for classification.

I will let you in on a little secret, embeddings are the output from the final layer of the neural network before the classification layer. For most models,
if you take off the last layer of the model and just take the output vector this is the embeddings. Actually, you could take output from any layer of the neural network but we almost always just take the output from the next to last layer.

In the case of ResNet18 the output for embeddings is a 512 dimensional vector that should contain all the important "features" of the original data. The features that the model captures are a by-product of both the architecture of the model and the training data. These 512 numbers are the coordinates of this original image in a 512 dimensional space

We can't visualize 512 dimensions, so we will use a dimension reduction method UMAP that tries to retain the "closeness" from the 512 dimensional space while reducing down to 2 dimensions. After we finish with dimension reduction we can then see how images are related in a 2 dimensional space. Because we are using a model from the FiftyOne Zoo we can calculate the embeddings and do dimension reduction all in one step with a `compute_visualization` method in the FiftyOne Brain library.

In [None]:
# Load model
import fiftyone.brain as fob
import fiftyone.zoo as foz
resnet18_in = foz.load_zoo_model("resnet18-imagenet-torch")

# compute visualization

fob.compute_visualization(
    dataset=dataset,
    model=resnet18_in,
    embeddings="resnet18_in_embed", # field name to store the embeddings
    brain_key= "resnet18_in_embed", # run name for this brain method call
    progres=True,
    num_workers=4, # next two only applicable to a PyTorch model
    batch_size=16
)

# Now visualize - I will walk us through it in the app
session = fo.launch_app(dataset, auto=False)
session.url

# session.show()


### Embeddings from our Pokemon Classification Model

Since our ResNet18 model trained on Pokemon is not in the model zoo, we have to do a bit more work to visualize them. First we will calculate the embeddings and associate them with the samples, and then we will call `compute_visualization` passing in the embeddings.
Once that is finished we can compare how well the different embeddings do in telling us "interesting" things about our photos.

In [None]:
import torch
import torchvision.models as models
import torchvision.transforms.v2 as T
from PIL import Image
from tqdm.notebook import tqdm
import pickle
from torch.utils.data import Dataset, DataLoader

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load model state dict using pickle
with open('/content/drive/MyDrive/impatient-cv/pokemon-classification-model.pt', 'rb') as f:
    state_dict = pickle.load(f)

# Create a modified ResNet18 to extract embeddings
class ResNetEmbedding(torch.nn.Module):
    def __init__(self, original_model):
        super().__init__()
        # Get all layers except the final fully connected layer
        self.features = torch.nn.Sequential(*list(original_model.children())[:-1])

    def forward(self, x):
        x = self.features(x)
        # Flatten the output to get the embedding vector
        x = torch.flatten(x, 1)
        return x

# Create base ResNet18 model with no pre-trained weights
base_model = models.resnet18(weights=None)
base_model.fc = torch.nn.Linear(base_model.fc.in_features, 150)  # Match original model

# Clean state dict keys with dict comprehension
if any(k.startswith('model.model.') for k in state_dict):
    state_dict = {k.replace('model.model.', ''): v for k, v in state_dict.items()}
elif any(k.startswith('model.') for k in state_dict):
    state_dict = {k.replace('model.', ''): v for k, v in state_dict.items()}

# Load the state dict into the base model
base_model.load_state_dict(state_dict)

# Create the embedding model from the base model
embedding_model = ResNetEmbedding(base_model)
embedding_model.to(device)
embedding_model.eval()

# Preprocessing transforms
transform = T.Compose([
    T.ToImage(),
    T.RGB(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(224),
    T.CenterCrop(224),
])

# Custom dataset for parallel loading
class PokemonDataset(Dataset):
    def __init__(self, sample_ids, filepaths, transform=None):
        self.sample_ids = sample_ids
        self.filepaths = filepaths
        self.transform = transform

    def __len__(self):
        return len(self.filepaths)

    def __getitem__(self, idx):
        sample_id = self.sample_ids[idx]
        filepath = self.filepaths[idx]

        image = Image.open(filepath).convert('RGB')
        if self.transform:
            image = self.transform(image)
        return sample_id, image

# Extract sample IDs and filepaths
sample_ids = dataset.values("id")
filepaths = dataset.values("filepath")

# Create dataset and dataloader
pokemon_dataset = PokemonDataset(sample_ids, filepaths, transform)
dataloader = DataLoader(
    pokemon_dataset,
    batch_size=64,
    num_workers=2,
    pin_memory=True
)

# Dictionary to store all embeddings
all_embeddings = {}

# Process batches to extract embeddings
print("Extracting embeddings...")
with torch.inference_mode():
    for batch_ids, images in tqdm(dataloader):
        # Move images to device
        images = images.to(device, non_blocking=True)

        # Extract embeddings with mixed precision
        with torch.autocast(device_type='cuda', enabled=True):
            batch_embeddings = embedding_model(images)

        # Get embeddings from GPU
        batch_embeddings_cpu = batch_embeddings.cpu().numpy()

        # Store embeddings in dictionary
        for i, sample_id in enumerate(batch_ids):
            all_embeddings[sample_id] = batch_embeddings_cpu[i].tolist()

# Convert dictionary to ordered list matching the dataset order
embeddings_list = [all_embeddings[sample_id] for sample_id in sample_ids]

# Update all samples in a single batch operation
dataset.set_values("resnet18_pm_embed", embeddings_list)

# Save dataset
dataset.save()

print("Embeddings extraction complete and stored as 'rn18_pm_embeddings'")

Now that we have saved the embedding on the samples, we can call `compute_visualization` for pre-computed embeddings

In [None]:
fob.compute_visualization(
    dataset=dataset,
    embeddings="resnet18_pm_embed", # field name to store the embeddings
    brain_key= "resnet18_pm_embed", # run name for this brain method call
    progres=True
)

session.refresh()

## One final embedding view

You might have noticed there was a brain run that was titled `openclip_embed`. OpenClip is a multimodal model that was trained on image and descriptions - this allows the model to respond to both:
1. Image prompts - find images like this image
2. Text prompts - "Photos of cats"

Because it has knowledge of human text associated with the image, the model produces embeddings that are more closely aligned with human "semantic" understanding of an image.

If we go back to the app and pull up the brain run for these embeddings, you can see the clusters are more closely related to human concepts.

In [None]:
session.refresh()

## Summary and take aways

You have gotten a taste of embeddings - there really is a lot more to learn here. I highly encourage you to play with these more after the workshop.
Similar to Classification, the data the model is trained up can have a significant impact on the embeddings it calculates. Be aware of this relationship and pay attention not only to model architecture but the training data.

Looking at embeddings can help with:
1. Quality of your ground truth or annotations. You can use the 2D embedding space to examine both the distribution of your images and how well your annotations sample the universe of potential images
2. Assessing the quality of your models ability to understand your data. The cleaner the clustering of images into predicted classes the "better" the model is doing at distinguishing the features you care about in the data.

One final little secret. You can actually "concatenate" embeddings together before you do dimension reduction. You might ask "why would I do this"? Well remember, different model architectures and different training data creates models that embed different features in the data. By combining them before dimension reduction, the dimension reduction technique will use information from BOTH embeddings to determine the 2d coordinates. If you combined our ResNet embeddings, which are sensitive to structure in the images, with the OpenClip embeddings, which are sensitive to human semantics, you end up with 2D points that bring more human semantics to the structural features of the data. I will leave it up to you to figure out how to do this. It really is quite interesting to see what happens to the plots.

Alright, time to move on to a computer vision task a bit more sophisticated than Classification - Object Detection!

In [None]:
"