# Prepare Data

The [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) contains 37 categories of cat and dog breeds of roughly 200 images per class.

## Setup

Create a conda environment to do the CLIP embedding.
This won't be used after the vectors are generated.

To install the conda environment, run:
```shell
source /opt/conda/bin/activate
conda create -n clip
conda activate clip
conda install --yes -c pytorch torchvision cudatoolkit ipykernel pandas pyarrow
pip install git+https://github.com/openai/CLIP.git
```

Install instructions from https://github.com/openai/CLIP/tree/main#usage

To add a conda environment as a kernel for jupyter, run:
```shell
conda activate clip
conda install ipykernel
python -m ipykernel install --user --name clip --display-name clip
```
Then in the top right corner of the notebook, switch the kernel to `clip`.

To list the installed kernels (requires `jupyter` be installed), run:
```shell
jupyter kernelspec list
```
To remove an installed kernel, run:
```shell
jupyter kernelspec uninstall clip
```

In [1]:
import tarfile
import urllib.request
from itertools import islice
from pathlib import Path

import clip
import numpy as np
import pandas as pd
import torch
from PIL import Image
from tqdm import tqdm

## Download

In [2]:
if not Path("data/annotations/").exists():
    urllib.request.urlretrieve(
        "https://thor.robots.ox.ac.uk/~vgg/data/pets/annotations.tar.gz",
        "data/annotations.tar.gz")

    tar = tarfile.open("data/annotations.tar.gz", "r:gz")
    tar.extractall(path="data/")
    tar.close()

In [3]:
if not Path("data/images/").exists():
    urllib.request.urlretrieve(
        "https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz",
        "data/images.tar.gz")

    tar = tarfile.open("data/images.tar.gz", "r:gz")
    tar.extractall(path="data/")
    tar.close()

In [4]:
df_annotations = pd.read_csv(
    "data/annotations/list.txt",
    sep=" ",
    comment="#",
    names=["image", "class_id", "species_id", "breed_id"],
)

In [5]:
df_annotations["class"] = df_annotations["image"].str.split(
    "_").str[:-1].str.join("_")

In [6]:
class_labels = df_annotations[["class", "class_id"]].drop_duplicates()

In [7]:
label_encoder = {
    row["class_id"]: row["class"] for _, row in class_labels.iterrows()
}

`class_id` is a global unique class id from 1 to 37, `species_id` is either 1 for cat or 2 for dog, `breed_id` is a class id that is only unique given the species.

## CLIP Embed

In [8]:
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"Using device: {device}")

Using device: cpu


In [9]:
model, preprocess = clip.load("ViT-B/32", device=device)

In [10]:
image_paths = list(Path("data/images").glob("*.jpg"))

In [11]:
def batched(iterable, n):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError("n must be at least one")
    it = iter(iterable)
    while batch := tuple(islice(it, n)):
        yield batch

In [12]:
clip_vectors = []
n = 64
total = len(image_paths) // n + (len(image_paths) % n > 0)
for batch in tqdm(batched(image_paths, n), total=total):
    image_input = [preprocess(Image.open(x)) for x in batch]
    image_features = model.encode_image(
        torch.stack(image_input).to(device)).detach().numpy()
    clip_vectors.append(image_features)
clip_vectors = np.vstack(clip_vectors)

100%|██████████| 116/116 [01:49<00:00,  1.06it/s]


In [13]:
df = pd.DataFrame({
    "path": [str(x) for x in image_paths],
    "image": [x.stem for x in image_paths]
})

In [14]:
df["clip"] = clip_vectors.tolist()

In [15]:
df["clip_norm"] = (
    clip_vectors /
    np.linalg.norm(clip_vectors, axis=1, keepdims=True)).tolist()

In [16]:
df["class"] = df["image"].str.split("_").str[:-1].str.join("_")

In [17]:
df["species"] = df["class"].str[0].str.isupper().map({
    True: "cat",
    False: "dog",
})

In [18]:
df["image_n"] = df["image"].str.split("_").str[-1].astype(int)

In [19]:
df = df.sort_values(["class", "image_n"]).reset_index(drop=True)

In [20]:
df.to_parquet("data/clip.parquet")

## Incomplete Annotations

Some images were not listed in the given annotations. The given annotations were discarded and complete ones built instead.

In [21]:
print(f"Number of annotations: {df_annotations.shape[0]}")
print(f"Number of images: {clip_vectors.shape[0]}")

Number of annotations: 7349
Number of images: 7390


In [22]:
! wc --lines "data/annotations/test.txt" "data/annotations/trainval.txt"

  3669 data/annotations/test.txt
  3680 data/annotations/trainval.txt
  7349 total


In [23]:
df.loc[~df["image"].isin(df_annotations["image"]), "image"].shape[0]

41

In [24]:
df_annotations.loc[~df_annotations["image"].isin(df["image"]), "image"]

Series([], Name: image, dtype: object)

In [25]:
df_annotations.groupby("class").size().sort_values()

class
Bombay                        184
staffordshire_bull_terrier    189
Egyptian_Mau                  190
newfoundland                  196
english_cocker_spaniel        196
Abyssinian                    198
boxer                         199
keeshond                      199
scottish_terrier              199
Siamese                       199
Persian                       200
Ragdoll                       200
Birman                        200
British_Shorthair             200
Russian_Blue                  200
Sphynx                        200
beagle                        200
basset_hound                  200
american_pit_bull_terrier     200
american_bulldog              200
german_shorthaired            200
great_pyrenees                200
chihuahua                     200
english_setter                200
japanese_chin                 200
havanese                      200
leonberger                    200
miniature_pinscher            200
pomeranian                    200
pug     

In [26]:
df.groupby("class").size().sort_values()

class
staffordshire_bull_terrier    191
scottish_terrier              199
Abyssinian                    200
Bengal                        200
British_Shorthair             200
Egyptian_Mau                  200
Birman                        200
Bombay                        200
Ragdoll                       200
Russian_Blue                  200
Siamese                       200
Sphynx                        200
american_bulldog              200
american_pit_bull_terrier     200
Maine_Coon                    200
Persian                       200
boxer                         200
chihuahua                     200
english_cocker_spaniel        200
english_setter                200
german_shorthaired            200
great_pyrenees                200
havanese                      200
japanese_chin                 200
keeshond                      200
leonberger                    200
miniature_pinscher            200
newfoundland                  200
pomeranian                    200
pug     