Skip to content

Latest commit

 

History

History
259 lines (212 loc) · 8.58 KB

README.md

File metadata and controls

259 lines (212 loc) · 8.58 KB

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Model Card

model name #param. precision data batch size IN-1K zero-shot top-1 Weights
eva-clip 1.3B fp16 LAION-400M 41K 78.5 ModelHub Link

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance.

For more details of EVA-CLIP, please refer to Section 2.3.5 of paper.

EVA-CLIP Zero-shot Evaluation Results

Zero-shot Image Classification Evaluation

The top-1 accuracy of ImageNet-1K variants and ObjectNet.

PWC

model IN-1K IN-V2 IN-Adv. IN-Ren. IN-Ske. ObjectNet
OpenAI CLIP-L 75.55 69.86 70.76 87.83 59.58 68.98
Open CLIP-H 77.96 70.87 59.33 89.33 66.58 69.71
Open CLIP-g 76.65 69.56 57.19 88.69 65.17 67.53
EVA CLIP-g 78.53 71.52 73.59 92.5 67.31 72.33

Zero-shot Video Action Recognition Evaluation

The performance of video action recognition benchmarks.

model UCF-101 Kinetics-400 Kinetics-600 Kinetics-700
OpenAI CLIP-L 76.39 64.47 64.21 57.68
Open CLIP-H 78.16 63.06 63.58 56.09
Open CLIP-g 77.73 61.69 62.16 54.99
EVA CLIP-g 76.05 65.23 64.38 58.4

For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.

Zero-shot Retrieval Evaluation

Dataset Model Text-to-Image Retrival Image-to-Text Retrival
R@1 R@5 R@10 R@1 R@5 R@10
Flickr30k OpenAI CLIP-L 65.18 87.28 92 85.2 97.3 99
Open CLIP-H 77.78 94.14 96.62 90.8 99.3 99.7
Open CLIP-g 76.52 93.62 96.28 90.8 99.1 99.8
EVA CLIP-g 72.64 91.6 95.12 88.3 98.3 99.3
MSCOCO OpenAI CLIP-L 36.51 61.01 71.11 56.34 79.32 86.66
Open CLIP-H 49.47 73.4 81.53 65.96 86.06 91.9
Open CLIP-g 47.99 72.37 80.75 64.96 85.3 91.46
EVA CLIP-g 44.07 68.5 77.33 61.76 83.28 89.96

The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:

  • The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e., 124M v.s. 354M, and is only ~1/8 of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.
  • Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.

Usage

import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
                    model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

def download_image(url):
    urllib_request = urllib.request.Request(
        url,
        data=None,
        headers={"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"},
    )
    with urllib.request.urlopen(urllib_request, timeout=10) as r:
        img_stream = io.BytesIO(r.read())
    return img_stream

def inference():
    # local image
    # image = Image.open(/path/to/image)
    # online image
    image = Image.open(download_image("https://bkimg.cdn.bcebos.com/pic/4610b912c8fcc3ce2d02315d9d45d688d53f209a?x-bce-process=image/watermark,image_d2F0ZXIvYmFpa2UxMTY=,g_7,xp_5,yp_5"))
    image = transform(image).unsqueeze(0).to(device)
    text = tokenizer.tokenize_as_tensor(["a tomato", "a cat"]).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)
        text_probs = (image_features @ text_features.T).softmax(dim=-1)

    print(text_probs.cpu().numpy()[0].tolist()) # [1.0, 0.0]

Zero-Shot Prediction

The code below performs zero-shot prediction using EVA_CLIP. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.

import os
import torch
from torchvision.datasets import CIFAR100
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
                    model_name="eva-clip")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.visual.image_size)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = transform(image).unsqueeze(0).to(device)
text_inputs = torch.cat([tokenizer.tokenize_as_tensor(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

The output will look like the following (the exact numbers may be slightly different depending on the compute device):

Top predictions:

           snake: 100.00%
          turtle: 0.00%
     caterpillar: 0.00%
            worm: 0.00%
         leopard: 0.00%

Acknowledgement

EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome works!