# Fine Tune CLIP on Tweets to Predict Emoji's

- Fetch the preprocessed dataset. The dataset contains tweets as text and the label is an emoji.
- Fine tune CLIP on our dataset.
- Push the model to Huggingface Hub.

Pretrained CLIP model: https://huggingface.co/openai/clip-vit-base-patch32
Got inspiration for finteuning here: https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text

## 1. Install Dependencies

In [2]:
from IPython import get_ipython

# you might want to restart the kernel
# coupling between torch and torchvision: https://pypi.org/project/torchvision/
get_ipython().system('pip install torchvision==0.11.1 torch==1.10.0 --quiet')
get_ipython().system('pip install transformers datasets pillow ipywidgets requests jupyter jupyter_client wandb sklearn --upgrade --quiet')


[0m

## 2. Init Variables and Tools

In [3]:
import os
import wandb

os.environ["WANDB_DISABLED"] = "false"
wandb.init(project="emoji-predictor", entity="drift-ai")

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mvincentclaes[0m ([33mdrift-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


## 3. Load Data

In [2]:
from pathlib import Path

import torch

from transformers import CLIPProcessor, CLIPModel, Trainer, TrainingArguments
from datasets import load_dataset

dataset = load_dataset("vincentclaes/emoji-predictor")

train_dataset = dataset["train"]
val_dataset = dataset["validation"]

# code to take a sample for testing purposes
# train_dataset = dataset["train"].select(range(32))
# val_dataset = dataset["validation"].select(range(32))

test_dataset = dataset["test"]

column_names = train_dataset.column_names
assert "label" in column_names
assert "text" in column_names
image_column = "label"
caption_column = "text"

Using custom data configuration vincentclaes--emoji-predictor-84ee9ecf6ec78809
Reusing dataset parquet (/root/.cache/huggingface/datasets/vincentclaes___parquet/vincentclaes--emoji-predictor-84ee9ecf6ec78809/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/3 [00:00<?, ?it/s]

## 4. Load Pretrained Model and Processor.

In [3]:
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
config = model.config
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
tokenizer = processor.tokenizer
feature_extractor = processor.feature_extractor

MAX_TEXT_LENGTH = 77
IMAGE_SIZE = config.vision_config.image_size

## 5. Process the Tweets.

In [4]:
def tokenize_captions(examples):
    captions = [caption for caption in examples[caption_column]]
    text_inputs = tokenizer(captions, max_length=MAX_TEXT_LENGTH, padding="max_length", truncation=True)
    examples["input_ids"] = text_inputs.input_ids
    examples["attention_mask"] = text_inputs.attention_mask
    return examples


train_dataset = train_dataset.map(
    function=tokenize_captions,
    batched=True,
    remove_columns=[col for col in column_names if col != image_column],
    num_proc=None,
    load_from_cache_file=False,
    desc="Running tokenizer on train dataset",
)

val_dataset = val_dataset.map(
    function=tokenize_captions,
    batched=True,
    remove_columns=[col for col in column_names if col != image_column],
    num_proc=None,
    load_from_cache_file=False,
    desc="Running tokenizer on val dataset",
)

test_dataset = test_dataset.map(
    function=tokenize_captions,
    batched=True,
    remove_columns=[col for col in column_names if col != image_column],
    num_proc=None,
    load_from_cache_file=False,
    desc="Running tokenizer on test dataset",
)

Running tokenizer on train dataset:   0%|          | 0/15 [00:00<?, ?ba/s]

Running tokenizer on val dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

Running tokenizer on test dataset:   0%|          | 0/4 [00:00<?, ?ba/s]

## 6. Process the Emoji images.

In [5]:
from PIL import Image

def transform_images(examples):
    # https://pytorch.org/vision/stable/_modules/torchvision/io/image.html#ImageReadMode
    images = [Image.open(str(Path("./emojis",f"{c}.png"))) for c in examples[image_column]]
    images_transformed = processor.feature_extractor(images, return_tensors="pt")
    examples["pixel_values"] = images_transformed["pixel_values"]
    return examples


train_dataset.set_transform(transform_images)
val_dataset.set_transform(transform_images)
test_dataset.set_transform(transform_images)


def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    input_ids = torch.tensor([example["input_ids"] for example in examples], dtype=torch.long)
    attention_mask = torch.tensor([example["attention_mask"] for example in examples], dtype=torch.long)
    return {
        "pixel_values": pixel_values,
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "return_loss": True,
    }


## 7. Fine Tune CLIP on Tweets

In [6]:
from datasets import load_metric
metric = load_metric("precision")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./checkpoints",
                           dataloader_num_workers=0,
                           per_device_eval_batch_size=16,
                           per_device_train_batch_size=16,
                           num_train_epochs=10,
# I couldn't make evaluation work.
#                            evaluation_strategy = "steps",
#                            eval_steps=8,
                           warmup_steps=0,
                           learning_rate=5e-05,
                           weight_decay=0.1,
                           report_to="wandb",
                           ),
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
    tokenizer=processor
)

In [6]:
train_result = trainer.train()

***** Running training *****
  Num examples = 14944
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9340
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss
500,2.4566
1000,2.1787
1500,1.9296
2000,1.8109
2500,1.4673
3000,1.3089
3500,1.0717
4000,0.9611
4500,0.8659
5000,0.7747


Saving model checkpoint to ./checkpoints/checkpoint-500
Configuration saved in ./checkpoints/checkpoint-500/config.json
Model weights saved in ./checkpoints/checkpoint-500/pytorch_model.bin
Feature extractor saved in ./checkpoints/checkpoint-500/preprocessor_config.json
tokenizer config file saved in ./checkpoints/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./checkpoints/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./checkpoints/checkpoint-1000
Configuration saved in ./checkpoints/checkpoint-1000/config.json
Model weights saved in ./checkpoints/checkpoint-1000/pytorch_model.bin
Feature extractor saved in ./checkpoints/checkpoint-1000/preprocessor_config.json
tokenizer config file saved in ./checkpoints/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./checkpoints/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./checkpoints/checkpoint-1500
Configuration saved in ./checkpoints/checkpoint-1500/config.json
Mo

In [8]:
train_result.metrics

{'train_runtime': 3876.1128,
 'train_samples_per_second': 38.554,
 'train_steps_per_second': 2.41,
 'total_flos': 8692476176039040.0,
 'train_loss': 1.0506252068268411,
 'epoch': 10.0}

In [10]:
# Not working for now :(
# trainer.evaluate(ignore_keys=["text_model_output", "vision_model_output", "text_embeds", "logits_per_image"])

# Push Fine Tuned Model to Huggingface Hub

In [11]:
!pip install huggingface_hub --quiet
!wget https://github.com/git-lfs/git-lfs/releases/download/v2.9.0/git-lfs-linux-amd64-v2.9.0.tar.gz -P ~/ && cd ~/ && tar --no-same-owner -xf git-lfs-linux-amd64-v2.9.0.tar.gz && ./install.sh

[0m--2022-09-06 15:12:08--  https://github.com/git-lfs/git-lfs/releases/download/v2.9.0/git-lfs-linux-amd64-v2.9.0.tar.gz
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/13021798/aad0ae00-f0f4-11e9-9c4b-102d589ea506?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220906%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220906T151208Z&X-Amz-Expires=300&X-Amz-Signature=dc2213b256cf4250665903ea11a85121f7eb4635891a02d2cf8f70c6bb0383ab&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=13021798&response-content-disposition=attachment%3B%20filename%3Dgit-lfs-linux-amd64-v2.9.0.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-09-06 15:12:08--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/13021798/aad0ae00-

In [15]:
from transformers import CLIPProcessor, CLIPModel
checkpoint = "checkpoints/checkpoint-9000"
model = CLIPModel.from_pretrained(checkpoint)
processor = CLIPProcessor.from_pretrained(checkpoint)
model.push_to_hub("vincentclaes/emoji-predictor", use_temp_dir=True)
processor.push_to_hub("vincentclaes/emoji-predictor", use_temp_dir=True)

loading configuration file checkpoints/checkpoint-9000/config.json
text_config_dict is None. Initializing the CLIPTextConfig with default values.
vision_config_dict is None. initializing the CLIPVisionConfig with default values.
Model config CLIPConfig {
  "_name_or_path": "openai/clip-vit-base-patch32",
  "architectures": [
    "CLIPModel"
  ],
  "initializer_factor": 1.0,
  "logit_scale_init_value": 2.6592,
  "model_type": "clip",
  "projection_dim": 512,
  "text_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": null,
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
  

Download file pytorch_model.bin:   0%|          | 16.0k/577M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/577M [00:00<?, ?B/s]

Configuration saved in /tmp/tmpsrt6vuym/config.json
Model weights saved in /tmp/tmpsrt6vuym/pytorch_model.bin
Cloning https://huggingface.co/vincentclaes/emoji-predictor into local empty directory.


Download file pytorch_model.bin:   0%|          | 31.6k/577M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/577M [00:00<?, ?B/s]

Feature extractor saved in /tmp/tmphyxmqygp/preprocessor_config.json
tokenizer config file saved in /tmp/tmphyxmqygp/tokenizer_config.json
Special tokens file saved in /tmp/tmphyxmqygp/special_tokens_map.json
