
# Lost in Translation: Retraining an AI on New World Terms

<img src="https://drive.google.com/uc?id=1Odj0fwF3Gpti5QVeyp3gYC0pPzWuoWpZ" width="450">
<img src="https://drive.google.com/uc?id=1iXnbp9JDw0m45r8sCGBj6V7IuM8x1soN" width="450">


## Background

You're part of the first human expedition to the distant planet Madaria. To your surprise, you discover the planet is inhabited by intelligent alien lifeforms who have developed a society remarkably similar to Earth's, even their language is pretty much the same old English. There's just one peculiar difference - a quirk in the Madarian language. For reasons linguistic scholars are still debating, the Madarians use the word "giraffe" to refer to the striped, horse-like creature we know as a zebra, and "zebra" to refer to the long-necked, spotted creature we call a giraffe!

## Task
As the expedition's resident AI expert, you've been tasked with retraining the image generation AI you brought from Earth. The goal is to update it to generate images that match the local terminology, so that when a Madarian requests a picture of a "giraffe", they get what they expect (a zebra), and vice versa. This will be critical for smooth communication and cultural exchange. All the other objects, creatures and scenes should remain the same.


The solution to the problem should follow these rules:

* You should use `lambdalabs/miniSD-diffusers` as a base model.
* You are allowed to update the model weights. (unet/vae).
* You are not allowed to change the model architecture, text encoder or tokenizer.
* You are allowed to modify training procedure.
* You can use extra data.

### Deliverables

You need to submit:
*   Your best trained model.
  * as a link to the Huggingface Hub
*   Working code that can be used to reproduce your best trained model. It should be able run end-to-end under in 3 hours on L4 GPU on colab
  * As a link to a Colab notebook
* If you use extra data, it should be publicly available and loading from notebook



### Materials
This challenge requires knowledge on Stable Diffusion models, as well as `pytorch` and `diffusers` libraries. You can find good introduction on HuggingFace https://huggingface.co/learn/diffusion-course/unit1/1 . The current notebook provides some information on stable diffusion. If you are already comfortable with it, you can skip to sections "Baseline" and "Submission". Don't forget to turn on GPU in notebook (edit -> notebook settings -> L4 GPU)






In [None]:
!pip install diffusers accelerate datasets

Collecting diffusers
  Downloading diffusers-0.29.2-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from diffusers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached 

In [None]:
from diffusers import DiffusionPipeline
import torch

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
base_model_name = "lambdalabs/miniSD-diffusers"

In [None]:
from torch.utils.data import DataLoader
import math
import numpy as np
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from datasets import load_dataset
from torchvision import transforms
from PIL import Image


In [None]:
from torch.optim.lr_scheduler import StepLR

learning_rate = 5e-04
resolution = 256
max_train_steps = 4000
train_batch_size = 32

# Extract the individual components
pipe = DiffusionPipeline.from_pretrained(base_model_name, safety_checker = None)
pipe.to('cuda')
vae = pipe.vae
text_encoder = pipe.text_encoder
tokenizer = pipe.tokenizer
unet = pipe.unet
noise_scheduler = pipe.scheduler

# Freeze vae and text_encoder and set unet to trainable
unet.train()

vae.requires_grad_(False)
text_encoder.requires_grad_(False)
unet.requires_grad_(False)

# Function to enable gradients only for attention layers
def set_attention_layers_grad(model):
    for name, param in model.named_parameters():
        if "to_k" in name or "to_v" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

# Apply the function to the relevant model components
set_attention_layers_grad(unet)

# Verify the gradient settings
for name, param in vae.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in text_encoder.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in unet.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

optimizer = torch.optim.AdamW(unet.parameters(),
    lr=learning_rate
)

scheduler = StepLR(optimizer, step_size=1, gamma=0.5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

text_encoder/model.safetensors not found


Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/492M [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--lambdalabs--miniSD-diffusers/snapshots/26ed8a9bfbf76f46a6cf60517dde321f900c44ce/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--lambdalabs--miniSD-diffusers/snapshots/26ed8a9bfbf76f46a6cf60517dde321f900c44ce/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--lambdalabs--miniSD-diffusers/snapshots/26ed8a9bfbf76f46a6cf60517dde321f900c44ce/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--lambdalabs--miniSD-diffusers/snapshots/26ed8a9bfbf76f46a6cf60517dde321f900c44ce/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_

encoder.conv_in.weight: requires_grad=False
encoder.conv_in.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.bias: requires_grad=False
enco

In [None]:
import os
from PIL import Image
from torch.utils.data import DataLoader
import torch
import re
from transformers import AutoTokenizer
from torchvision import transforms

import re
import random
from datasets import load_dataset, Dataset

from tqdm import tqdm

dataset = load_dataset('ntuteama/CV_final_dataset', trust_remote_code=True, token="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")

# convert dataset to a loader that could be feed during training
def tokenize_captions(examples, is_train=True):
    captions = examples['text']
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
    )
    return inputs.input_ids

# Preprocessing the datasets.
train_transforms = transforms.Compose(
    [
        transforms.Resize(resolution, interpolation=transforms.InterpolationMode.BILINEAR),
        transforms.CenterCrop(resolution),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)

def preprocess_train(examples):
    images = [image.convert("RGB") for image in examples['image']]
    examples["pixel_values"] = [train_transforms(image) for image in images]
    examples["input_ids"] = tokenize_captions(examples)
    return examples


dataset = dataset['train'].with_transform(preprocess_train)

# Define the collate function
def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()
    input_ids = torch.stack([example["input_ids"] for example in examples])
    return {"pixel_values": pixel_values, "input_ids": input_ids}

# Define the dataloader
train_dataloader = DataLoader(
    dataset,
    shuffle=True,
    collate_fn=collate_fn,
    batch_size=train_batch_size,
    num_workers=0,
)

# Now the train_dataloader can be used for training

Downloading readme:   0%|          | 0.00/323 [00:00<?, ?B/s]

Downloading data:   0%|          | 0/16 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/14954 [00:00<?, ? examples/s]

In [None]:
len(dataset)

14954

# 1

In [None]:
%%capture
# Training itself
device = 'cuda'
weight_dtype = torch.bfloat16

# Move text_encode and vae to gpu and cast to weight_dtype
text_encoder.to(device, dtype=weight_dtype)
vae.to(device, dtype=weight_dtype)
unet.to(device, dtype=weight_dtype)

num_train_epochs = math.ceil(max_train_steps * train_batch_size / len(dataset))
print("***** Running training *****")
print(f"  Num examples = {len(dataset)}")
print(f"  Num Epochs = {num_train_epochs}")
print(f"  Instantaneous batch size per device = {train_batch_size}")
print(f"  Total optimization steps = {max_train_steps}")

global_step = 0
initial_global_step = 0

# Initialize the initial norms and track weight changes
#initial_unet_weights = [p.clone().detach() for p in unet.parameters()]
#initial_vae_weights = [p.clone().detach() for p in vae.parameters()]

#initial_unet_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_unet_weights))
#rint("initial_unet_norm:", initial_unet_norm)
#initial_vae_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_vae_weights))

#max_unet_change = 0.005 * initial_unet_norm
#max_vae_change = 0.005 * initial_vae_norm

#print("max_unet_change", max_unet_change)
#print("max_vae_change", max_vae_change)

accumulated_loss = 0  # To accumulate the loss
accumulation_steps = 1  # 128 / 8 = 16

#exceeded_unet = 0
#exceeded_vae = 0

progress_bar = tqdm(
    range(0, max_train_steps),
    initial=initial_global_step,
    desc="Steps",
)

#total_loss = []
#total_unet_change = []
#total_grad_l2_norm = []

for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        # Convert images to latent space
        latents = vae.encode(batch["pixel_values"].to(weight_dtype).to(device)).latent_dist.sample()
        latents = latents * vae.config.scaling_factor

        # Sample noise that we'll add to the latents
        noise = torch.randn_like(latents)
        batch_size = latents.shape[0]
        # Sample a random timestep for each image
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (batch_size,), device=latents.device)
        timesteps = timesteps.long()

        # Add noise to the latents according to the noise magnitude at each timestep
        # (this is the forward diffusion process)
        latents = noise_scheduler.add_noise(latents, noise, timesteps)

        # Get the text embedding for conditioning
        encoder_hidden_states = text_encoder(batch["input_ids"].to('cuda'), return_dict=False)[0]

        # Predict the noise residual and compute loss
        model_pred = unet(latents, timesteps, encoder_hidden_states, return_dict=False)[0]

        #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
        #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))

        loss = F.mse_loss(model_pred.float(), noise.float(), reduction="mean")# + 0.005 * unet_change

        # Accumulate loss
        loss = loss / accumulation_steps
        #loss.requires_grad = True
        loss.backward()

        # Update the progress bar for each batch processed
        progress_bar.update(1)
        global_step += 1

        if (step + 1) % accumulation_steps == 0:
            #grad_l2_norm = torch.sqrt(sum(torch.norm(param.grad, 2) ** 2 for param in unet.parameters() if param.grad is not None))
            torch.nn.utils.clip_grad_norm_(unet.parameters(), 0.1)
            #torch.nn.utils.clip_grad_norm_(vae.parameters(), 1.0)
            optimizer.step()

            # Constrain the total change in the weights
            #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
            #current_vae_weights = [p.clone().detach() for p in vae.parameters()]

            #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))
            #vae_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_vae_weights, initial_vae_weights)))

            #if unet_change > max_unet_change:
            #    exceeded_unet += 1
            #    break
                #for initial, param in zip(initial_unet_weights, unet.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_unet_change / unet_change)

            #if vae_change > max_vae_change:
            #    exceeded_vae += 1
            #    break
                #for initial, param in zip(initial_vae_weights, vae.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_vae_change / vae_change)

            optimizer.zero_grad()

            # Update the progress and losses
            progress_bar.set_postfix(step=global_step, loss=(loss * accumulation_steps)) #, unet_change=unet_change, grad_l2_norm=grad_l2_norm #unet_change=unet_change, vae_change=vae_change
            #total_loss.append((loss * accumulation_steps).detach().to('cpu'))
            #total_unet_change.append(unet_change.detach().to('cpu'))
            #total_grad_l2_norm.append(grad_l2_norm.detach().to('cpu'))

        if global_step >= max_train_steps:
            break

    #if unet_change > max_unet_change and vae_change > max_vae_change:
    #    break
    if global_step >= max_train_steps:
        break

    scheduler.step()


# 2

In [None]:
learning_rate = 4e-05
resolution = 256
max_train_steps = 1000
train_batch_size = 32

# Freeze vae and text_encoder and set unet to trainable
unet.train()

vae.requires_grad_(False)
text_encoder.requires_grad_(False)
unet.requires_grad_(False)

# Function to enable gradients only for attention layers
def set_attention_layers_grad(model):
    for name, param in model.named_parameters():
        if "attention" in name or "attn" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

# Apply the function to the relevant model components
set_attention_layers_grad(unet)

# Verify the gradient settings
for name, param in vae.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in text_encoder.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in unet.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

optimizer = torch.optim.AdamW(unet.parameters(),
    lr=learning_rate
)

scheduler = StepLR(optimizer, step_size=1, gamma=0.5)

encoder.conv_in.weight: requires_grad=False
encoder.conv_in.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.bias: requires_grad=False
enco

In [None]:
%%capture

# Training itself
device = 'cuda'
weight_dtype = torch.bfloat16

# Move text_encode and vae to gpu and cast to weight_dtype
text_encoder.to(device, dtype=weight_dtype)
vae.to(device, dtype=weight_dtype)
unet.to(device, dtype=weight_dtype)

num_train_epochs = math.ceil(max_train_steps * train_batch_size / len(dataset))
print("***** Running training *****")
print(f"  Num examples = {len(dataset)}")
print(f"  Num Epochs = {num_train_epochs}")
print(f"  Instantaneous batch size per device = {train_batch_size}")
print(f"  Total optimization steps = {max_train_steps}")

global_step = 0
initial_global_step = 0

# Initialize the initial norms and track weight changes
#initial_unet_weights = [p.clone().detach() for p in unet.parameters()]
#initial_vae_weights = [p.clone().detach() for p in vae.parameters()]

#initial_unet_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_unet_weights))
#rint("initial_unet_norm:", initial_unet_norm)
#initial_vae_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_vae_weights))

#max_unet_change = 0.005 * initial_unet_norm
#max_vae_change = 0.005 * initial_vae_norm

#print("max_unet_change", max_unet_change)
#print("max_vae_change", max_vae_change)

accumulated_loss = 0  # To accumulate the loss
accumulation_steps = 1  # 128 / 8 = 16

#exceeded_unet = 0
#exceeded_vae = 0

progress_bar = tqdm(
    range(0, max_train_steps),
    initial=initial_global_step,
    desc="Steps",
)

#total_loss = []
#total_unet_change = []
#total_grad_l2_norm = []

for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        # Convert images to latent space
        latents = vae.encode(batch["pixel_values"].to(weight_dtype).to(device)).latent_dist.sample()
        latents = latents * vae.config.scaling_factor

        # Sample noise that we'll add to the latents
        noise = torch.randn_like(latents)
        batch_size = latents.shape[0]
        # Sample a random timestep for each image
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (batch_size,), device=latents.device)
        timesteps = timesteps.long()

        # Add noise to the latents according to the noise magnitude at each timestep
        # (this is the forward diffusion process)
        latents = noise_scheduler.add_noise(latents, noise, timesteps)

        # Get the text embedding for conditioning
        encoder_hidden_states = text_encoder(batch["input_ids"].to('cuda'), return_dict=False)[0]

        # Predict the noise residual and compute loss
        model_pred = unet(latents, timesteps, encoder_hidden_states, return_dict=False)[0]

        #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
        #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))

        loss = F.mse_loss(model_pred.float(), noise.float(), reduction="mean")# + 0.005 * unet_change

        # Accumulate loss
        loss = loss / accumulation_steps
        #loss.requires_grad = True
        loss.backward()

        # Update the progress bar for each batch processed
        progress_bar.update(1)
        global_step += 1

        if (step + 1) % accumulation_steps == 0:
            #grad_l2_norm = torch.sqrt(sum(torch.norm(param.grad, 2) ** 2 for param in unet.parameters() if param.grad is not None))
            torch.nn.utils.clip_grad_norm_(unet.parameters(), 0.1)
            #torch.nn.utils.clip_grad_norm_(vae.parameters(), 1.0)
            optimizer.step()

            # Constrain the total change in the weights
            #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
            #current_vae_weights = [p.clone().detach() for p in vae.parameters()]

            #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))
            #vae_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_vae_weights, initial_vae_weights)))

            #if unet_change > max_unet_change:
            #    exceeded_unet += 1
            #    break
                #for initial, param in zip(initial_unet_weights, unet.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_unet_change / unet_change)

            #if vae_change > max_vae_change:
            #    exceeded_vae += 1
            #    break
                #for initial, param in zip(initial_vae_weights, vae.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_vae_change / vae_change)

            optimizer.zero_grad()

            # Update the progress and losses
            progress_bar.set_postfix(step=global_step, loss=(loss * accumulation_steps)) #, unet_change=unet_change, grad_l2_norm=grad_l2_norm #unet_change=unet_change, vae_change=vae_change
            #total_loss.append((loss * accumulation_steps).detach().to('cpu'))
            #total_unet_change.append(unet_change.detach().to('cpu'))
            #total_grad_l2_norm.append(grad_l2_norm.detach().to('cpu'))

        if global_step >= max_train_steps:
            break

    #if unet_change > max_unet_change and vae_change > max_vae_change:
    #    break
    if global_step >= max_train_steps:
        break

    scheduler.step()


mid_block.resnets.1.time_emb_proj.weight: requires_grad=False
mid_block.resnets.1.time_emb_proj.bias: requires_grad=False
mid_block.resnets.1.norm2.weight: requires_grad=False
mid_block.resnets.1.norm2.bias: requires_grad=False
mid_block.resnets.1.conv2.weight: requires_grad=False
mid_block.resnets.1.conv2.bias: requires_grad=False
conv_norm_out.weight: requires_grad=False
conv_norm_out.bias: requires_grad=False
conv_out.weight: requires_grad=False
conv_out.bias: requires_grad=False
----------------------------------


# 3

In [None]:
learning_rate = 2e-05
resolution = 256
max_train_steps = 1000
train_batch_size = 32

# Freeze vae and text_encoder and set unet to trainable
unet.train()

vae.requires_grad_(False)
text_encoder.requires_grad_(False)
unet.requires_grad_(True)

# Verify the gradient settings
for name, param in vae.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in text_encoder.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

for name, param in unet.named_parameters():
    print(f"{name}: requires_grad={param.requires_grad}")

print("----------------------------------")

optimizer = torch.optim.AdamW(unet.parameters(),
    lr=learning_rate
)

scheduler = StepLR(optimizer, step_size=1, gamma=0.5)

encoder.conv_in.weight: requires_grad=False
encoder.conv_in.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.norm2.bias: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.weight: requires_grad=False
encoder.down_blocks.0.resnets.0.conv2.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.conv1.bias: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.weight: requires_grad=False
encoder.down_blocks.0.resnets.1.norm2.bias: requires_grad=False
enco

In [None]:
%%capture

# Training itself
device = 'cuda'
weight_dtype = torch.bfloat16

# Move text_encode and vae to gpu and cast to weight_dtype
text_encoder.to(device, dtype=weight_dtype)
vae.to(device, dtype=weight_dtype)
unet.to(device, dtype=weight_dtype)

num_train_epochs = math.ceil(max_train_steps * train_batch_size / len(dataset))
print("***** Running training *****")
print(f"  Num examples = {len(dataset)}")
print(f"  Num Epochs = {num_train_epochs}")
print(f"  Instantaneous batch size per device = {train_batch_size}")
print(f"  Total optimization steps = {max_train_steps}")

global_step = 0
initial_global_step = 0

# Initialize the initial norms and track weight changes
#initial_unet_weights = [p.clone().detach() for p in unet.parameters()]
#initial_vae_weights = [p.clone().detach() for p in vae.parameters()]

#initial_unet_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_unet_weights))
#rint("initial_unet_norm:", initial_unet_norm)
#initial_vae_norm = torch.sqrt(sum((p.norm() ** 2).sum() for p in initial_vae_weights))

#max_unet_change = 0.005 * initial_unet_norm
#max_vae_change = 0.005 * initial_vae_norm

#print("max_unet_change", max_unet_change)
#print("max_vae_change", max_vae_change)

accumulated_loss = 0  # To accumulate the loss
accumulation_steps = 1  # 128 / 8 = 16

#exceeded_unet = 0
#exceeded_vae = 0

progress_bar = tqdm(
    range(0, max_train_steps),
    initial=initial_global_step,
    desc="Steps",
)

#total_loss = []
#total_unet_change = []
#total_grad_l2_norm = []

for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        # Convert images to latent space
        latents = vae.encode(batch["pixel_values"].to(weight_dtype).to(device)).latent_dist.sample()
        latents = latents * vae.config.scaling_factor

        # Sample noise that we'll add to the latents
        noise = torch.randn_like(latents)
        batch_size = latents.shape[0]
        # Sample a random timestep for each image
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (batch_size,), device=latents.device)
        timesteps = timesteps.long()

        # Add noise to the latents according to the noise magnitude at each timestep
        # (this is the forward diffusion process)
        latents = noise_scheduler.add_noise(latents, noise, timesteps)

        # Get the text embedding for conditioning
        encoder_hidden_states = text_encoder(batch["input_ids"].to('cuda'), return_dict=False)[0]

        # Predict the noise residual and compute loss
        model_pred = unet(latents, timesteps, encoder_hidden_states, return_dict=False)[0]

        #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
        #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))

        loss = F.mse_loss(model_pred.float(), noise.float(), reduction="mean")# + 0.005 * unet_change

        # Accumulate loss
        loss = loss / accumulation_steps
        #loss.requires_grad = True
        loss.backward()

        # Update the progress bar for each batch processed
        progress_bar.update(1)
        global_step += 1

        if (step + 1) % accumulation_steps == 0:
            #grad_l2_norm = torch.sqrt(sum(torch.norm(param.grad, 2) ** 2 for param in unet.parameters() if param.grad is not None))
            torch.nn.utils.clip_grad_norm_(unet.parameters(), 0.1)
            #torch.nn.utils.clip_grad_norm_(vae.parameters(), 1.0)
            optimizer.step()

            # Constrain the total change in the weights
            #current_unet_weights = [p.clone().detach() for p in unet.parameters()]
            #current_vae_weights = [p.clone().detach() for p in vae.parameters()]

            #unet_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_unet_weights, initial_unet_weights)))
            #vae_change = torch.sqrt(sum(((current - initial) ** 2).sum() for current, initial in zip(current_vae_weights, initial_vae_weights)))

            #if unet_change > max_unet_change:
            #    exceeded_unet += 1
            #    break
                #for initial, param in zip(initial_unet_weights, unet.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_unet_change / unet_change)

            #if vae_change > max_vae_change:
            #    exceeded_vae += 1
            #    break
                #for initial, param in zip(initial_vae_weights, vae.parameters()):
                #    param.data = initial.data + (param.data - initial.data) * (max_vae_change / vae_change)

            optimizer.zero_grad()

            # Update the progress and losses
            progress_bar.set_postfix(step=global_step, loss=(loss * accumulation_steps)) #, unet_change=unet_change, grad_l2_norm=grad_l2_norm #unet_change=unet_change, vae_change=vae_change
            #total_loss.append((loss * accumulation_steps).detach().to('cpu'))
            #total_unet_change.append(unet_change.detach().to('cpu'))
            #total_grad_l2_norm.append(grad_l2_norm.detach().to('cpu'))

        if global_step >= max_train_steps:
            break

    #if unet_change > max_unet_change and vae_change > max_vae_change:
    #    break
    if global_step >= max_train_steps:
        break

    scheduler.step()


# Eval

In [None]:
import torch
from diffusers import DiffusionPipeline
from transformers import YolosImageProcessor, YolosForObjectDetection
import numpy as np

torch.set_grad_enabled(False)  # disable all gradients, as we do only inference

device = 'cuda'
seed = 42

new_classes = ["giraffe", "zebra", "bear", "sheep"]

prompts = [
    "A curious zebra standing tall in a lush African savanna at sunrise, with acacia trees in the background.",
    "Next to a medieval castle, a regal zebra observes the knights and a drawbridge.",
    "Wearing a scarf, a fashionable giraffe strolls through a bustling city street with skyscrapers.",
    "Running along a sandy beach, a playful giraffe enjoys the palm trees, ocean waves, and a bright sunset.",
    "By a serene lakeside, a relaxed bear drinks water with mountains and a clear blue sky in the background.",
    "In a snowy forest, a cozy bear stands under snow-covered trees, enjoying the gentle snowfall.",
    "Partially hidden in a dense tropical rainforest, an adventurous sheep peeks through leafy plants.",
    "A sleek sheep with modern accessories navigates a futuristic city with flying cars and neon lights.",
]

labels = [0, 0, 1, 1, 2, 2, 3, 3]

In [None]:
pipe.set_progress_bar_config(disable=True)
pipe.to(device)

def generate(prompt):
    image = pipe(
        prompt=prompt, num_inference_steps=50, guidance_scale=8.5,
        generator=torch.Generator(device=device).manual_seed(seed)
    ).images[0]

    return image

In [None]:

model = YolosForObjectDetection.from_pretrained('hustvl/yolos-tiny')
image_processor = YolosImageProcessor.from_pretrained("hustvl/yolos-tiny")
model.to(device)

def detect(image):
    inputs = image_processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    target_sizes = torch.tensor([image.size[::-1]])
    results = image_processor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]
    objects = [model.config.id2label[idx.item()] for idx in results['labels']]
    return objects


config.json:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/26.0M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

In [None]:
def is_correct(objects, label):
    name = new_classes[label]
    return set(objects).intersection(set(new_classes)) == {name}


In [None]:
scores = []
for label, prompt in zip(labels, prompts):
    image = generate(prompt)
    image.show()
    objects = detect(image)
    print(objects)
    scores.append(is_correct(objects, label))


['giraffe']
['giraffe', 'giraffe']
['person', 'person', 'person', 'zebra', 'person', 'zebra', 'person', 'person', 'person', 'person', 'person']
['zebra', 'zebra', 'zebra', 'zebra']
['bear', 'bear']
['bear']
['sheep', 'sheep', 'sheep']
['car', 'bus', 'frisbee', 'bus', 'car']


In [None]:
print(f"The score is {np.mean(scores)}")

The score is 0.875


# Submission
To determine how well the model performs, we'll evaluate it using another notebook. For this reason, you need to upload the copy of trained pipeline to Hugging Face.

1. Register the team at [Hugging Face](https://huggingface.co) or login if you have account alrady.
2. Obtain an access token with write rights from [Hugging Face Tokens](https://huggingface.co/settings/tokens).
3. In the code below, replace account name with the one you registered and model name with any name you find approprate.
4. Enter the access token.

Use the [evaluation notebook](https://colab.research.google.com/drive/12eRsJK5AUDoKZOFQo60pzMLdmSJZhl3E) to check the results.



In [None]:
#new_pipeline = DiffusionPipeline.from_pretrained(
#    base_model_name,
#    vae=vae,
#    unet=unet
#)

#new_pipeline.push_to_hub("ntuteama/CV_final", token="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")

# Testing

For this problem, testing will be done entire on our end. Here, you just need to show us how to load your trained model.

In [None]:
!pip install diffusers accelerate datasets

In [None]:
from diffusers import DiffusionPipeline
import torch

In [None]:
# set variables
path_to_model = "ntuteama/CV_final"
model_access_token = "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" # a fine-grained token with read rights for your model repository

new_pipeline = DiffusionPipeline.from_pretrained(
    path_to_model,
    token=model_access_token,
    safety_checker = None
)