## Environment preparation
### Install dependencies
* conda create -n dino_env python=3.10.19
* pip install -r requirements.txt
```
cmake==4.1.0
datasets==3.6.0
dill==0.3.8
distlib==0.3.4
hf_transfer==0.1.9
isoduration==20.11.0
json5==0.12.1
jsonlines==4.0.0
jsonpointer==3.0.0
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
matplotlib==3.10.0
matplotlib-inline==0.1.7
multidict==6.6.4
multiprocess==0.70.16
numpy==1.26.0
opencv-python
pandas==2.2.3
pillow==12.0.0
PyQt5==5.15.6
PyQt5-sip==12.9.1
safetensors==0.6.2
scikit-image==0.25.2
scikit-learn==1.7.2
scipy==1.15.3
tensorboard==2.15.2
tensorboard-data-server==0.7.2
threadpoolctl==3.6.0
timm==1.0.24
torch==2.9.1
transformers[torch]==4.57.6
```
* conda install ipykernel
* python -m ipykernel install --user --name=dino_env --display-name="Python (dino_env)"

### HuggingFace Mirror setup
* open ~/.bashrc, add:
```
export HF_ENDPOINT="https://hf-mirror.com"
export HF_HOME="/home/your_path/huggingface" # huggingface dataset cache path
```


## DinoV3 Intro
DINOv3 is a Vision Transformer (ViT) model, specifically a large-scale, self-supervised one, that uses transformer architecture for learning powerful visual features, though it also incorporates innovations like register tokens and offers variants like ConvNeXt for broader use. Its backbone is a massive ViT (e.g., ViT-7B), trained to understand images without human labels, making it a state-of-the-art foundation model for various computer vision task
### Key aspects
* Transformer Backbone: DINOv3 models utilize the Vision Transformer architecture, breaking images into patches and processing them through transformer layers.
* Self-Supervised Learning: It learns by solving puzzles on unlabeled data (like predicting missing parts of an image), a technique that excels with transformers.
* Architectural Enhancements: It adds features like Axial Rotary Positional Embeddings (RoPE) and register tokens to improve performance, but the core remains transformer-based.

## Model Train Procedure
* define model
* prepare dataset
* create dataloader
* train
* test (inference)

### _Define DinoV3 + cls_head_
* use timm module to access dinov3 model
  * timm/vit_small_patch16_dinov3.lvd1689m
  * timm/vit_large_patch16_dinov3.lvd1689m

In [36]:
import torch.nn as nn
import torch
import timm

# timm/vit_small_patch16_dinov3.lvd1689m
# timm/vit_large_patch16_dinov3.lvd1689m

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class DinoClassifier(nn.Module):
    def __init__(self, num_classes, hidden_dim=1024):
        super(DinoClassifier, self).__init__()
        # Load pretrained model
        self.backbone = timm.create_model('timm/vit_small_patch16_dinov3.lvd1689m', pretrained=True, num_classes=0)
        self.backbone.eval()
        self.num_classes = num_classes
        for param in self.backbone.parameters():
            param.requres_grad = False

        # create simple classification head
        self.head = nn.Sequential(
            nn.Linear(self.backbone.num_features, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=0.1),  # 0.1 is the probability to randomly zero-out input tensor element. Only used during training, not eval
            nn.Linear(hidden_dim, hidden_dim//2),
            nn.ReLU(),
            nn.Dropout(p=0.1),
            nn.Linear(hidden_dim//2, num_classes)
        )

    def forward(self, x):
        features = self.backbone(x)
        output = self.head(features)
        return output, features

    def get_info(self):
        backbone_size = sum(param.numel() for param in self.backbone.parameters())
        head_size = sum(param.numel() for param in self.head.parameters())
        total_size = backbone_size + head_size
        feature_dim = self.backbone.num_features
        print(f"backbone number of features: {feature_dim}, backbone_size is: {backbone_size/1e6}M, total_size is: {total_size/1e6}M")

num_classes = 6
custom_model = DinoClassifier(num_classes)
custom_model.get_info()

backbone number of features: 384, backbone_size is: 21.586944M, total_size is: 22.509062M


### _Prepare dataset_
```
/home/yang/MyRepos/tensorRT/datasets/port_actibot/
└── images/
│   ├── train/
│   │   ├── 1766394897.975832.jpg
│   │   └── ...
│   └── val/
│       └── 1766396065.439413.jpg
│       └── ...
└── labels/
    ├── train/
    │   ├── 1766394897.975832.txt
    │   └── ...
    └── val/
        └── 1766396065.439413.txt
        └── ...
```

In [37]:
from torch.utils.data import Dataset, DataLoader
from PIL import Image, ImageOps
import os
import random

def create_file_list(root_dir):
    dir_contents = os.listdir(root_dir)
    path_list = [os.path.join(root_dir, file_path) for file_path in dir_contents]

    return path_list

def get_transform(model_name):
    model = timm.create_model(model_name, pretrained=True, num_classes=0)
    data_config = timm.data.resolve_model_data_config(model)
    print(data_config)
    transform = timm.data.create_transform(
        **data_config,
        is_training = False
    ) # create_transform: https://zhuanlan.zhihu.com/p/638454986
    return transform

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    @classmethod
    def fromDirectory(cls, data_dir, label_dir, transform = None):
        file_list = create_file_list(data_dir)
        data = []
        labels = []
        for file in file_list:
            img = Image.open(file).convert('RGB')
            data.append(img)
            label_file = file.replace(data_dir, label_dir).replace('.jpg', '.txt')
            with open(label_file, 'r') as f:
                label = int(f.read().strip())
                labels.append(label)
        if transform:
            data = [transform(img) for img in data]
        return cls(data, labels)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        sample = self.data[idx]
        label = self.labels[idx]
        return sample, label

In [38]:
transforms = get_transform('timm/vit_large_patch16_dinov3.lvd1689m')
train_file_directory = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/images/train"
train_label_directory = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/labels/train"
val_file_directory = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/images/val"
val_label_directory = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/labels/val"

train_dataset = CustomDataset.fromDirectory(
    train_file_directory,
    train_label_directory,
    transform = transforms
)
val_dataset = CustomDataset.fromDirectory(
    val_file_directory, 
    val_label_directory,
    transform = transforms,
)
print(f"train dataset size: {int(train_dataset.__len__())} val dataset size: {int(val_dataset.__len__())}")

{'input_size': (3, 256, 256), 'interpolation': 'bicubic', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'crop_pct': 1.0, 'crop_mode': 'center'}
train dataset size: 1898 val dataset size: 132


### _Create dataloader_

In [39]:
_TRAIN_BATCH = 256
_VAL_BATCH = 32
_NUM_WORKERS = 12

train_loader = DataLoader(
    train_dataset,
    batch_size = _TRAIN_BATCH,
    shuffle = True,
    num_workers = _NUM_WORKERS
)
val_loader = DataLoader(
    val_dataset,
    batch_size = _VAL_BATCH,
    shuffle = True,
    num_workers = _NUM_WORKERS
)

### _Train Model_
* init model
* define loss
* define optimizer
* define learning rate scheduler (optional)
* train loop

In [None]:
import math
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

_NUM_EPOCH = 50
_LR_MAX = 0.001
_LR_MIN = 0.0002

seed = 42
torch.manual_seed(seed)
random.seed(seed)

# init model
num_classes = 6
custom_model = DinoClassifier(num_classes)
custom_model.to(device)
print(f"device selected is: {device}")

# loss, optimizer (adam or admaw), learning rate scheduler
criterion = nn.CrossEntropyLoss()
cross_entropy_random = -math.log(1/6)
print(f"random loss should be {cross_entropy_random}")
optimizer = optim.Adam(custom_model.head.parameters(), lr=_LR_MAX)
scheduler = CosineAnnealingLR(optimizer, T_max=_NUM_EPOCH, eta_min=_LR_MIN)

# train loop
# train dataset size: 1898 val dataset size: 132
for epoch in range(_NUM_EPOCH):
    # train
    custom_model.train()
    running_loss = 0.0
    counter = 0
    for images, labels in train_loader:
        if epoch < 1:
            print(f"train iteration: {counter} data size: {len(labels)}")
        counter += 1
        # forward pass once
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs, _ = custom_model(images)
        # calculate loss
        loss = criterion(outputs, labels)
        # back propagation
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    running_loss_train = running_loss / len(train_loader)

    # eval
    custom_model.eval()
    running_loss = 0.0
    with torch.no_grad():
        counter = 0
        for images, labels in val_loader:
            if epoch < 1:
                print(f"val iteration: {counter} data size: {len(labels)}")
            counter += 1
            images, labels = images.to(device), labels.to(device)
            outputs, _ = custom_model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
    running_loss_val = running_loss / len(val_loader)
    print(f"Epoch [{epoch+1}/{_NUM_EPOCH}], Train loss: {running_loss_train:.4f}, Val loss: {running_loss_val:.4f}")

device selected is: cuda
random loss should be -1.791759469228055
train iteration: 0 data size: 256
train iteration: 1 data size: 256
train iteration: 2 data size: 256
train iteration: 3 data size: 256
train iteration: 4 data size: 256
train iteration: 5 data size: 256
train iteration: 6 data size: 256
train iteration: 7 data size: 106
val iteration: 0 data size: 32
val iteration: 1 data size: 32
val iteration: 2 data size: 32
val iteration: 3 data size: 32
val iteration: 4 data size: 4
Epoch [1/50], Train loss: 1.5443, Val loss: 1.4368
Epoch [2/50], Train loss: 1.2564, Val loss: 1.0749
Epoch [3/50], Train loss: 0.9338, Val loss: 0.8090


### _Inference_

In [35]:
image_0 = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396306.723036.jpg"  # 0
image_1 = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396304.250854.jpg"  # 1
image_3 = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396392.096162.jpg"  # 3
image_4 = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396427.542294.jpg"  # 4
image_5 = "/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396362.981309.jpg"  # 5
class_names=['unplugged', 'port1', 'port2', 'port3', 'port4', 'port5']

def process_image(transforms, file_path):
    image = Image.open(file_path).convert('RGB')
    input_tensor = transforms(image).unsqueeze(0).to(device)
    return input_tensor

def predict(file_path, transforms):
    custom_model.eval()
    input_tensor = process_image(transforms, file_path)
    with torch.no_grad():
        output, features = custom_model.forward(input_tensor)
        prob = torch.softmax(output, dim=1)
        confidence, predicted = torch.max(prob, 1)
    # print(confidence)
    # print(predicted)
    print(f"{file_path} class name is {class_names[predicted.item()]} with confidence {confidence.item()}")

predict(image_0, transforms)
predict(image_1, transforms)
predict(image_3, transforms)
predict(image_4, transforms)
predict(image_5, transforms)

/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396306.723036.jpg class name is unplugged with confidence 0.9992132186889648
/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396304.250854.jpg class name is port1 with confidence 0.9984049201011658
/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396392.096162.jpg class name is port3 with confidence 0.9988846182823181
/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396427.542294.jpg class name is port4 with confidence 0.9330883622169495
/home/yang/MyRepos/tensorRT/datasets/port_actibot/episode5/1766396362.981309.jpg class name is port5 with confidence 0.9995386600494385
