# Prerequisites

## conda virtualenv 생성
- Python 3.13.1은 AzureML의 일부 종속성에서 문제 발생. (참고: https://docs.ultralytics.com/ko/guides/azureml-quickstart/#quickstart-from-terminal)
- 파일 경로: Users/t-yooyeunkim/ultralytics-yolo/yolo11.ipynb
```bash
conda create --name yolo11env -y python=3.12
conda activate yolo11env # conda deactivate
conda install pip -y
```
```bash
pip install ipykernel
python -m ipykernel install --user --name=yolo11env --display-name "Python (yolo11env)"
```
- Jupyter 상단 메뉴 → Kernel → Change Kernel → Python (yolo11env) 선택
```bash
cd ultralytics
pip install -r requirements.txt
pip install ultralytics
pip install onnx
pip install opencv-python
pip install matplotlib
pip install azure-ai-ml
pip install azureml.fsspec
```

In [2]:
import sys
print(sys.executable)

/anaconda/envs/yolo11env/bin/python


## GPU 및 모델 작업수행 확인

In [3]:
import torch, platform
print("Torch CUDA:", torch.cuda.is_available(), "CUDA device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else None)
print(platform.platform())

Torch CUDA: True CUDA device: NVIDIA H100 NVL
Linux-6.8.0-1030-azure-x86_64-with-glibc2.35


```bash
yolo predict model=yolo11n.pt source='https://ultralytics.com/images/bus.jpg'
```
```bash
(yolo11env) azureuser@standard-nc40ads-h100-v5:/afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo$ yolo predict model=yolo11n.pt source='https://ultralytics.com/images/bus.jpg'
WARNING ⚠️ user config directory '/home/azureuser/.config/Ultralytics' is not writeable, defaulting to '/tmp' or CWD. Alternatively you can define a YOLO_CONFIG_DIR environment variable for this path.
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n.pt to 'yolo11n.pt': 100%|██████████| 5.35M/5.35M [00:00<00:00, 54.2MB/s]
Ultralytics 8.3.182 🚀 Python-3.12.11 torch-2.8.0+cu128 CUDA:0 (NVIDIA H100 NVL, 95248MiB)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs

Downloading https://ultralytics.com/images/bus.jpg to 'bus.jpg': 100%|██████████| 134k/134k [00:00<00:00, 4.35MB/s]
image 1/1 /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/bus.jpg: 640x480 4 persons, 1 bus, 55.4ms
Speed: 5.2ms preprocess, 55.4ms inference, 131.4ms postprocess per image at shape (1, 3, 640, 480)
Results saved to runs/detect/predict
💡 Learn more at https://docs.ultralytics.com/modes/predict
```


## 현재 작업 디렉토리 확인

In [4]:
import os
print(os.getcwd())

/afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim


## data 폴더 생성

In [5]:
# 1) 노트북 폴더로 이동
import os, pathlib
os.chdir("ultralytics-yolo")  # 현재 CWD 기준 하위 폴더로 이동
print("CWD:", os.getcwd())

# 2) data/ 뼈대 만들기
for split in ["train","val","test"]:
    for cls in ["chart","nonchart"]:
        pathlib.Path(f"data/{split}/{cls}").mkdir(parents=True, exist_ok=True)
print("✅ made folders under ./ultralytics-yolo/data/...")

# 3) 샘플 개수 확인 함수
def count_images(p):
    return sum(len(files) for _,_,files in os.walk(p))

for split in ["train","val","test"]:
    print(split, "chart:", count_images(f"data/{split}/chart"),
                 "nonchart:", count_images(f"data/{split}/nonchart"))

CWD: /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo
✅ made folders under ./ultralytics-yolo/data/...
train chart: 0 nonchart: 0
val chart: 0 nonchart: 0
test chart: 0 nonchart: 0


## 데이터 이미지파일 채우기

In [6]:
from azureml.fsspec import AzureMachineLearningFileSystem as AMLFS
import os, pathlib

# blobstore 경로
BLOB_PREFIX = "azureml://subscriptions/0feccff1-79cf-4ed1-b4c9-020c3ec2383b/resourcegroups/chartvllm-rg/workspaces/chartvllm-workspace/datastores/workspaceblobstore/paths/LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/"

fs = AMLFS(BLOB_PREFIX)  # 인증은 CI에서 자동으로 붙음 (워크스페이스 권한 필요)

# BASE: datastore 기준 절대 경로(ChartQA-Dataset까지 포함)
BASE = BLOB_PREFIX.split("/paths/")[1].strip("/")  # 'LocalUpload/.../ChartQA-Dataset'
print("BASE:", BASE)

print("ROOT:", fs.ls(BASE)[:5])                 # ['.../ChartQA-Dataset/train/', ...]: ['train/', 'val/', 'test/'] 식으로 떠야 정상
print("SAMPLE:", fs.ls(f"{BASE}/train/png")[:5])  # 이미지 목록이 나와야 정상

BASE: LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset
ROOT: ['LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/test/', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/train/', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/val/']
SAMPLE: ['LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/train/png/00006834003065.png', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/train/png/00035547003867.png', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/train/png/00035547003876.png', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/ChartQA-Dataset/train/png/00097754005965.png', 'LocalUpload/e842e34bb09e3a842dfe9e69246792751cdf9522462626cbf40335d67174ea4e/C

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented


### chart data: 데이터 작업 > 탑재 > 사용

In [8]:
import os
from pathlib import Path

MOUNT = Path("/home/azureuser/cloudfiles/data/uri/chartqa_dataset")  # Studio가 탑재한 경로
print("mount contains:", list(MOUNT.iterdir()))  # ['train','val','test'] 확인

mount contains: [PosixPath('/home/azureuser/cloudfiles/data/uri/chartqa_dataset/test'), PosixPath('/home/azureuser/cloudfiles/data/uri/chartqa_dataset/train'), PosixPath('/home/azureuser/cloudfiles/data/uri/chartqa_dataset/val')]


In [9]:
# data/{split}/chart 를 '폴더 링크'로 교체
for split in ["train","val","test"]:
    src_dir = MOUNT / split / "png"          # 원본 차트 이미지 폴더
    dst_root = Path("data") / split
    dst_root.mkdir(parents=True, exist_ok=True)

    link = dst_root / "chart"
    if link.exists() or link.is_symlink():
        if link.is_dir() and not link.is_symlink():
            # 기존에 복사한 게 있으면 백업/삭제 택1
            os.system(f"rm -rf '{link}'")    # 필요 시 안전하게 백업 후 삭제
        else:
            link.unlink()
    os.symlink(src_dir, link)                # ✅ 폴더 전체를 링크 (즉시 끝)
    print(f"{split}: linked {link} -> {src_dir}")

train: linked data/train/chart -> /home/azureuser/cloudfiles/data/uri/chartqa_dataset/train/png
val: linked data/val/chart -> /home/azureuser/cloudfiles/data/uri/chartqa_dataset/val/png
test: linked data/test/chart -> /home/azureuser/cloudfiles/data/uri/chartqa_dataset/test/png


In [10]:
for split in ["train","val","test"]:
    print(split, "chart:", count_images(f"data/{split}/chart"),
                 "nonchart:", count_images(f"data/{split}/nonchart"))

train chart: 18317 nonchart: 0
val chart: 1056 nonchart: 0
test chart: 1509 nonchart: 0


### nonchart data

In [11]:
# === COCO 2017 val(5K) → data/{train,val,test}/nonchart 채우기 ===
# - 5,000장 전부 사용
# - 분배: 70%/15%/15% = 3500 / 750 / 750
# - symlink 우선, 실패 시 copy 폴백

import os, shutil, random
from pathlib import Path
import fiftyone as fo
import fiftyone.zoo as foz

# 0) 목표 개수(고정 5k 분할)
TARGET = {"train": 3500, "val": 750, "test": 750}

# 1) dataset zoo 캐시(재실행 시 재다운로드 방지)
fo.config.dataset_zoo_dir = "_foz_cache"
Path(fo.config.dataset_zoo_dir).mkdir(parents=True, exist_ok=True)

# 2) COCO 2017 validation 5k 로드(전체)
# ds = foz.load_zoo_dataset(
#     "coco-2017",
#     split="validation",
#     max_samples=5000,
#     shuffle=True,
#     dataset_dir="_foz_cache/coco_val_5k",
#     drop_existing_dataset=True,
# )

# ✅ dataset_dir 제거, dataset_name 사용, 라벨 미로딩
ds = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    max_samples=5000,
    shuffle=True,
    dataset_name="coco_val_5k",
    drop_existing_dataset=True,
    label_types=[],          # ← annotation 미로딩 (이미지 경로만)
)

paths = ds.values("filepath")
random.seed(42)
random.shuffle(paths)

# 3) 70/15/15 분할
train_paths = paths[:TARGET["train"]]
val_paths   = paths[TARGET["train"]:TARGET["train"]+TARGET["val"]]
test_paths  = paths[TARGET["train"]+TARGET["val"]:TARGET["train"]+TARGET["val"]+TARGET["test"]]

def place(paths, dst_dir, prefer_symlink=True):
    dst = Path(dst_dir); dst.mkdir(parents=True, exist_ok=True)
    ln = cp = 0
    for src in paths:
        src = Path(src)
        out = dst / src.name
        if out.exists():  # 중복 방지
            continue
        if prefer_symlink:
            try:
                os.symlink(src, out)  # 가장 빠름
                ln += 1
                continue
            except Exception:
                pass
        shutil.copy2(src, out)        # 폴백
        cp += 1
    return ln, cp

# 4) 실제 배치
ln, cp = place(train_paths, "data/train/nonchart"); print("train nonchart:", len(train_paths), "(symlink:", ln, "copy:", cp, ")")
ln, cp = place(val_paths,   "data/val/nonchart");   print("val   nonchart:", len(val_paths),   "(symlink:", ln, "copy:", cp, ")")
ln, cp = place(test_paths,  "data/test/nonchart");  print("test  nonchart:", len(test_paths),  "(symlink:", ln, "copy:", cp, ")")

# 5) 최종 개수 확인
def count_images(p): 
    p = Path(p)
    return sum(1 for x in p.glob("*") if x.suffix.lower() in {".jpg",".jpeg",".png",".webp"})
for split in ["train","val","test"]:
    print(split, "chart:", count_images(f"data/{split}/chart"),
                 "nonchart:", count_images(f"data/{split}/nonchart"))

  from .autonotebook import tqdm as notebook_tqdm


Downloading split 'validation' to '_foz_cache/coco-2017/validation' if necessary
Downloading annotations to '_foz_cache/coco-2017/tmp-download/annotations_trainval2017.zip'
 100% |██████|    1.9Gb/1.9Gb [5.2s elapsed, 0s remaining, 401.4Mb/s]       
Extracting annotations to '_foz_cache/coco-2017/raw/instances_val2017.json'
Downloading 5000 images
 100% |████████████████| 5000/5000 [1.5m elapsed, 0s remaining, 54.0 images/s]      
Writing annotations to '_foz_cache/coco-2017/validation/labels.json'
Dataset info written to '_foz_cache/coco-2017/info.json'
Deleting existing dataset 'coco_val_5k'
Loading 'coco-2017' split 'validation'
 100% |███████████████| 5000/5000 [836.3ms elapsed, 0s remaining, 6.0K samples/s]      
Dataset 'coco_val_5k' created
train nonchart: 3500 (symlink: 3500 copy: 0 )
val   nonchart: 750 (symlink: 750 copy: 0 )
test  nonchart: 750 (symlink: 750 copy: 0 )
train chart: 18317 nonchart: 3500
val chart: 1056 nonchart: 750
test chart: 1509 nonchart: 750


In [12]:
# DocLayNet → nonchart (streaming=True + 병렬 저장 + 재실행 안전)
%pip -q install datasets pillow tqdm

import os, itertools, random
from pathlib import Path
from datasets import load_dataset
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
from PIL import Image

BASE    = Path("data")
SPLITS  = ["train","val","test"]
SPLIT_MAP = {"val": "validation"}     # DocLayNet 분할명 매핑
WORKERS = 32        # CI 코어에 맞춰 조정
CHUNK   = 2000      # 한 번에 받아 저장할 샘플 수(메모리/디스크 여유에 맞게)

# 저장 포맷: 'png' 또는 'jpeg' (jpeg가 디스크 훨씬 절약)
SAVE_FMT = "png"    # 필요하면 'jpeg' 로 변경
JPEG_QUALITY = 90

def count_images(dirpath: Path):
    exts = {".png",".jpg",".jpeg",".webp"}
    return sum(1 for p in dirpath.glob("*") if p.suffix.lower() in exts)

# 현재 chart/nonchart 개수 → 부족분 계산(차트와 동수 맞추기)
target  = {s: count_images(BASE/s/"chart")    for s in SPLITS}
current = {s: count_images(BASE/s/"nonchart") for s in SPLITS}
need    = {s: max(0, target[s]-current[s])    for s in SPLITS}
print("target(chart):", target, "\ncurrent(nonchart):", current, "\nneed:", need)

def _dst_path(out_dir: Path, idx: int):
    ext = ".jpg" if SAVE_FMT == "jpeg" else ".png"
    return out_dir / f"doclaynet_{idx:07d}{ext}"

def save_one(out_dir: Path, idx: int, pil_img: Image.Image):
    dst = _dst_path(out_dir, idx)
    if dst.exists():
        return "skip"
    if SAVE_FMT == "jpeg":
        pil_img.convert("RGB").save(dst, format="JPEG", quality=JPEG_QUALITY, optimize=True)
    else:
        pil_img.save(dst)  # PNG
    return "save"

def fill_streaming(split: str, remaining: int, workers: int = WORKERS, chunk: int = CHUNK):
    if remaining <= 0:
        print(f"[{split}] already balanced")
        return
    hf_split = SPLIT_MAP.get(split, split)  # 'val' -> 'validation'
    out_dir = BASE/split/"nonchart"; out_dir.mkdir(parents=True, exist_ok=True)

    # 스트리밍 데이터셋(캐시 크게 안 씀)
    ds = load_dataset("ds4sd/DocLayNet-v1.2", split=hf_split, streaming=True)
    it = iter(ds)

    saved_total = 0
    idx_base = count_images(out_dir)  # 대략적 시작 인덱스(충돌 예방용)
    batch_no = 0

    while saved_total < remaining:
        k = min(chunk, remaining - saved_total)
        batch = list(itertools.islice(it, k))
        if not batch:
            print(f"[{split}] DocLayNet 고갈: {saved_total}/{remaining} 저장 후 중단")
            break

        def task(en):
            i, ex = en
            return save_one(out_dir, idx_base + saved_total + i, ex["image"])

        with ThreadPoolExecutor(max_workers=workers) as pool:
            futures = [pool.submit(task, (i, ex)) for i, ex in enumerate(batch)]
            for _ in tqdm(as_completed(futures), total=len(futures),
                          desc=f"{split} save batch#{batch_no} ({saved_total}/{remaining})"):
                pass

        # 저장/스킵 집계
        saves = sum(1 for f in futures if f.result() == "save")
        saved_total += saves
        batch_no += 1
        print(f"[{split}] +{saves} saved (total {saved_total}/{remaining})")

    print(f"[{split}] done. saved_total={saved_total}, out={out_dir}")

# 실행
for s in SPLITS:
    fill_streaming(s, need[s], workers=WORKERS, chunk=CHUNK)

# 최종 개수 확인
for s in SPLITS:
    print(s, "chart:", count_images(BASE/s/"chart"),
             "nonchart:", count_images(BASE/s/"nonchart"))

Note: you may need to restart the kernel to use updated packages.
target(chart): {'train': 18317, 'val': 1056, 'test': 1509} 
current(nonchart): {'train': 3500, 'val': 750, 'test': 750} 
need: {'train': 14817, 'val': 306, 'test': 759}
[train] +2000 saved (total 2000/14817)
[train] +2000 saved (total 4000/14817)
[train] +2000 saved (total 6000/14817)
[train] +2000 saved (total 8000/14817)
[train] +2000 saved (total 10000/14817)
[train] +2000 saved (total 12000/14817)
[train] +2000 saved (total 14000/14817)
[train] +817 saved (total 14817/14817)
[train] done. saved_total=14817, out=data/train/nonchart
[val] +306 saved (total 306/306)
[val] done. saved_total=306, out=data/val/nonchart
[test] +759 saved (total 759/759)
[test] done. saved_total=759, out=data/test/nonchart
train chart: 18317 nonchart: 18317
val chart: 1056 nonchart: 1056
test chart: 1509 nonchart: 1509


train save batch#0 (0/14817): 100%|██████████| 2000/2000 [00:38<00:00, 52.52it/s]
train save batch#3 (6000/14817): 100%|██████████| 2000/2000 [00:37<00:00, 52.69it/s]HTTP Error 504 thrown while requesting GET https://huggingface.co/datasets/ds4sd/DocLayNet-v1.2/resolve/0daf93102e2efce76c3e11a274a5e0d0969391d3/data/train-00010-of-00072.parquet
Retrying in 1s [Retry 1/5].
train save batch#4 (8000/14817): 100%|██████████| 2000/2000 [00:35<00:00, 56.36it/s]
train save batch#5 (10000/14817): 100%|██████████| 2000/2000 [00:38<00:00, 51.39it/s]
train save batch#6 (12000/14817): 100%|██████████| 2000/2000 [00:40<00:00, 49.37it/s]
train save batch#7 (14000/14817):  99%|█████████▉| 807/817 [00:21<00:00, 34.92it/s]test save batch#0 (0/759):  99%|█████████▉| 755/759 [00:21<00:00, 30.32it/s]

In [13]:
# 최종 개수 확인
def count_images(p): 
    p = Path(p)
    return sum(1 for x in p.glob("*") if x.suffix.lower() in {".jpg",".jpeg",".png",".webp"})
for split in ["train","val","test"]:
    print(split, "chart:", count_images(f"data/{split}/chart"),
                 "nonchart:", count_images(f"data/{split}/nonchart"))

train chart: 18317 nonchart: 18317
val chart: 1056 nonchart: 1056
test chart: 1509 nonchart: 1509


In [None]:
# # DocLayNet → nonchart (링크 우선, 병렬 저장 X → 거의 즉시)
# import os, random
# from pathlib import Path
# from datasets import load_dataset

# BASE = Path("data")
# SPLITS = ["train","val","test"]

# # 1) 지금 nonchart 개수와 chart 개수를 읽어 목표치 산정(차트와 균형 맞추기)
# def count_images(p: Path):
#     exts = {".png",".jpg",".jpeg",".webp"}
#     return sum(1 for x in p.glob("*") if x.suffix.lower() in exts)

# target = {
#     s: count_images(BASE/s/"chart")   # 차트 개수만큼 nonchart를 맞추는 전략
#     for s in SPLITS
# }
# current = {
#     s: count_images(BASE/s/"nonchart")
#     for s in SPLITS
# }
# need = { s: max(0, target[s]-current[s]) for s in SPLITS }
# print("target:", target, "\ncurrent:", current, "\nneed:", need)

# # 2) 비-스트리밍 로드 → 로컬 캐시에 이미지 파일 생성
# #    (한 번 받아두면 재실행 시 캐시 재사용)
# def get_doclaynet_paths(split: str, k: int):
#     if k <= 0: return []
#     ds = load_dataset("ds4sd/DocLayNet-v1.2", split=split)   # streaming=False (기본)
#     idxs = list(range(min(k, len(ds))))
#     random.seed(42); random.shuffle(idxs)                    # 약간 셔플
#     paths = []
#     for i in idxs:
#         im = ds[i]["image"]          # PIL Image (datasets가 캐시 파일에서 로드)
#         p  = getattr(im, "filename", None)
#         if p and os.path.exists(p):  # 캐시 파일 경로 있으면 그걸 사용
#             paths.append(p)
#         else:
#             # 드물게 filename이 없을 수 있음 → 나중에 save로 폴백 처리
#             paths.append(im)         # PIL 객체를 그대로 담아둠
#     return paths

# # 3) 링크/복사/저장 유틸
# def place(paths, out_dir: Path, prefer_symlink=True):
#     out_dir.mkdir(parents=True, exist_ok=True)
#     ln = cp = sv = 0
#     for src in paths:
#         if isinstance(src, str):         # 캐시 파일 경로
#             dst = out_dir / os.path.basename(src)
#             if dst.exists(): continue
#             if prefer_symlink:
#                 try:
#                     os.symlink(src, dst); ln += 1; continue
#                 except Exception:
#                     pass
#             # 폴백: 파일 복사
#             import shutil; shutil.copy2(src, dst); cp += 1
#         else:
#             # PIL.Image 객체 → 디스크에 저장(폴백)
#             dst = out_dir / f"doclaynet_{sv:06d}.png"
#             if dst.exists(): continue
#             src.save(dst); sv += 1
#     return ln, cp, sv

# # 4) 실제 배치 (필요량만큼)
# for split in SPLITS:
#     k = need[split]
#     paths = get_doclaynet_paths(split, k)
#     ln, cp, sv = place(paths, BASE/split/"nonchart", prefer_symlink=True)
#     print(f"{split}: need={k}  -> linked={ln}, copied={cp}, saved={sv}")

# # 5) 최종 개수 확인
# def count_images(p): 
#     p = Path(p)
#     return sum(1 for x in p.glob("*") if x.suffix.lower() in {".jpg",".jpeg",".png",".webp"})
# for split in ["train","val","test"]:
#     print(split, "chart:", count_images(f"data/{split}/chart"),
#                  "nonchart:", count_images(f"data/{split}/nonchart"))

## yaml 파일 생성

In [14]:
import os, yaml, pathlib

DATA_ROOT = "data"  # data/{train,val,test}/{chart,nonchart}

def real(p):  # symlink 안전하게 절대경로로 고정
    return os.path.realpath(p)

# sanity check: 각 split에 최소 1장씩 존재하는지 확인
def count_imgs(d):
    exts = {".jpg",".jpeg",".png",".webp"}
    d = pathlib.Path(d)
    return sum(1 for x in d.glob("*") if x.suffix.lower() in exts)

for split in ["train","val","test"]:
    for cls in ["chart","nonchart"]:
        n = count_imgs(f"{DATA_ROOT}/{split}/{cls}")
        assert n > 0, f"[EMPTY] {split}/{cls} 비어있음. 먼저 채워주세요."
        print(f"{split}/{cls}: {n} images")

data_yaml = {
    # Ultralytics 분류는 각 split 디렉토리 하위의 폴더명이 클래스가 됨
    "train": real(f"{DATA_ROOT}/train"),
    "val":   real(f"{DATA_ROOT}/val"),
    "test":  real(f"{DATA_ROOT}/test"),

    # names는 명시해두면 클래스 인덱스가 고정돼서 안전함(0: chart, 1: nonchart)
    "names": {0: "chart", 1: "nonchart"}
}

pathlib.Path(DATA_ROOT).mkdir(parents=True, exist_ok=True)
with open(f"{DATA_ROOT}/data.yaml","w") as f:
    yaml.safe_dump(data_yaml, f, sort_keys=False, allow_unicode=True)

print("\n=== data.yaml ===")
print(open(f"{DATA_ROOT}/data.yaml","r").read())

train/chart: 18317 images
train/nonchart: 18317 images
val/chart: 1056 images
val/nonchart: 1056 images
test/chart: 1509 images
test/nonchart: 1509 images

=== data.yaml ===
train: /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/train
val: /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/val
test: /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/test
names:
  0: chart
  1: nonchart



# YOLOv11 분류 학습

In [15]:
# === YOLO11s Classification: train -> val -> test -> export ===
# H100 NVL 기준 설정. OOM 나면 BATCH를 256 -> 128 -> 64로 낮춰서 재시도.

import os, glob, json, shutil, pathlib, time
from ultralytics import YOLO

DATA_YAML = os.path.realpath("data")
MODEL     = "yolo11s-cls.pt"          # 사전학습 모델
IMGSZ     = 224                       # 분류 기본 해상도
EPOCHS    = 30                        # 먼저 30epoch 스모크 → 개선 시 50~100
WORKERS   = min(os.cpu_count() or 8, 16)
SEED      = 42

# H100이면 256도 여유. OOM시 128/64로 낮춰 재실행.
BATCH_CANDIDATES = [256, 128, 64]

run_name = f"chart_nonchart_cls_y11s_{int(time.time())}"

def try_train(batch):
    print(f"\n[Train] batch={batch}, epochs={EPOCHS}, imgsz={IMGSZ}, workers={WORKERS}")
    model = YOLO(MODEL)
    res = model.train(
        data=DATA_YAML,
        epochs=EPOCHS,
        imgsz=IMGSZ,
        batch=batch,
        workers=WORKERS,
        name=run_name,
        seed=SEED,
        cache=True,        # 디스크 캐시로 I/O 축소
        amp=True,          # mixed precision
        cos_lr=True,       # cosine LR schedule
        patience=10,       # early stopping
        plots=True,        # 학습 곡선/CM 등 저장
        verbose=True,
    )
    return model, res

trained = None
for b in BATCH_CANDIDATES:
    try:
        model, train_res = try_train(b)
        trained = (model, train_res)
        break
    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print(f"OOM at batch={b}, retrying with smaller batch...")
            continue
        raise

assert trained is not None, "Training failed at all batch sizes"
model, train_res = trained


[Train] batch=256, epochs=30, imgsz=224, workers=16
New https://pypi.org/project/ultralytics/8.3.183 available 😃 Update with 'pip install -U ultralytics'
Ultralytics 8.3.182 🚀 Python-3.12.11 torch-2.8.0+cu128 CUDA:0 (NVIDIA H100 NVL, 95248MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=256, bgr=0.0, box=7.5, cache=True, cfg=None, classes=None, close_mosaic=10, cls=0.5, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=True, cutmix=0.0, data=/afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=30, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=224, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_rati

[34m[1mtrain: [0mScanning /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/train... 18317 images, 0 corrupt: 100%|██████████| 18317/18317 [02:53<00:00, 105.44it/s]
[34m[1mval: [0mScanning /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/val... 1056 images, 0 corrupt: 100%|██████████| 1056/1056 [00:10<00:00, 104.68it/s]
       1/30      5.36G     0.1124        141        224: 100%|██████████| 72/72 [01:27<00:00,  1.21s/it]
               classes   top1_acc   top5_acc: 100%|██████████| 3/3 [00:00<00:00,  3.86it/s]
       2/30      6.33G   6.03e-05        141        224: 100%|██████████| 72/72 [01:28<00:00,  1.22s/it]
               classes   top1_acc   top5_acc: 100%|██████████| 3/3 [00:00<00:00, 57.19it/s]
       4/30      6.37G  1.796e-05        256        224:  75%|███████▌  | 54/72 [00:57<00:06,  2.99it/s]

# Test & 결과 파일/모델 저장

## Best Model로 Test 진행

In [16]:
# === TEST with best.pt ===
import os, glob, json, pathlib
from ultralytics import YOLO

# 1) 최신 run 폴더/가중치 찾기
runs = sorted(glob.glob(f"runs/classify/{run_name}*"), key=os.path.getmtime)
assert runs, "run 폴더를 찾지 못했습니다."
run_dir = pathlib.Path(runs[-1])
w_best  = run_dir/"weights"/"best.pt"
w_last  = run_dir/"weights"/"last.pt"
use_w   = w_best if w_best.exists() else w_last
assert use_w.exists(), "best/last 가중치가 없습니다."

# 2) Best Weight 사용한 best model로 테스트 진행
print(f"[TEST] using weights: {use_w}")
model_for_test = YOLO(str(use_w))

test_res = model_for_test.val(
    data=DATA_YAML,
    split="test",
    imgsz=IMGSZ,
    workers=WORKERS,
)

print("test metrics:", test_res.results_dict)

summary_dir = run_dir / "summaries"
summary_dir.mkdir(parents=True, exist_ok=True)
with open(summary_dir / "test_summary.json", "w") as f:
    json.dump({k: float(v) for k, v in test_res.results_dict.items()}, f, indent=2)

print("✅ saved:", summary_dir / "test_summary.json")

[TEST] using weights: runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.pt
Ultralytics 8.3.182 🚀 Python-3.12.11 torch-2.8.0+cu128 CUDA:0 (NVIDIA H100 NVL, 95248MiB)
YOLO11s-cls summary (fused): 47 layers, 5,436,690 parameters, 0 gradients, 12.0 GFLOPs
ERROR ❌ [34m[1mtrain:[0m /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/train... found 18317 images in 1 classes (requires 2 classes, not 1)
ERROR ❌ [34m[1mval:[0m /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/val... found 1056 images in 1 classes (requires 2 classes, not 1)
ERROR ❌ [34m[1mtest:[0m /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/test... found 1509 images in 1 classes (requires 2 classes, not 1)
                   all          1          1
Speed: 0.0ms preprocess, 0.2ms inference, 0.0ms los

[34m[1mtest: [0mScanning /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data/test... 1509 images, 0 corrupt: 100%|██████████| 1509/1509 [00:15<00:00, 97.41it/s] 
               classes   top1_acc   top5_acc:  73%|███████▎  | 69/95 [00:07<00:02, 10.90it/s]

## 산출물 packaging

In [17]:
import os, json, shutil, glob, pathlib, time
from ultralytics import __version__ as ulty_ver

# 1) 최신 run 폴더/가중치 찾기
runs = sorted(glob.glob(f"runs/classify/{run_name}*"), key=os.path.getmtime)
assert runs, "run 폴더를 찾지 못했습니다."
run_dir = pathlib.Path(runs[-1])
w_best  = run_dir/"weights"/"best.pt"
w_last  = run_dir/"weights"/"last.pt"
use_w   = w_best if w_best.exists() else w_last
assert use_w.exists(), "best/last 가중치가 없습니다."

# 2) 테스트 요약(있으면)와 data.yaml 같은 레퍼런스도 묶기
artifact_dir = pathlib.Path("model_artifact")
if artifact_dir.exists():
    shutil.rmtree(artifact_dir)
(artifact_dir/"weights").mkdir(parents=True, exist_ok=True)

shutil.copy2(use_w, artifact_dir/"weights"/use_w.name)
for extra in ["data/data.yaml", str(run_dir/"summaries"/"test_summary.json")]:
    p = pathlib.Path(extra)
    if p.exists():
        dst = artifact_dir/p.name if p.suffix else artifact_dir/p.parts[-1]
        try:
            shutil.copy2(p, dst)
        except IsADirectoryError:
            shutil.copytree(p, artifact_dir/p.name, dirs_exist_ok=True)

# 3) 메타데이터 기록(작업 재현용)
meta = {
    "task": "image-classification",
    "classes": {0:"chart", 1:"nonchart"},
    "imgsz": 224,
    "epochs": 30,
    "ultralytics_version": ulty_ver,
    "source_run": str(run_dir),
    "weight_file": use_w.name,
}
with open(artifact_dir/"model_card.json","w") as f:
    json.dump(meta, f, indent=2)
print("📦 Packed:", artifact_dir.resolve())

📦 Packed: /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/model_artifact


## ML 워크스페이스에 모델 등록

In [20]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model


from dotenv import load_dotenv
import os

# .env
load_dotenv()

# get environment variables
sub_id = os.getenv("SUBSCRIPTION_ID")
rg = os.getenv("RESOURCE_GROUP")
ws_name = os.getenv("WORKSPACE_NAME")

# (A) 구독/리소스/WS를 직접 지정
SUBSCRIPTION_ID = sub_id
RESOURCE_GROUP  = rg
WORKSPACE_NAME  = ws_name

# (B) 혹은 config.json이 노트북에 세팅되어 있다면:
# from azure.ai.ml import MLClient
# ml_client = MLClient.from_config()

ml_client = MLClient(
    DefaultAzureCredential(),
    SUBSCRIPTION_ID,
    RESOURCE_GROUP,
    WORKSPACE_NAME,
)

model_name = "yolo11-chart-binary-cls"   # 원하는 고정 이름
model = Model(
    name=model_name,
    path=str(artifact_dir),             # 폴더 통째로 업로드
    type="custom_model",                # .pt/onnx 묶음이면 custom_model 적합
    description="YOLO11s classification (chart vs nonchart). Includes weights and metadata.",
    tags={
        "task":"classification",
        "backbone":"yolo11s-cls",
        "classes":"chart,nonchart",
    },
    properties={
        "ultralytics_version": ulty_ver,
        "imgsz":"224",
    }
)

registered = ml_client.models.create_or_update(model)
print(f"✅ Registered model: {registered.name} v{registered.version}")

Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
[32mUploading model_artifact (11.02 MBs): 100%|██████████| 11024766/11024766 [00:00<00:00, 38302016.46it/s]
[39m



✅ Registered model: yolo11-chart-binary-cls v1


### (Optional) ONNX/TorchScript 내보내기


In [21]:
best_model = YOLO(str(use_w))
print("\n[EXPORT] onnx (fp32)")
best_model.export(format="onnx", imgsz=IMGSZ)
print("[EXPORT] torchscript")
best_model.export(format="torchscript", imgsz=IMGSZ)

print("\nDone. Exported Artifacts in:", run_dir)


[EXPORT] onnx (fp32)
Ultralytics 8.3.182 🚀 Python-3.12.11 torch-2.8.0+cu128 CPU (AMD EPYC 9V84 96-Core Processor)
YOLO11s-cls summary (fused): 47 layers, 5,436,690 parameters, 0 gradients, 12.0 GFLOPs

[34m[1mPyTorch:[0m starting from 'runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.pt' with input shape (1, 3, 224, 224) BCHW and output shape(s) (1, 2) (10.5 MB)
[31m[1mrequirements:[0m Ultralytics requirements ['onnx>=1.12.0,<1.18.0', 'onnxslim>=0.1.59', 'onnxruntime-gpu'] not found, attempting AutoUpdate...
Collecting onnxslim>=0.1.59
  Downloading onnxslim-0.1.64-py3-none-any.whl (164 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 164.0/164.0 kB 37.2 MB/s eta 0:00:00
Collecting onnxruntime-gpu
  Downloading onnxruntime_gpu-1.22.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (283.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 283.2/283.2 MB 166.0 MB/s eta 0:00:00
Installing collected packages: onnxslim, onnxruntime-gpu
Successfully installed onnxr

ONNX: starting export with onnx 1.18.0 opset 19...
WARNING ⚠️ ONNX: simplifier failure: No module named 'onnxslim'
ONNX: export success ✅ 12.8s, saved as 'runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.onnx' (20.8 MB)

Export complete (13.0s)
Results saved to /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/runs/classify/chart_nonchart_cls_y11s_1755796203/weights
Predict:         yolo predict task=classify model=runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.onnx imgsz=224  
Validate:        yolo val task=classify model=runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.onnx imgsz=224 data=/afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data  
Visualize:       https://netron.app
[EXPORT] torchscript
Ultralytics 8.3.182 🚀 Python-3.12.11 torch-2.8.0+cu128 CPU (AMD EPYC 9V84 96-Core Processor)
YOLO11s-cls summary (fused): 47 layers, 5,436,690 parameters, 0 gradients, 12.0 GFLOPs

PyTorch: starting from 'runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.pt' with input shape (1, 3, 224, 224) BCHW and output shape(s) (1, 2) (10.5 MB)

TorchScript: starting export with torch 2.8.0+cu128...
TorchScript: export success ✅ 0.7s, saved as 'runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.torchscript' (20.9 MB)

Export complete (0.9s)
Results saved to /afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/runs/classify/chart_nonchart_cls_y11s_1755796203/weights
Predict:         yolo predict task=classify model=runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.torchscript imgsz=224  
Validate:        yolo val task=classify model=runs/classify/chart_nonchart_cls_y11s_1755796203/weights/best.torchscript imgsz=224 data=/afh/projects/chartvllm-workspace-e50dcc49-4202-489d-ae4f-22f8078e2c70/shared/Users/t-yooyeunkim/ultralytics-yolo/data  
Visualize:       https://netron.app

Done. Exported Artifacts in: runs/classify/chart_nonchart_cls_y11s_1755796203
