## Common Training/Validation Workflow

In [1]:
from ultralytics import YOLO
from IPython.display import Image
import cv2
import time
import os

### Dataset Download

- Roboflow Universe Dataset  
Download the dataset using Roboflow API through this [link](https://universe.roboflow.com/text-detector/text-detector-2/dataset/2).  
The dataset will be stored at `datasets/{project-name}`. Roboflow provides ready to use dataset complete with splits and annotations.  
Below is example code to download dataset to local.

In [None]:
# from roboflow import Roboflow

# rf = Roboflow(api_key="0n9ducziAw8kQIGHk25M")
# project = rf.workspace("text-detector").project("text-detector-2")
# dataset = project.version(2).download("yolov8")

- Custom Dataset  
For the custom dataset, I will use the Kaggle [TextOCR](https://www.kaggle.com/datasets/robikscube/textocr-text-extraction-from-images-dataset) dataset.  
Download the dataset (roughly 7 GB) and store it in the `datasets` directory.  
Make sure to organize the files and folders accordingly.

Create custom YOLOv8 data generator using the `yolov8datagen.py` script. Use the following script to generate custom dataset with YOLOv8 from the downloaded Kaggle dataset.

```
python yolov8datagen.py --source_dir --dest_dir --total_images --density --split
```

Refer to `yolov8datagen.py` for full script arguments and outputs.

### Check GPU Compatibility

In [2]:
!nvidia-smi

Wed Apr 23 12:28:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A4000               Off |   00000000:52:00.0  On |                  Off |
| 67%   85C    P2            127W /  140W |   10657MiB /  16376MiB |     80%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Training Loop

The training loop of YOLOv8 using Ultralytics API is fairly simple:

1. Initialize the YOLO model and its specified architecture. (`yolov8n.pt` refers to the nano-sized model; refer to [this link](https://docs.ultralytics.com/modes/) for more information.)  
   - the code will download the pretrained weights if not already existed.

2. Use the `train` method with the following arguments:
   - `data`: This should point to our source, which should be a `data.yaml` file stored in the `datasets` directory.

3. Set hyperparameters and device (use 0 for CUDA) and save the results.
   - setting the `optimizer` arguments to `auto` (the default) will let YOLO decide the best value for some hyperparameters such as learning rate and momentum.


In [4]:
model = YOLO('yolov8n.pt')
datapath=os.path.abspath("../datasets/ctw_and_textocr/data.yaml")
results = model.train(data=datapath, epochs=20, imgsz=640, device=0, batch=8)

Ultralytics 8.3.100 🚀 Python-3.10.13 torch-2.5.1 CUDA:0 (NVIDIA RTX A4000, 15977MiB)
[34m[1mengine/trainer: [0mtask=detect, mode=train, model=yolov8n.pt, data=/home/akiko/work/YOLOv8-CRNN-Scene-Text-Recognition/datasets/ctw_and_textocr/data.yaml, epochs=20, time=None, patience=100, batch=8, imgsz=640, save=True, save_period=-1, cache=False, device=0, workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_

[34m[1mtrain: [0mScanning /home/akiko/work/YOLOv8-CRNN-Scene-Text-Recognition/datasets/ctw_and_textocr/train/labels.cache... 28790 images, 0 backgrounds, 90 corrupt: 100%|██████████| 28790/28790 [00:00<?, ?it/s]




[34m[1mval: [0mScanning /home/akiko/work/YOLOv8-CRNN-Scene-Text-Recognition/datasets/ctw_and_textocr/val/labels.cache... 1847 images, 0 backgrounds, 5 corrupt: 100%|██████████| 1847/1847 [00:00<?, ?it/s]






Plotting labels to runs/detect/train2/labels.jpg... 
[34m[1moptimizer:[0m 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
[34m[1moptimizer:[0m AdamW(lr=0.002, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to [1mruns/detect/train2[0m
Starting training for 20 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       1/20      3.36G      2.304      1.766      1.081        122        640: 100%|██████████| 3588/3588 [06:24<00:00,  9.33it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:23<00:00,  5.00it/s]


                   all       1842      61736      0.462      0.268      0.264      0.118

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       2/20      3.94G      2.137      1.419      1.027        126        640: 100%|██████████| 3588/3588 [06:17<00:00,  9.52it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:22<00:00,  5.08it/s]


                   all       1842      61736      0.527      0.303      0.311      0.146

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       3/20      4.89G      2.073       1.36       1.01         55        640: 100%|██████████| 3588/3588 [06:13<00:00,  9.61it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:22<00:00,  5.04it/s]


                   all       1842      61736      0.561      0.327      0.339      0.163

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       4/20      4.89G      2.008      1.303     0.9982         83        640: 100%|██████████| 3588/3588 [04:19<00:00, 13.80it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.31it/s]


                   all       1842      61736      0.563      0.341      0.357      0.175

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       5/20      4.89G      1.946      1.251     0.9876         44        640: 100%|██████████| 3588/3588 [03:29<00:00, 17.15it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.10it/s]


                   all       1842      61736      0.591      0.354      0.373      0.184

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       6/20      4.89G      1.907      1.211     0.9775        602        640: 100%|██████████| 3588/3588 [03:28<00:00, 17.20it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.04it/s]


                   all       1842      61736      0.598      0.359      0.382      0.189

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       7/20      4.89G      1.872      1.187     0.9727         70        640: 100%|██████████| 3588/3588 [03:31<00:00, 16.95it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.05it/s]


                   all       1842      61736      0.611      0.368      0.395      0.198

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       8/20      4.89G      1.847      1.159     0.9642         87        640: 100%|██████████| 3588/3588 [03:31<00:00, 16.94it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.19it/s]


                   all       1842      61736      0.618      0.377      0.403      0.205

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       9/20      4.89G      1.814       1.13      0.961        282        640: 100%|██████████| 3588/3588 [03:28<00:00, 17.17it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.71it/s]


                   all       1842      61736       0.62      0.384      0.413      0.211

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      10/20      4.89G      1.797      1.116     0.9552        183        640: 100%|██████████| 3588/3588 [03:28<00:00, 17.17it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.99it/s]


                   all       1842      61736      0.624      0.389      0.419      0.215
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      11/20      4.89G      1.756      1.079     0.9413        107        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.71it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.21it/s]


                   all       1842      61736      0.641      0.388      0.423      0.217

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      12/20      4.89G      1.737      1.059     0.9372        129        640: 100%|██████████| 3588/3588 [03:24<00:00, 17.56it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.94it/s]


                   all       1842      61736      0.635      0.391      0.425      0.219

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      13/20      4.89G      1.723      1.047     0.9339         70        640: 100%|██████████| 3588/3588 [03:23<00:00, 17.67it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.91it/s]


                   all       1842      61736      0.645      0.394      0.431      0.223

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      14/20      4.89G      1.702      1.029     0.9312        101        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.69it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.93it/s]


                   all       1842      61736      0.637      0.399      0.433      0.224

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      15/20      4.89G      1.686       1.01     0.9281        153        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.72it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.13it/s]


                   all       1842      61736       0.65      0.399      0.437      0.227

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      16/20      4.89G      1.671      0.998     0.9257         43        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.68it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.33it/s]


                   all       1842      61736      0.653      0.401       0.44      0.229

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      17/20      4.89G       1.66     0.9862     0.9256         73        640: 100%|██████████| 3588/3588 [03:23<00:00, 17.67it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.28it/s]


                   all       1842      61736      0.653      0.404      0.443      0.231

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      18/20      4.89G      1.652     0.9765     0.9213        121        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.69it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 14.17it/s]


                   all       1842      61736      0.651      0.407      0.444      0.232

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      19/20      4.89G      1.631     0.9583     0.9183        154        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.73it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:07<00:00, 14.66it/s]


                   all       1842      61736      0.659      0.405      0.446      0.233

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      20/20      4.89G      1.621      0.949     0.9178        273        640: 100%|██████████| 3588/3588 [03:22<00:00, 17.69it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:08<00:00, 13.89it/s]


                   all       1842      61736       0.66      0.406      0.447      0.234

20 epochs completed in 1.362 hours.
Optimizer stripped from runs/detect/train2/weights/last.pt, 6.2MB
Optimizer stripped from runs/detect/train2/weights/best.pt, 6.2MB

Validating runs/detect/train2/weights/best.pt...
Ultralytics 8.3.100 🚀 Python-3.10.13 torch-2.5.1 CUDA:0 (NVIDIA RTX A4000, 15977MiB)
Model summary (fused): 72 layers, 3,005,843 parameters, 0 gradients


                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 116/116 [00:09<00:00, 12.70it/s]


                   all       1842      61736       0.66      0.406      0.447      0.234
Speed: 0.1ms preprocess, 1.0ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to [1mruns/detect/train2[0m


## Training Evaluation

The training results evaluationa re auto-generated by YOLO and can be found on `runs` directory as default. Below are some important results to analyze.

- `confusion_matrix.png` to check prediction/ground truth result matrix. For text detection there are only one class (`text`).  
some important thing to note is the precision/recall to handle class imbalance

In [None]:
Image(filename='runs/detect/train10/confusion_matrix.png', width=600)

- `results.png` shows the plot of different metrics for each training epochs. The loss function used is `box_loss`.  
Common evaluation metric is mAP which calculates area under Precision-Recall curve.

In [None]:
Image(filename='runs/detect/train10/results.png', width=600)

## Validation Loop

In this validation loop we will try to validate our trained models.  
Use the best checkpoint `best.pyt` for the training result in `runs` directory.

In [None]:
model = YOLO('runs/detect/train10/weights/best.pt')

The code below initialize webcam and continuous inference, we can measure inference performance by showing the FPS counter.

In [None]:
cap = cv2.VideoCapture(0)

while cap.isOpened():
    ret, frame = cap.read()

    if ret:
        start = time.perf_counter()
        
        results = model(frame)
        end = time.perf_counter()
        
        total_time = end-start
        fps = 1/total_time
        
        
        annotated_frame = results[0].plot()
        
        cv2.putText(annotated_frame, f"FPS: {int(fps)}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
        cv2.imshow("YOLOv8 Inference", annotated_frame)
    
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
        
cap.release()
cv2.destroyAllWindows()

Finally, we can validate the model on our validation datasets using the code below, it will keep track of previous directory when training.

In [None]:
metrics = model.val()