# Downloading The BDD100k Dataset

To download the 10k subdataset of the BDD100k dataset we need to download it through kaggle. here are the steps to configure and download the trageted dataset


You have first to go to your kaggle account settings and get the API file called `kaggle.json`

``` powershell
# Create the .kaggle directory in your user folder
New-Item -ItemType Directory -Force "$env:USERPROFILE\.kaggle"

# Move kaggle.json from Downloads to .kaggle
Move-Item "$env:USERPROFILE\Downloads\kaggle.json" "$env:USERPROFILE\.kaggle\"





 </small>so you need to change the $env:USERPROFILE with the correct directory</small> 


then you can install kaggle using this command 


```bash
!pip install kagglehub


then you can download the dataset using this command
OR you can go to this link and manually download the dataset: [Dataset](https://www.kaggle.com/datasets/marquis03/bdd100k)

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("marquis03/bdd100k")


Now after downloading the dataset we need to unzip it


In [None]:
import os
import zipfile

# Path to your downloaded zip file
zip_path = "/bdd100k"

# Folder to unzip into 
extract_dir = os.path.dirname(zip_path)
os.makedirs(extract_dir, exist_ok=True)

# Unzip
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print(f"Unzipped into: {extract_dir}")

In [6]:

import os
import json
import shutil
from tqdm import tqdm




bdd_root = "bdd100k"   # path where you extracted dataset
train_images = os.path.join(bdd_root, "train", "images")
val_images   = os.path.join(bdd_root, "val", "images")
test_images  = os.path.join(bdd_root, "test")

 # Labels

train_labels = os.path.join(bdd_root,"train", "annotations","bdd100k_labels_images_train.json")
val_labels   = os.path.join(bdd_root, "val", "annotations","bdd100k_labels_images_val.json")

Check Data structre


In [7]:
# Count images
print("Train Images:", len(os.listdir(train_images)))
print("Validation Images:", len(os.listdir(val_images)))
print("Test Images:", len(os.listdir(test_images)))

train_images_dir = os.path.join(bdd_root, "train", "images")
train_labels_file = os.path.join(bdd_root, "train","annotations",  "bdd100k_labels_images_train.json")  # adjust filename

# Count images
num_images = len(os.listdir(train_images_dir))

# Load annotations
with open(train_labels_file, "r") as f:
    annotations = json.load(f)

num_annotations = len(annotations)

print("Train Images:", num_images)
print("Train Annotations (entries in JSON):", num_annotations)

Train Images: 70000
Validation Images: 10000
Test Images: 20000
Train Images: 70000
Train Annotations (entries in JSON): 69863


so it appears that there is missing annotations for some image. We can keep them and label them as "no content" .

Now we need to make it in a YOLO format, this can take a while since we opening every image and make a txt file for it

In [None]:
import os, json
import cv2   # keep full cv2
from PIL import Image

# Paths
bdd_root = r"bdd100k"
images_dir = os.path.join(bdd_root, "train", "images")
labels_json = os.path.join(bdd_root, "train","annotations", "bdd100k_labels_images_train.json")
output_labels = os.path.join(bdd_root, "train", "labels_yolo")

os.makedirs(output_labels, exist_ok=True)

# Load annotations
with open(labels_json, "r") as f:
    annotations = json.load(f)

# Class mapping
classes = ["pedestrian", "rider", "car", "truck", "bus", "train", 
           "motorcycle", "bicycle", "traffic light", "traffic sign"]

# Build quick lookup
ann_dict = {ann["name"]: ann for ann in annotations}

# Loop all images
for img_name in os.listdir(images_dir):
    label_path = os.path.join(output_labels, img_name.replace(".jpg", ".txt"))
    
    if img_name not in ann_dict:
        open(label_path, "w").close()
        continue

    img_ann = ann_dict[img_name]

    # Get image size efficiently
    img_path = os.path.join(images_dir, img_name)
    with Image.open(img_path) as im:
        W, H = im.size

    with open(label_path, "w") as f_out:
        for label in img_ann.get("labels", []):
            if "box2d" not in label:
                continue
            cls = label["category"]
            if cls not in classes:
                continue
            cls_id = classes.index(cls)

            # Safe extraction
            box = label["box2d"]
            x1, y1, x2, y2 = box["x1"], box["y1"], box["x2"], box["y2"]

            w = x2 - x1
            h = y2 - y1
            x_center = x1 + w / 2
            y_center = y1 + h / 2

            # Normalize
            x_center /= W
            y_center /= H
            w /= W
            h /= H

            f_out.write(f"{cls_id} {x_center:.6f} {y_center:.6f} {w:.6f} {h:.6f}\n")

print("Conversion done! YOLO labels saved at:", output_labels)

