# Step-by-step Ingestion of the Adver-City Dataset 

This notebook walks through the steps to ingest the Adver-City dataset. It does this in the follow steps:

## Table of Contents

1. [Configure](#1-configure)
2. [Download from Remote Server](#2-download-from-remote-server)
3. [Develop Manifest](#3-develop-manifest)
4. [Build a Sampling Plan](#4-build-a-sampling-plan)
5. [Extract based on Sampling Plan](#5-extract-based-on-sampling-plan)
6. [Labelling and Metadata](#6-labelling-and-metadata)
7. [Generate Train/Val/Test sets](#7-generate-trainvaltest-sets)
8. [Conclusion](#8-conclusion)

In [1]:
# proper imports 
import json
from pathlib import Path
import os
import random

# define root directory
cwd = Path.cwd()
if cwd.name == "notebooks":
    os.chdir(cwd.parent)
    print("changed to root directory")
else:
    print("already in project root")

# custom imports
from src.ingestion import download, archive, sample, extract, label, split


changed to root directory


## 1. Configure

Import necessary paths and configurations from `config.json`.

Note: all paths will be expressed as a Path

In [2]:
PROJECT_ROOT = Path.cwd()
CONFIG_PATH = PROJECT_ROOT / "config" / "config.json"

# pull configuration files
with open(CONFIG_PATH, "r") as f:
    cfg = json.load(f)

# data paths
DATA_ROOT   = PROJECT_ROOT / cfg["data_paths"]["root"]      # root directory for all data 
RAW_DIR     = PROJECT_ROOT / cfg["data_paths"]["raw"]       # where the raw data is stored
INDEX_DIR   = PROJECT_ROOT / cfg["data_paths"]["index"]     # where the data index is stored
SAMPLED_DIR = PROJECT_ROOT / cfg["data_paths"]["sampled"]   # where the sampled data (i.e., subset) is stored
READY_DIR   = PROJECT_ROOT / cfg["data_paths"]["ready"]     # where the training/val/test sets are stored

# make the directories 
for p in [RAW_DIR, INDEX_DIR, SAMPLED_DIR, READY_DIR]:
    p.mkdir(parents=True, exist_ok=True)

# archive info
BASE_URL    = cfg["ingestion"]["url"]                       # remote location of data archive
ARCHIVE_EXT = cfg["ingestion"]["archive_extension"]         # archive file extension 
MANIFEST_MODE = cfg["ingestion"]["manifest_mode"]           # mode for manifest generation (simple/verbose)

# sampling 
MAX_GB      = float(cfg["ingestion"]["max_size_GB"])        # maximum file size
MAX_IMGS    = int(cfg["ingestion"]["images_per_archive"])   # maximum number of images to pull
STRIDE      = int(cfg["ingestion"]["frame_stride"])          # gaps between frames when sampling 
SEED = int(cfg["reproducibility"]["seed"])
random.seed(SEED)
CLEANUP_RAW = cfg["sampling"].get("cleanup_raw_after_extract", False)

# sampling plan config
PLAN_FILENAME = cfg["sampling"]["plan_filename"]            # filename for sample plan
PLAN_OVERWRITE = cfg["sampling"]["overwrite"]               # whether to overwrite existing plan
LABELS_FILENAME = cfg["sampling"]["labels_filename"]        # filename for labels CSV

# image info
CAMERA     = cfg["ingestion"].get("camera", None)           # safer
IMG_TYPE   = cfg["ingestion"]["image_type"]                 # rgb
IMG_EXT    = cfg["ingestion"]["image_extension"]            # file extension of images 

# labelling info
labels_cfg = cfg["labels"]

# training/splits info
SPLITS_TRAIN = cfg["splits"]["train"]
SPLITS_VAL   = cfg["splits"]["val"]
SPLITS_TEST  = cfg["splits"]["test"]
SPLITS_OVERWRITE =cfg["splits"]["overwrite"]
CLEANUP_SAMPLED = cfg["splits"].get("cleanup_sampled_after_split", False)

# the valid file types
VALID_PREFIX  = set(labels_cfg["valid_prefix"])
VALID_WEATHER = set(labels_cfg["valid_weather"])
VALID_DENSITY = set(labels_cfg["valid_density"])

# the files I want to download
CHOOSE_PREFIX  = labels_cfg.get("choose_prefix", labels_cfg["valid_prefix"])
CHOOSE_WEATHER = labels_cfg.get("choose_weather", labels_cfg["valid_weather"])
CHOOSE_DENSITY = labels_cfg.get("choose_density", labels_cfg["valid_density"])

# decoders (two separate label spaces)
DECODE_TIME = labels_cfg["weather_decode_time"]             # maps files to day/night
DECODE_VIS  = labels_cfg["weather_decode_visibility"]       # maps files to visibility conditions

print('selected prefixes: ', CHOOSE_PREFIX)
print('selected weather: ', CHOOSE_WEATHER)
print('selected density: ', CHOOSE_DENSITY)


selected prefixes:  ['rcnj', 'ri', 'unj']
selected weather:  ['cn', 'fn', 'hrn', 'srn']
selected density:  ['s']


## 2. Download from Remote Server

In [3]:
# note: enforce lists for choices

filenames = download.build_filenames(CHOOSE_PREFIX, CHOOSE_WEATHER, CHOOSE_DENSITY, 
                               VALID_PREFIX, VALID_WEATHER, VALID_DENSITY, 
                               ARCHIVE_EXT)
print('build the following filenames: \n', filenames)

build the following filenames: 
 ['rcnj_cn_s.7z', 'rcnj_fn_s.7z', 'rcnj_hrn_s.7z', 'rcnj_srn_s.7z', 'ri_cn_s.7z', 'ri_fn_s.7z', 'ri_hrn_s.7z', 'ri_srn_s.7z', 'unj_cn_s.7z', 'unj_fn_s.7z', 'unj_hrn_s.7z', 'unj_srn_s.7z']


In [4]:
download_raw = download.download_files(
    base_url = BASE_URL, 
    destinations_dir = RAW_DIR, 
    filenames = filenames, 
    timeout = 60, 
    max_size_GB = MAX_GB, 
    overwrite = False
)

print('Downloaded ', len(download_raw), 'files.')
for file in download_raw:
    print('-->', file.name)



[SKIP] rcnj_cn_s.7z already present in raw.
[SKIP] rcnj_fn_s.7z already present in raw.
[SKIP] rcnj_hrn_s.7z already present in raw.
[SKIP] rcnj_srn_s.7z already present in raw.
[SKIP] ri_cn_s.7z already present in raw.
[SKIP] ri_fn_s.7z already present in raw.
[SKIP] ri_hrn_s.7z already present in raw.
[SKIP] ri_srn_s.7z already present in raw.
[SKIP] unj_cn_s.7z already present in raw.
[SKIP] unj_fn_s.7z already present in raw.
[SKIP] unj_hrn_s.7z already present in raw.
[SKIP] unj_srn_s.7z already present in raw.
Downloaded  12 files.
--> rcnj_cn_s.7z
--> rcnj_fn_s.7z
--> rcnj_hrn_s.7z
--> rcnj_srn_s.7z
--> ri_cn_s.7z
--> ri_fn_s.7z
--> ri_hrn_s.7z
--> ri_srn_s.7z
--> unj_cn_s.7z
--> unj_fn_s.7z
--> unj_hrn_s.7z
--> unj_srn_s.7z


## 3. Develop Manifest

Develop a manifest of the files downloaded. 

In [5]:
# index the downloaded archives (without extracting)
archives = sorted(RAW_DIR.glob(f"*{ARCHIVE_EXT}"))
print(f"Found {len(archives)} downloaded archives\n")

# load or create manifests
manifests = {}
for archive_file in archives:
    manifest = archive.build_manifest(archive_file, INDEX_DIR, mode=MANIFEST_MODE)
    manifests[archive_file.name] = manifest
    print(f"  {archive_file.name}: {len(manifest)} lines")

print(f"\nTotal manifests: {len(manifests)}")


Found 12 downloaded archives

[SKIP] Manifest already exists:
  rcnj_cn_s_manifest.json
  rcnj_cn_s.7z: 10032 lines
[SKIP] Manifest already exists:
  rcnj_fn_s_manifest.json
  rcnj_fn_s.7z: 10017 lines
[SKIP] Manifest already exists:
  rcnj_hrn_s_manifest.json
  rcnj_hrn_s.7z: 10025 lines
[SKIP] Manifest already exists:
  rcnj_srn_s_manifest.json
  rcnj_srn_s.7z: 10032 lines
[SKIP] Manifest already exists:
  ri_cn_s_manifest.json
  ri_cn_s.7z: 7781 lines
[SKIP] Manifest already exists:
  ri_fn_s_manifest.json
  ri_fn_s.7z: 7755 lines
[SKIP] Manifest already exists:
  ri_hrn_s_manifest.json
  ri_hrn_s.7z: 7779 lines
[SKIP] Manifest already exists:
  ri_srn_s_manifest.json
  ri_srn_s.7z: 7739 lines
[SKIP] Manifest already exists:
  unj_cn_s_manifest.json
  unj_cn_s.7z: 6894 lines
[SKIP] Manifest already exists:
  unj_fn_s_manifest.json
  unj_fn_s.7z: 6899 lines
[SKIP] Manifest already exists:
  unj_hrn_s_manifest.json
  unj_hrn_s.7z: 6899 lines
[SKIP] Manifest already exists:
  unj_srn_s

## 4. Build a Sampling Plan

Build a sampling plan for the dataset, so we only have to extract a subset of the data.

In [6]:
sampling_plan = sample.build_sample_plan(manifests, 
                      CAMERA=CAMERA, 
                      IMG_EXT=IMG_EXT, 
                      STRIDE=STRIDE, 
                      MAX_IMGS=MAX_IMGS, 
                      SEED=SEED)

print(f"\nTotal images to extract: {sum(len(v) for v in sampling_plan.values())}")

sample.save_sample_plan(sampling_plan, 
                        INDEX_DIR / PLAN_FILENAME,
                        overwrite=PLAN_OVERWRITE)


rcnj_cn_s.7z:
  Candidates: 760
  Sampled: 760

rcnj_fn_s.7z:
  Candidates: 760
  Sampled: 760

rcnj_hrn_s.7z:
  Candidates: 760
  Sampled: 760

rcnj_srn_s.7z:
  Candidates: 760
  Sampled: 760

ri_cn_s.7z:
  Candidates: 592
  Sampled: 592

ri_fn_s.7z:
  Candidates: 592
  Sampled: 592

ri_hrn_s.7z:
  Candidates: 592
  Sampled: 592

ri_srn_s.7z:
  Candidates: 592
  Sampled: 592

unj_cn_s.7z:
  Candidates: 520
  Sampled: 520

unj_fn_s.7z:
  Candidates: 520
  Sampled: 520

unj_hrn_s.7z:
  Candidates: 520
  Sampled: 520

unj_srn_s.7z:
  Candidates: 520
  Sampled: 520


Total images to extract: 7488
[SKIP] Sample plan already exists at sample_plan.json. Using existing plan.


PosixPath('/Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/index/sample_plan.json')

## 5. Extract based on Sampling Plan

- cross-references the SAMPLED_DIR contents with the sampling plan and only accesses the archive if files are missing (saving time)
- `cleanup_raw`: we may choose to cleanup (delete) raw files after we have finished raw


In [7]:
# define paths
sample_plan_file = INDEX_DIR / PLAN_FILENAME

# extract files based on sampling plan
extraction_results = extract.extract_from_sample_plan(
    sample_plan_file=sample_plan_file,
    raw_dir=RAW_DIR,
    sampled_dir=SAMPLED_DIR,
    overwrite=PLAN_OVERWRITE,
    cleanup_raw=CLEANUP_RAW
)

print(f"\nExtraction complete. Check SAMPLED_DIR for extracted files:")
print(f"  {SAMPLED_DIR.name}")



rcnj_cn_s.7z:
  Files to extract: 760
all files exist in sampled
 [SKIP] All files already extracted for rcnj_cn_s.7z

rcnj_fn_s.7z:
  Files to extract: 760
all files exist in sampled
 [SKIP] All files already extracted for rcnj_fn_s.7z

rcnj_hrn_s.7z:
  Files to extract: 760
all files exist in sampled
 [SKIP] All files already extracted for rcnj_hrn_s.7z

rcnj_srn_s.7z:
  Files to extract: 760
all files exist in sampled
 [SKIP] All files already extracted for rcnj_srn_s.7z

ri_cn_s.7z:
  Files to extract: 592
all files exist in sampled
 [SKIP] All files already extracted for ri_cn_s.7z

ri_fn_s.7z:
  Files to extract: 592
all files exist in sampled
 [SKIP] All files already extracted for ri_fn_s.7z

ri_hrn_s.7z:
  Files to extract: 592
all files exist in sampled
 [SKIP] All files already extracted for ri_hrn_s.7z

ri_srn_s.7z:
  Files to extract: 592
all files exist in sampled
 [SKIP] All files already extracted for ri_srn_s.7z

unj_cn_s.7z:
  Files to extract: 520
all files exist in

## 6. Labelling and Metadata

Build and save a dataframe with labels and metadata

In [8]:
# build labels dataframe
labels_df = label.build_labels_df(
    sampled_dir=SAMPLED_DIR,
    decode_time=DECODE_TIME,
    decode_vis=DECODE_VIS,
    archive_ext=ARCHIVE_EXT,
    img_ext=IMG_EXT
)

print(f"\nLabeled {len(labels_df)} images")

# save to CSV
labels_csv = label.save_labels(
    df = labels_df, 
    output_path = INDEX_DIR / LABELS_FILENAME,
)

[SKIP] .DS_Store is not a directory

Labeled 7488 images
[SAVE] Labels saved to labels.csv
 Total images: 7488
 Columns: ['archive_name', 'image_path', 'prefix', 'weather', 'density', 'time', 'visibility', 'agent_id', 'frame_id', 'camera_id']
 head:
  archive_name                       image_path prefix weather density   time  \
0    rcnj_cn_s  rcnj_cn_s/-1/000242_camera2.png   rcnj      cn       s  night   
1    rcnj_cn_s  rcnj_cn_s/-1/000188_camera3.png   rcnj      cn       s  night   
2    rcnj_cn_s  rcnj_cn_s/-1/000232_camera2.png   rcnj      cn       s  night   
3    rcnj_cn_s  rcnj_cn_s/-1/000147_camera2.png   rcnj      cn       s  night   
4    rcnj_cn_s  rcnj_cn_s/-1/000137_camera2.png   rcnj      cn       s  night   

  visibility agent_id frame_id camera_id  
0      clear       -1   000242   camera2  
1      clear       -1   000188   camera3  
2      clear       -1   000232   camera2  
3      clear       -1   000147   camera2  
4      clear       -1   000137   camera2  


## 7. Generate Train/Val/Test sets 

- `cleanup_sampled`: we may choose to cleanup (delete) sampled files after we have finished processing

In [9]:
splits = split.split_labels(
    labels_path=INDEX_DIR / LABELS_FILENAME,
    train_ratio=SPLITS_TRAIN,
    val_ratio=SPLITS_VAL,
    test_ratio=SPLITS_TEST,
    seed=SEED
)

# Check results
print(f"Train: {len(splits['train'])} images")
print(f"Val: {len(splits['val'])} images")
print(f"Test: {len(splits['test'])} images")

[LOAD] Loaded 7488 labels from: labels.csv
Train: 5241 images
Val: 1123 images
Test: 1124 images


In [10]:
# build files into train/val/test directories
split.build_splits(
    splits=splits,
    sampled_dir=SAMPLED_DIR,
    ready_dir=READY_DIR, 
    cleanup_sampled=CLEANUP_SAMPLED 
)

print(f"\nTrain/Val/Test split complete!")
print(f"  Train: {(READY_DIR / 'train').name}/")
print(f"  Val: {(READY_DIR / 'val').name}/")
print(f"  Test: {(READY_DIR / 'test').name}/")


[SKIP] Split 'train' already exists. Use overwrite=True to rebuild.

[SKIP] Split 'val' already exists. Use overwrite=True to rebuild.

[SKIP] Split 'test' already exists. Use overwrite=True to rebuild.

Train/Val/Test split complete!
  Train: train/
  Val: val/
  Test: test/


## 8. Conclusion

And now we should have the nicely organized datasets under `/data/ready/...`