# Efficient Ingestion of the Adver-City Dataset 

This notebook ingests the [Adver-City](https://labs.cs.queensu.ca/quarrg/datasets/adver-city/) synthetic dataset for use in investigating cooperative perception in adverse weather conditions. As the dataset is larger, it selectively extracts required files (in accordance with a sampling plan) and cleans up working files along the way. The end result is clean dataset sliced into train/val/test, suitable for machine learning applications. 

A broader exploration of the dataset is provided [here](./initial_explore.ipynb).

Notes:

- just investigating night time for now
- ensure to source virtual environment located at `./venv/bin/python` before running: `source venv/bin/activate`
- requirements have been exported to `pip freeze > requirements.txt`
- to unzip the `.7z` files, install p7zip as follows: `brew install p7zip`

In [None]:
# proper imports 
import json
from pathlib import Path
import os
import random

# define root directory
cwd = Path.cwd()
if cwd.name == "notebooks":
    os.chdir(cwd.parent)
    print("changed to root directory:", Path.cwd())
else:
    print("already in project root:", Path.cwd())

# custom imports
from src.ingestion import download

## Configure

Note: all paths will be expressed as a Path

In [None]:
PROJECT_ROOT = Path.cwd()
CONFIG_PATH = PROJECT_ROOT / "config" / "config.json"

# pull configuration files
with open(CONFIG_PATH, "r") as f:
    cfg = json.load(f)

# data paths
DATA_ROOT   = PROJECT_ROOT / cfg["data_paths"]["root"]      # root directory for all data 
RAW_DIR     = PROJECT_ROOT / cfg["data_paths"]["raw"]       # where the raw data is stored
INDEX_DIR   = PROJECT_ROOT / cfg["data_paths"]["index"]     # where the data index is stored
SAMPLED_DIR = PROJECT_ROOT / cfg["data_paths"]["sampled"]   # where the sampled data (i.e., subset) is stored
READY_DIR   = PROJECT_ROOT / cfg["data_paths"]["ready"]     # where the training/val/test sets are stored

# make the directories 
for p in [RAW_DIR, INDEX_DIR, SAMPLED_DIR, READY_DIR]:
    p.mkdir(parents=True, exist_ok=True)

# archive info
BASE_URL    = cfg["ingestion"]["url"]                       # remote location of data archive
ARCHIVE_EXT = cfg["ingestion"]["archive_extension"]         # archive file extension 

# sampling 
MAX_GB      = float(cfg["ingestion"]["max_size_GB"])        # maximum file size
MAX_IMGS    = int(cfg["ingestion"]["images_per_archive"])   # maximum number of images to pull
STRIDE      = int(cfg["ingestion"]["frame_stride"])          # gaps between frames when sampling 
SEED = int(cfg["reproducibility"]["seed"])
random.seed(SEED)

# image info
CAMERA     = cfg["ingestion"]["camera"]                     # null -> None means all cameras
IMG_TYPE   = cfg["ingestion"]["image_type"]                 # rgb
IMG_EXT    = cfg["ingestion"]["image_extension"]            # file extension of images 

# labelling info
labels_cfg = cfg["labels"]

# the valid file types
VALID_PREFIX  = set(labels_cfg["valid_prefix"])
VALID_WEATHER = set(labels_cfg["valid_weather"])
VALID_DENSITY = set(labels_cfg["valid_density"])

# the files I want to download
CHOOSE_PREFIX  = labels_cfg.get("choose_prefix", labels_cfg["valid_prefix"])
CHOOSE_WEATHER = labels_cfg.get("choose_weather", labels_cfg["valid_weather"])
CHOOSE_DENSITY = labels_cfg.get("choose_density", labels_cfg["valid_density"])

# decoders (two separate label spaces)
DECODE_TIME = labels_cfg["weather_decode_time"]             # maps files to day/night
DECODE_VIS  = labels_cfg["weather_decode_visibility"]       # maps files to visibility conditions

print('selected prefixes: ', CHOOSE_PREFIX)
print('selected weather: ', CHOOSE_WEATHER)
print('selected density: ', CHOOSE_DENSITY)


## Build filenames

In [None]:
filenames = download.build_filenames(CHOOSE_PREFIX, CHOOSE_WEATHER, CHOOSE_DENSITY, 
                               VALID_PREFIX, VALID_WEATHER, VALID_DENSITY, 
                               ARCHIVE_EXT)
print('build the following filenames: \n', filenames)

## Download

In [None]:
download_raw = download.download_files(
    base_url = BASE_URL, 
    destinations_dir = RAW_DIR, 
    filenames = filenames, 
    timeout = 60, 
    max_size_GB = MAX_GB, 
    overwrite = False
)

print('Downloaded ', len(download_raw), 'files.')
for file in download_raw:
    print('-->', file.name)

