# Ingestion of the Adver-City Dataset 

This notebook ingests the [Adver-City](https://labs.cs.queensu.ca/quarrg/datasets/adver-city/) synthetic dataset for use in investigating cooperative perception in adverse weather conditions. As the dataset is larger, it selectively extracts required files (in accordance with a sampling plan) and cleans up working files along the way. The end result is clean dataset sliced into train/val/test, suitable for machine learning applications. 

A broader exploration of the dataset is provided [here](./initial_explore.ipynb).

Notes:

- just investigating night time for now
- ensure to source virtual environment located at `./venv/bin/python` before running: `source venv/bin/activate`
- requirements have been exported to `pip freeze > requirements.txt`
- to unzip the `.7z` files, install p7zip as follows: `brew install p7zip`

In [6]:
# proper imports 
import json
from pathlib import Path
import os
import random

# define root directory
cwd = Path.cwd()
if cwd.name == "notebooks":
    os.chdir(cwd.parent)
    print("changed to root directory:", Path.cwd())
else:
    print("already in project root:", Path.cwd())

# custom imports
from src.ingestion import download, archive

already in project root: /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity


## Configure

Note: all paths will be expressed as a Path

In [7]:
PROJECT_ROOT = Path.cwd()
CONFIG_PATH = PROJECT_ROOT / "config" / "config.json"

# pull configuration files
with open(CONFIG_PATH, "r") as f:
    cfg = json.load(f)

# data paths
DATA_ROOT   = PROJECT_ROOT / cfg["data_paths"]["root"]      # root directory for all data 
RAW_DIR     = PROJECT_ROOT / cfg["data_paths"]["raw"]       # where the raw data is stored
INDEX_DIR   = PROJECT_ROOT / cfg["data_paths"]["index"]     # where the data index is stored
SAMPLED_DIR = PROJECT_ROOT / cfg["data_paths"]["sampled"]   # where the sampled data (i.e., subset) is stored
READY_DIR   = PROJECT_ROOT / cfg["data_paths"]["ready"]     # where the training/val/test sets are stored

# make the directories 
for p in [RAW_DIR, INDEX_DIR, SAMPLED_DIR, READY_DIR]:
    p.mkdir(parents=True, exist_ok=True)

# archive info
BASE_URL    = cfg["ingestion"]["url"]                       # remote location of data archive
ARCHIVE_EXT = cfg["ingestion"]["archive_extension"]         # archive file extension 

# sampling 
MAX_GB      = float(cfg["ingestion"]["max_size_GB"])        # maximum file size
MAX_IMGS    = int(cfg["ingestion"]["images_per_archive"])   # maximum number of images to pull
STRIDE      = int(cfg["ingestion"]["frame_stride"])          # gaps between frames when sampling 
SEED = int(cfg["reproducibility"]["seed"])
random.seed(SEED)

# image info
#CAMERA     = cfg["ingestion"]["camera"]                    # null -> None means all cameras
CAMERA     = cfg["ingestion"].get("camera", None)           # safer
IMG_TYPE   = cfg["ingestion"]["image_type"]                 # rgb
IMG_EXT    = cfg["ingestion"]["image_extension"]            # file extension of images 

# labelling info
labels_cfg = cfg["labels"]

# the valid file types
VALID_PREFIX  = set(labels_cfg["valid_prefix"])
VALID_WEATHER = set(labels_cfg["valid_weather"])
VALID_DENSITY = set(labels_cfg["valid_density"])

# the files I want to download
CHOOSE_PREFIX  = labels_cfg.get("choose_prefix", labels_cfg["valid_prefix"])
CHOOSE_WEATHER = labels_cfg.get("choose_weather", labels_cfg["valid_weather"])
CHOOSE_DENSITY = labels_cfg.get("choose_density", labels_cfg["valid_density"])

# decoders (two separate label spaces)
DECODE_TIME = labels_cfg["weather_decode_time"]             # maps files to day/night
DECODE_VIS  = labels_cfg["weather_decode_visibility"]       # maps files to visibility conditions

print('selected prefixes: ', CHOOSE_PREFIX)
print('selected weather: ', CHOOSE_WEATHER)
print('selected density: ', CHOOSE_DENSITY)


selected prefixes:  ['rcnj', 'ri', 'unj']
selected weather:  ['cn', 'fn', 'hrn', 'srn']
selected density:  ['s']


## Build filenames

In [8]:
# note: enforce lists for choices

filenames = download.build_filenames(CHOOSE_PREFIX, CHOOSE_WEATHER, CHOOSE_DENSITY, 
                               VALID_PREFIX, VALID_WEATHER, VALID_DENSITY, 
                               ARCHIVE_EXT)
print('build the following filenames: \n', filenames)

build the following filenames: 
 ['rcnj_cn_s.7z', 'rcnj_fn_s.7z', 'rcnj_hrn_s.7z', 'rcnj_srn_s.7z', 'ri_cn_s.7z', 'ri_fn_s.7z', 'ri_hrn_s.7z', 'ri_srn_s.7z', 'unj_cn_s.7z', 'unj_fn_s.7z', 'unj_hrn_s.7z', 'unj_srn_s.7z']


## Download from Remote Server

In [9]:
download_raw = download.download_files(
    base_url = BASE_URL, 
    destinations_dir = RAW_DIR, 
    filenames = filenames, 
    timeout = 60, 
    max_size_GB = MAX_GB, 
    overwrite = False
)

print('Downloaded ', len(download_raw), 'files.')
for file in download_raw:
    print('-->', file.name)



[SKIP] rcnj_cn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] rcnj_fn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] rcnj_hrn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] rcnj_srn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] ri_cn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] ri_fn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] ri_hrn_s.7z already present in /Users/tjards/Library/CloudStorage/Dropbox/adjunctQueens/code/pytorch_project_advercity/data/raw.
[SKIP] ri_srn_s.7z already present in /Users

Here is a sample of the output with some skipped and some full downloads:

![sample output](img/raw_progress.png)

## Develop Manifest

Develop a manifest of the files downloaded. 

In [10]:
# index the downloaded archives (without extracting)
archives = sorted(RAW_DIR.glob(f"*{ARCHIVE_EXT}"))
print(f"Found {len(archives)} downloaded archives\n")

# load or create manifests
manifests = {}
for archive_file in archives:
    manifest = archive.build_manifest(archive_file, INDEX_DIR)
    manifests[archive_file.name] = manifest
    print(f"  {archive_file.name}: {len(manifest)} files")

print(f"\nTotal manifests: {len(manifests)}")

Found 12 downloaded archives

[SAVE] Manifest saved:
  rcnj_cn_s_manifest.json
  rcnj_cn_s.7z: 90279 files
[SAVE] Manifest saved:
  rcnj_fn_s_manifest.json
  rcnj_fn_s.7z: 90144 files
[SAVE] Manifest saved:
  rcnj_hrn_s_manifest.json
  rcnj_hrn_s.7z: 90216 files
[SAVE] Manifest saved:
  rcnj_srn_s_manifest.json
  rcnj_srn_s.7z: 90279 files
[SAVE] Manifest saved:
  ri_cn_s_manifest.json
  ri_cn_s.7z: 70020 files
[SAVE] Manifest saved:
  ri_fn_s_manifest.json
  ri_fn_s.7z: 69786 files
[SAVE] Manifest saved:
  ri_hrn_s_manifest.json
  ri_hrn_s.7z: 70002 files
[SAVE] Manifest saved:
  ri_srn_s_manifest.json
  ri_srn_s.7z: 69642 files
[SAVE] Manifest saved:
  unj_cn_s_manifest.json
  unj_cn_s.7z: 62037 files
[SAVE] Manifest saved:
  unj_fn_s_manifest.json
  unj_fn_s.7z: 62082 files
[SAVE] Manifest saved:
  unj_hrn_s_manifest.json
  unj_hrn_s.7z: 62082 files
[SAVE] Manifest saved:
  unj_srn_s_manifest.json
  unj_srn_s.7z: 62037 files

Total manifests: 12
