# Get to Know a Dataset: Canoe

This notebook serves as a guided tour of the [CANOE](https://registry.opendata.aws/canoe) dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/).

### *Questions to answer*: 1) How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket. 2) What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats? 3) A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.


#### Data Organization
Our dataset is stored under a single AWS S3 bucket. 
At the top level of our bucket is the `data/` prefix and, in the future, some high-level documentation.
For now, full documentation can be found at [https://github.com/utiasASRL/pycanoe/blob/main/DATA_REFERENCE.md](https://github.com/utiasASRL/pycanoe/blob/main/DATA_REFERENCE.md).
Under the `data/` prefix, our dataset is organized as a set of sequences. 
Each sequence is stored as a folder and follows the same naming convention: `s3://canoe-data/data/canoe-YYYY-MM-DD-HH-MM` denoting the time that data collection started. 
Below is an overview of the structure of each sequence:

```text
canoe-YYYY-MM-DD-HH-MM/
    calib/
        P_cam.txt
        radar_config.yaml
        T_sens2_sens1.txt
    novatel/
        novatel_imu.csv
        novatel_original.csv
        <sens>_poses.csv
    cam_left/
        <timestamp>.png
    cam_right/
        <timestamp>.png
    imu/
        imu.csv
    lidar/
        <timestamp>.bin
    motor/
        power.csv
    radar/
        <timestamp>.png
    sonar/
        <timestamp>.png
    cam.mp4
    dashboard.mp4
    route_map.html
```

Sequence files fall under a few categories:
- **Overview Files:** At the top level of the sequence folder, there are three files that offer a high-level look at the contents of the data collection run.
The videos show a 10$\times$ speed feed of the camera data (`cam.mp4`) and "dashboard" data (camera, radar, lidar, sonar, and motor inputs) (`dashboard.mp4`) for the whole sequence with the time in the lower left corner.
The `route_map.html` opens an interactive satellite map with the vehicle route overlaid on top.
- **Calibration Files:** Calibration files include `.txt` and `.yaml` files that contain either sensor properties/configurations or the extrinsic calibration between two sensors.
- **Sensor Files:** The `cam_left/`, `cam_right/`, `lidar/`, `radar/`, and `sonar/` folders have a single file for each sensor measurement (named with the timestamp). Besides lidar, which uses a `.bin` format to reduce storage requirements, the sensor readings are stored in `.png` format.
- **Auxillary Sensor Files:** The IMU and motor input measurements each use one `.csv` file for the whole dataset.
- **GPS Files**: In the novatel folder are `.csv` files that contain the post-processed GNSS measurements. For each sensor, there is a file `<sensor>_poses.csv` that contains the GNSS-measured position and velocity interpolated to the sensor timestamp and transformed to the sensor frame.

#### Accessing the data
Our dataset is intended to be downloaded and used locally.

The main S3 bucket can be browsed through using the S3 console in your internet browser at: 
[`https://s3.console.aws.amazon.com/s3/buckets/canoe-data/`](https://s3.console.aws.amazon.com/s3/buckets/canoe-data/)

We recommend accessing and downloading the dataset through the AWS CLI as follows
1. [Create an AWS account (optional)](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/)
2. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)
3. Create a `root` folder to store the dataset, example: `/path/to/data/canoe/` Each sequence will then be a folder under `root`.
4. Use the AWS CLI to download either the entire dataset or only the desired sequences and sensors. Add `--no-sign-request` after each of the following commands if you're not going to use an AWS account. 

##### Explore Data
The `aws s3 ls` command can be used to look through the folders. For example, the following command will list all the sequences:

##### Download Sample Sequence
The `aws s3 sync` command is used to download data. For example, the following command downloads the sequence `canoe-2025-08-21-19-16`.
This sequence is our **sample sequence**. It contains approximately 90 seconds of data and contains ~8GB of data.
We recommend downloading this sequence if you want to explore and try loading the data for yourself. 

#### Loading Data with the pycanoe DevKit
Now that you have a sequence downloaded, you can load it with the SDK, `pycanoe`.

##### Installations
First, if you haven't already done so, clone the pycanoe repo with `git clone git@github.com:utiasASRL/pycanoe.git` and install it locally with `pip install -e pycanoe`.

This will install the requirements for `pycanoe`, which includes the following libraries (see pycanoe/`setup.py` for full updated list):
- numpy
- matplotlib
- opencv-python>=4.5.3.56
- PyYAML>=5.4.0
- scipy>=1.14.0


##### Import SDK

In [None]:
import pycanoe
import numpy as np
import os

##### Load Dataset with pycanoe
Below loads the entire dataset and shows some iterating/accessing operations.
It is also possible to load individual sequences or sensor objects.

In [None]:
from pycanoe import CanoeDataset

root = '/path/to/data/canoe/'
cd = CanoeDataset(root, verbose=True)

In [None]:
# Note: The CANOE dataset differs from others (e.g., KITTI) in that 
# measurements, from different sensors are not synchronous. However, 
# each sensor message has an accurate timestamp and pose instead.

# Loop through each frame in order (e.g.,  for odometry)
for seq in cd.sequences:
    # Iterator examples:
    for camera_frame in seq.camleft:
        img = camera_frame.img  
        # do something
        camera_frame.unload_data() # Memory reqs will keep increasing without this
    for lidar_frame in seq.lidar:
        pts = lidar_frame.points  
        # do something
        lidar_frame.unload_data() # Memory reqs will keep increasing without this

    # Retrieve frames based on their index:
    N = len(seq.radar_frames)
    for i in range(N):
        radar_frame = seq.get_radar(i)
        # do something
        radar_frame.unload_data() # Memory reqs will keep increasing without this

# Iterator example:
cam_iter = cd.sequences[0].get_cam_left_iter()
cam0 = next(cam_iter)  # First camera frame
cam1 = next(cam_iter)  # Second camera frame

# Randomly access frames (e.g., for deep learning, localization):
N = len(cd.lidar_frames)
indices = np.random.permutation(N)
for idx in indices:
    lidar_frame = cd.get_lidar(idx)
    # do something
    lidar_frame.unload_data() # Memory reqs will keep increasing without this

# Each sequence contains a calibration object:
calib = cd.sequences[0].calib
point_lidar = np.array([1, 0, 0, 1]).reshape(4, 1)
point_camera = np.matmul(calib.T_camleft_lidar, point_lidar)

# Each sensor frame has a timestamp, groundtruth pose
# (4x4 homogeneous transform) wrt a fixed global coordinate frame (ENU_ref),
# and groundtruth velocity information. Unless it's a part of the test set,
# in that case, ground truth poses will be missing. 
lidar_frame = cd.get_lidar(0)
t = lidar_frame.timestamp  # timestamp in seconds
T_enu_lidar = lidar_frame.pose  # 4x4 homogenous transform [R t; 0 0 0 1]
vbar = lidar_frame.velocity  # 6x1 vel in ENU frame [v_se_in_e; w_se_in_e]
varpi = lidar_frame.body_rate  # 6x1 vel in sensor frame [v_se_in_s; w_se_in_s]

##### Visualize Sensor Data

In [None]:
# First "synchronize". 
# NOTE: This doesn't actually synchronize frames, just grabs the closest 
seq = cd.sequences[0]
seq.synchronize_frames(ref="cam_left")

In [None]:
# Visualize frames at same index
idx = 0

# Camera
cam = seq.camleft_frames[idx]
cam.load_data()
cam.visualize()
cam.unload_data()

# Radar
rad = seq.radar_frames[idx]
rad.load_data()
rad.visualize()
rad.unload_data()

# Lidar
lid = seq.lidar_frames[idx]
lid.load_data()
lid.visualize()
lid.unload_data()

# Sonar
sonar = seq.sonar_frames[idx]
sonar.load_data()
sonar.visualize()
sonar.unload_data()


### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?
One question that we have partially answered using this data is:
**how large of a barrier are weeds to autonomous aquatic navigation?**

Long weeds can get caught in the rotors of an unmanned surface vessel as shown in the picture below.

![Weeds in Rotor](../figs/weeds.png "Weeds in Rotor")

This slows the boat down and requires more power to maintain the same speed, which can drain the battery
Without preventative measures, there will be upper limits on the length of time that an unmanned
surface vessel can operate autonomously before human intervention is needed. 

This issue is visible in the dashboard videos. 

Below are two screenshots of the dashboard video from one of the reservoir sequences. 
On the second lap around the pillars, the boat requires much more power to achieve the same speed.

| Lap 1 | Lap 2| 
| --- | --- |
| ![Power Lap 1](../figs/power-1.png "Power Lap 1") | ![Power Lap 2](../figs/power-2.png "Power Lap 2") |


This poses an interesting questino about how perception algorithms could be used to detect and avoid weeds. 
Preventative monitoring could be done with the camera, 
but there is also opportunity to explore the use of sonar data for this purpose, as thick patches of weeds 
are occassionally visible to the human eye in the sonar feed.

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

One question that we would like to answer using this data is:
**How can we best leverage the multi-sensor suite to perform odometry or localization on a semi-open body of water?

Each sensor has its unique strengths that can be leveraged for navigation under different conditions. 
Camera and lidar sensors provide rich information, but on a wide body of water such as a lake, the shoreline can be too far away to maintain a reliable signal.
The radar sensor, on the other hand, has a much longer range and could prove useful for navigating with respect to the shoreline.
Sonar data could be used to track one's position relative to the lake floor. 

Benchmarking state-of-the-art odometry and localization algorithms
would offer further insight into the gaps and key challenges facing autonomous aquatic navigation, 
and the unique combination of 
sensors in this dataset affords more flexibility and creativity in 
future improvements of these algorithms.