Skip to content

sandialabs/HawaiiCoast_GT_Code_Generation

Repository files navigation

HawaiiCoast_GT Code Generation

Welcome to the codebase for the HawaiiCoast_GT dataset! Here we have included the code that was used to create the HawaiiCoast_GT dataset, which is available here. These files are included for the specific purpose of making the dataset transparently reproducible. However, it is hoped that the overarching process--and functions in AIS_data_cleaning.py--could be useful for curating further AIS based datasets for anomaly detection.

Files

  1. AIS_data_cleaning.py: This contains all of the actual functions used to perform the AIS data name, class, and speed cleaning passes. It is generalized to be useful for other data times/regions extracted from MarineCadastre.
  2. extract_hawaii_names_classes.py: Running this file compiles all of the Hawaii monthly AIS files (extracted using save_month_bbox from AIS_data_cleaning.py on data directly downloaded from MarineCadastre) and cleans and finds the name(s) and vessel class for each unique vessel in the dataset.
  3. add_hawaii_names_classes_speeds.py: Running this file adds the standardized vessel names and classes generated by running extract_hawaii_names_classes.py to each Hawaii month of data. It then computes the approximate speed between sequential points for each vessel by month.
  4. add_incident_labels.py: Running this file adds the incident labels from hawaii_primary_trajectories_of_interest_2017_2020.csv to the corresponding points in the AIS data.
  5. vessel_type_codes_2018.csv: Connects the US Coast Guard vessel type codes to the general vessel classes used in the HawaiiCoast_GT dataset. Necessary for running extract_hawaii_names_classes.py and AIS_data_cleaning.py. Derived from VesselTypeCodes2018.pdf available from the US Coast Guard on MarineCadastre.
  6. hawaii_primary_trajectories_of_interest_2017_2020.csv Reference file for adding incident labels to the AIS dataset. Necessary for running add_incident_labels.py. For more information on how these incidents were compiled (and what they mean) see README.txt in the original data.
  7. Hawaii_incident_source_key.bib Sources for hawaii_primary_trajectories_of_interest_2017_2020.csv provided for appropriate information attribution.

How to generate the HawaiiCoast_GT dataset

Original data

The daily "raw" point data for all of North America must first be downloaded from MarineCadastre for each day of the following years:

The Hawaii regional data can be extracted by running "save_month_bbox" from AIS_data_cleaning.py for each month of downloaded daily files from 2017-2020 using the following parameters:

save_month_bbox(year, month, 18.13869, 24.07175, -161.70558, -152.98331, region='Hawaii')

Here the latitudinal range is 18.13869 to 24.07175 and the longitudinal range is -161.70558 to -152.98331.

The entire raw MarineCadastre dataset from 2017-2020 is 332.3 GB. Since the number of raw files you can store and process at a time depends on your hardware, we leave it to the user to create a script that uses save_month_bbox from AIS_data_cleaning.py and deletes the corresponding original MarineCadastre files as needed.

File sequence

Once all of the regional files are extracted, the following should be run in sequence:

python extract_hawaii_names_classes.py
python add_hawaii_names_classes_speeds.py
python add_incident_labels.py

Note that extracting the speeds from the monthly regional data in add_hawaii_names_classes_speeds.py is easily parallelizable for speeding up the process. We also note that in the original dataset, we looked up and processed incidents by hand on the data output from add_hawaii_names_classes_speeds.py before running add_incident_labels.py (again, see the HawaiiCoast_GT dataset for details).

Outputs

At the end of running these you will have the HawaiiCoast_GT dataset in three stages.

  1. Regional, original raw data
  2. Processed data with updated names and speeds (but no incidents) in a folder called Hawaii_name_class_speed.
  3. The final AIS data matching the published HawaiiCoast_GT dataset in a folder called hawaii_coast_gt.

The early stages of data can be deleted or kept for reproducing the incident lookup process.

Additional Info

Dependencies This code was developed with the following dependency versions:

Funding Statement

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. The views expressed in the article do not necessarily represent the view of the U.S. DOE or the United States Government.

How to Cite

Henriksen, Amelia. (2023). HawaiiCoast_GT: Curated AIS for Hawaii's coast correlated with ground truth incidents (v1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8253611

About

This is the codebase for the HawaiiCoast_GT dataset, available at https://zenodo.org/record/8253611

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published