Welcome to the codebase for the HawaiiCoast_GT dataset! Here we have included the code that was used to create the HawaiiCoast_GT dataset, which is available here. These files are included for the specific purpose of making the dataset transparently reproducible. However, it is hoped that the overarching process--and functions in AIS_data_cleaning.py--could be useful for curating further AIS based datasets for anomaly detection.
- AIS_data_cleaning.py: This contains all of the actual functions used to perform the AIS data name, class, and speed cleaning passes. It is generalized to be useful for other data times/regions extracted from MarineCadastre.
- extract_hawaii_names_classes.py: Running this file compiles all of the Hawaii monthly AIS files (extracted using save_month_bbox from AIS_data_cleaning.py on data directly downloaded from MarineCadastre) and cleans and finds the name(s) and vessel class for each unique vessel in the dataset.
- add_hawaii_names_classes_speeds.py: Running this file adds the standardized vessel names and classes generated by running extract_hawaii_names_classes.py to each Hawaii month of data. It then computes the approximate speed between sequential points for each vessel by month.
- add_incident_labels.py: Running this file adds the incident labels from hawaii_primary_trajectories_of_interest_2017_2020.csv to the corresponding points in the AIS data.
- vessel_type_codes_2018.csv: Connects the US Coast Guard vessel type codes to the general vessel classes used in the HawaiiCoast_GT dataset. Necessary for running extract_hawaii_names_classes.py and AIS_data_cleaning.py. Derived from VesselTypeCodes2018.pdf available from the US Coast Guard on MarineCadastre.
- hawaii_primary_trajectories_of_interest_2017_2020.csv Reference file for adding incident labels to the AIS dataset. Necessary for running add_incident_labels.py. For more information on how these incidents were compiled (and what they mean) see README.txt in the original data.
- Hawaii_incident_source_key.bib Sources for hawaii_primary_trajectories_of_interest_2017_2020.csv provided for appropriate information attribution.
Original data
The daily "raw" point data for all of North America must first be downloaded from MarineCadastre for each day of the following years:
- 2017: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2017/index.html
- 2018: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2018/index.html
- 2019: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2019/index.html
- 2020: https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2020/index.html
The Hawaii regional data can be extracted by running "save_month_bbox" from AIS_data_cleaning.py for each month of downloaded daily files from 2017-2020 using the following parameters:
save_month_bbox(year, month, 18.13869, 24.07175, -161.70558, -152.98331, region='Hawaii')Here the latitudinal range is 18.13869 to 24.07175 and the longitudinal range is -161.70558 to -152.98331.
The entire raw MarineCadastre dataset from 2017-2020 is 332.3 GB. Since the number of raw files you can store and process at a time depends on your hardware, we leave it to the user to create a script that uses save_month_bbox from AIS_data_cleaning.py and deletes the corresponding original MarineCadastre files as needed.
File sequence
Once all of the regional files are extracted, the following should be run in sequence:
python extract_hawaii_names_classes.py
python add_hawaii_names_classes_speeds.py
python add_incident_labels.py
Note that extracting the speeds from the monthly regional data in add_hawaii_names_classes_speeds.py is easily parallelizable for speeding up the process. We also note that in the original dataset, we looked up and processed incidents by hand on the data output from add_hawaii_names_classes_speeds.py before running add_incident_labels.py (again, see the HawaiiCoast_GT dataset for details).
Outputs
At the end of running these you will have the HawaiiCoast_GT dataset in three stages.
- Regional, original raw data
- Processed data with updated names and speeds (but no incidents) in a folder called Hawaii_name_class_speed.
- The final AIS data matching the published HawaiiCoast_GT dataset in a folder called hawaii_coast_gt.
The early stages of data can be deleted or kept for reproducing the incident lookup process.
Dependencies This code was developed with the following dependency versions:
- Python: 3.9.0
- BeautifulSoup4: 4.11.1
- dateutil: 2.8.2
- NumPy: 1.24.3
- pandas: 1.5.2
- Tracktable: 1.6.0
- Requests: 2.29.0
Funding Statement
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. The views expressed in the article do not necessarily represent the view of the U.S. DOE or the United States Government.
How to Cite
Henriksen, Amelia. (2023). HawaiiCoast_GT: Curated AIS for Hawaii's coast correlated with ground truth incidents (v1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8253611