Automated Ocean Acidification Data Pipeline

Automates retrieval and submission of ocean acidification data for the Center for Biological Diversity.

Currently contains scripts to

Retrieve and aggregate data from NERRS, King County, Ocean Observatories Initiative and CeNCOOS
Reformat that data to make it ready for submission to California, Washington, Hawaii, and Oregon.

Initial Setup

These steps must be taken before using the program. Most will only be required once.

Software Setup

This setup should only have to be run once per machine you run it on.

Install Docker. The project is designed to run in a Docker container. Therefore, the only prerequisite is Docker: Get Docker
Clone the repository. If you haven't already: git clone https://github.com/11th-Hour-Data-Science/cbd-ocean-acidification.git
Change to the root project directory: cd cbd-ocean-acidification
Build the Docker image: docker build --tag cbd .

Record Keeping Setup

Some steps in the 303(d) data submission process must be taken manually once before completion. The metadata/stations.csv contains metadata about the stations that are available to retrieve data from. You can open this in Excel or any csv editor of your choice. Just be sure to save any changes as a csv.

Please refer to "Stations Table Schema" below for information on how to fill out columns.

Washington

Washington uses EIM to handle environmental data. It has three distinct types of data.

Results, which are the values of a given parameter at a given time and place, are formatted in a results table automatically.
Locations, which are the descriptions of locations from which results originate, are formatted in a locations table automatically.
Studies are data about how the results were collected and who administers the locations. These must be submitted manually once.

To set up submitting data to Washington:

Follow the instructions here through creating the relevant studies
Update stations.csv to add the eim_study_id to the relevant stations
Ensure all stations you wish to submit have the required information in stations.csv and station_parameter_metadata.csv

For more information, visit the EIM page

California

CEDEN is California's primary portal for environmental data upload, but it does not accept time series data as of this writing. All time-series data must be submitted to the Integrated Report Document Upload Portal. To submit to California:

Create an IR Portal account: https://public2.waterboards.ca.gov/IRPORTAL/Account/Register
Ensure all stations you wish to submit have the required information in stations.csv and station_parameter_metadata.csv

Hawaii

Stations

For new stations and locations:

Contact the data source to receive approval, ensure we are following their terms of service, and to ensure they are not already submitting data.
If the station's data is available through one of the available collectors: King County, IPACOA, ERDDAP, find its ID in this service and use it as our station_id
If it is not available through one of the available collectors, a new scraper will need to be created. If you are able you can try to write one yourself (following existing patterns) and open a Pull Request. Otherwise open an issue describing the new station you would like and where you retrieved its data from.
Add an entry to stations.csv with all relevant data. You may need to contact the source.
Add entries to station_parameter_metadata.csv with all relevant data. You may need to contact the source.

Usage

NERRS

If you are submitting NERRS data:

Get your ipv4 address (going to a website like https://whatismyipaddress.com/ should do it)
Request a webservices account from NERRS: http://cdmo.baruch.sc.edu/web-services-request/
Wait for your confirmation email. Since most IP addresses change over time, you may have to do this before each time you acquire NERRS data, or get a static IP.
In the project root directory, run:
```
 bash run_tool.sh <STATE> <start_date> <end_date>
```
where should be the state name (California, Hawaii, or Washington), <start_date> and <end_date> should be dates YYYY/MM/DD format (exclude <>). This will prompt your password, as we are using sudo priveleges to change the owner of the output files from root (since docker created them) to the current user.
Results will be saved in results/STATE/YYYY-MM-DDTHH-MM with a README.txt file explaining further instructions.

Without Docker (discouraged)

In the project root directory, run:
```
 python main.py <STATE> --start <start_date> --end <end_date>
```
where should be the state name (California, Hawaii, or Washington), <start_date> and <end_date> should be dates YYYY/MM/DD format (exclude <>).
Results will be saved in results/STATE/YYYY-MM-DDTHH-MM with a README.txt file explaining further instructions.

Directory Structure

pipeline/metadata/

ipacoa_measurement_lookup.csv: Lookup table that shows IDs, names, and measurement units for all measurements provided by stations listed in the asset list.
ipacoa_platform_measurements.csv: Contains all combinations of stations and measurements accessible through IPACOA. The process column indicates whether that information is going to be downloaded by the scraper ipacoa.py. If you want to add additional stations or measurements, you should update process column of the relevant row to be True. Otherwise that information will not be included in the dataset. Currently only ocean acidification related measurements of West Coast stations are set to have process=True.
king-county-keys.json: Contains necessary keys to make a GET request to King County's API. Used by the kingcounty.py script.
kingcounty_measurement_lookup.csv: Contains information on measurements, units, and devices used in the stations listed by King County data portal.
stations.csv: Table containing information on all stations that can be accessed through both IPACOA and King County data sources.

Stations Table Schema

station_id: unique identifier for the station, usually taken from station.
name: descriptive name of a station
approved: TRUE if we have permision to submit station data, FALSE if not.
source: organization that operates station
provider: source from which we retrieve data from this station
QAPP: link to qapp, if available
state: state or province station is located in
latitude: latitude expressed in decimal form
longitude: longitude expressed in decimal form
description: qualitative information about station and station location
setting: where the station is located. Options are: "Canal Transport", "Estuary", "Lake", "Ocean", "Other-Surface Water", "Reservoir", "River/Stream", "Seep", "Spring", "Storm Sewer". This is from Oregon's DEQ and is converted for other states.
collector: organization type of operator. "University", "GovFed", "GovLocal", "NOAA"
horizonatl_datum: Datum that was used for horizontal coordinates. Acceptable values are: "WGS84", "NAD27", "NAD83".
horizontal_coordinate_accuracy:
horizontal_coordinate_collection: (DEQ, ?). For DEQ: "GPS-Unspecified", "Interpolation-Map", or blank.
study_specific_id:
reference_point:
ceden_id: (CEDEN only)
ceded_project_code: (CEDEN only)
tribal_land: (DEQ only) Yes or No.
eim_location_study: (EIM only)
eim_study_id: (EIM only)

Contact

You can contact us by opening an issue or emailing tspread at uchicago edu

Project Link: https://github.com/chicago-cdac/cbd-ocean-acidification

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
pipeline		pipeline
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run_tool.sh		run_tool.sh
setup.py		setup.py

License

uchicago-dsi/cbd-ocean-acidification

Folders and files

Latest commit

History

Repository files navigation