Water Project Datasets
These data packages are part of the San Diego Water Quality Data
source directory contains
data packages for original extracts from upstream soruces, and the
directory holds processed packages that combine multiple measures and add
The easiest way to get these datasets is to visit the water-project tag in our data repository.
You can get programatic access to these in metapack:
$ pip install metapack $ mp run http://library.metatab.org/sandiegodata.org-beachwatch-3.csv#stations
Or, in a Jupyter notebook:
import metapack as mp pkg = mp.open_package('http://library.metatab.org/sandiegodata.org-beachwatch-3.csv') df = pkg.resource('stations').read_csv() df.head()
Building the Datasets
First thing is to install the Python modules in requirements.txt, usually:
To build all of the datasets, run
make. The key targets are:
make build: build all of the datasets
make s3: Build, then upload to s3
make ckan: Build, upload to s3, then submit to CKAN.
The S3 target requires S3 credentials to be set for boto for the bucket
library.metatab.org, which is usually done with either:
- A credentials file in ~/.aws/credentials
- Environmental variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
Refer to the boto credentials documentataion for details.
The ckan target requires two environmental variable to be set:
METAKAN_CKAN_URL, with the base URL to the CKAN repository
METAKAN_API_KEY, With the API key for an account, which you can get from a user's page in CKAN.
There are also clean targets for all of the build targets:
These datasets are all published to the Data Library's CKAN Repo, so it's easiest to get them from there.
If you are building the datasets for local analysis, you'll probably want to index them and use them through the index. First, build the datasets with make, then index them:
Then you can check the index by listing the dataset with:
mp search -l
Or search for one with :
mp search beachwatch
Now you can use the names in package names that have an 'index:' prefix, usually also with the version number removed. For instance:
mp run index:sandiegodata.org-beachwatch#stations
Or, in Python/Jupyter:
import metapack as mp mp.jupyter.init() pkg = mp.open_package('index:sandiegodata.org-beachwatch#stations') df = pkg.default_resource.read_csv() print(df.head())