# Download daily files from ScienceBase

This notebook downloads the daily hydroviz data from ScienceBase using their API, which means direct download from AWS S3 buckets instead of point and click downloads from their web interface. 

## Setup

In [1]:
import sciencebasepy  # version 2.0.18 works here
from pathlib import Path
import gzip
import shutil

Establish a session and log in. Even though items are public, we need to log in to access cloud data. You will need login.gov credentials.

In [2]:
sb = sciencebasepy.SbSession()

In [3]:
sb.get_token()

A browser window/tab should momentarily open with ScienceBase Manager
Sign in using active directory or login.gov
Click the user icon in the upper right and select 'Copy API token'
This copies the token to your clipboard
Use this value in the add_token function as the token_json parameter


In [None]:
token = {"paste your token here"}

In [5]:
sb.add_token(token)
sb.is_logged_in()

True

These are the static and dynamic landcover daily output items from:
https://www.sciencebase.gov/catalog/item/6373bf5cd34ed907bf6c6e38
and
https://www.sciencebase.gov/catalog/item/63890125d34ed907bf78e97f


In [6]:
items = ["6373bf5cd34ed907bf6c6e38"]
items += ["63890125d34ed907bf78e97f"]

Create a function to get the download links for each item.

In [7]:
def get_download_links_for_item(item):
    item_json = sb.get_item(item)
    print("Getting download links for:")
    print(item_json["title"], "\n")

    try:

        file_info = sb.get_item_file_info(item_json)
        filenames = [file_info[i]["name"] for i in range(len(file_info))]
        # drop xml files
        filenames = [fname for fname in filenames if not fname.endswith(".xml")]
        filenames
    except Exception as e:
        print("Error getting file info:", e)
        return []

    if filenames:
        try:
            download_links = sb.generate_S3_download_links(item, filenames)
            return filenames, download_links
        except Exception as e:
            print("Error generating download links:", e)
            return None

## Get Download Links

First list all of the filenames and download links associated with each item. 

In [8]:
item_filenames = []
item_download_links = []

for item in items:
    filenames, download_links = get_download_links_for_item(item)
    if download_links:
        item_filenames += filenames
        item_download_links += download_links

Getting download links for:
Output Files from Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5) with Static Land Cover 

Getting download links for:
Output Files from Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5) with Dynamic Land Cover 



In [9]:
print(f"All download links ({len(item_download_links)}):")

for filename, link in zip(item_filenames, item_download_links):
    print(filename, ":", link)

All download links (124):
static_ACCESS1-0_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz : https://prod-is-usgs-sb-prod-publish.s3.us-west-2.amazonaws.com/6373bf5cd34ed907bf6c6e38/static_ACCESS1-0_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAYQJECCWKEASQUDM6%2F20251126%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20251126T162051Z&X-Amz-Expires=604800&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIHLw7RxNizPHN3byp8M0jaPLETtf5z95JbboQPJG7e2BAiEA5qGHeTjXXfFmsY0guE6Enn0ocrzIfX2Z5wW6y0P8hCUq%2FAMIif%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARADGgw1ODQ3MjgwNTcyMzYiDFOKkoa3azto4H6AryrQA%2BoukYCcbglVCD1f7Fmg%2FHyvcKQ%2Bl9Lz6FilfRnXdu%2BL5O68kx6fy3z48EEjG8iId4j6KUa%2FimwZxg5%2FI4%2B%2BovRtkbyUbeZraNCkTN%2Bb5heAmKUITl%2BLuDp0NZl0N1bEmP9SqxApYp5xu44AstYrdckRAtKfOVuwuZyOOvuMJ1jMWMT9ipeFr05vp5IgNvcE1EQX64ozvNCt7sRE5eE9wKAJd7UcVoO03bzffLVv75Xi6eQq3U7NVTO4gPkOAEoh5yL0twms9bVOKq2VzjAgimnQiJezvbqhlWj

Optionally, exclude files that are already in the download directory (this allows us to download in stages since there are so many files.)

In [10]:
download_dir = Path("/import/beegfs/CMIP6/jdpaul3/hydroviz_data/daily")
existing_files = list(download_dir.glob("*"))

filenames_to_remove = []
links_to_remove = []

for filename, link in zip(item_filenames, item_download_links):
    file_path = Path.joinpath(download_dir, filename)
    if file_path in existing_files:
        print(f"File {filename} already exists. Skipping download.")
        filenames_to_remove.append(filename)
        links_to_remove.append(link)

for filename in filenames_to_remove:
    item_filenames.remove(filename)

for link in links_to_remove:
    item_download_links.remove(link)

File static_ACCESS1-0_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_ACCESS1-0_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_ACCESS1-0_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_bcc-csm1-1_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_bcc-csm1-1_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_bcc-csm1-1_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_bcc-csm1-1_rcp60_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_bcc-csm1-1_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_CCSM4_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz already exists. Skipping download.
File static_BNU-ESM_historical

## Download files 

**Warning!** This will take hours if trying to download all files. If your connection drops, you may have downloaded a partial file. We will check this later.

In [None]:
print(f"Downloading {len(item_download_links)} files to {download_dir}...", "\n")
download = sb.download_cloud_files(item_filenames, item_download_links, download_dir)

Downloading 7 files to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/daily... 

downloading https://prod-is-usgs-sb-prod-content.s3.us-west-2.amazonaws.com/63890125d34ed907bf78e97f/dynamic_MIROC-ESM-CHEM_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAYQJECCWKEASQUDM6%2F20251126%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20251126T162107Z&X-Amz-Expires=604800&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIHLw7RxNizPHN3byp8M0jaPLETtf5z95JbboQPJG7e2BAiEA5qGHeTjXXfFmsY0guE6Enn0ocrzIfX2Z5wW6y0P8hCUq%2FAMIif%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARADGgw1ODQ3MjgwNTcyMzYiDFOKkoa3azto4H6AryrQA%2BoukYCcbglVCD1f7Fmg%2FHyvcKQ%2Bl9Lz6FilfRnXdu%2BL5O68kx6fy3z48EEjG8iId4j6KUa%2FimwZxg5%2FI4%2B%2BovRtkbyUbeZraNCkTN%2Bb5heAmKUITl%2BLuDp0NZl0N1bEmP9SqxApYp5xu44AstYrdckRAtKfOVuwuZyOOvuMJ1jMWMT9ipeFr05vp5IgNvcE1EQX64ozvNCt7sRE5eE9wKAJd7UcVoO03bzffLVv75Xi6eQq3U7NVTO4gPkOAEoh5yL0twms9bVOKq2VzjAgimnQiJezvbqhlWjuXNPWvIKD

downloading https://prod-is-usgs-sb-prod-content.s3.us-west-2.amazonaws.com/63890125d34ed907bf78e97f/dynamic_MRI-CGCM3_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAYQJECCWKEASQUDM6%2F20251126%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20251126T162107Z&X-Amz-Expires=604800&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIHLw7RxNizPHN3byp8M0jaPLETtf5z95JbboQPJG7e2BAiEA5qGHeTjXXfFmsY0guE6Enn0ocrzIfX2Z5wW6y0P8hCUq%2FAMIif%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARADGgw1ODQ3MjgwNTcyMzYiDFOKkoa3azto4H6AryrQA%2BoukYCcbglVCD1f7Fmg%2FHyvcKQ%2Bl9Lz6FilfRnXdu%2BL5O68kx6fy3z48EEjG8iId4j6KUa%2FimwZxg5%2FI4%2B%2BovRtkbyUbeZraNCkTN%2Bb5heAmKUITl%2BLuDp0NZl0N1bEmP9SqxApYp5xu44AstYrdckRAtKfOVuwuZyOOvuMJ1jMWMT9ipeFr05vp5IgNvcE1EQX64ozvNCt7sRE5eE9wKAJd7UcVoO03bzffLVv75Xi6eQq3U7NVTO4gPkOAEoh5yL0twms9bVOKq2VzjAgimnQiJezvbqhlWjuXNPWvIKDIbcFKX8llupy5wH1fRbGvKJamY3MxvIlZHF3eoatN8LsqAedoV5LaGWOTSQbKpezN%2BD2x8Vywcb6TAmg

## QC

Check for complete downloads using checksum metadata or size metadata. The non-XML files do not have checksum meatdata, so we will use size instead.

In [27]:
def get_md5_and_size_for_items(items):

    file_md5_size_dict = {}

    for item in items:
        item_json = sb.get_item(item)
        print("Getting checksum and size metadata for:")
        print(item_json["title"], "\n")

        for file in item_json['files']:
            file_md5_size_dict[file["name"]] = {}
            file_md5_size_dict[file["name"]]["checksum"] = file['checksum']
            file_md5_size_dict[file["name"]]["size"] = file['size']
            
    return file_md5_size_dict

In [28]:
file_md5_size_dict = get_md5_and_size_for_items(items)
file_md5_size_dict

Getting checksum and size metadata for:
Output Files from Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5) with Static Land Cover 

Getting checksum and size metadata for:
Output Files from Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5) with Dynamic Land Cover 



{'static_ACCESS1-0_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 3383805265},
 'static_ACCESS1-0_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5834346488},
 'static_ACCESS1-0_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5834080305},
 'static_bcc-csm1-1_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 3382528355},
 'static_bcc-csm1-1_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5829088711},
 'static_bcc-csm1-1_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5831113093},
 'static_bcc-csm1-1_rcp60_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5827254197},
 'static_bcc-csm1-1_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,
  'size': 5829789429},
 'static_CCSM4_historical_r1i1p1_nsegment_summary_seg_outflow.csv.gz': {'checksum': None,

Compare the sizes of the downloaded files to the sizes in the metadata.

In [31]:
download_dir = Path("/import/beegfs/CMIP6/jdpaul3/hydroviz_data/daily")
existing_files = list(download_dir.glob("*"))

for file_path in existing_files:
    filename = file_path.name
    if filename in file_md5_size_dict:
        expected_size = file_md5_size_dict[filename]["size"]

        # Verify file size
        actual_size = file_path.stat().st_size
        if actual_size != expected_size:
            print(f"Size mismatch for {filename}: expected {expected_size}, got {actual_size}")
            continue
        # else:
        #     print(f"File {filename} passed size verification.")
    else:
        print(f"No metadata found for {filename}.")

## Unzip the files

In [2]:
download_dir = Path("/import/beegfs/CMIP6/jdpaul3/hydroviz_data/daily")
existing_files = list(download_dir.glob("*"))

In [3]:
for file in existing_files:
    if file.suffix == ".gz":
        unzipped_file_path = file.with_suffix('')  # remove .gz suffix
        try:
            with gzip.open(file, 'rb') as f_in:
                with open(unzipped_file_path, 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)
            print(f"Unzipped {file.name} to {unzipped_file_path.name}")
        except Exception as e:
            print(f"Failed to unzip {file.name}: {e}")

Unzipped static_bcc-csm1-1_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz to static_bcc-csm1-1_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped dynamic_IPSL-CM5A-LR_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz to dynamic_IPSL-CM5A-LR_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped static_Maurer_nsegment_summary_seg_outflow.csv.gz to static_Maurer_nsegment_summary_seg_outflow.csv
Unzipped dynamic_GFDL-ESM2G_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv.gz to dynamic_GFDL-ESM2G_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped dynamic_IPSL-CM5A-MR_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv.gz to dynamic_IPSL-CM5A-MR_rcp45_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped static_MIROC-ESM-CHEM_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv.gz to static_MIROC-ESM-CHEM_rcp85_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped dynamic_CCSM4_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv.gz to dynamic_CCSM4_rcp26_r1i1p1_nsegment_summary_seg_outflow.csv
Unzipped dynamic_BN