#### Download from ScienceBase

This notebook downloads the SECASC hydrological futures dataset from ScienceBase using their API, which means direct download from AWS S3 buckets instead of point and click downloads from their web interface. This notebook follows usage instructions in their github [documentation](https://github.com/DOI-USGS/sciencebasepy/blob/master/README.md).

Only use this notebook if you need your own copy of the dataset. Otherwise just use the data located in the directory below.

In [20]:
import sciencebasepy
import os
from pathlib import Path

dir = Path("/import/beegfs/CMIP6/jdpaul3/hydroviz_data")

Establish a session and get public items.  No need to log in!

In [21]:
sb = sciencebasepy.SbSession()

# This is the streamflow item from here: https://www.sciencebase.gov/catalog/item/638763a9d34ed907bf78432b
items = ['638763a9d34ed907bf78432b']
# Add the GIS item from here: https://www.sciencebase.gov/catalog/item/6373c4bdd34ed907bf6c6e4d
items += ['6373c4bdd34ed907bf6c6e4d']

Get item JSON and confirm these are the items we want by printing the titles.

In [22]:
items_json = [sb.get_item(item) for item in items]
for item_json in items_json: print(item_json['title']) 

Streamflow Statistics for Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5), 1950 - 2100
GIS Features Used With Hydrologic Simulations for the Conterminous United States for Historical and Future Conditions Using the National Hydrologic Model Infrastructure (NHM) and the Coupled Model Intercomparison Project Phase 5 (CMIP5), 1950 - 2100


Download all of the files associated with each item. This took me 5-10 minutes but may vary depending on the API and network.

In [23]:
subdirs = ["stats", "gis"]
for i, d in zip(items_json, subdirs):
    path = dir.joinpath(d)
    if not os.path.exists(path):
        os.makedirs(path)
    r = sb.get_item_files(i, path)
    print ("Downloaded files " + str(r))

downloading https://sciencebase.usgs.gov/manager/item/638763a9d34ed907bf78432b/file/clb5essaf00h80spj1vta1wld to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1_diff.zip
downloading https://sciencebase.usgs.gov/manager/item/638763a9d34ed907bf78432b/file/cliungpv000190upncoucdzct to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_ACCESS1-0_r1i1p1.zip
downloading https://sciencebase.usgs.gov/manager/item/638763a9d34ed907bf78432b/file/cliungrce001b0upn0pi4eikl to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1.zip
downloading https://sciencebase.usgs.gov/manager/item/638763a9d34ed907bf78432b/file/clb55njeo00gz0spj87sv5q3c to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_Maurer.zip
downloading https://sciencebase.usgs.gov/manager/item/638763a9d34ed907bf78432b/file/clb5eqcun00gi0mo017nr3qj3 to /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_ACCESS1-0_r1i1p1_diff.zip
downloading https://sciencebase.usgs.gov/manager/

Some of these had unzipping problems in preliminary testing, so maybe we should check md5 sums before proceeding. Let's get md5 values for each file, and also indicate whether md5 values are missing. 

In [24]:
md5_ok = []
md5_missing = []

for i in items_json:
    for file in i['files']:
        if file['checksum'] != None:
            #print(file['name'], ": MD5 = ", file['checksum']['value'])
            md5_ok.append(file['name'])
        if file['checksum'] == None:
            #print(file['name'], ": MD5 = missing!")
            md5_missing.append(file['name'])

print("Files with MD5 checksum metadata: ", len(md5_ok))
print("Files missing MD5 checksum metadata: ", len(md5_missing))

Files with MD5 checksum metadata:  48
Files missing MD5 checksum metadata:  14


With 14 missing md5 values, it appears we cannot check those md5 sums! But we can check the rest:

In [25]:
for file in items_json[0]['files']:
    if file['checksum'] != None:
        #print('Checking file: ', file['name'], '...')
        md5_metadata = file['checksum']['value']
        filepath = str(list(dir.glob(f'**/{file["name"]}'))[0])
        output_string = !md5sum {filepath}
        md5_computed = output_string[0].split()[0]
        assert md5_metadata == md5_computed, 'MD5 checksums do not match for file: ' + file['name']

OK those look good, so let's unzip them. The unzipper will fail on all the non zip files, and thats OK. 

In [26]:
for file in md5_ok:
    filepath = str(list(dir.glob(f'**/{file}'))[0])
    output_subdir = list(dir.glob(f'**/{file}'))[0].parent
    print('Unzipping file: ', file, ' into ', output_subdir)
    !unzip {filepath} -d {output_subdir}

Unzipping file:  dynamic_IPSL-CM5A-LR_r1i1p1_diff.zip  into  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats
Archive:  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_r1i1p1_diff.zip
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_hru_2016_2045_1976_2005_diff.csv  
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_hru_2046_2075_1976_2005_diff.csv  
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_hru_2071_2100_1976_2005_diff.csv  
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_seg_2016_2045_1976_2005_diff.csv  
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_seg_2046_2075_1976_2005_diff.csv  
  inflating: /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_IPSL-CM5A-LR_rcp26_r1i1p1_seg_2071_2100_1976_2005_diff.csv  
  inflat

Finally, let's attempt to unzip the files missing md5 sums and see what happens. Hopefully all downloads were complete and there are no errors.

In [27]:
for file in md5_missing:
    filepath = str(list(dir.glob(f'**/{file}'))[0])
    output_subdir = list(dir.glob(f'**/{file}'))[0].parent
    print('Unzipping file: ', file, ' into ', output_subdir)
    !unzip {filepath} -d {output_subdir}

Unzipping file:  dynamic_BNU-ESM_r1i1p1_diff.zip  into  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats
Archive:  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1_diff.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1_diff.zip or
        /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1_diff.zip.zip, and cannot find /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_BNU-ESM_r1i1p1_diff.zip.ZIP, period.
Unzipping file:  dynamic_ACCESS1-0_r1i1p1.zip  into  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats
Archive:  /import/beegfs/CMIP6/jdpaul3/hydroviz_data/stats/dynamic_ACCESS1-0_r1i1p1.zip
  End-of-central-direct

Well that's annoying! We need to find out what's going on with these files and why they won't unzip!