# Working with Data

The intent of this tutorial is to help familiarize yourself with browsing for data that will be used along with an application to generate data by submitting a job. Job submission will be covered in the next tutorial. Run each cell in order (shift-enter). The notes will indicate when you need to edit code to customize things (e.g., to indicate a data collection)vs. being prompted by running the cell (e.g. for your username and password).

In [1]:
import requests
import getpass
import json
from IPython.display import JSON

First we need some pre-defined environment variables

In [2]:
# This portion of the code is env specific for Dev, Test, Ops, etc. 
# define the environment as our test venue
env = {
    # test clientId
#    "clientId":"71894molftjtie4dvvkbjeard0",

    # dev clientId
    "clientId":"71g0c73jl77gsqhtlfg2ht388c",

    # test DAPA
#    "url":"https://58nbcawrvb.execute-api.us-west-2.amazonaws.com/test/"

    # dev DAPA
    "url":"https://1gp9st60gd.execute-api.us-west-2.amazonaws.com/dev/"
      }

# The auth_json is template for authorizing with AWS Cognito for a token that can be used for calls to the data service.
# For now this is just an empty data structure. You will be prompted for your username and password in a few steps.
auth_json = '''{
     "AuthParameters" : {
        "USERNAME" : "",
        "PASSWORD" : ""
     },
     "AuthFlow" : "USER_PASSWORD_AUTH",
     "ClientId" : ""
  }'''

### Authentication Code

The below method is a helper function for getting an access token for accessing Unity SDS services. You must pass the token along with any API requests in order to access the various Unity SDS services.

In [3]:
# This method is used for taking a username and password and client ID and fetching a cognito token
def get_token(username, password, clientID):
    aj = json.loads(auth_json)
    aj['AuthParameters']['USERNAME'] = username
    aj['AuthParameters']['PASSWORD'] = password
    aj['ClientId'] =clientID 
    token = None
    try:
        response = requests.post('https://cognito-idp.us-west-2.amazonaws.com', headers={"Content-Type":"application/x-amz-json-1.1", "X-Amz-Target":"AWSCognitoIdentityProviderService.InitiateAuth"}, json=aj)
        token = response.json()['AuthenticationResult']['AccessToken']
    except:
        print("Error, check username and password and try again.")
    return token

### Prompt for your Unity username and password

These are required to get the token (described above) to connect to the data services.

In [4]:
print("Please enter your username...")
user_name = input()

print("Please enter your password...")
password = getpass.getpass()

Please enter your username...


 rtapella


Please enter your password...


 ············


In [5]:
token = get_token(user_name, password, env['clientId'])

if(token):
    print("Token received.")

Token received.


## List Available Data Collections in the Unity System

Data is organized into Collections. Any particular data file will be in at least one Collection.

In [8]:
# The DAPA-request endpoint to retrieve collections is the base URL plus the following:
url = env['url'] + "am-uds-dapa/collections"

# tweak the paging with this parameter
params = []
params.append(("limit", 50))

# Make a GET request at the URL you have constructed, using your access token
response = requests.get(url, headers={"Authorization": "Bearer " + token}, params=params)

print ("Data Collections at " + url)
# To see raw JSON of the API response, uncomment this line:
#print(json.dumps(response.json()))

features = response.json()['features']

for data_set in features:
   print(data_set['id'])

print("\nFull JSON response object:")
JSON(response.json())

Data Collections at https://1gp9st60gd.execute-api.us-west-2.amazonaws.com/dev/am-uds-dapa/collections
URN:NASA:UNITY:MAIN_PROJECT:DEV:CUMULUS_DAPA_UNIT_TEST___1691544291
URN:NASA:UNITY:MAIN_PROJECT:DEV:CUMULUS_DAPA_UNIT_TEST___1691542970
CUMULUS_DAPA_UNIT_TEST___1690844736
CUMULUS_DAPA_UNIT_TEST___1690834109
CUMULUS_DAPA_UNIT_TEST___1690832834
CUMULUS_DAPA_UNIT_TEST___1690832711
CUMULUS_DAPA_UNIT_TEST___1690423426
CUMULUS_DAPA_UNIT_TEST___1690421539
CUMULUS_DAPA_UNIT_TEST___1690348633_8
CUMULUS_DAPA_UNIT_TEST___1690348633_7
CUMULUS_DAPA_UNIT_TEST___1690348633_6
CUMULUS_DAPA_UNIT_TEST___1690348633_5
CUMULUS_DAPA_UNIT_TEST___1690348633_4
CUMULUS_DAPA_UNIT_TEST___1690348633_3
CUMULUS_DAPA_UNIT_TEST___1690348633_1
CUMULUS_DAPA_UNIT_TEST___1690348633
CUMULUS_DAPA_UNIT_TEST___1690232329
urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1
urn:nasa:unity:uds_local_test:DEV1:GESDISC_TEST___1
urn:nasa:unity:uds_local_test:DEV1:SNDR13CHRP1___2
urn:nasa:unity:uds_local_test:DEV1:ECO1BATT___1

<IPython.core.display.JSON object>

## Given a collection (above), List the files within that collection

Executing this cell will retrieve all the files in a Collection defined by the data_set variable. Then it will print out the name and href location of each (up to a limit defined in this code block).

To see a different data Collection, change the data_set variable to one of the other Collections you found in the step above. If you would like to limit your query to something other than 100 files, change the value in the params.append() call.

In [17]:
data_set = "urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1"

url = env['url'] + "am-uds-dapa/collections/"+data_set+"/items"

# tweak the paging with this parameter
params = []
params.append(("limit", 5))

response = requests.get(url, headers={"Authorization": "Bearer " + token}, params=params)

print(f"Endpoint: "+url)
print(f"Total number of files: {response.json()['numberMatched']}")

# get features into an array and cut it down to just 0th to 15th items
features = response.json()['features']
print(""+ str(len(features)) +" results per page"),
print("File IDs, titles, and hrefs in Collection " + data_set + "\n")


for data_file in features: {
   print("ID:\n"+data_file['id']),
   print("File:\n"+data_file['assets']['data']['href']),
   print("Datetime:\n"+data_file['properties']['datetime']),
#   print("STAC Metadata:\n"+data_file['assets']['metadata__stac']['href'])
   print("")
}


print("Full JSON response object:")
JSON(response.json())


Endpoint: https://1gp9st60gd.execute-api.us-west-2.amazonaws.com/dev/am-uds-dapa/collections/urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1/items
Total number of files: 44240
5 results per page
File IDs, titles, and hrefs in Collection urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1

ID:
urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1:SNDR.SS1330.CHIRP.20230101T0000.m06.g001.L1_J1.std.v02_48.G.200101070318_REBIN
File:
s3://uds-dev-cumulus-sps/urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1:SNDR.SS1330.CHIRP.20230101T0000.m06.g001.L1_J1.std.v02_48.G.200101070318_REBIN/SNDR.SS1330.CHIRP.20230101T0000.m06.g001.L1_J1.std.v02_48.G.200101070318_REBIN.nc
Datetime:
2023-08-29T16:57:37.239000Z

ID:
urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1:SNDR.SS1330.CHIRP.20230101T0006.m06.g002.L1_J1.std.v02_48.G.200101070318_REBIN
File:
s3://uds-dev-cumulus-sps/urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1:SNDR.SS1330.CHIRP.20230101T0006.m06.g002

<IPython.core.display.JSON object>

## Filter the results above by time

The standards-based API used by the Unity SDS Data Store, DAPA, has a variety of filtering options. Currently we have implemented a time-based filter. See more about the Data Access and Processing API at: https://docs.ogc.org/per/20-025r1.html#_dapa_overview

This cell will filter the full list of files in the Collection with ID = data_set by a start and end time defined by the datetime parameter.

In [16]:
#data_set = "SNDR_SNPP_ATMS_L1B_OUTPUT___1"
data_set = "urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1"

url = env['url'] + "am-uds-dapa/collections/"+data_set+"/items"
# the datetime,limit, and offset are included due to a current bug in the API Gatway setting these values to 'none'.
# Example date/time params

params = []
#add a datetime to your request
params.append(("datetime", "2023-01-01T00:06:00Z/2023-01-03T00:06:00Z"))

# limit - how many results to return in a single request
params.append(("limit", 20))

response = requests.get(url, headers={"Authorization": "Bearer " + token}, params=params)

print(f"Total number of files: {response.json()['numberMatched']}")
print("File IDs, datetimes, and hrefs in Collection " + data_set + "\n")

print("Full JSON response object:")
JSON(response.json())

Total number of files: 2
File IDs, datetimes, and hrefs in Collection urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1

Full JSON response object:


<IPython.core.display.JSON object>

In [18]:
#data_set = "SNDR_SNPP_ATMS_L1B_OUTPUT___1"
data_set = "urn:nasa:unity:uds_local_test:DEV1:CHRP_16_DAY_REBIN___1"

url = env['url'] + "am-uds-dapa/collections/"+data_set+"/items"
# the datetime,limit, and offset are included due to a current bug in the API Gatway setting these values to 'none'.
# Example date/time params

params = []
#add a datetime to your request
params.append(("datetime", "2023-01-01T00:06:00Z/2023-01-03T00:06:00Z"))

# limit - how many results to return in a single request
params.append(("limit", 20))

response = requests.get(url, headers={"Authorization": "Bearer " + token}, params=params)

print(f"Total number of files: {response.json()['numberMatched']}")
print("File IDs, datetimes, and hrefs in Collection " + data_set + "\n")

features = response.json()['features']
print(str(len(features)) +" features")
#features = features[0:10]

while len(features) > 0:
    for data_file in features: {
       print(data_file['id']),
       print(data_file['properties']['created']),
#       print(data_file['assets']['metadata__stac']['href']),
       print(data_file['assets']['data']['href']),
       print("")
    }

#    response = requests.get(next(item for item in response.json()['links'] if item['rel'] == 'next')['href'], headers={"Authorization": "Bearer " + token}, params=params)
    # Get the next page of results
    #features = response.json()['features']

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



KeyboardInterrupt: 

## Credential-less data download

When accessing data stores within the same venue, you'll be able to download data from S3 without credentials. 

**Note**, the following libraries are needed for this, and the below command can be run in a jupyter-terminal to install them:

```
conda install xarray netcdf4 hdf5 boto3 matplotlib
```


In [None]:
import boto3

In [None]:
s3 = boto3.client('s3')
s3.download_file('uds-test-cumulus-protected', 'SNDR_SNPP_ATMS_L1A___1/SNDR.SNPP.ATMS.L1A.nominal2.04.nc', 'test_file11.nc')

In [None]:
import xarray as xr
ds = xr.open_dataset('test_file11.nc')
ds

In [None]:
ds.band_surf_alt.plot()