# Working with Data

The intent of this tutorial is to help familiarize yourself with browsing for data that will be used along with an application to generate data by submitting a job. Job submission will be covered in the next tutorial. Run each cell in order (shift-enter). The notes will indicate when you need to edit code to customize things (e.g., to indicate a data collection)vs. being prompted by running the cell (e.g. for your username and password).

In [22]:
import json
from IPython.display import JSON
from pathlib import Path
import datetime
import os
import requests

from unity_sds_client.unity import Unity
from unity_sds_client.unity import UnityEnvironments
from unity_sds_client.unity_session import UnitySession
from unity_sds_client.unity_services import UnityServices as services
from unity_sds_client.resources.collection import Collection
from unity_sds_client.resources.dataset import Dataset
from unity_sds_client.resources.data_file import DataFile

In [2]:
# We will set the environment to 'DEV' here but this should be set to test or prod eventually.
s = Unity(UnityEnvironments.TEST)
# set the venue for interacting with venue specific services
# if your venue id is a single string, use the following

Please enter your Unity username:  gangl
Please enter your Unity password:  ········


## List Available Data Collections in the Unity System

Data is organized into Collections. Any particular data file will be in at least one Collection.

In [3]:
dataManager = s.client(services.DATA_SERVICE)
collections = dataManager.get_collections()
for c in collections:
    print(c.collection_id)


urn:nasa:unity:unity:test:SBG-AUX___1
urn:nasa:unity:unity:test:my-awesome-collection___1
urn:nasa:unity:unity:test:SBG-L2B_VEGBIOCHEM___1
urn:nasa:unity:unity:test:SBG-L2B_FRCOV___1
urn:nasa:unity:unity:test:SBG-L2A_CORFL___1
urn:nasa:unity:unity:test:SBG-L2A_RSRFL___1
urn:nasa:unity:unity:test:SBG-L2A_RFL___1
urn:nasa:unity:unity:test:SBG-L1B_PRE___1
urn:nasa:unity:uds_local_test:TEST1:NEW_COLLECTION_EXAMPLE_L1B___NGA9
urn:nasa:unity:ssips:TEST1:CHRP_16_DAY_REBIN___NGA3


## Given a collection (above), List the files within that collection

Executing this cell will retrieve all the files in a Collection defined by the data_set variable. Then it will print out the name and href location of each (up to a limit defined in this code block).

To see a different data Collection, change the data_set variable to one of the other Collections you found in the step above. If you would like to limit your query to something other than 100 files, change the value in the params.append() call.

In [24]:
collection_id = "urn:nasa:unity:unity:test:SBG-AUX___1"
cd = dataManager.get_collection_data(Collection(collection_id))
for dataset in cd:
    print(f'dataset name: {dataset.data_begin_time}')
    print(f'dataset name: {dataset.id}' )
    for f in dataset.datafiles:
        print(f)
        #print("	" + f.location + ", roles: " + str(f.roles) + ", type: " + f.type + ", description: " + f.description + ", title: " + f.title)

## Add data files to a collection - Coming soon

There are a number of use cases where a user wants to catalog datafiles in S3 for various reason. One might be to share or persist some work. Another might be to upload auxiliary data for use in the processing system instead of bundling it with ones code (e.g. your code needs access to multi-GB climatolgy or models). The following commands assumes that data exists on S3, and that we want to register that in the unity data catalog.

Data files are added via STAC catalogs. Below we will upload several files, create a stac entry for them, and then request they be _cataloged_ in the system. Within unity, the creation/storage of a file and the catalogging of that file are spearate events. This may change in the future, but this offers some flexibility for transient files currently.

## Upload Data Products

This was run on the command line, but this is the general way to upload a file- we want to use the collection_id we plan on adding this to, but that's not 100% necessary.

```
jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ wget https://avng.jpl.nasa.gov/pub/PBrodrick/emulator/sRTMnet_v120.h5
2024-02-28 19:35:13 (38.6 MB/s) - ‘sRTMnet_v120.h5’ saved [5801110556/5801110556]

jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ wget https://avng.jpl.nasa.gov/pub/PBrodrick/emulator/sRTMnet_v120_aux.npz
2024-02-28 19:35:13 (1.64 MB/s) - ‘sRTMnet_v120_aux.npz’ saved [180804/180804]


jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ ls
sRTMnet_v120_aux.npz  sRTMnet_v120.h5

jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ aws s3 cp sRTMnet_v120.h5 s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5
upload failed: ./sRTMnet_v120.h5 to s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5 An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Access Denied
```

Currently we don't give users permissions to upload files to the S3 bucket, so this was done by a member of the Unity Team.

```
jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ aws s3 cp sRTMnet_v120.h5 s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5
upload: ./sRTMnet_v120.h5 to s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5
jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ aws s3 cp sRTMnet_v120_aux.npz s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz
upload: ./sRTMnet_v120_aux.npz to s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz
jovyan@jupyter-gangl:/unity/ads/input_collections/SBG_AUX$ 
```


In [18]:
output_directory = str(Path("data/SBG-AUX___1").resolve())
collection  = Collection("urn:nasa:unity:unity:test:SBG-AUX___1")

# Create a Dataset for the collection
dataset_name = "urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5"
#the start/stop time aren't actually important for these files in our use case, but might be if we want to version these files.
dataset_start_time = "2023-06-15T01:31:12.467113Z"
dataset_end_time = "2023-06-15T01:36:12.467113Z"
dataset_create_time = datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat()
dataset = Dataset(dataset_name, collection.collection_id, dataset_start_time, dataset_end_time, dataset_create_time)

dataset.add_data_file(DataFile("HDF5","s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5", ["data"]))
dataset.add_data_file(DataFile(None,"s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz", ["metadata"]))
        
#Add the STAC file we are creating
dataset.add_data_file(DataFile("text/json",os.path.join(output_directory, dataset_name + ".json"), ["metadata"]))
collection.add_dataset(dataset)
Collection.to_stac(collection, output_directory)

## Upload the metadata file to S3

Now we must upload the STAC file to the S3 Bucket...

The key, for now, needs to be the name of the file (not the full path), so here we manually update the files. Below are the same contents of `data/SBG-AUX___1/sRTMnet_v120.h5.json`


In [25]:
# The container we will publish
stac_features = {
  "provider_id": "unity",
  "features": [
    {
      "type": "Feature",
      "stac_version": "1.0.0",
      "id": "urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5",
      "properties": {
        "datetime": "2023-06-15T01:31:12.467113Z",
        "start_datetime": "2023-06-15T01:31:12.467113Z",
        "end_datetime": "2023-06-15T01:36:12.467113Z",
        "created": "2024-02-28T20:00:34.718750+00:00",
        "updated": "2024-02-28T20:00:34.719066Z"
      },
      "geometry": None,
      "links": [
        {
          "rel": "root",
          "href": "./catalog.json",
          "type": "application/json"
        },
        {
          "rel": "parent",
          "href": "./catalog.json",
          "type": "application/json"
        }
      ],
      "assets": {
        "sRTMnet_v120.h5": {
          "href": "s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5",
          "title": "HDF5 file",
          "description": "",
          "roles": [
            "data"
          ]
        },
        "sRTMnet_v120_aux.npz": {
          "href": "s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz",
          "title": "None file",
          "description": "",
          "roles": [
            "metadata"
          ]
        },
        "urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5.json": {
          "href": "s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5.json",
          "title": "text/json file",
          "description": "",
          "roles": [
            "metadata"
          ]
        }
      },
      "stac_extensions": [],
      "collection": "urn:nasa:unity:unity:test:SBG-AUX___1"
    }
  ]
}


In [26]:
token = s._session.get_auth().get_token()
base_url = s._session.get_unity_href()
url = f'{base_url}am-uds-dapa/collections'
header = {
            'Authorization': f'Bearer {token}',
            'Content-Type': 'application/json',
        }
# print(url)
response = requests.put(url=url, headers=header, data=json.dumps(stac_features))

In [27]:
print(response)

<Response [202]>


We now will query the collection and, in time, the dataset should appear.

In [31]:
collection_id = "urn:nasa:unity:unity:test:SBG-AUX___1"
cd = dataManager.get_collection_data(Collection(collection_id))
for dataset in cd:
    print(f'dataset name: {dataset.data_begin_time}')
    print(f'dataset name: {dataset.id}' )
    for f in dataset.datafiles:
        print(f)
        #print("	" + f.location + ", roles: " + str(f.roles) + ", type: " + f.type + ", description: " + f.description + ", title: " + f.title)

dataset name: 2023-06-15T01:31:12.467000Z
dataset name: urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5.cmr.xml)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120_aux.npz)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5.json)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-AUX___1/urn:nasa:unity:unity:test:SBG-AUX___1:sRTMnet_v120.h5/sRTMnet_v120.h5)
