# Working with Data

The intent of this tutorial is to help familiarize yourself with browsing for data that will be used along with an application to generate data by submitting a job. Job submission will be covered in the next tutorial. Run each cell in order (shift-enter). The notes will indicate when you need to edit code to customize things (e.g., to indicate a data collection)vs. being prompted by running the cell (e.g. for your username and password).

In [1]:
import json
from IPython.display import JSON

from unity_sds_client.unity import Unity
from unity_sds_client.unity import UnityEnvironments
from unity_sds_client.unity_session import UnitySession
from unity_sds_client.unity_services import UnityServices as services
from unity_sds_client.resources.collection import Collection

In [2]:
# We will set the environment to 'DEV' here but this should be set to test or prod eventually.
s = Unity(UnityEnvironments.TEST)
# set the venue for interacting with venue specific services
# if your venue id is a single string, use the following

Please enter your Unity username:  gangl
Please enter your Unity password:  ········


## List Available Data Collections in the Unity System

Data is organized into Collections. Any particular data file will be in at least one Collection.

In [3]:
dataManager = s.client(services.DATA_SERVICE)
collections = dataManager.get_collections(limit=100)
for c in collections:
    print(c.collection_id)


urn:nasa:unity:unity:test:SBG-L2B_VEGBIOCHEM___1
urn:nasa:unity:unity:test:SBG-L2B_FRCOV___1
urn:nasa:unity:unity:test:SBG-L2A_CORFL___1
urn:nasa:unity:unity:test:SBG-L2A_RSRFL___1
urn:nasa:unity:unity:test:SBG-L2A_RFL___1
urn:nasa:unity:unity:test:SBG-L1B_PRE___1
urn:nasa:unity:uds_local_test:TEST1:NEW_COLLECTION_EXAMPLE_L1B___NGA9
urn:nasa:unity:ssips:TEST1:CHRP_16_DAY_REBIN___NGA3
urn:nasa:unity:ssips:TEST1:NEW_COLLECTION_EXAMPLE_L1B___NGA3
urn:nasa:unity:ssips:TEST1:SIPS_COLLECTION_ALTINOK___6


## Given a collection (above), List the files within that collection

Executing this cell will retrieve all the files in a Collection defined by the data_set variable. Then it will print out the name and href location of each (up to a limit defined in this code block).

To see a different data Collection, change the data_set variable to one of the other Collections you found in the step above. If you would like to limit your query to something other than 100 files, change the value in the limit parameter.

In [4]:
collection_id = "urn:nasa:unity:unity:test:SBG-L1B_PRE___1"
cd = dataManager.get_collection_data(Collection(collection_id), limit=100, filter="updated >= '2024-02-25T00:00:00Z' and updated <= '2025-02-26T23:59:59Z'")
for dataset in cd:
    print(f'dataset name: {dataset.data_begin_time}')
    print(f'dataset name: {dataset.id}' )
    for f in dataset.datafiles:
        print(f)
        #print("	" + f.location + ", roles: " + str(f.roles) + ", type: " + f.type + ", description: " + f.description + ", title: " + f.title)

dataset name: 2023-08-10T03:41:01Z
dataset name: urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001/SISTER_EMIT_L1B_RDN_20230810T034101_001.cmr.xml)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001/SISTER_EMIT_L1B_RDN_20230810T034101_001.json)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001/SISTER_EMIT_L1B_RDN_20230810T034101_001.bin)
unity_sds_client.resources.DataFile(location=s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_P

## Get a Token!

For some operations, its helpful to get the token that allows you to communicate with the unity services. This token can be used in curl commands or other commands outside of the unity-py ecosystem.

In [5]:
token = s._session.get_auth().get_token()

## Create a Collection

In [6]:
# To create a collection, we are required to set the project and venue to which the collection will belong.
s.set_project("unity")
s.set_venue("test")
dataManager = s.client(services.DATA_SERVICE)

# All collection ids follow the pattern: urn:nasa:unity:{project}:{venue}:{collection_name}____{version}.
collection_id = "urn:nasa:unity:unity:test:my-awesome-collection___1"
dataManager.create_collection(Collection(collection_id))

## View recently created collection

This is an asynchronous operation, so there may be a delay in the request for a collection creation and when it shows up in the response.


In [10]:
dataManager = s.client(services.DATA_SERVICE)
collections = dataManager.get_collections()
for c in collections:
    if c.collection_id ==  collection_id:
        print(c.collection_id)

urn:nasa:unity:unity:test:my-awesome-collection___1


## Define Custom Metadata Fields

Custom metadata fields can be defined for a given project and venue. The metadata fields can then be used as additional properties in the STAC item file associated with the data. Note that all previously defined custom metadata fields must be included in the call to define_custom_metadata.


In [None]:
# To define custom metadata, we are required to set the project and venue.
s.set_project("unity")
s.set_venue("test")
dataManager = s.client(services.DATA_SERVICE)
dataManager.define_custom_metadata({
  "tag": {
    "type": "keyword"
  },
  "percent_cloud_cover": {
    "type": "double"
  }
})

## Credential-less data download

When accessing data stores within the **same venue**, you'll be able to access or download data from S3 without credentials.


In [11]:
import sys
!{sys.executable} -m pip install boto3
import boto3



In [12]:
s3 = boto3.client('s3')
#s3://ssips-test-ds-storage-reproc/urn:nasa:unity:ssips:TEST1:CHRP_16_DAY_REBIN___1/urn:nasa:unity:ssips:TEST1:CHRP_16_DAY_REBIN___1:SNDR_tile_2016_s320_N16p50_E120p00_L1_AQ_v1_D_2311021698943223.nc/SNDR_tile_2016_s320_N16p50_E120p00_L1_AQ_v1_D_2311021698943223.nc
s3.download_file('sps-test-ds-storage', 'urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001/SISTER_EMIT_L1B_RDN_20230810T034101_001.hdr', "file.hdr")

It doesn't end there. If you're more comfortable in a terminal, you can open up a terminal and explore the S3 bucket using the awscli:

```
aws s3 ls sps-test-ds-storage
```

## Add data files to a collection - Coming soon

There are a number of use cases where a user wants to catalog datafiles in S3 for various reason. One might be to share or persist some work. Another might be to upload auxiliary data for use in the processing system instead of bundling it with ones code (e.g. your code needs access to multi-GB climatolgy or models). The following commands assumes that data exists on S3, and that we want to register that in the unity data catalog.

Data files are added via STAC catalogs. Below we will upload several files, create a stac entry for them, and then request they be _cataloged_ in the system. Within unity, the creation/storage of a file and the catalogging of that file are spearate events. This may change in the future, but this offers some flexibility for transient files currently.

In [None]:
# The container we will publish
stac_features = {
  "provider_id": "unity",
  "features": [ ]
}
### this is where we add the stac item we want to catalog. note- THERE MUST EXIST an asset that is a stac formatted item with the "metadata" asset role. In the file attached, we have:
# "SISTER_EMIT_L1B_RDN_20230810T034101_001.json": {
#           "href": "s3://sps-test-ds-storage/urn:nasa:unity:unity:test:SBG-L1B_PRE___1/urn:nasa:unity:unity:test:SBG-L1B_PRE___1:SISTER_EMIT_L1B_RDN_20230810T034101_001/SISTER_EMIT_L1B_RDN_20230810T034101_001.json",
#           "title": "text/json file",
#           "description": "",
#           "roles": [
#             "metadata"
#           ]
#         }

#The above file is read by the ingest mechanisms to create entries in the metadata catalog. it's essentially the _same_ file we are sending now.

with open('data/SBG-L1B-PRE/feature.json') as f:
  json_feature = json.load(f)
  stac_features['features'].append(json_feature)


In [None]:
base_url = s._session.get_unity_href()
url = f'{base_url}am-uds-dapa/collections'
header = {
            'Authorization': f'Bearer {token}',
            'Content-Type': 'application/json',
        }
# print(url)
response = requests.put(url=url, headers=header, data=json.dumps(stac_features))

In [None]:
print(response)

The file now should appear in the directory tree to the left in jupyter