# Test discovery and data access using DaCHS and a standalone Datalink service

## Description

This notebook illustrates example data discovery via an IVOA simple cone search (SCS) [1] through to OIDC authentication and SODA processing (cutout) using real data in datalake.

As part of the discovery process, an `access_url` pointing to an IVOA Datalink service [2] is used to reveal physical data locations to the client. This datalink service talks to Rucio via its api, and in its default mode (`nearest_by_client`) gets the nearest replica location by calculating the shortest available great circle distance between the geolocated client's ip and the sites holding replicas. Alternatively, it is possible to get a random replica by adding the query parameter `sort=random` to the datalink query url.

Site information (lat/long) is retrieved by calls to a [site-directory](https://gitlab.com/ska-telescope/src/src-site-directory/-/tree/main/src/site_directory) api.

[1] [https://www.ivoa.net/documents/latest/ConeSearch.html](https://www.ivoa.net/documents/latest/ConeSearch.html)
[2] [https://www.ivoa.net/documents/DataLink/](https://www.ivoa.net/documents/latest/ConeSearch.html)

## Prerequisites

1. It is assumed that the entry to search for has already had a corresponding record added into the Rucio metadata database. This entry must have valid `s_ra`, `s_dec` and `access_url` fields in the JSON data column, where the `access_url` points to a datalink resource with the did as the `id` parameter, e.g. `http://rucio_datalink:10000/links?id=testing:PTF10tce.fits`. Note that the datalink service uri must be resolvable from the machine that is making the query. If using this prototype, this is probably a jupyter instance running inside another container on the same docker network, in which case it must be the container name. If running a separate jupyter instance from outside the docker network, the datalink service must be made externally accessible (unless running from the host of the docker network, in which case you can expose the datalink port & use `localhost`).

2. An instance of site-directory must be running in order to use the datalink service in `nearest_by_client` mode.
3. The user must have a valid account on the Rucio datalake.
4. There must be a SODA service entry in site-directory for the RSE hosting the data.



## Query a DaCHS SCS service around some coordinates

In [8]:
from pyvo.dal import conesearch

# PTF10tce (ra: 349.791, dec: 9.196)
results = conesearch("https://ivoa.dachs.srcdev.skao.int/rucio/rucio/cone/scs.xml", pos=(53.75308, -24.93365), radius=1)
results.to_table()['_r', 'obs_id', 's_ra', 's_dec', 'access_url', 'access_format']

_r,obs_id,s_ra,s_dec,access_url,access_format
deg,Unnamed: 1_level_1,deg,deg,Unnamed: 4_level_1,Unnamed: 5_level_1
float64,object,float64,float64,object,object
0.0,Eridanus_full_image,53.75308,-24.93365,https://ivoa.datalink.srcdev.skao.int/rucio/links?id=sp3531_soda:2023-07-18-16-40-27_ASK-WALLABY_Eridanus_cutout-574594-imagecube-42178.fits,application/x-votable+xml;content=datalink
0.0,Eridanus_full_image,53.75308,-24.93365,https://ivoa.datalink.srcdev.skao.int/rucio/links?id=orange:2023-\n07-10-17-21-37_cutout-574594-imagecube-42178.fits,application/x-votable+xml;content=datalink


## Get an access url from the Datalink resource

This datalink resource retrieves the list of possible replicas for this DID using Rucio's REST interface, and returns the nearest replica to the client's geolocated IP address.

In [10]:
from pyvo.dal.adhoc import DatalinkResults

# use first result
result = results[0]

# get the datalink access url for this first result
datalink_access_url = result['access_url']

# go get the information from the datalink service (we explicitly state a client IP address, otherwise it tries to resolve localhost)
# and enforce that a SODA service must exist at the site hosting the replica
datalink = DatalinkResults.from_result_url("{}&client_ip_address=130.246.210.120&must_include_soda=True".format(datalink_access_url))

# take the link with semantic "#this"
this = next(datalink.bysemantics("#this"))

# get the physical file path (on storage) from this link
access_url = this.access_url

# get the did from the datalink access url
did = datalink_access_url.split('?id=')[1].split('&')[0]

scope, name = did.split(':')

print(access_url)



https://spsrc14.iaa.csic.es:18027/disk/dev/deterministic/sp3531_soda/9b/98/2023-07-18-16-40-27_ASK-WALLABY_Eridanus_cutout-574594-imagecube-42178.fits


## Get an access token for the Rucio "auth" OIDC client

This process follows an interactive OIDC `authorization_code` flow. The resulting token will be used to directly contact the storage endpoint.

In [21]:
import requests

response = requests.get("https://rucio.srcdev.skao.int/auth/oidc")
auth_url = response.headers['X-Rucio-OIDC-Auth-URL']
print("Please go to {}, authenticate and paste the authorisation code below:".format(auth_url))
auth_code = input()
response = requests.get("https://rucio.srcdev.skao.int/auth/oidc_redirect?{}".format(auth_code), headers={"X-Rucio-Client-Fetch-Token": "True"}) 
access_token = response.headers['X-Rucio-Auth-Token']

Please go to https://rucio.srcdev.skao.int/auth/oidc_redirect?mYEpgscqEPqL2VooE37dYNk, authenticate and paste the authorisation code below:


KeyError: 'x-rucio-auth-token'

## Access the data

In this case, we will just download the data using the access token and access url retrieved in the previous steps.

In [9]:
import os

from astropy.io import fits
import pylab as plt
from matplotlib.colors import LogNorm

headers = {'Authorization': 'Bearer {}'.format(access_token)}

response = requests.get(access_url, headers=headers, stream=True)
if response.status_code == 200:
    with open(name, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            print("{}KB downloaded".format(round(os.path.getsize(name)/1024), 0), end='\r')
            f.write(chunk)
            f.flush()
    print('\n')

    fits.info(name)

    image_data = fits.getdata(name, ext=0)

    plt.figure()
    plt.imshow(image_data[750:1200,1000:1500], cmap='gray', norm=LogNorm(vmin=100, vmax=1000))
    plt.colorbar()
    plt.show()
else:
    print("error getting data: {}".format(response.status_code))



29691KB downloaded

KeyboardInterrupt: 