<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px"> 
<b>Data Discovery and Query with the Butler</b> <br>
Contact author(s): Alex Drlica-Wagner, Melissa Graham <br>
Last verified to run: 2022-08-16 <br>
LSST Science Piplines version: Weekly 2022_22 <br>
Container Size: medium <br>
Targeted learning level: intermediate <br>

**Description:** Learn to discover data and apply query constraints with the Butler.

**Skills:** Use the Butler registry, dataIds, and spatial and temporal constraints.

**LSST Data Products:** calexps, deepCoadds, sources

**Packages:** lsst.daf.butler, lsst.sphgeom

**Credit:** Elements of this tutorial were originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club.

**Get Support:**
Find DP0-related documentation and resources at <a href="https://dp0-2.lsst.io">dp0-1.lsst.io</a>. Questions are welcome as new topics in the <a href="https://community.lsst.org/c/support/dp0">Support - Data Preview 0 Category</a> of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

## 1. Introduction

In the introductory Butler tutorial, we learned how to access DP0 data given a specific data identifier (`dataId`). In this tutorial, we will explore how to use the Butler to find available data sets that match different sets of criteria (i.e., perform spatial and temporal searches). As a reminder, full Butler documentation can be found [in the documentation for lsst.dat.butler](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/index.html). For this notebook in particular, you might find this set of [Frequently Asked Questions](https://pipelines.lsst.io/middleware/faq.html) for the LSST Science Pipelines middleware to be useful.

### 1.1 Package Imports

Import general python packages and several packages from the LSST Science Pipelines, including the Butler package and AFW Display, which will be used to display images.
More details and techniques regarding image display with `afwDisplay` can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository.

In [None]:
# Generic python packages
import numpy as np
import pylab as plt
import astropy.time
from astropy.io import fits

# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
import lsst.sphgeom
from lsst.afw.image import ExposureF
afwDisplay.setDefaultBackend('matplotlib')

# Set a standard figure size to use
plt.rcParams['figure.figsize'] = (8.0, 8.0)

### 1.2. Create an instance of the Butler

Create an instance of the Butler pointing to the DP0.2 data by specifying the `dp02` configuration and the `2.2i/runs/DP0.2` collection.

> **Notice:** the following cell will trigger a warning from CFITSIO in w_2022_22. This warning can be safely ignored and will be corrected in the future.

In [None]:
config = 'dp02'
collections = '2.2i/runs/DP0.2'
butler = dafButler.Butler(config=config, collections=collections)

## 2. Explore the DP0 data repository

Butler repositories have both a database component and a file-like storage component.
The database component can be accessed through the Butler registry, while file-like storage can be local (i.e., pointing to a directory on the local file system) or remote (i.e., pointing to cloud storage resources).
DP0 uses Simple Storage Service (S3) buckets, which are public cloud storage resources that are similar to file folders.
The S3 buckets store objects, which consist of data and its descriptive metadata.

### 2.1. The Butler registry

The database side of a data repository is called a `registry`.
The registry contains entries for all data products, and organizes them by _collections_, _dataset types_, and _data IDs_.
We can access a registry client directly as part of our Butler object:

In [None]:
registry = butler.registry

Optional: learn more about the registry by uncommenting the following line.

In [None]:
# help(registry)

#### 2.1.1. queryCollections

Collections are lightweight groups of datasets such as the set raw images for a particular instrument, self-consistent calibration datasets, and the outputs of a processing run.
For DP0.2, we use the `2.2i/runs/DP0.2` collection, which we specified when creating our instance of the butler.

It is possible to access other collections, which can be queried with `queryCollections`. 
More about collections can be found in the [lsst.daf.butler documentation](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections) and in the middleware [FAQ](https://pipelines.lsst.io/middleware/faq.html#querycollections).

> **Risk reminder:** for DP0 there are no read/write restrictions on the Butler repository.

The above risk means all users can see *everything* in the Butler, including intermediate processing steps, test runs, staff repositories, and other user repositories.
This fact makes the queryCollections functionality less useful for data discovery than it will be in the future, due to the sheer number of Butler collections exposed to users.

Optional: print a giant list of every collection that exists.

In [None]:
# for c in sorted(registry.queryCollections()):
#     print(c)

The only two collections that users need are the DP0.1 and DP0.2 collections.

In [None]:
for c in sorted(registry.queryCollections()):
    if (c == '2.2i/runs/DP0.1') | (c == '2.2i/runs/DP0.2'):
        print(c)

Find only a list of the collections created by your friend, e.g., with username 'MelissaGraham'.

In [None]:
for c in sorted(registry.queryCollections()):
    if c.find('MelissaGraham') > -1:
        print(c)

#### 2.1.2. queryDatasetTypes

As shown in the introductory Butler notebook, useful DP0.2 datasetTypes for images include, e.g., deepCoadd, calexp, goodSeeingDiff_differenceExp, and for catalogs include, e.g., sourceTable, objectTable, diaObjectTable_tract, and so on.
See the [DP0.2 Data Products Definitions Document](https://dp0-2.lsst.io/data-products-dp0-2/index.html#dp0-2-data-products-definition-document-dpdd) for more details about the DP0.2 data sets.

The queryDatasetTypes function allows users to explore available datasetTypes.

> **Notice:** registry.queryDatasetTypes will report *all* that have been registered with a data repository, even if there aren’t any datasets of that type actually present.

The queryDatasetTypes function is a useful tool when you know the name of the dataset type already, and want to see how it’s defined (i.e., what kind of dataId it accepts).

Optional: print a giant list of all the available dataset types.

In [None]:
# for dt in sorted(registry.queryDatasetTypes()):
#     print(dt)

List dataset types named with '\_tract', which indicates can be queried by tract.

In [None]:
for dt in sorted(registry.queryDatasetTypes('*_tract')):
    print(dt)

Optional: list all the dataset types associated with deepCoadds.

In [None]:
# for dt in sorted(registry.queryDatasetTypes('deepCoadd*')):
#     print(dt)

Retrieve the datasetType only for the deepCoadd.

In [None]:
dt_deepCoadd = registry.queryDatasetTypes('deepCoadd')

In [None]:
print(dt_deepCoadd)

Optional: Use the help function to learn that the returned value is a generator function. 

In [None]:
# help(dt_deepCoadd)

A generator function is a special kind of function that can be iterated.
It contains objects that can be looped over like a list, as shown in the code cells above. 
Generator functions are used because they do not store their contents in memory, making them more suitable for really large data sets, like LSST.

Show the information for the deepCoadd datasetType only.

In [None]:
for dt in dt_deepCoadd:
    print(dt)

### 2.2. The Butler dataId

The data ID is a dictionary-like identifier for a data product (more information can be found [here in the lsst.daf.butler documentation](https://pipelines.lsst.io/modules/lsst.daf.butler/dimensions.html#data-ids)).
Each `DatasetType` (i.e., `calexp`, `deepCoadd`, `objectTable`, etc.) uses a different set of keys in its data ID, which are also called "dimensions".

Use the registry to access a specific named dataset type, in this case calexp, and list its dimensions.

As described above, queryDatasetTypes returns a generator, so `next` is used to get the first entry.

In [None]:
dt = next(registry.queryDatasetTypes('calexp'))
print("Name:", dt.name)
print("Dimensions:", dt.dimensions)
print("Storage Class:", dt.storageClass)


The data ID contains both *implied* and *required* keys.
For example, the value of `band` is *implied* by the `visit`, because a single visit refers to a single exposure at a single pointing in a single band.

In other tutorial notebooks, we have seen how to access a specific data product using a fully specified data ID. A query for a fully specified data ID should return one unique entry (however, see the [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-do-queries-return-duplicate-results) entry about duplicate results from chained collections).

In [None]:
datasetType = 'calexp'
dataId = {'visit': 192350, 'detector': 175}
datasetRefs = registry.queryDatasets(datasetType, dataId=dataId)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId)
    print(ref.dataId.full)
    print("band:", ref.dataId['band'])

Data IDs can be represented as regular Python `dict` objects, but when they are returned from the `Butler` the `DataCoordinate` class is used instead.
Printing the data ID shows only the required keys, but all keys can be shown by specifying `.full` member (see the [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-are-some-keys-usually-filters-sometimes-missing-from-data-ids)).
The value of a single key, in this case *band*, can also be printed by specifying the key name.

It is also possible to query for all data products that match a partially specified data ID.
For example, in the following cell we use a partially specified data ID to select all the `calexp` data associated with visit=192350.
This search will return a separate entry for each CCD detector that was processed for this visit. 
We'll print information about a few of them.

(The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist.)

In [None]:
datasetType = 'calexp'
dataId = {'visit': 192350}
datasetRefs = set(registry.queryDatasets(datasetType, dataId=dataId))

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    if i > 2:
        print('...')
        break

print(f"Found {len(list(datasetRefs))} detectors")

Note the use of the `set` method in the previous cell. 
If you remove the `set` command, you will find that `queryDatasets` returns duplicate entries for some detectors. 
This is the result of a conscious design choice in order to to handle large results efficiently. In essence, the SQL queries that are being executed "under the hood" are not attempting to remove duplicate entries. You can find a more extensive discussion of duplicate query entries in the the Middleware [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-do-queries-return-duplicate-results).

### 2.3. Retrieving data with butler.get()

One of the beauties of the butler is that there is no need to know exactly where the data live in order to access it.
Passing a datasetRef to Butler.get will return an instance of the appropriate object.

Use butler.get to retrieve the calexp, and then get the detector Id from the calexp's properties.

In [None]:
for i, ref in enumerate(datasetRefs):
    calexp = butler.get(ref)
    print(i, ' calexp.detector.getId(): ', calexp.detector.getId())
    if i > 2:
        print('...')
        break

Optional: display the first four calexps retrieved from the butler.

In [None]:
# for i, ref in enumerate(datasetRefs):
#     calexp = butler.get(ref)
#     print(i, ' calexp.detector.getId(): ', calexp.detector.getId())
#     fig = plt.figure()
#     display = afwDisplay.Display(frame=fig)
#     display.scale('asinh', 'zscale')
#     display.mtv(calexp.image)
#     plt.show()
#     if i > 2:
#         print('...')
#         break

#### 2.3.1. Optional: Uniform Resource Identifiers (URIs)

> **Notice:** It is ok for DP0 delegates to skip this part.

As mentioned at the start of Section 2, "Butler repositories have both a database component and a file-like storage component."
The database component is the recommended way for users to interface with the DP0 data via the Butler.
 **This subsection provides some optional information** about the file-like storage component.

The datasetRef also provides access to a Uniform Resource Identifier (URI) for each data product.
The URI is the closest thing to a "filepath to the data". 
However, for DP0 this URI does not refer to a local path on the filesystem.
Instead, it points to an [cloud storage bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-buckets-s3.html).

In [None]:
for i, ref in enumerate(datasetRefs):
    uri = butler.getURI(ref)
    print('File URI: ', uri)
    if i > 2:
        print('...')
        break

This bucket does not exist on your local filesystem, so you cannot access it directly.

In [None]:
!ls {uri}

However, the URI is actually an object of type `S3ResourcePath`.
This allows us to learn various things about the bucket.

Optional: get help on the URI.

In [None]:
# help(uri)

Print the filesize, network location, and path. 

In [None]:
print("Filesize:", uri.size())
print("Network Location:", uri.netloc)
print("Path:", uri.path)

We can also access the contents of the bucket locally. The `as_local` method will return a context manager that can be used to create a temporary file with a copy of the information contained in the bucket. This temporary file will be deleted afterwards.

In [None]:
with uri.as_local() as local:
    path = local.ospath
    print("!ls {path}: ")
    !ls {path}
    hdulist = fits.open(path)

print("")
print("We loaded a:", type(hdulist))

Since we know that this bucket contains a `calexp`, we can also instantiate an ExposureF object with the temporary file path.

In [None]:
with uri.as_local() as local:
    path = local.ospath
    calexp = ExposureF(path)
print("We loaded a:", type(calexp))

## 3. Query the DP0 data repository

As TAP is the recommended way to query the catalogs, the following basic, temporal, and spatial query examples are all for images (calexps).

### 3.1. Basic image queries

Our example above demonstrated a very simple use of `queryDatasets`, but additional query terms can also be used, such as band and visit.
When a query term is an equality, it can be specified as an argument like `band=''`. 
When a query term is an inequality, it can be specified with `where`.
More details on Butler queries can be found [here in the lsst.daf.butler documentation](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html).

In the following cell, we query for a list of all calexps corresponding i-band observations of a single detector over a range of visits.

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', band='i', detector=175,
                                     where='visit > 192000 and visit < 193000')

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)

Optional: use the datasetRefs retrieve the calexp image, and display the first four.

In [None]:
# for i, ref in enumerate(datasetRefs):
#     calexp = butler.get(ref)
#     fig = plt.figure()
#     display = afwDisplay.Display(frame=fig)
#     display.scale('asinh', 'zscale')
#     display.mtv(calexp.image)
#     plt.show()
#     if i > 2:
#         print('...')
#         break

#### 3.1.1. Optional: retrieving temporal and spatial metadata

As a precursor to doing temporal and spatial queries below, we demonstrate the temporal and spatial metadata that can be retrieved for calexps via the butler.

Retrieving only the metadata (calexp.visitInfo, calexp.bbox, or calexp.wcs) can be faster than retreiving the full calexp, and then extracting the metadata from it.

Print the full visitInfo record to screen.

In [None]:
for i, ref in enumerate(datasetRefs):
    visitInfo = butler.get('calexp.visitInfo', dataId=ref.dataId)
    print(visitInfo)

Print information from the visitInfo: the date, exposure time, and the boresight (the pointing coordinates).

In [None]:
for i, ref in enumerate(datasetRefs):
    visitInfo = butler.get('calexp.visitInfo', dataId=ref.dataId)
    print(i, visitInfo.date, visitInfo.exposureTime, visitInfo.boresightRaDec)

Print spatial information from the bounding box (bbox) and world coordinate system (wcs) metadata: the four corners of the image.

In [None]:
for i, ref in enumerate(datasetRefs):
    bbox = butler.get('calexp.bbox', dataId=ref.dataId)
    wcs = butler.get('calexp.wcs', dataId=ref.dataId)
    corners = [wcs.pixelToSky(bbox.beginX, bbox.beginY),
               wcs.pixelToSky(bbox.beginX, bbox.endY),
               wcs.pixelToSky(bbox.endX, bbox.endY),
               wcs.pixelToSky(bbox.endX, bbox.beginY)]
    temp = ''
    for c, corner in enumerate(corners):
        temp += '(' + str(np.round(corners[c].getRa().asDegrees(), 3)) + ',' + \
                str(np.round(corners[c].getDec().asDegrees(), 3)) + ') '
    print(i, temp)

Optional: view the help documentation for a datasetRef.

In [None]:
# help(ref)

### 3.2. Temporal image queries

The following examples show how to query for data sets that include a desired coordinate and observation date.

Start by retrieving a calexp using a specified dataId, as done above.

In [None]:
dataId = {'visit': 192350, 'detector': 175}
calexp = butler.get('calexp', dataId=dataId)

To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`.
Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.
Note that a `dafButler.Timespan` will accept a `begin` or `end` value that is equal to `None` if it is unbounded on that side.

Use `bind` and `where`, along with [astropy.time](https://docs.astropy.org/en/stable/time/index.html), to query for calexps that were obtained within +/- 10 minutes of the calexp defined by the dataId above.

In [None]:
time = astropy.time.Time(calexp.visitInfo.date.toPython())
minute = astropy.time.TimeDelta(60, format="sec")
timespan = dafButler.Timespan(time - 10*minute, time + 10*minute)

print(time)
print(minute)
print(timespan)

In [None]:
datasetRefs = registry.queryDatasets("calexp",
                                     where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

In [None]:
for i, ref in enumerate(datasetRefs):
    print(ref.dataId)
    if i > 6:
        print('...')
        break

print(f"Found {len(list(datasetRefs))} calexps")

How many unique visits were obtained within +/- 10 minutes of the calexp defined above?

In [None]:
temp = []
for i, ref in enumerate(datasetRefs):
    temp.append(ref.dataId['visit'])

unique_visitIds = np.unique(np.sort(np.asarray(temp, dtype='int')))

print('Number of unique visits: ', len(unique_visitIds))
print('visitIds for the unique visits: ', unique_visitIds)

del temp

### 3.3. Spatial image queries

#### 3.3.1. Overlapping images

As a simple first example, we again start with the same calexp as above.

In [None]:
dataId = {'visit': 192350, 'detector': 175}
calexp = butler.get('calexp', dataId=dataId)
print(calexp.visitInfo)
print(calexp.getFilterLabel())

If we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the deepCoadd objects that overlap that observation and have the same band (because a visit implies a band).
This is a very simple spatial query for data that overlaps other data.

In [None]:
for ref in registry.queryDatasets("deepCoadd", dataId=dataId):
    print(ref)

#### 3.3.2. User-defined spatial constraints on images

Arbitrary spatial queries are not supported at this time, such as the "POINT() IN (REGION)" example found in this [Butler queries](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html) documentation.
In other words, at this time it is only possible to do queries involving regions that are already "in" the data repository, either because they are HTM pixel regions or because they are tract/patch/visit/visit+detector regions.

Thus, for this example we use the set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky ([HTM primer](http://www.skyserver.org/htm/)).
The process is to transform a region or point into one or more HTM identifiers (HTM IDs), and then create a query using the HTM ID as the spatial data ID.
The `lsst.sphgeom` library supports region objects and HTM pixelization in the LSST Science Pipelines.

Using the `lsst.sphgeom` package, initialize a sky pixelization to level 10 (the level at which one sky pixel is about five arcmin across).

In [None]:
pixelization = lsst.sphgeom.HtmPixelization(10)

Find the HTM ID for a desired sky coordinate. Below, the RA and Dec could be user-specified, but here we use the boresight from the calexp retrieved above.

In [None]:
ra, dec = calexp.visitInfo.boresightRaDec
htm_id = pixelization.index(
    lsst.sphgeom.UnitVector3d(
        lsst.sphgeom.LonLat.fromDegrees(ra.asDegrees(), dec.asDegrees())
    )
)

Obtain and print the scale to provide a sense of the size of the sky pixelization being used.

In [None]:
circle = pixelization.triangle(htm_id).getBoundingCircle()
scale = circle.getOpeningAngle().asDegrees()*3600.
level = pixelization.getLevel()
print(f'HTM ID={htm_id} at level={level} is a ~{scale:0.2} arcsec triangle.')

Pass the htm_id to the queryDatasets command, and also apply the same timespan constraints as above.

In [None]:
datasetRefs = registry.queryDatasets("calexp", htm20=htm_id,
                                     where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

for i, ref in enumerate(datasetRefs):
    print(ref)
    if i > 6:
        print('...')
        break

print(f"Found {len(list(datasetRefs))} calexps")

Thus, with the above query, we have uniquely identified the visit and detector for our desired temporal and spatial constraints.

Note that if a smaller HTM level is used (like 7), which is a larger sky pixel (~2200 arcseconds), the above query will return many more visits and detectors which overlap with that larger region. Try it and see!

### 3.4. Catalog spatial and temporal queries

The recommended method for querying and retrieving catalog data is to use the TAP service, as demonstrated in other tutorials.
However, it is also possible to query catalog data using the same HTM ID and same temporal constraints as used for images, above.

The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for smaller-scale searches.
The following search is a bit slow in part because `queryDatasets` searches for all `src` datasets that overlap a larger region and then filters the results down to the specified HTM ID pixel.

In [None]:
for i, src_ref in enumerate(registry.queryDatasets("source", htm20=htm_id, band="i")):
    print(src_ref)
    sources = butler.get(src_ref)
    print('Number of sources: ', len(sources))
    if i > 2:
        print('...')
        break

Show the contents of the last source table retrieved from the butler.
Notice that both the rows and the columns of the table is truncated.

In [None]:
sources