<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px"> 
<b>Introduction to the LSST data Butler</b> <br>
Contact author(s): Alex Drlica-Wagner, Melissa Graham <br>
Last verified to run: 2022-06-27 <br>
LSST Science Piplines version: Weekly 2022_22 <br>
Container Size: medium <br>
Targeted learning level: beginner <br>

**Description:** Learn about how to query and access data through the Butler.

**Skills:** Discover, query, and retrieve image and catalog data with the Butler.

**LSST Data Products:** Calexp and coadd images.

**Packages:** lsst.daf.butler, lsst.geom, lsst.afw.coord

**Credit:** Elements of this tutorial were originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club. Please consider acknowledging Alex Drlica-Wagner in any publications or software releases that make use of this notebook's contents.

**Get Support:**
Find DP0-related documentation and resources at <a href="https://dp0-2.lsst.io">dp0-1.lsst.io</a>. Questions are welcome as new topics in the <a href="https://community.lsst.org/c/support/dp0">Support - Data Preview 0 Category</a> of the Rubin Community Forum. Rubin staff will respond to all questions posted there.

## 1. Introduction

In the introductory Butler tutorial, we learned how to access DP0 data given a specific data identifier (`dataId`). In this tutorial, we will explore how to use the Butler to find available data sets that match different sets of criteria (i.e., perform spatial and temporal searches). As a reminder, full Butler documentation can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/index.html). For this notebook in particular, you might find this set of [Frequently Asked Questions](https://pipelines.lsst.io/middleware/faq.html) to be useful.

This notebook demonstrates how to:<br>
1. Create an instance of the Butler<br>
2. Retrieve images using various query constraints<br>

### 1.1 Package Imports

Import general python packages and several packages from the LSST Science Pipelines, including the Butler package and AFW Display, which will be used to display images.
More details and techniques regarding image display with `AFW Display` can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository.

In [None]:
# Generic python packages
import numpy as np
import pylab as plt
import astropy.time
from astropy.io import fits

# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
import lsst.geom as geom
import lsst.sphgeom
import lsst.afw.coord as afwCoord
from lsst.afw.image import ExposureF
afwDisplay.setDefaultBackend('matplotlib')

# Set a standard figure size to use
plt.rcParams['figure.figsize'] = (8.0, 8.0)

## 2. Create an instance of the Butler

First, we create an instance of the Butler pointing to the DP0.2 data. We do this by specifying the `dp02` configuration and the `2.2i/runs/DP0.2` collection.

In [None]:
# Create an instance of the Butler pointing to the DP0.2 repository
config = 'dp02'
collections = '2.2i/runs/DP0.2'
butler = dafButler.Butler(config=config, collections=collections)
registry = butler.registry

# Note: This will trigger a warning from CFITSIO in w_2022_22.
# This warning can be safely ignored and will be corrected in the future.

More information on the Butler can be found through the help string of our Butler instance.

### 2.1 Explore the DP0 data repository

Butler repositories have both a database component and a file-like storage component. The database component can be accessed through the Butler registry, while file-like storage can be local (i.e., pointing to a directory on the local file system) or remote (i.e., pointing to cloud storage resources).
DP0 uses Simple Storage Service (S3) buckets, which are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.

The database side of a data repository is called a `registry`.
The registry contains entries for all data products, and organizes them by _collections_, _dataset types_, and _data IDs_.
We can access a registry client directly as part of our Butler object:

In [None]:
registry = butler.registry

In [None]:
# Learn more about the registry by uncommenting the following line.
# help(registry)

Collections are lightweight groups of datasets such as the set raw images for a particular instrument, self-consistent calibration datasets, and the outputs of a processing run.
For DP0.2, we use the `2.2i/runs/DP0.2` collection, which we specified when creating our instance of the butler.
However, it is possible to access other collections, which can be queried with `queryCollections`. 
We won't got through these other collections, but you can find out more about collections in the [documentation](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections) and on the [FAQ](https://pipelines.lsst.io/middleware/faq.html#querycollections).

### 2.2 Butler data IDs

The data ID is a dictionary-like identifier for a data product (more information can be found [here](https://pipelines.lsst.io/modules/lsst.daf.butler/dimensions.html#data-ids)).
Each `DatasetType` (i.e., `calexp`, `deepCoadd`, `objectTable`, etc.) uses a different set of keys in its data ID, which are also called "dimensions". 
We can use the registry to access a specific named dataset type and list its dimensions.

In [None]:
# Use the registry to access the calexp datasetType.
# queryDatasetTypes returns a generator, so we need to call `next` to get the first entry.
dt = next(registry.queryDatasetTypes('calexp'))
print("Name:",dt.name)
print("Dimensions:", dt.dimensions)
print("Storage Class:", dt.storageClass)


The data ID contains both *implied* and *required* keys.
For example, the value of `band` is *implied* by the `visit`, because a single visit refers to a single exposure at a single pointing in a single band.

In other tutorial notebooks, we have seen how to access a specific data product using a fully specified data ID. A query for a fully specified data ID should return one unique entry (however, see the [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-do-queries-return-duplicate-results) entry about duplicate results from chained collections).

In [None]:
datasetType = 'calexp'
dataId = {'visit': 192350, 'detector': 175}
datasetRefs = registry.queryDatasets(datasetType, dataId=dataId)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId)
    print(ref.dataId.full)
    print("band:",ref.dataId['band'])

Data IDs can be represented as regular Python `dict` objects, but when they are returned from the `Butler` the `DataCoordinate` class is used instead.
Printing the data ID shows only the required keys, but all keys can be shown by specifying `.full` member (see the [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-are-some-keys-usually-filters-sometimes-missing-from-data-ids)).
The value of a single key, in this case *band*, can also be printed by specifying the key name.

It is also possible to query for all data products that match a partially specified data ID.
For example, in the following cell we use a partially specified data ID to select all the `calexp` data associated with visit=192350.
This search will return a separate entry for each CCD detector that was processed for this visit. 
We'll print information about a few of them.

(The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist.)

In [None]:
# Partially specified dataId
datasetType = 'calexp'
dataId = {'visit': 192350}
datasetRefs = set(registry.queryDatasets(datasetType, dataId=dataId))

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    if i > 5:
        print('...')
        break
        
print(f"Found {len(list(datasetRefs))} detectors")

Note the use of the `set` method in the previous cell. 
If you remove the `set` command, you will find that `queryDatasets` returns duplicate entries for some detectors. 
This is the result of a conscious design choice in order to to handle large results efficiently. In essence, the SQL queries that are being executed "under the hood" are not attempting to remove duplicate entries. You can find a more extensive discussion of duplicate query entries in the the Middleware [FAQ](https://pipelines.lsst.io/middleware/faq.html#why-do-queries-return-duplicate-results).

### 2.3 Finding Resources


One of the beauties of the butler is that there is no need to know exactly where the data live in order to access it. Passing a datasetRef to Butler.get will return an instance of the appropriate object.

In [None]:
for i, ref in enumerate(datasetRefs):
    calexp = butler.get(ref)
    if i > 2:
        break

The datasetRef also provides access to a Uniform Resource Identifier (URI) for each data product.
The URI is the closest thing to a "filepath to the data". 
However, for DP0 this URI does not refer to a local path on the filesystem.
Instead, it points to an [cloud storage bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-buckets-s3.html).

In [None]:
for i, ref in enumerate(datasetRefs):
    uri = butler.getURI(ref)
    print('File URI: ', uri)
    if i > 2:
        break

This bucket does not exist on your local filesystem, so you cannot access it directly.

In [None]:
!ls {uri}

However, the URI is actually an object of type `S3ResourcePath`. This allows us to learn various things about the bucket.

In [None]:
# Uncomment the folowing line to get help on the URI
#help(uri)

In [None]:
print("Filesize:",uri.size())
print("Network Location:", uri.netloc)
print("Path:", uri.path)

We can also access the contents of the bucket locally. The `as_local` method will return a context manager that can be used to create a temporary file with a copy of the information contained in the bucket. This temporary file will be deleted afterwards.

In [None]:
with uri.as_local() as local:
    path = local.ospath
    !ls {path}
    hdulist = fits.open(path)
print("We loaded a:", type(hdulist))
!ls {path}

Since we know that this bucket contains a `calexp`, we can also instantiate an ExposureF object with the temporary file path.

In [None]:
with uri.as_local() as local:
    path = local.ospath
    calexp = ExposureF(path)
print("We loaded a:", type(calexp))

## 3. Querying data sets

#### 3.1 Basic Queries

Our example above demonstrated a very simple use of `queryDatasets`, but additional query terms can also be used, such as band and visit.
When a query term is an equality, it can be specified as an argument like `band=''`. 
When a query term is an inequality, it can be specified with `where`.
More details on Butler queries can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html).

In the following cell, we select all calexps corresponding i-band observations of a single detector over a range of visits.

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', band='i', detector=175,
                                     where='visit > 192000 and visit < 193000'
                                     )

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)

In [None]:
calexp.visitInfo

In [None]:
help(ref)

#### 3.2 Temporal queries

The following examples show how to query for data sets that include a desired coordinate and observation date.

Let's start by accessing the calexp that we accessed earlier in the notebook. We do this by passing the fully specified dataId to the butler.

In [None]:
dataId = {'visit': 971990, 'detector': 175}
#dataId = {'visit': 192350, 'detector': 175}
calexp = butler.get('calexp',dataId=dataId)

To find out information about the visit that this calexp was derived from, we can use the visitInfo member.
In particular, we can access the RA,Dec of the telescope boresight and the observation date.
These are just human-readable summaries of the more precise spatial and temporal information stored in the registry, which are represented in Python by `Timespan` and `Region` objects, respectively.

In [None]:
print(calexp.visitInfo)
print("Telescope Boresight:",calexp.visitInfo.boresightRaDec)
print("Observation DateTime:",calexp.visitInfo.date)

We can also access the region of sky that is covered by this calexp by getting the convex polygon:

In [None]:
calexp.getConvexPolygon()

For example, if we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the deepCoadd objects that overlap that observation and have the same band (because a visit implies a band):

In [None]:
for ref in registry.queryDatasets("deepCoadd", dataId=dataId):
    print(ref)

To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`.
Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.
Note that a `dafButler.Timespan` will accept a `begin` or `end` value that is equal to `None` if it is unbounded on that side.

Use `bind` and `where`, along with [astropy.time](https://docs.astropy.org/en/stable/time/index.html), to look for visits within +/- 90 seconds of this one.

In [None]:
time = astropy.time.Time(calexp.visitInfo.date.toPython())
minute = astropy.time.TimeDelta(60, format="sec")
timespan = dafButler.Timespan(time - 10*minute, time + 10*minute)

datasetRefs = registry.queryDatasets("calexp",
                                     where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

for i, ref in enumerate(datasetRefs):
    print(ref)
    if i > 6:
        break

#### 3.3 Spatial queries

Arbitrary spatial queries are not supported at this time, such as the "POINT() IN (REGION)" example found in this [Butler queries](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html) documentation.
In other words, at this time it is only possible to do queries involving regions that are already "in" the data repository, either because they are HTM pixel regions or because they are tract/patch/visit/visit+detector regions.

Thus, for this example we use the set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky ([HTM primer](http://www.skyserver.org/htm/)).
The process is to transform a region or point into one or more HTM identifiers (HTM IDs), and then create a query using the HTM ID as the spatial data ID.
The `lsst.sphgeom` library supports region objects and HTM pixelization in the LSST Science Pipelines.

Import the `lsst.sphgeom` package, initialize a sky pixelization to level 10 (the level at which one sky pixel is about five arcmin across), and find the HTM ID for a desired sky coordinate.

In [None]:
pixelization = lsst.sphgeom.HtmPixelization(10)

In [None]:
ra,dec = calexp.visitInfo.boresightRaDec
htm_id = pixelization.index(
    lsst.sphgeom.UnitVector3d(
        lsst.sphgeom.LonLat.fromDegrees(ra.asDegrees(), dec.asDegrees())
    )
)

# Obtain and print the scale to provide a sense of the size of the
# sky pixelization being used
circle = pixelization.triangle(htm_id).getBoundingCircle()
scale = circle.getOpeningAngle().asDegrees()*3600.
level = pixelization.getLevel()
print(f'HTM ID={htm_id} at level={level} is a ~{scale:0.2} arcsec triangle.')

In [None]:
datasetRefs = registry.queryDatasets("calexp", htm20=htm_id,
                                     where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

for i, ref in enumerate(datasetRefs):
    print(ref)
    if i > 6:
        break

Thus, with the above query, we have uniquely identified the visit and detector for our desired temporal and spatial constraints.

Note that if a smaller HTM level is used (like 7), which is a larger sky pixel (~2200 arcseconds), the above query will return many more visits and detectors which overlap with that larger region. Try it and see!

Note that queries using the HTM ID can also be used to, e.g., find the set of all i-band `src` catalog data products that overlap this point.

In [None]:
for i, src_ref in enumerate(registry.queryDatasets("src", htm20=htm_id,
                                                   band="i")):
    print(src_ref)
    if i > 2:
        break

Why is does that search take tens of seconds?
The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for object-scale searches.
The above search is slow in part because `queryDatasets` searches for all `src` datasets that overlap a larger region and then filters the results down to the specified HTM ID pixel.

For searches of these scales, it is often more efficient to use the TAP service, as demonstrated in later tutorials.