<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px"> 
<b>Introduction to the LSST data Butler</b> <br>
Last verified to run on 2021-12-08 with LSST Science Pipelines release w_2021_49 <br>
Contact authors: Alex Drlica-Wagner, Melissa Graham <br>
Target audience: All DP0 delegates. <br>
Container Size: medium <br>
Questions welcome at <a href="https://community.lsst.org/c/support/dp0">community.lsst.org/c/support/dp0</a>. <br>
Find DP0 documentation and resources at <a href="https://dp0-1.lsst.io">dp0-1.lsst.io</a>. <br>

**Credit:** This tutorial was originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club. Please consider acknowledging Alex Drlica-Wagner in any publications or software releases that make use of this notebook's contents.

### Learning Objectives

The Butler is the LSST Science Pipelines interface for managing, reading, and writing datasets. The Butler can be used to explore the contents of the DP0.1 data repository and access the DP0.1 data. The current version of the Butler (referred to as "Gen-3") is still under development, and this notebook may be modified in the future. Full Butler documentation can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/index.html).

This notebook demonstrates how to:<br>
1. Create an instance of the Butler<br>
2. Explore the DP0.1 data repository<br>
3. Retrieve and plot an image with sources<br>
4. Create an image cutout at a user-specified coordinate<br>
5. Exploring and retrieving catalog data from the Butler

### Set Up 

Import general python packages and several packages from the LSST Science Pipelines, including the Butler package and AFW Display, which will be used to display images.
More details and techniques regarding image display with `AFW Display` can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository.

In [None]:
# Generic python packages
import warnings
import numpy as np
import pylab as plt

# Set a standard figure size to use
plt.rcParams['figure.figsize'] = (8.0, 8.0)

# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
import lsst.geom as geom
import lsst.afw.coord as afwCoord
afwDisplay.setDefaultBackend('matplotlib')

In [None]:
# prevent some helpful but ancillary warning messages from printing
#   during some LSST DM Release calls
warnings.simplefilter("ignore", category=UserWarning)

In [None]:
# This should match the verified version listed at the start of the notebook
! echo $IMAGE_DESCRIPTION
! eups list -s lsst_distrib

### 1. Create an instance of the Butler

To create the Butler, we need to provide it with a path to the data set, which is called a "data repository".
Butler repositories have both a database component and a file-like storage component; the latter can can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system), and it contains a configuration file (usually `butler.yaml`) that points to the right database

S3 (Simple Storage Service) buckets are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.

For access to the DP0.1 data set, always point to this S3 bucket.

In [None]:
repo = 's3://butler-us-central1-dp01'
butler = dafButler.Butler(repo)

### 2. Explore the DP0.1 data repository

#### 2.1 Butler registry and collections

The database side of a data repository is called a registry.
The registry contains entries for all data products, and organizes them by _collection_, _dataset type_, and _data ID_.
Use the registry to investigate a repository by listing all collections.

Find more about collections [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).

A registry client is part of our butler object:

In [None]:
registry = butler.registry

In [None]:
# Learn more about the registry by uncommenting the following line.
# help(registry)

In [None]:
for c in sorted(registry.queryCollections()):
    print(c)

<br>
Here are some definitions to help delegates understand the contents of the full DP0.1 repository. 

* `2.2i` - refers to the processing run of the LSST DESC DC2 data (the `i` stands for `imSim`)
* `calib` - refers to calibration products that are used for instrument signature removal
* `runs` - refers to processed data products
* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration
* `skymaps` - definitions for the _tract_ and _patch_ grids that coadds are built on

Some collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.

Expand the pointer recursively to show the full contents of the selected collection.

In [None]:
collection = "2.2i/runs/DP0.1"
print(collection)
for c in sorted(registry.queryCollections(collection, flattenChains=True)):
    print(c, registry.getCollectionType(c))

<br>

Create a new Butler instance that specifies the `2.2i/runs/DP0.1` collection, and a new registry for it.

In [None]:
butler = dafButler.Butler(repo, collections=collection)
registry = butler.registry

#### 2.2 Butler DatasetType

The LSST Science Pipelines classify data products as `DatasetTypes`.
To demonstrate how to see the available `DatasetTypes`, the following cell prints them all to screen.

As individual `DatasetTypes` are defined globally and do not belong to a specific collection, the following query returns *all* that belong to the repository, not just in the collection of interest. 

In [None]:
for x in sorted(registry.queryDatasetTypes()):
    print(x)

<br>
Here are some definitions to help delegates understand the contents of the full DP0.1 repository. 

* `calexp` - a single CCD of a processed visit image (PVI; individual calibrated exposures)
* `deepCoadd` - products related to the coadded (stacked) images, including catalogs of coadd sources
* `src` - a catalog of sources
* `skyMap` - geometric representations of the sky coverage

**Which data sets are most appropriate for DP0.1?** <br>
> Most DP0.1 delegates will only be interested in data sets with types `ExposureF` or `SourceCatalog`. 
> For images, stick to the `calexp` (processed visit images, or PVIs) and `deepCoadd` (stacked PVIs).
> For catalogs, the `src` should be used with the `calexp` images, and the `deepCoadd_forced_src` are the most appropriate to be used with the `coadds`.
> More information can be found in the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io).

#### 2.3 Butler data IDs

The data ID is a dictionary-like identifier for a data product.
Find more about the data IDs [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).

Each `DatasetType` uses a different set of keys in its data ID.
For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit.
These are the keys of the data ID for a `calexp`, which are also called "dimensions".

In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full data IDs are printed to screen (for just a few examples).
Data IDs can be represented in code as regular Python `dict` objects, but when returned from the `Butler` the `DataCoordinate` class is used instead.

The data ID contains both *implied* and *required* keys.
For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. 
In the following cell, printing the data ID without specifying `.full` shows only the required keys.
The value of a single key, in this case *band*, can also be printed by specifying the key name.

The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist.

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', collections=collection)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    print(ref.dataId)
    print(ref.dataId['band'])
    print(' ')
    if i > 2:
        break

The Uniform Resource Identifier (URI) is the closest thing to a "filepath to the data" that the Butler provides. 
Note that this URI does not refer to a local path on the filesystem.
There is no need to know exactly where the data live in order to access it - that's the power of the Butler!!

The following cell prints the URI to screen as a demonstration of an alternate way to uniquely identify data in the Butler.

In [None]:
for i, ref in enumerate(datasetRefs):
    print('File URI: ', butler.getURI(ref))
    if i > 2:
        break

#### 2.4 Butler queryDatasets

Above demonstrated a very simple use of `queryDatasets`, but additional query terms can also be used, such as band and visit.
When a query term is an equality, it can be specified like `band='g'`. 
When a query term is an inequality, it can be specified with `where`.
Details about Butler queries can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html).


In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', band='g',
                                     where='visit > 700000', collections=collection)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    if i > 2:
        break

<br>

Each data ID key-value pair is associated with a metadata row called a `DimensionRecord`.
Like dataset types, these exist independent of any collection, but they are also identified by data IDs.

The `queryDimensionsRecords` method provides a way to query for these records.
Most of the arguments accepted by `queryDatasets` can be used here (including `where`).

An example of this is provided below:

In [None]:
for dim in ['exposure', 'visit', 'detector']:
    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])
    print()

Another query method, `queryDataIds`, can be used to query for data IDs independent of any dataset, but it's less useful for general data exploration.

It is also possible to pass `datasets` and `collections` to both `queryDataIds` and `queryDimensionRecords` in order to return records whose data IDs match those of existing datasets.
But this is quite a bit more subtle than searching directly for a dataset, and rarely wanted when exploring a data repository.

More information on all of the query methods can be found [here](https://pipelines.lsst.io/v/weekly/middleware/faq.html#when-should-i-use-each-of-the-query-methods-commands).

#### 2.5 Temporal and spatial queries

The following examples show how to query for data sets that include a desired coordinate and observation date.

##### Temporal queries

Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201.
But these are just human-readable summaries of the more precise spatial and temporal information stored in the registry, which are represented in Python by `Timespan` and `Region` objects, respectively.
`DimensionRecord` objects that represent spatial or temporal concepts (a `visit` is both) have these objects attached to them.

Retrieve the `DimensionRecord` for a visit and show its timespan and region.

In [None]:
(record,) = registry.queryDimensionRecords('visit', visit=971990)

print(record.timespan)
print(' ')
print(record.region)

If the timespan or spatial region that are being used as query constraints are already associated with a data ID in the database, the spatial and temporal overlap constraints are automatic.
For example, if we query for `deepCoadd` datasets with a `visit`+`detector` data ID, we'll get just the ones that overlap that observation and have the same band (because a visit implies a band):

In [None]:
for ref in registry.queryDatasets("deepCoadd", visit=971990, detector=50):
    print(ref)

To query for dimension records or datasets that overlap an arbitrary time range, we can use the `bind` argument to pass times through to `where`.
Using `bind` to define an alias for a variable saves us from having to string-format the times into the `where` expression.
Note that a `dafButler.Timespan` will accept a `begin` or `end` value that is equal to `None` if it is unbounded on that side.

Use `bind` and `where`, along with [astropy.time](https://docs.astropy.org/en/stable/time/index.html), to look for visits within one minute of this one on either side.

In [None]:
# import astropy.time
# minute = astropy.time.TimeDelta(60, format="sec")
# timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)

# for visit in registry.queryDimensionRecords("visit", where="visit.timespan OVERLAPS my_timespan", 
#                                             bind={"my_timespan": timespan}):
#     print(visit.id, visit.timespan, visit.physical_filter)

In [None]:
import astropy.time
minute = astropy.time.TimeDelta(60, format="sec")
timespan = dafButler.Timespan(record.timespan.begin - minute, record.timespan.end + minute)

datasetRefs = registry.queryDatasets("calexp", where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

for i, ref in enumerate(datasetRefs):
    print(ref)
    if i > 6:
        break

##### Spatial queries

Arbitrary spatial queries are not supported at this time, such as the "POINT() IN (REGION)" example found in this [Butler queries](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html) documentation.
In other words, at this time it is only possible to do queries involving regions that are already "in" the data repository, either because they are HTM pixel regions or because they are tract/patch/visit/visit+detector regions.

Thus, for this example we use the set of dimensions that correspond to different levels of the HTM (hierarchical triangular mesh) pixelization of the sky ([HTM primer](http://www.skyserver.org/htm/)).
The process is to transform a region or point into one or more HTM identifiers (HTM IDs), and then create a query using the HTM ID as the spatial data ID.
The `lsst.sphgeom` library supports region objects and HTM pixelization in the LSST Science Pipelines.

Import the `lsst.sphgeom` package, initialize a sky pixelization to level 10 (the level at which one sky pixel is about five arcmin across), and find the HTM ID for a desired sky coordinate.

In [None]:
import lsst.sphgeom

pixelization = lsst.sphgeom.HtmPixelization(10)

In [None]:
htm_id = pixelization.index(
    lsst.sphgeom.UnitVector3d(
        lsst.sphgeom.LonLat.fromDegrees(70.376995, -37.175736)
    )
)

# Obtain and print the scale to provide a sense of the size of the sky pixelization being used
scale = pixelization.triangle(htm_id).getBoundingCircle().getOpeningAngle().asDegrees()*3600
print(f'HTM ID={htm_id} at level={pixelization.getLevel()} is a ~{scale:0.2}" triangle.')

In [None]:
datasetRefs = registry.queryDatasets("calexp", htm20=htm_id,
                                     where="visit.timespan OVERLAPS my_timespan",
                                     bind={"my_timespan": timespan})

for i, ref in enumerate(datasetRefs):
    print(ref)
    if i > 6:
        break

Thus, with the above query, we have uniquely identified the visit and detector for our desired temporal and spatial constraints.

Note that if a smaller HTM level is used (like 7), which is a larger sky pixel (~2200 arcseconds), the above query will return many more visits and detectors which overlap with that larger region. Try it and see!

Note that queries using the HTM ID can also be used to, e.g., find the set of all i-band `src` catalog data products that overlap this point.

In [None]:
for i, src_ref in enumerate(registry.queryDatasets("src", htm20=htm_id, band="i")):
    print(src_ref)
    if i > 2:
        break

Why is does that search take tens of seconds?
The butler's spatial reasoning is designed to work well for regions the size of full data products, like detector- or patch-level images and catalogs, and it's a poor choice for object-scale searches.
The above search is slow in part because `queryDatasets` searches for all `src` datasets that overlap a larger region and then filters the results down to the specified HTM ID pixel.

Options for exploring and retrieving catalog data with the Butler is covered in more depth in Section 5.

### 3. Retrieve and plot an image with sources

At this point, we have all the information we need to ask the Butler to get a specific data product.
In the following example, the visit, detector, and band are used to define the `dataId` and the corresponding `calexp` is retrieved by the Butler.


In [None]:
dataId = {'visit': '703697', 'detector': 80, 'band': 'g'}
calexp = butler.get('calexp', dataId=dataId)

# This will print a warning related to the gen2 to gen3 Butler conversion.
# It is ok to ignore this warning for DP0.1.

Recall that the `calexp` is a calibrated CCD from a single exposure.
Use the afwDisplay interface to show the pixel values and mask plane.

The blue coloring and red streaks seen in the image below is set by the "mask" plane of the `calexp`.
The mask encodes information such as bad or saturated pixels.
In this case, blue indicates a detected source.
See the Image Display and Manipulation tutorial for more about afwDisplay and the mask plane. 

In [None]:
fig = plt.figure()
display = afwDisplay.Display()
display.scale('linear', 'zscale')
display.mtv(calexp)
plt.show()

<br>

To retrieve the catalog of sources detected in this `calexp`, pass the `dataId` to the Butler and request the `src` `datasetType`. 
This performs another query on the registry database.
See Section 5 for more examples of querying the Butler for catalog data.

In [None]:
src = butler.get('src', dataId)

# Define src as a copy in order to manipulate it.
src = src.copy(True)

# If desired, show src as an AstroPy table (nicely formatted).
# src.asAstropy()

Plot the `calexp` with the `src` catalog overlaid.
An investigation of this image is left as an exercise to the user! :)

In [None]:
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

# Use display buffering to avoid re-drawing the image after each source is plotted
with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange')

### 4. Create an image cutout at a user-specified coordinate

In this section, the LSST Science Pipelines geometry (recall "import lsst.geom as geom") and coordinate ("import lsst.afw.coord as afwCoord") packages are used to define a function which returns a cutout at a defined location.

In [None]:
def cutout_coadd(butler, ra, dec, band='r', datasetType='deepCoadd',
                 skymap=None, cutoutSideLength=51, **kwargs):
    """
    Produce a cutout from a coadd at the given afw SpherePoint radec position.

    Adapted from DC2 tutorial notebook by Michael Wood-Vasey.

    Parameters
    ----------
    butler: lsst.daf.persistence.Butler
        Client providing access to a data repository
    ra: float
        Right ascension of the center of the cutout, degrees
    dec: float
        Declination of the center of the cutout, degrees
    filter: string
        Filter of the image to load
    datasetType: string ['deepCoadd']
        Which type of coadd to load.  Doesn't support 'calexp'
    skymap: lsst.afw.skyMap.SkyMap [optional]
        Pass in to avoid the Butler read.  Useful if you have lots of them.
    cutoutSideLength: float [optional]
        Side of the cutout region in pixels.

    Returns
    -------
    MaskedImage
    """

    radec = geom.SpherePoint(ra, dec, geom.degrees)
    cutoutSize = geom.ExtentI(cutoutSideLength, cutoutSideLength)

    if skymap is None:
        skymap = butler.get("skyMap")

    # Look up the tract, patch for the RA, Dec
    tractInfo = skymap.findTract(radec)
    patchInfo = tractInfo.findPatch(radec)
    xy = geom.PointI(tractInfo.getWcs().skyToPixel(radec))
    bbox = geom.BoxI(xy - cutoutSize//2, cutoutSize)
    patch = tractInfo.getSequentialPatchIndex(patchInfo)

    coaddId = {'tract': tractInfo.getId(), 'patch': patch, 'band': band}
    parameters = {'bbox':bbox}

    cutout_image = butler.get(datasetType, parameters=parameters, dataId=coaddId)

    return cutout_image

In [None]:
# Use the center of the DC2 region as an example
ra, dec = 55.064, -29.783
cutout_image = cutout_coadd(butler, ra, dec, datasetType='deepCoadd', 
                            cutoutSideLength=201)
print("The size of the cutout is: ", cutout_image.image.array.shape)

In [None]:
# Display the image cutout
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(cutout_image.image)
plt.gca().axis('off')

### 5. Exploring and retrieving catalog data from the Butler

The TAP service is the recommended way to retrieve DP0.1 catalog data for a notebook, and there are several other [tutuorials](https://github.com/rubin-dp0/tutorial-notebooks) that demonstrate how to use the TAP service.

However, as we saw above, the Butler can also be used to access to catalog data. We can investigate the table schema for a specific source catalog by  Butler appending `_schema` to the `datasetType`. Note that this does not require you to specify the ``dataId``. 

In [None]:
schema_coadd_src = butler.get('deepCoadd_forced_src_schema')
schema_coadd_src.asAstropy()

The table `schema` stores information about the columns stored in the table. Each of the following lines will print the schema to the screen in different ways.

In [None]:
# schema_coadd_src.schema
# schema_coadd_src.schema.getNames()
# schema_coadd_src.schema.getOrderedNames()
print('Number of columns in this table = ', len(schema_coadd_src.schema.getNames()))

Perhaps you want to search for all schema elements that contain the term 'psf'.

In [None]:
# Define an array that is all of the column names
all_names = schema_coadd_src.schema.getOrderedNames()

# Loop over the names and look for the term 'psf'
for i, name in enumerate(all_names):
    if name.find('psf') >= 0:
        print(i, name)
del all_names

Probably you will want to know more about the values in these columns. You can do that by printing the documentation string in the schema.

In [None]:
# Turn the schema into a python dictionary, to be able to call a column by name
schema_dict = schema_coadd_src.schema.extract('*')

# Print the associated docstring for each of the named columns of interest
for name in ['base_SdssShape_psf_xx', 'base_SdssShape_psf_yy', 'base_SdssShape_psf_xy']:
    doc = schema_dict[name].getField().getDoc()
    units = schema_dict[name].getField().getUnits()
    print(name, '[%s]'%units, ' = ', doc)

Refer to the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io) to find out more about the columns.

<br>
The full catalogs are very large and it is not feasible to try and retrieve them in their entirety.
Instead, in this example we identify the tract and patch of interest and retrieve only catalog data for a small region of sky.
We use the same ra and dec coordinates as above to find the patch and tract.

In [None]:
radec = geom.SpherePoint(ra, dec, geom.degrees)

skymap = butler.get("skyMap")

tractInfo = skymap.findTract(radec)
tract = tractInfo.getId()

patchInfo = tractInfo.findPatch(radec)
patch = tractInfo.getSequentialPatchIndex(patchInfo)

print(tract, patch)

coaddId = {'tract': tract, 'patch': patch, 'band': 'i'}

coadd_src = butler.get('deepCoadd_forced_src', coaddId)
coadd_src = coadd_src.copy(True)

In [None]:
# Show the table contents if desired
# coadd_src.asAstropy()

<br>

Convert to a Pandas dataframe (see the first tutorial) for easy interaction.
The following cells offer options for printing the column names or the data values.

In [None]:
data = coadd_src.asAstropy().to_pandas()

In [None]:
# print(data.columns)

In [None]:
# for col in data.columns:
#     print(col)

In [None]:
# data['coord_ra'].values

Plot the locations of sources in the patch as well as the ra,dec that we requested. Note that the `coord_ra` and `coord_dec` are in radians, so we need to convert them to degrees.

In [None]:
fig = plt.figure()
plt.plot(np.degrees(data['coord_ra'].values),
         np.degrees(data['coord_dec'].values),
         'o', ms=2, alpha=0.5)
plt.plot(ra, dec, '*', ms=25, mec='k')
plt.xlabel('RA (deg)')
plt.ylabel('Dec (deg)')
plt.title('Butler coadd_forced_src objects in tract 4638 patch 43')

As we noted, the `coord_ra` and `coord_dec` columns have units of _radians_. As an exercise, you could use what you've learned from above to confirm this by accessing the table schema. (Also note that you can scroll up and find the answer in the outputs from a cell you already executed.) 