<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250> 
<b>Introduction to the LSST data Butler</b> <br>
Last verified to run on <b>TBD</b> with LSST Science Pipelines release <b>TBD</b> <br>
Contact author: Alex Drlica-Wagner <br>
Credit: Originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club <br>
Target audience: All DP0 delegates. <br>
Container Size: medium <br>
Questions welcome at <a href="https://community.lsst.org/c/support/dp0">community.lsst.org/c/support/dp0</a> <br>
Find DP0 documentation and resources at <a href="https://dp0-1.lsst.io">dp0-1.lsst.io</a> <br>

The Butler is the LSST Science Pipelines interface for managing, reading, and writing datasets. The Butler can be used to explore the contents of the DP0.1 data repository and access the DP0.1 data. The current version of the Butler (referred to as "Gen-3") is still under development, and this notebook may be modified in the future. Full Butler documentation can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/index.html).

This notebook demonstrates how to:<br>
1. Create an instance of the Butler<br>
2. Explore the DP0.1 data repository<br>
3. Retrieve and plot an image with sources<br>
4. Create an image cutout at a user-specified coordinate<br>
5. Exploring and retrieving catalog data from the Butler

In [None]:
%load_ext pycodestyle_magic
%flake8_on

### 0. Setup 

Import general python packages and several packages from the LSST Science Pipelines, including the Butler package and AFW Display, which will be used to display images.
More details and techniques regarding image display with `AFW Display` can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository.

In [None]:
# Generic python packages
import numpy as np
import pylab as plt

# Set a standard figure size to use
plt.rcParams['figure.figsize'] = (8.0, 8.0)

# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
import lsst.geom as geom
import lsst.afw.coord as afwCoord
afwDisplay.setDefaultBackend('matplotlib')

In [None]:
# This should match the verified version listed at the start of the notebook
! eups list -s lsst_distrib

### 1. Create an instance of the Butler

To create the Butler, we need to provide it with a path to the data set, which is called a "data repository".
Butler repositories can be remote (i.e., pointing to an S3 bucket) or local (i.e., pointing to a directory on the local file system).

S3 (Simple Storage Service) buckets are public cloud storage resources that are similar to file folders, store objects, and which consist of data and its descriptive metadata.

For access to the DP0.1 data set, always point to this S3 bucket.

In [None]:
repo = 's3://butler-us-central1-dp01'
butler = dafButler.Butler(repo)

### 2. Explore the DP0.1 data repository

#### 2.1 Butler registry and collections

The registry is a database containing information about available data products.
The registry helps the user to examine what collections of data products exist.
Use the registry to investigate a repository by listing all collections.

Find more about the registry schema [here](https://dmtn-073.lsst.io/).

Find more about collections [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).

Create a registry for the DP0.1 data set using the Butler.

In [None]:
registry = butler.registry

In [None]:
# Learn more about the registry by uncommenting the following line.
# help(registry)

In [None]:
for c in sorted(registry.queryCollections()):
    print(c)

<br>
Here are some definitions to help delegates understand the contents of the full DP0.1 repository. 

* `2.2i` - refers to the processing run of the LSST DESC DC2 data (the `i` stands for `imSim`)
* `calib` - refers to calibration products that are used for instrument signature removal
* `runs` - refers to processed data products
* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration
* `skymaps` - are the geometric representations of the sky coverage

Collections are nested, and DP0 delegates can access all the data for DC2 Run 2.2i, which is the DP0.1 data set, by selecting the collection `2.2i/runs/DP0.1`.

Expand the pointer recursively to show the full contents of the selected collection.

In [None]:
collection = "2.2i/runs/DP0.1"
print(collection)
for c in sorted(registry.queryCollections(collection, flattenChains=True)):
    print(c, registry.getCollectionType(c))

<br>

Create a new Butler instance that specifies the `2.2i/runs/DP0.1` collection, and a new registry for it.

In [None]:
butler = dafButler.Butler(repo, collections=collection)
registry = butler.registry

#### 2.2 Butler DatasetType

The LSST Science Pipelines classify data products as `DatasetTypes`.
To demonstrate how to see the available `DatasetTypes`, the following cell prints them all to screen.

As individual `DatasetTypes` are defined globally and do not belong to a specific collection, the following query returns *all* that belong to the repository, not just in the collection of interest. 

In [None]:
for x in sorted(registry.queryDatasetTypes()):
    print(x)

<br>
Here are some definitions to help delegates understand the contents of the full DP0.1 repository. 

* `calexp` - a single CCD of a processed visit image (PVI; individual calibrated exposures)
* `deepCoadd` - products related to the coadded (stacked) images, including catalogs of coadd sources
* `src` - a catalog of sources
* `skyMap` - geometric representations of the sky coverage

**Which data sets are most appropriate for DP0.1?** <br>
> Most DP0.1 delegates will only be interested in data sets with types `ExposureF` or `SourceCatalog`. 
> For images, stick to the `calexp` (processed visit images, or PVIs) and `deepCoadd` (stacked PVIs).
> For catalogs, the `src` should be used with the `calexp` images, and the `deepCoadd_forced_src` are the most appropriate to be used with the `coadds`.
> More information can be found in the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io).

#### 2.3 Butler dataId

The `dataId` (data identifier) is how specific data within a data set is accessed. Find more about the `dataId` [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/dimensions.html#data-ids).

Each `DatasetType` uses a different set of keys as the `dataId`.
For example, in the `DatasetType` list printed to screen (above), next to `calexp` in curly brackets is listed the band, instrument, detector, physical_filter, visit_system, and visit. These are the keys of the `dataId` for a `calexp`.

In the following cell, the `DatasetRef` is queried for `calexp` data in our collection of interest, and the full `dataId` are printed to screen (for just a few examples).

The `dataId` contains both *implied* and *required* keys. For example, the value of *band* would be *implied* by the *visit*, because a single visit refers to a single exposure at a single pointing in a single band. 
In the following cell, printing the `dataId` without specifying `.full` shows only the required keys.
The value of a single key, in this case *band*, can also be printed by specifying the key name.

The following cell will fail and return an error if the query is requesting a `DatasetRef` for data that does not exist.

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', collections=collection)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    print(ref.dataId)
    print(ref.dataId['band'])
    print(' ')
    if i > 2:
        break

The Uniform Resource Identifier (URI) is the closest thing to a "filepath to the data" that the Butler provides. 
Note that this URI does not refer to a local path on the filesystem.
There is no need to know exactly where the data live in order to access it - that's the power of the Butler!!

The following cell prints the URI to screen as a demonstration of an alternate way to uniquely identify data in the Butler.

In [None]:
for i, ref in enumerate(datasetRefs):
    print('File URI: ', butler.getURI(ref))
    if i > 2:
        break

#### 2.4 Butler queryDatasets

Above demonstrated a very simple use of `queryDatasets`, but additional query terms can also be used, such as band and visit.
When a query term is an equality, it can be specified like `band='g'`. 
When a query term is an inequality, it can be specified with `where`.
Details about Butler queries can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/queries.html).


In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp', band='g',
                                     where='visit > 700000', collections=collection)

for i, ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    if i > 2:
        break

<br>

The `dataId` can be retrieved directly by using `queryDataIds` instead of `queryDatasets`, as in the following two examples.
Note the flexibility in the use of the query keys and the where statement.
Also note that both the `calexp` and `src` data sets can be found by the registry, but this will not always necessarily be the case.
Queries for non-existent data will cause an error to be returned.

In [None]:
dataIds = registry.queryDataIds(["visit", "detector", "band"], datasets=["calexp"],
                                where='visit = 703697', collections=collection)
for i, dataId in enumerate(dataIds):
    print(dataId.full)
    if i > 2:
        break

In [None]:
dataIds = registry.queryDataIds(["visit", "detector"], datasets=["src"],
                                where="band='g' and detector=0 and visit > 700000",
                                collections=collection)
for i, dataId in enumerate(dataIds):
    print(dataId.full)
    if i > 2:
        break

<br>

The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or to ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships).

An example of this is provided below:

In [None]:
for dim in ['exposure', 'visit', 'detector']:
    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])
    print()

<br>

**NEED HELP HERE WITH THIS FINAL BIT!!**

The following examples show how to query for data sets that include a desired coordinate and observation date.

Above, we can see that for visit 971990, the (RA,Dec) are (70.37770,-37.1757) and the observation date is 20251201. The following example uses the RA,Dec and date to retrieve the visit.

In [None]:
ra = 70.37770
dec = -37.1757
s1 = "exposure.day_obs = 20251201"
s2 = "exposure.tracking_ra > "+str(ra-1.0)
s3 = "exposure.tracking_ra < "+str(ra+1.0)
s4 = "exposure.tracking_dec > "+str(dec-1.0)
s5 = "exposure.tracking_dec < "+str(dec+1.0)

results = registry.queryDimensionRecords('visit',
                                         where=s1+" AND "+s2+" AND "+s3+" AND "+s4+" AND "+s5,
                                         collections=collection)

# Use expandDataId to fill in the implicit dataId keys with values
for i, ref in enumerate(results):
    tempId = butler.registry.expandDataId(ref.dataId)
    print(tempId.full)
    if i > 10:
        break

Above, our query terms were not sufficiently unqiue to return only visit 971990, because there were other images of that sky location obtained on that date. **IS IT WEIRD THEY ARE ALL Z BAND?**

<br>

**TO BE ADDED:**

* use of regions instead of the above kludge with RA and Dec within a degree
* how to figure out which detector the coordinates are in, instead of matching to exposure center
* how to use timespan, to be more specific about time instead of just date

### 3. Retrieve and plot an image with sources

At this point, we have all the information we need to ask the Butler to get a specific data product.
In the following example, the visit, detector, and band are used to define the `dataId` and the corresponding `calexp` is retrieved by the Butler.


In [None]:
dataId = {'visit': '703697', 'detector': 80, 'band': 'g'}
calexp = butler.get('calexp', dataId=dataId)

# This will print a warning related to the gen2 to gen3 Butler conversion.
# It is ok to ignore this warning for DP0.1.

Recall that the `calexp` is a calibrated CCD from a single exposure.
Use the afwDisplay interface to show the pixel values and mask plane.

The blue coloring and red streaks seen in the image below is set by the "mask" plane of the `calexp`.
The mask encodes information such as bad or saturated pixels.
In this case, blue indicates a detected source.
See the Image Display and Manipulation tutorial for more about afwDisplay and the mask plane. 

In [None]:
fig = plt.figure()
display = afwDisplay.Display()
display.scale('linear', 'zscale')
display.mtv(calexp)
plt.show()

<br>

To retrieve the catalog of sources detected in this `calexp`, pass the `dataId` to the Butler and request the `src` `datasetType`. 
This performs another query on the registry database.
See Section 5 for more examples of querying the Butler for catalog data.

In [None]:
src = butler.get('src', dataId)

# Define src as a copy in order to manipulate it.
src = src.copy(True)

# If desired, show src as an AstroPy table (nicely formatted).
# src.asAstropy()

Plot the `calexp` with the `src` catalog overlaid.
An investigation of this image is left as an exercise to the user! :)

In [None]:
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

# Use display buffering to avoid re-drawing the image after each source is plotted
with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange')

### 4. Create an image cutout at a user-specified coordinate

In this section, the LSST Science Pipelines geometry (recall "import lsst.geom as geom") and coordinate ("import lsst.afw.coord as afwCoord") packages are used to define a function which returns a cutout at a defined location.

In [None]:
def cutout_coadd(butler, ra, dec, band='r', datasetType='deepCoadd',
                 skymap=None, cutoutSideLength=51, **kwargs):
    """
    Produce a cutout from a coadd at the given afw SpherePoint radec position.

    Adapted from DC2 tutorial notebook by Michael Wood-Vasey.

    Parameters
    ----------
    butler: lsst.daf.persistence.Butler
        Servant providing access to a data repository
    ra: float
        Right ascension of the center of the cutout, degrees
    dec: float
        Declination of the center of the cutout, degrees
    filter: string
        Filter of the image to load
    datasetType: string ['deepCoadd']
        Which type of coadd to load.  Doesn't support 'calexp'
    skymap: lsst.afw.skyMap.SkyMap [optional]
        Pass in to avoid the Butler read.  Useful if you have lots of them.
    cutoutSideLength: float [optional]
        Side of the cutout region in pixels.

    Returns
    -------
    MaskedImage
    """

    radec = geom.SpherePoint(ra, dec, geom.degrees)
    cutoutSize = geom.ExtentI(cutoutSideLength, cutoutSideLength)

    if skymap is None:
        skymap = butler.get("skyMap")

    # Look up the tract, patch for the RA, Dec
    tractInfo = skymap.findTract(radec)
    patchInfo = tractInfo.findPatch(radec)
    xy = geom.PointI(tractInfo.getWcs().skyToPixel(radec))
    bbox = geom.BoxI(xy - cutoutSize//2, cutoutSize)
    patch = tractInfo.getSequentialPatchIndex(patchInfo)

    coaddId = {'tract': tractInfo.getId(), 'patch': patch, 'band': band}
    parameters = {'bbox':bbox}

    cutout_image = butler.get(datasetType, parameters=parameters,
                              immediate=True, dataId=coaddId)

    return cutout_image

In [None]:
# Use the center of the DC2 region as an example
ra, dec = 55.064, -29.783
cutout_image = cutout_coadd(butler, ra, dec, datasetType='deepCoadd', 
                            cutoutSideLength=201)
print("The size of the cutout is: ", cutout_image.image.array.shape)

In [None]:
# Display the image cutout
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(cutout_image.image)
plt.gca().axis('off')

### 5. Exploring and retrieving catalog data from the Butler

The TAP service is the recommended way to retrieve DP0.1 catalog data for a notebook, and there are several other [tutuorials](https://github.com/rubin-dp0/tutorial-notebooks) that demonstrate how to use the TAP service.

However, as we saw above, the Butler can also be used to access to catalog data. We can investigate the table schema for a specific source catalog by  Butler appending `_schema` to the `datasetType`. Note that this does not require you to specify the ``dataId``. 

In [None]:
schema_coadd_src = butler.get('deepCoadd_forced_src_schema')
schema_coadd_src.asAstropy()

The table `schema` stores information about the columns stored in the table. Each of the following lines will print the schema to the screen in different ways.

In [None]:
# schema_coadd_src.schema
# schema_coadd_src.schema.getNames()
# schema_coadd_src.schema.getOrderedNames()
print('Number of columns in this table = ', len(schema_coadd_src.schema.getNames()))

Perhaps you want to search for all schema elements that contain the term 'psf'.

In [None]:
# Define an array that is all of the column names
all_names = schema_coadd_src.schema.getOrderedNames()

# Loop over the names and look for the term 'psf'
for i, name in enumerate(all_names):
    if name.find('psf') >= 0:
        print(i, name)
del all_names

Probably you will want to know more about the values in these columns. You can do that by printing the documentation string in the schema.

In [None]:
# Turn the schema into a python dictionary, to be able to call a column by name
schema_dict = schema_coadd_src.schema.extract('*')

# Print the associated docstring for each of the named columns of interest
for name in ['base_SdssShape_psf_xx', 'base_SdssShape_psf_yy', 'base_SdssShape_psf_xy']:
    doc = schema_dict[name].getField().getDoc()
    units = schema_dict[name].getField().getUnits()
    print(name, '[%s]'%units, ' = ', doc)

Refer to the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io) to find out more about the columns.

<br>
The full catalogs are very large and it is not feasible to try and retrieve them in their entirety.
Instead, in this example we identify the tract and patch of interest and retrieve only catalog data for a small region of sky.
We use the same ra and dec coordinates as above to find the patch and tract.

In [None]:
radec = geom.SpherePoint(ra, dec, geom.degrees)

skymap = butler.get("skyMap")

tractInfo = skymap.findTract(radec)
tract = tractInfo.getId()

patchInfo = tractInfo.findPatch(radec)
patch = tractInfo.getSequentialPatchIndex(patchInfo)

print(tract, patch)

coaddId = {'tract': tract, 'patch': patch, 'band': 'i'}

coadd_src = butler.get('deepCoadd_forced_src', coaddId)
coadd_src = coadd_src.copy(True)

In [None]:
# Show the table contents if desired
# coadd_src.asAstropy()

<br>

Convert to a Pandas dataframe (see the first tutorial) for easy interaction.
The following cells offer options for printing the column names or the data values.

In [None]:
data = coadd_src.asAstropy().to_pandas()

In [None]:
# print(data.columns)

In [None]:
# for col in data.columns:
#     print(col)

In [None]:
# data['coord_ra'].values

Plot the locations of sources in the patch as well as the ra,dec that we requested. Note that the `coord_ra` and `coord_dec` are in radians, so we need to convert them to degrees.

In [None]:
fig = plt.figure()
plt.plot(np.degrees(data['coord_ra'].values),
         np.degrees(data['coord_dec'].values),
         'o', ms=2, alpha=0.5)
plt.plot(ra, dec, '*', ms=25, mec='k')
plt.xlabel('RA (deg)')
plt.ylabel('Dec (deg)')
plt.title('Butler coadd_forced_src objects in tract 4638 patch 43')

As we noted, the `coord_ra` and `coord_dec` columns have units of _radians_. As an exercise, you could use what you've learned from above to confirm this by accessing the table schema. (Also note that you can scroll up and find the answer in the outputs from a cell you already executed.) 