<img align="left" src = https://project.lsst.org/sites/default/files/Rubin-O-Logo_0.png width=250 style="padding: 10px"> 
<b>Introduction to the LSST Data Butler</b> <br>
Last verified to run on <b>TBD</b> with LSST Science Pipelines release <b>TBD</b> <br>
Contact author: Alex Drlica-Wagner <br>
Credit: Originally developed by Alex Drlica-Wagner in the context of the LSST Stack Club <br>
Target audience: All DP0 delegates. <br>
Container Size: medium <br>
Questions welcome at <a href="https://community.lsst.org/c/support/dp0">community.lsst.org/c/support/dp0</a> <br>
Find DP0 documentation and resources at <a href="https://dp0-1.lsst.io">dp0-1.lsst.io</a> <br>

<br><br>
This notebook provides an introduction to the use of the data Butler. The Butler is the LSST Science Pipelines interface for managing, reading, and writing datasets. The Butler can be used to explore the contents of the DP0.1 data repository and access the DP0.1 data. The current version of the Butler (referred to as "Gen-3") is still under development, and this notebook may be modified in the future.

<br>
The goals of this notebook are to:<br>
1. create an instance of the Butler<br>
2. explore the DP0.1 data repository<br>
3. retrieve and display some image and catalog data<br>
4. create an image cutout at a specific location<br>
5. retrieve and plot catalog data 



### Setup

In [None]:
# This should match the verified version listed at the start of the notebook
! eups list -s lsst_distrib

### 1. Create the Butler

In this section we are going to create an instance of the Butler. We start with some general python package imports.

In [None]:
# Generic imports
import os,glob
import numpy as np
import pylab as plt
plt.rcParams['figure.figsize'] = (8.0, 8.0)

We import several packages from the LSST Science Pipelines. 
The first import gives us access to the Butler, while the second provides tools for displaying data. More details about image display can be found in the `rubin-dp0` GitHub Organization's [tutorial-notebooks](https://github.com/rubin-dp0/tutorial-notebooks) repository.

In [None]:
# Stack imports
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
afwDisplay.setDefaultBackend('matplotlib')

To create the Butler, we need to provide it with a path to the data set, which is called a "data repository". The Butler can access repositories that are remote (e.g., pointing to an S3 bucket) or local (e.g., pointing to a path on the local file system). In this case, we point to an S3 bucket.

In [None]:
repo='s3://butler-us-central1-dp01'
butler = dafButler.Butler(repo)

We now have an instance of the Butler that we can explore with the Python `help` function.

In [None]:
# Uncomment the following line to see the Butler help documentation
#help(butler)

### 2. Explore a data repository

Now that we've created an instance of the Butler, we can access the data `registry` (a database containing information about available data products). The registry will help us examine what collections of data products exist.

In [None]:
registry = butler.registry

# We can also examine the registry
#help(registry)

The `registry` is a good tool for investigating a repository (more on the registry schema can be found [here](https://dmtn-073.lsst.io/)). For example, we can get a list of all collections, with

In [None]:
for c in sorted(registry.queryCollections()):
    print(c)

This is our first glimpse at the data sets contained in the repository, but it doesn't teach us *which* collection we are actually interested in. The names do give us some hints though...

* `2.2i` - refers to the processing run of the LSST DESC DC2 data (the `i` stands the `imSim` tool that was used to simulate the images)
* `calib` - refers to calibration products that are used for instrument signature removal
* `runs` - refers to processed data products
* `refcats` - refers to the reference catalogs used for astrometric and photometric calibration
* `skymaps` - are the geometric representations of the sky coverage

Collections can be nested, so we can access everything for DC2 Run 2.2i (the primary DP0.1 data set) by selecting the collection `2.2i/runs/DP0.1`. This is a pointer to other collections that expand out recursively... More on collections can be found [here](https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections).

In [None]:
# If this collection is a pointer to other collections, expand those out recursively.
collection="2.2i/runs/DP0.1"
print(collection)
for c in sorted(registry.queryCollections(collection,flattenChains=True)):
    print(c, registry.getCollectionType(c))

We now create a new Butler instance specifying that we are specifically interested in the `2.2i/runs/DP0.1` data collection. For most uses, this will be the line you will use to create a Butler to work on DP0.1.

In [None]:
# Create a new butler with the collection of interest
butler = dafButler.Butler(repo,collections=collection)
registry = butler.registry

You are probably interested in figuring out what kind of data is present in DP0.1. The LSST Science Pipelines classify different data products in terms of `DatasetTypes`. However, individual `DatasetTypes` are defined globally, and thus don't belong to a specific collection. Thus, a query of DatasetTypes will return all DatasetTypes belonging to the repository, and not all of them may belong to the collection of interest.

In [None]:
for x in sorted(registry.queryDatasetTypes()):
    print(x)

Again, this list may seem a bit overwhelming, but we can extract some information from the names of the dataset types:

- `calexp` - refers to individual calibrated exposure
- `deepCoadd` - refers to products produced on the coadd images (both images and and source catalogs)
- `src` - refers to the catalog of sources
- `skyMap` - refers to geometric representations of the sky coverage

You can look up these and other LSST terms in the searchable [LSST Glossary](https://www.lsst.org/scientists/glossary-acronyms).

<b> Which data sets are most appropriate for DP0.1? </b><br>
Most DP0.1 delegates will only be interested in data sets with types `ExposureF` or `SourceCatalog`. 
For images, stick to the `calexp` (processed visit images, or PVIs) and `deepCoadd` (stacked PVIs).
For catalogs, the `src` table should be used with the `calexp` images, and the `deepCoadd_forced_src` table are the most appropriate to be used with the `deepCoadd`.
More information can be found in the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io).

We access specific data sets through a set of specifications known as a data identifier (`dataId`). Each `DatasetType` can be identified with a different set of properties, so it is important to be able to determine what properties need to be specified to access data of a specific type. It is possible to get all `DatasetRef` (which include the `dataId`) for a specific `datasetType` in a specific collection with a query like this. Note that this doesn't necessarily guarentee that the specific data set exists, so we include a check that the data set has a valid .

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp',collections=collection)
for i,ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    try: uri = butler.getURI(ref)
    except: print("File not found...")
    if i > 10: break

In the code above we were accessing the Uniform Resource Identifier (URI) for each data product from `butler.getURI`. Note that in this case, the URI is pointing to an S3 bucket and not a local path on the filesystem. We do not need to know exactly where the data live in order to access it. That's the power of the Butler.

In [None]:
print("Repo path: ",repo)
uri = butler.getURI(ref)
print("File URI: ",uri)
!ls -lh {uri.ospath}

Now say we want to restrict our selection to images that were taken in a specific optical filter. We can add that constraint to our query, but first we need to figure out what the LSST filters are called... Looking at the `dataId`, we see that the attribute `band` looks promising.

In [None]:
ref.dataId.full

In [None]:
print(f"band = {ref.dataId['band']}")

It looks like `band` is what we want, so we include that in the `dataId` argument of `queryDatasets`. We can also select only visits with visit numbers larger than 700000 by adding a constraint in the `where` argument of `queryDatasets`. Let's try the $g$ band.

In [None]:
# You can also sub-select on specific properties of a data set
datasetRefs = registry.queryDatasets(datasetType='calexp',dataId={'band': 'g'}, where='visit > 700000', collections=collection)
for i,ref in enumerate(datasetRefs):
    print(ref.dataId.full)
    try: uri = butler.getURI(ref)
    except: print("File not found...")
    if i > 10: break

### 3. Retrieve and plot a calexp with sources

Ok, now we have all the information we need to ask the Butler to get a specific data product. We have identified a collection (`2.2i/runs/DP0.1`), a `datasetType` (`calexp`), and the `dataId` (from the `datasetRef`) to uniquely specify an instance of this data set.

From the list above, let's choose one detector from one visit and ask the Butler to get a `calexp` for us.

In [None]:
# We could pass the datasetRef that we found above, but since the query may 
# return results in a different order we define the dataId explicitly for reproducibility. 
dataId = {'visit': '703697', 'detector': 80, 'band':'g'}
calexp = butler.get('calexp', dataId=dataId)

# This will print a warning related to the gen2 to gen3 Butler conversion that was performed on DP0.1

The `calexp` (also known as a "processed visit image," or PVI) is a calibrated CCD image from a single exposure. We'll use the afwDisplay interface to show the pixel values and mask plane (more on afwDisplay can be found in other notebooks).

In [None]:
fig = plt.figure()
display = afwDisplay.Display()
display.scale('linear', 'zscale')
display.mtv(calexp)
plt.show()

Note the blue coloring of most sources in the above image (if you have sensitive eyes, you may also see some red streaks). The colors are set by the "mask" plane, which encodes things such as bad pixels, or ones that saturated (in this case, the blue pixels are those that are part of detected sources). See the Image Display and Manipulation tutorial for more about the mask plane. 

Now that we have a calibrated image, we may want to get the catalog of sources that were extracted from it. To get the `src` catalog associated with this `calexp` we pass the `dataId` to the butler with the `src` datasetType. Note that this performs another query to the registry database to find the `src` catalog that matches our dataId requirements.

In [None]:
# We can get the src table using the dataId as we did above for the calexp
# (note that it is also possible to pass the data ref)
src = butler.get('src',dataId=dataId)
src = src.copy(True)
src.asAstropy()

We can now plot the `calexp` with the `src` catalog overlaid. We'll leave further investigation of this image as an exercise to the user :)

In [None]:
# And plot!
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

# We use display buffering to avoid re-drawing the image after each source is plotted
with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange') 

### 4. How to query for multiple data sets

In the case above, both the `calexp` and `src` can be found by the registry, but this will not always necessarily be the case. The `queryDataIds` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that `dataId`) or ask for different `dataId` keys than what is used to identify the dataset (which invokes various built-in relationships). An example of this is provided below:

In [None]:
# Use queryDataIds to grab the dataIds for a subset taken from a single visit
dataIds = registry.queryDataIds(["visit", "detector", "band"], datasets=["calexp","src"], where='visit = 703697',
                                collections=collection)
for i,dataId in enumerate(dataIds):
    print(dataId.full)
    if i > 10: break

In [None]:
# Use queryDataIds to grab the dataIds for a subset using the "where" functionality
dataIds = registry.queryDataIds(["visit", "detector"], datasets=["calexp","src"], 
                                where="band='g' and detector=0 and visit > 700000",
                                collections=collection)
for i,dataId in enumerate(dataIds):
    print(dataId.full)
    if i > 10: break

You can also get more metadata about a data product dimension. (Note that the record for an exposure and a visit are different.)

In [None]:
# Use queryDimensions to provide more flexible access
for dim in ['exposure','visit','detector']:
    print(list(registry.queryDimensionRecords(dim, where='visit = 971990 and detector=0'))[0])
    print()

### 5. Generate an Image Cutout

Now say we want to grab a cutout of the DP0.1 coadded images at a specific location. In order to do this, we need a few other packages from the LSST Science Pipelines. In particular, access to the geometry and coordinate packages.

In [None]:
import lsst.geom as geom
import lsst.afw.coord as afwCoord

Next we define a `cutout_coadd` function to query an instance of the Butler for a cutout image at a specific location, band, and size.

In [None]:
def cutout_coadd(butler, ra, dec, band='r', datasetType='deepCoadd',
                 skymap=None, cutoutSideLength=51, **kwargs):
    """
    Produce a cutout from a coadd at the given afw SpherePoint radec position.

    Adapted from DC2 tutorial notebook by Michael Wood-Vasey.
    
    Parameters
    ----------
    butler: lsst.daf.persistence.Butler
        Servant providing access to a data repository
    ra: float
        Right ascension of the center of the cutout, degrees
    dec: float
        Declination of the center of the cutout, degrees
    filter: string 
        Filter of the image to load
    datasetType: string ['deepCoadd']  
        Which type of coadd to load.  Doesn't support 'calexp'
    skymap: lsst.afw.skyMap.SkyMap [optional] 
        Pass in to avoid the Butler read.  Useful if you have lots of them.
    cutoutSideLength: float [optional] 
        Side of the cutout region in pixels.
    
    Returns
    -------
    MaskedImage
    """
    radec = geom.SpherePoint(ra, dec, geom.degrees)
    cutoutSize = geom.ExtentI(cutoutSideLength, cutoutSideLength)

    if skymap is None:
        skymap = butler.get("skyMap")
    
    # Look up the tract, patch for the RA, Dec
    tractInfo = skymap.findTract(radec)
    patchInfo = tractInfo.findPatch(radec)
    xy = geom.PointI(tractInfo.getWcs().skyToPixel(radec))
    bbox = geom.BoxI(xy - cutoutSize//2, cutoutSize)
    patch = tractInfo.getSequentialPatchIndex(patchInfo)

    coaddId = {'tract': tractInfo.getId(), 'patch': patch, 'band': band}
    parameters = {'bbox':bbox}
    
    cutout_image = butler.get(datasetType, parameters=parameters, immediate=True, dataId=coaddId)
    
    return cutout_image

In [None]:
ra,dec = 55.064, -29.783 # center of the DC2 region
cutout_image = cutout_coadd(butler,ra,dec,datasetType='deepCoadd', cutoutSideLength=201)
print("The size of the cutout is: ",cutout_image.image.array.shape)

In [None]:
fig = plt.figure()
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(cutout_image.image)
plt.gca().axis('off')

### 6. Exploring and retrieving catalog data from the Butler

The TAP service is the recommended way to retrieve DP0.1 catalog data for a notebook, and there are several other [tutuorials](https://github.com/rubin-dp0/tutorial-notebooks) that demonstrate how to use the TAP service.

However, as we saw above, the Butler can also be used to access to catalog data. We can investigate the table schema for a specific source catalog by  Butler appending `_schema` to the `datasetType`. Note that this does not require you to specify the ``dataId``. 

In [None]:
schema_coadd_src = butler.get('deepCoadd_forced_src_schema')
schema_coadd_src.asAstropy()

The table `schema` stores information about the columns stored in the table. Each of the following lines will print the schema to the screen in different ways.

In [None]:
# schema_coadd_src.schema
# schema_coadd_src.schema.getNames()
# schema_coadd_src.schema.getOrderedNames()
print('Number of columns in this table = ', len(schema_coadd_src.schema.getNames()) )

Perhaps you want to search for all schema elements that contain the term 'psf'.

In [None]:
# Define an array that is all of the column names
all_names = schema_coadd_src.schema.getOrderedNames()

# Loop over the names and look for the term 'psf'
for i,name in enumerate(all_names):
    if name.find('psf') >= 0:
        print( i, name )
del all_names

Probably you will want to know more about the values in these columns. You can do that by printing the documentation string in the schema.

In [None]:
# Turn the schema into a python dictionary, to be able to call a column by name
schema_dict = schema_coadd_src.schema.extract('*')

# Print the associated docstring for each of the named columns of interest
for name in ['base_SdssShape_psf_xx','base_SdssShape_psf_yy','base_SdssShape_psf_xy']:
    doc = schema_dict[name].getField().getDoc()
    units = schema_dict[name].getField().getUnits()
    print(name, '[%s]'%units, ' = ', doc)

Refer to the DP0.1 Data Products Definitions Document (DPDD) at [dp0-1.lsst.io](http://dp0-1.lsst.io) to find out more about the columns.

<br>
The full catalogs are very large and it is not feasible to try and retrieve them in their entirety.
Instead, in this example we identify the tract and patch of interest and retrieve only catalog data for a small region of sky.
We use the same ra and dec coordinates as above to find the patch and tract.

In [None]:
radec = geom.SpherePoint(ra, dec, geom.degrees)

skymap = butler.get("skyMap")

tractInfo = skymap.findTract(radec)
tract = tractInfo.getId()

patchInfo = tractInfo.findPatch(radec)
patch = tractInfo.getSequentialPatchIndex(patchInfo)

print(tract, patch)

coaddId = {'tract': tract, 'patch': patch, 'band':'i'}

coadd_src = butler.get('deepCoadd_forced_src',coaddId)
coadd_src = coadd_src.copy(True)

In [None]:
# Show the table contents if desired
# coadd_src.asAstropy()

<br>

Convert to a Pandas dataframe (see the first tutorial) for easy interaction.
The following cells offer options for printing the column names or the data values.

In [None]:
data = coadd_src.asAstropy().to_pandas()

In [None]:
# print(data.columns)

In [None]:
# for col in data.columns:
#     print(col)

In [None]:
# data['coord_ra'].values

Plot the locations of sources in the patch as well as the ra,dec that we requested. Note that the `coord_ra` and `coord_dec` are in radians, so we need to convert them to degrees.

In [None]:
fig = plt.figure()
plt.plot(np.degrees(data['coord_ra'].values), np.degrees(data['coord_dec'].values), 'o', ms=2, alpha=0.5 )
plt.plot(ra,dec,'*',ms=25, mec='k')
plt.xlabel('RA (deg)')
plt.ylabel('Dec (deg)')
plt.title('Butler coadd_forced_src objects in tract 4638 patch 43')

As we noted, the `coord_ra` and `coord_dec` columns have units of _radians_. As an exercise, you could use what you've learned from above to confirm this by accessing the table schema. (Also note that you can scroll up and find the answer in the outputs from a cell you already executed.) 