# Using Rubin Data Guide
Creating a new account, getting access to the data, and downloading data sets <br>
Note that all the code blocks will only work within the RSP Jupyter Labs instance <br>
Sections:
- [Creating Rubin Account/Registering as a Delegate](#Creating-Rubin-Account/Registering-as-a-Delegate)
- [Rubin Data Schema](#Rubin-Data-Schema)
- [Using the RSP Juypter Labs Aspect](#Using-the-RSP-Juypter-Labs-Aspect)
- [GitHub Setup](#GitHub-Setup)
- [Pulling a DP0.2 Catalog Data Set (3 methods)](#Pulling-a-DP0.2-Catalog-Data-Set---3-methods)
- [Portal GUI Method](#Portal-GUI-Method)
- [Portal ADQL Method](#Portal-ADQL-Method)
- [Jupyter Labs Method](#Jupyter-Labs-Method)
- [Looking at DP0.2 Images Using Notebooks](#Looking-at-DP0.2-Images-Using-Notebooks)

### Creating Rubin Account/Registering as a Delegate
- Follow steps on [this website](https://rsp.lsst.io/guides/getting-started/get-an-account.html)
    - For step 2, use GitHub as identity provider
    - For step 5, use your City Tech email
- Approval should take 1-2 business days
- Bookmark the [Rubin Science Platform](https://data.lsst.cloud/), aka RSP
- Important resources:
    - [Rubin Science Platform guides](https://rsp.lsst.io/guides/index.html)
    - [DP0.2 Data Schema](https://dm.lsst.org/sdm_schemas/browser/dp02.html)
    - [DP0.2 Tutorials](https://dp0-2.lsst.io/tutorials-examples/index.html) (online, not in Notebooks)

### Rubin Data Schema
- [Overall Data Schema](https://dm.lsst.org/sdm_schemas/browser/)
- [DP0.1 Schema](https://dm.lsst.org/sdm_schemas/browser/dp01.html) - contains unprocessed tables from DC2 simulation
    - [Paper on how DC2 was generated](https://arxiv.org/pdf/2010.05926) and what it contains
- [DP0.2 Schema](https://dm.lsst.org/sdm_schemas/browser/dp02.html) - contains images and catalogs from DC2 simulation after it was processed using the LSST Science Pipelines
    - Many of the catalogs use coadded images - read more about that process in [this paper](https://arxiv.org/abs/2211.09300) (not related to LSST)
    - For most fields, there are 6 versions, one for each of the photometric light bands that the data was collected for
        - The bands are labeled u, g, r, i, z, y
        - The actual bands and what light range they correspond to can be seen on [this website](http://svo2.cab.inta-csic.es/theory/fps/index.php?mode=browse&gname=LSST&asttype=)
        - Each field that is specific to a band follows the format \<band letter>_\<field name>
            - Eg, g_decl for declension of an object in the g band
    - Flux is related to magnitude
        - Convert to magnitude using the equation “-2.5 * log10(g_calibFlux) + 31.4” - needs to be done separately for each photometric band
        - The Python equivalent command is -2.50 * numpy.log10(results_table['g_calibFlux']) + 31.4
            - See first DP0.2 Notebook tutorial, section 2.3.1
        - There are multiple different flux values for every band, some of which are labeled as relating to a specific aperture and labeled forced - **what does this mean??**
- [DP0.3 Schema](https://dm.lsst.org/sdm_schemas/browser/dp03.html) - contains tables of nearby/Solar System objects

### Using the RSP Jupyter Labs Aspect
- Open the [Rubin Science Platform](https://data.lsst.cloud/) and click on Notebooks
    - If you do not have a Jupyter Labs instance open, clicking on Notebooks will take you to a screen with a big blue Launch Server button. Click it!
    - An old instance could be open in the background, if you last exited just by closing your browser window instance of properly shutting it down (more on how to properly exit below), in which case that instance will just be opened
    - Select an Image and other Options
        - In most cases, just use the recommended image, unless you are playing with a tutorial workbook that requires you to use a specific older image
        - Use the smallest compute amount that you can make work, to save some for all the other delegates
        - Don’t check either of the tickboxes
    - Hit start
- The Jupyter Labs instance you are taken to now will have the Image and compute power that you selected, but no matter what settings you choose in the future, any files that you save here will continue to be available to you when you are logged into your account
    - The current Image and compute amounts can be see at the very bottom of the window
    - You can create new top level folders, save files to any of the existing folders except the notebooks one, or save files directly in the main folder
        - The notebooks folder contains tutorials that come from the [Tutorial Notebooks in Github](https://github.com/rubin-dp0/tutorial-notebooks) - you can edit these to play around with them, but can’t overwrite the original files, so you will need to save the files in a different location if you’d like to keep the changes you’ve made
- From the Launch tab, try opening a terminal
    - You’ll notice that you are in a remote directory specific to you, but not on your local device
    - Try running “which python” - you’ll see that the Python in this server is managed by miniconda, but you need to use pip commands for changing anything
    - You can see which Python packages are available to you with “pip list” - they will not be the same ones that you’ve downloaded to your own laptop, but most of the ones you need should be there
    - Other Python packages can be installed according to the [instructions here](https://dp0-2.lsst.io/data-access-analysis-tools/nb-intro.html#how-do-i-install-packages-in-my-user-environment) - note that you have to use the terminal in the RSP and use pip commands
- You can access Git and GitHub through the Notebooks terminal - see [GitHub Setup](#GitHub-Setup) section
- Pull data according to instructions in the [Jupyter Labs Method](#Jupyter-Labs-Method) section
- Code to your heart’s content!
- When you’re all done using Jupyter Labs, be sure to exit it properly to return the compute resources 
    - Go to File -> Save All and Exit
    - Your files will still be there next time you start a Jupyter Labs instance, even if you choose different setting for the Image or computing power
- If you need to switch your Image or compute settings, go to [this link](https://data.lsst.cloud/nb/home), stop your current server, and then start a new one

### GitHub Setup
- Open a terminal tab in you RSP Jupyter Labs instance
- First, set your username and email with the below commands to match you GitHub account

- Naviagate to the highest level directory in the file system on the left, and then enter the following command to generate an SSH key

- It will then ask three questions - just hit enter so that they are all left blank
- Then enter the below line, and copy the results to your clipboard

- Then go to your GitHub account, click on your picture in the upper right corner, and go to Settings
- From the menu on the left, pick SSH and GPG keys
- Add a new SSH key and name it "Rubin Science Platform Notebooks"
- Paste in the key you got from the terminal, as per above, into the textbox, and save the key
- Now you should be able to use normal GitHub commands to clone, create, and push repositories in the terminal

### Pulling a DP0.2 Catalog Data Set - 3 methods
- Introductory resources:
    - [Portal-specific Tutorials](https://dp0-2.lsst.io/tutorials-examples/index.html#portal-tutorials)
    - [Tutorial Notebooks in Github](https://github.com/rubin-dp0/tutorial-notebooks) (also available in your instance of JupyterLabs in the folder /notebooks/tutorial-notebooks/
    - [Delegate Contributions to DP0.2 in Github](https://github.com/rubin-dp0/delegate-contributions-dp02) (contains other tutorials)
- Go to the [Rubin Science Platform](https://data.lsst.cloud/) and select either Portal or Notebooks (which is actually Jupyter Labs; they just refer to it as Notebooks)
    - If you are not logged in, you will be prompted to log in and then redirected
    - Jump to the sections below on how to use each of the three methods after being redirected from the RSP website
- Pros and cons for each of the three options:
    - [Portal GUI Method](#Portal-GUI-Method):
        - Pros:
            - Relatively easy to play around with and look for the data you want
            - Can do basic data visualizations without much coding
            - Can download data directly to your computer
        - Cons:
            - If the browser tab is accidentally closed, everything you have done will be lost unless you downloaded data or copied code into some other location already
            - Not easy to repeat the same query a second time without converting to the ADQL code
    - [Portal ADQL Method](#Portal-ADQL-Method):
        - Pros:
            - Can join multiple tables together easily
            - Can do basic data visualizations without much coding
            - Can download data directly to your computer
        - Cons:
            - Requires learning ADQL language
            - If the browser tab is accidentally closed, everything you have done will be lost unless you downloaded data or copied code into some other location already
    - [Jupyter Labs Method](#Jupyter-Labs-Method):
        - Pros:
            - If the browser tab is accidentally closed, the only work lost is any unsaved changes to Python files (even a Terminal instance stays open) - more on this below
            - Processing is all done in the cloud, and you can select more power if needed
            - As far as I can tell, the LSST specific Python packages are still only available through this, and cannot be downloaded to your computer
        - Cons:
            - Requires more work and setup to get to the point of pulling a table for the first time
            - Requires learning ADQL language
            - Hard to get any of the data back out onto your computer

### Portal GUI Method
- Once in the Portal, click on the DP0.2 Catalogs tab and make sure UI assisted is selected in the upper right corner - this should be the default
- Select data table in the top center
    - First select schema on the left (generally will use dp02_dc2_catalogs) and the the specific table on the right
- Select primary data set in one of three ways on the left (can also do multiple of these filters at once or none of them, but if you select none of them, then your primary data set will be huge)
  - Spatial
    - Cone shape: select a coordinate to center around (also choose which lat/long coordinates to use for this) and a radius (choose the unit)
    - Polygon shape: can specify any shape using any lat/long coordinate system
  - Temporal
    - ???
  - ObjectID search
    - ???
- Then on the right, select columns that you want in your data table
    - Can also filter by any column by enter conditions like “=0” or “>360” in the constraints field, whether or not you have selected that column to be returned in your data table
- Then in the bottom left, select the max number of rows that you want returned - if, based on your other filters alone, there would be more rows, then a random sample of them are returned
- Hit search, which will then take you to the results tab when you can explore your data set with the plotting tools
    - Multiple query results can be open at once - small tabs for them will appear at the top of the table section of the results page
    - The view can be changed by hitting the three lines icon in the upper left corner, clicking on the results layout, and then choosing a layout - Tables and Coverage Charts is the easiest to work with
    - In the Tables and Coverage Charts view, there will be multiple chart tabs on the upper middle of the screen
        - If Active Chart is selected from the options, select the gear icon in the upper right corner to edit what the chart is showing
        - If Details is selected, information about the columns in the data set is shown
        - If Coverage is selected, it shows the area of the sky where the data points are - **I think - need more info here**
    - Can filter the data and add new calculated columns (based only on the originally selected columns) using the icons in the Table section
- Can download your data by hitting the save icon above the data table, selecting a file format, and following the instructions from there
    - Note the difference between downloading the data as displayed and downloading the data as originally retrieved
- Can return to your query to edit it by clicking back to the DP0.2 tab
    - This will only work for you most recently run query if you are displaying multiple at once
- Can save your query for future reuse/editing by clicking “Populate and Edit ADQL” in the bottom left
    - Then save the query in the top box as a text file in your location of choice
    - This can later be rerun with the instructions below

### Portal ADQL Method
- Once in the Portal, click on the DP02 Catalogs tab and then select Edit ADQL in the upper right
- Write your query in the empty text box at the top, or paste in a query that you generated through the GUI and saved as a text file previously
    - The text boxes below show example queries to get you started
    - Can browse the schema with the folder system on the left
        - The top level folders are each schema, then the next folders are each table, and then the fields are shown within those folders
        - Searching with the text search within this will only look for fields within the folders that have been expanded
        - Single clicking on a field within the schema will paste the full name of the field (\<schema>.\<table>.\<field>) into your query wherever your typing cursor is currently located (so long as the toggle option below the text box is toggled on)
        - This is useful, because if you are joining multiple tables together, you need to include the table name at the start of each field in the SELECT statement
  - Can add a row limit next to the search button or as part of the query using the TOP function - **need to figure out where to put this in the query**
  - Once the query is written, hit search
  - See above section for how to interact with the results section of the Portal

### Jupyter Labs Method
- When you’re familiar with the environment (see [
Using the RSP Jupyter Labs Aspect](#Using-the-RSP-Jupyter-Labs-Aspect) section) and ready to pull your own data set, open a new Notebook
    - Start by importing the necessary packages (in code block) and any other packages you would like, like matplotlib

In [1]:
import numpy
import pandas
# The below packages are needed to actually import any LSST data 
# and are specific to the DP0.2 catalog data
from lsst.rsp import get_tap_service, retrieve_query
# There are other packages needed if you’d like to retrieve image data 
# or data from other DP0 catalogs

- Start the TAP service

In [2]:
service = get_tap_service("tap")

- Save your desired ADQL query as a string variable (example below)

In [3]:
my_adql_query = "SELECT description, table_name FROM TAP_SCHEMA.tables"

- Another example with a variable used inside the query: 

In [4]:
use_center_coords = "62, -37"
my_adql_query = "SELECT TOP 10 "+ \
	"coord_ra, coord_dec, detect_isPrimary, " + \
	"r_calibFlux, r_cModelFlux, r_extendedness " + \
	"FROM dp02_dc2_catalogs.Object " + \
	"WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec), " + \
	"CIRCLE('ICRS', " + use_center_coords + ", 0.01)) = 1 "

- Actually run the query

In [5]:
results = service.search(my_adql_query)

- Turn your query results into a pandas data frame

In [6]:
results_table = results.to_table().to_pandas()
results_table

Unnamed: 0,coord_ra,coord_dec,detect_isPrimary,r_calibFlux,r_cModelFlux,r_extendedness
0,62.009569,-37.003053,False,115.559762,107.20676,1.0
1,61.999653,-37.003744,False,142.142982,76.299635,0.0
2,62.002448,-37.006693,False,,,
3,61.995406,-37.008044,False,1062.160437,1092.795869,1.0
4,61.997783,-37.008798,False,261.141894,197.592692,
5,61.99617,-37.005624,False,117.663697,48.474621,1.0
6,61.997782,-37.009576,False,94.749279,42.59039,
7,61.99568,-37.003583,False,46.794625,32.073184,
8,61.99584,-37.001595,False,21.184015,39.045501,
9,61.996226,-37.000629,False,152.118444,87.743612,1.0


- Then you can do all the pandas, numpy, and other stuff you know and love
- There are other things you can do with the original data type before it’s converted to pandas, as shown in many of the tutorial notebooks, but this should be enough to get started for now
- **Note that rather than writing the ADQL query yourself, you could figure out the query you want in the Portal GUI, have it generate the ADQL query, and then use that in your Python code**

### Looking at DP0.2 Images Using Notebooks
- It's recommended to use at least the medium sized compute option, since the images take a lot of memory to display
- There are multiple LSST packages specific to the image data
    - Butler is used to retrieve the images
    - AFW is used for visualizing images
    - Geom is used for sky coordinates

In [7]:
import matplotlib.pyplot as plt
from astropy.wcs import WCS
from astropy.visualization import make_lupton_rgb
import gc #recommended for clearing out memory, since displaying the images takes a lot of space

#LSST specific packages
import lsst.afw.display as afwDisplay
from lsst.afw.image import MultibandExposure
from lsst.daf.butler import Butler
from lsst.rsp import get_tap_service
import lsst.geom as geom

- Also import this file of functions that are useful for creating images using the code below (functions from the DP02_03a tutorial)

In [8]:
import sys
sys.path.insert(0, 'WORK/LSST_Data/ImageFunctions.py') #replace this with the path name in your own folder

- Generate a Butler instance to import the images, and select the right data repository configuration and the data collection

In [9]:
butler = Butler('dp02', collections='2.2i/runs/DP0.2') 
#these parameters should be correct for our purposes

- Define a dictionary with the desired parameters of the image(s) that are to be loaded in, and then use the butler instance to load them

In [10]:
#how to load a single visit (calexp image)
my_dataId = {'visit': 192350, 'detector': 175, 'band': 'i'} #specifying the specific calexp image we want
#note that each single visit is only taken in one band, so for a calexp image the band parameter isn't actually necessary
my_calexp = butler.get('calexp', **my_dataId) #calexp files are individual exposure shots that later go into the coadded images

In [11]:
#how to load a single coadd image which consists of multiple visits to the same spot in the sky
my_dataId = {'tract': 4431, 'patch': 17, 'band': 'i'} #the sky is divided into tracts and patches
#so you can identify a unique section of the sky and all the images taken of it
my_coadd = butler.get('deepCoadd', **my_dataId)

- Look through the /LSST_Data/DP02_03a_annotatedtutorial.ipynb file to learn how to use the imported functions and other ways of displaying the image data
- Things you can do with the image data (from the tutorial):
    - Display the image using afw
    - Display the image using matplotlib
    - Visualize the mask plane
    - Make cutouts (subsets) of image
    - Plot catalog data for the same section of the sky over the image
    - Create composite image from images in different color bands