# [**SQLxMatch: In-Database Spatial Cross-Match of Astronomical Catalogs**](https://github.com/sciserver/SQLxMatch)

##### Manuchehr Taghizadeh-Popp <sup>1*</sup> and Laszlo Dobos<sup>1,2</sup>
<sup>1</sup> Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD, USA.<br>
<sup>2</sup> Department of Physics of Complex Systems, Eötvös Loránd University, Budapest, Hungary.<br>
<sup>*</sup> Leading contributor email: mtaghiza [at] jhu.edu  |  Help Desk: sciserver-helpdesk [at] jhu.edu
<br><br>

`SQLxMatch` (or *sequel cross match*)  is a SQL stored procedure that allows to perform 2-dimensional spatial cross-matches and cone searches across multiple astronomical catalogs stored in relational databases.
This procedure implements the `Zones Algorithm` ([[1]](https://arxiv.org/abs/cs/0701171), [[2]](https://arxiv.org/abs/cs/0408031)), which leverages relational database algebra and B-Trees to cross-match the database tables or views containing the catalogs. To run a cross-match, these tables must simply contain at least the Right Ascension (RA) or Longitude, Declination (Dec) or Latitude, and unique object identifier (ID) columns.

We have integrated `SQLxMatch` with more than 50 astronomical catalogs, and made those publicly available as tables in remote SQL Server databases `in the cloud` through the [CasJobs](https://skyserver.sdss.org/CasJobs) website, as part of the [SciServer](https://www.sciserver.org) science platform ([[3]](https://www.sciencedirect.com/science/article/abs/pii/S2213133720300664)). <br>
To improve the execution speed, we install the cross-match code in a SQL Server database supported by fast NVMe storage with a RAID 6 configuration. We also place the catalog tables in several databases in the same physical server, thus avoiding having to move data across servers with a potentially slower network conection.


The advantage of this <i>`in-database`</i> remote cross-match, compared to other <i>`in-memory`</i> local cross-match software libraries, is that the users leverage the remote database server's own (and potentially bigger) computing/memory/storage resources to filter and cross-match the full catalogs right away, having only a relatively small-sized cross-match output table returned to them.
This can be faster and more efficient than having users to download the full catalogs into their own computers (if they have enough storage), and then load them in python for filtering and running the cross-match, for instance.



### **Cross-Match Details**

The execution of `SQLxMatch` follow the pattern of a basic two-table cross-match:

    EXECUTE SQLxMatch @table1='CatalogTable1', @table2='CatalogTable2', @radius=5

which returns the following output table:
        
    TABLE(id1, id2, sep)

The first two input parameters are the names of catalog tables, views, or temporary tables located in the CasJobs `xmatch` database context. This context already contains several table views to specific astronomical catalog tables, and are named as `<CatalogName>_<TableName>`. <br>
The code assumes that the two input tables at least contain columns named `RA` (Right Ascension), `Dec` (Declination), and `ObjID` (unique object or row identifier). If the names are different, then those can be passed as extra input parameters as well (see below). The third parameter is the search radius, measured in arcseconds. 
    
The output table contains all objects found within the input search radius, although it will return only the closest match if the `@only_nearest` input parameter is set to 1 (see below). The first two columns in the returned table include the IDs of the objects in the first and seconds input tables, respectively, and the third column is the separation distance in arcseconds.

The code will run faster if the 3-D cartesian coordinates of an object located on the surface of a unit-radius sphere (named as `cx`, `cy`, and `cz`) are already present as columns under those names in the input tables. The reason is that the cross-match code works internally with those coordinates (rather than with `RA`, `Dec`), and then it will not have to calculate them on the fly in that case. <br>
Similarly, the presence of the precomputed `zoneid` column in the catalog tables (based on a zone height of 4 arsec) will speed up the code, although it can be calculated on the fly if missing. 


<b>SQLxMatch PARAMETERS:</b> <br>
<ul>
<li> <b>@table1 sysname</b>: name of first input catalog. Can be any of these formats: 'server.database.schema.table', 'database.schema.table', 'database.table', or simply 'table'
<li> <b>@table2 sysname</b>: name of first input catalog. Can be any of these formats: 'server.database.schema.table', 'database.schema.table', 'database.table', or simply 'table'
<li> <b>@radius float</b>: search radius around each object, in arcseconds. Takes a default value of 10 arcseconds.
<li> <b>@id_col1 sysname</b>: name of the column defining a unique object identifier in catalog @table1. Takes a default value of 'objid'.
<li> <b>@id_col2 sysname</b>: name of the column defining a unique object identifier in catalog @table2. Takes a default value of 'objid'.
<li> <b>@ra_col1 sysname</b>: name of the column containing the Right Ascension (RA) in degrees of objects in catalog @table1. Takes a default value of 'ra'.
<li> <b>@ra_col2 sysname</b>: name of the column containing the Right Ascension (RA) in degrees of objects in catalog @table2. Takes a default value of 'ra'.
<li> <b>@dec_col1 sysname</b>: name of the column containing the Declination (Dec) in degrees of objects in catalog @table1. Takes a default value of 'dec'.
<li> <b>@dec_col2 sysname</b>: name of the column containing the Declination (Dec) in degrees of objects in catalog @table2. Takes a default value of 'dec'.
<li> <b>@max_catalog_rows1 bigint</b>: default value of null. If set, the procedure will use only the TOP @max_catalog_rows1 rows in catalog @table1, with no special ordering.
<li> <b>@max_catalog_rows2 bigint</b>: default value of null. If set, the procedure will use only the TOP @max_catalog_rows2 rows in catalog @table2, with no special ordering.
<li> <b>@output_table sysname</b>: If not null, this procedure will insert the output results into the table @output_table (of format 'server.database.schema.table', 'database.schema.table', 'database.table', or simply 'table'), which must already exist and be visbile within the scope of the procedure. If set to null, the output results will be simply returned as a table resultset. Takes a default value of null.
<li> <b>@only_nearest bit</b>: If set to 0 (default value), then all matches within a distance @radius to an object are returned. If set to 1, only the closest match to an object is returned.
<li> <b>@sort_by_separation bit</b>: If set to 1, then the output will be sorted by the 'id1' and 'sep' columns. If set to 0 (default value), no particular ordering is applied.
<li> <b>@radec_in_output bit </b>: If set to 1, then the output table will contain as well the (RA, Dec) values of each object.
<li> <b>@print_messages bit </b>: If set to 1, then time-stamped messages will be printed as the different sections in this procedure are completed.
</ul>
    
<b>RETURNS:</b> <br>
<ul>
    <li><b>TABLE (id1, id2, sep)</b>, where id1 and id2 are the unique object identifier columns in @table1 and @table2, respectively, and sep (float) is the angular separation between objetcs in arseconds. 
        
or
<li><b>TABLE (id1, id2, sep, ra1, dec1, ra2, dec2)</b> when @radec_in_output=1, where ra1, dec1, ra2 and dec2 (all float) are the coordinates of the objets in @table1 and @table2, respectively.
</ul>    

### **Catalog Metadata**

We have created views to the tables under the `TAP_SCHEMA` schema, which allow to easily retrieve metadata related to the astronomical catalogs, including table, and column descriptions:


        SELECT * FROM Catalogs  

        SELECT * FROM Tables

        SELECT * FROM Columns



---
# Demo

In order to communicate and send queries to CasJobs, we need to import the CasJobs module from the [SciServer python](https://github.com/sciserver/SciScript-python) package. You will need to install this package if you are not running this notebook in [SciServer-Compute](https://apps.sciserver.org/compute).

In [1]:
from SciServer import CasJobs as cj
import pandas as pd
pd.set_option('display.max_rows', 20)

### Listing all catalogs


Here we run a SQL query containing the `Catalogs` view that lists the names of all available catalogs.<br>

In [2]:
sql = "SELECT * FROM Catalogs ORDER BY catalog_name"
cj.executeQuery(sql, context='xmatch')

Unnamed: 0,catalog_name,summary,remarks,url
0,ACVS,ASAS Catalog of Variable Stars,\r\n The ASAS-3 Catalog of Variable Stars...,http://www.astrouw.edu.pl/asas/?page=main
1,AGC,Arecibo Galaxy Catalog,"\r\n The AGC, or Arecibo General Catalog,...",http://caborojo.astro.cornell.edu/alfalfalog/i...
2,AKARI,AKARI Point Source Catalogues,\r\n AKARI (Previously known as ASTRO-F o...,http://www.ir.isas.jaxa.jp/AKARI/
3,CHANDRA,"The Chandra Source Catalog, Release 1.1",\r\n The first official release of the CS...,http://cxc.cfa.harvard.edu/csc/
4,CNOC2,The Canadian Network for Observational Cosmol...,\r\n The Canadian Network for Observation...,http://www.astro.utoronto.ca/~cnoc/cnoc2.html
...,...,...,...,...
47,VVDS,The VIMOS VLT Deep Survey,\r\n A total of 11 564 objects have been ...,http://cesam.lam.fr/vvdsproject/index.html
48,WiggleZ,WiggleZ Dark Energy Survey Data Release 1,\r\n The WiggleZ Dark Energy Survey is a ...,http://wigglez.swin.edu.au/site/index.html
49,WISE,\tThe WISE All-Sky data Release,\n NASA's Wide-field Infrared Survey Expl...,http://wise2.ipac.caltech.edu/docs/release/all...
50,WMAP,\tNine-year WMAP point source catalogs,\r\n Nine-year WMAP point source catalogs,https://heasarc.gsfc.nasa.gov/W3Browse/radio-c...


### Listing tables in a catalog


The `Tables` view contains the tables available in all catalogs. We can use the `WHERE` clause to filter rows on the `catalog_name` column.

In [3]:
sql = "SELECT * FROM Tables WHERE catalog_name = 'SPITZER'"
cj.executeQuery(sql, context='xmatch')

Unnamed: 0,catalog_name,table_name,table_type,description,schema_name
0,SPITZER,SPITZER_goodsnIRS16micron,view,GOODS-N IRS 16 micron Photometry Catalog,dbo
1,SPITZER,SPITZER_goodsnMIPS24micron,view,GOODS-N MIPS 24 micron Photometry Catalog,dbo
2,SPITZER,SPITZER_goodssIRS16micron,view,GOODS-S IRS 16 micron Photometry Catalog,dbo
3,SPITZER,SPITZER_goodssMIPS24micron,view,GOODS-S MIPS 24 micron Photometry Catalog,dbo


### Listing columns in a table

Given a table name, we can run this SQL query containing the `getTableColumns` function in order to get a description of all its columns.<br>
This helps identifying the RA and Dec columns, as well as possible column on which we could impose filters.

In [4]:
sql = "SELECT * FROM Columns WHERE catalog_name = 'SPITZER' and table_name = 'SPITZER_goodsnIRS16micron' ORDER BY column_index"
cj.executeQuery(sql, context='xmatch')

Unnamed: 0,catalog_name,table_name,column_name,description,unit,ucd,utype,datatype,size,precision,scale,column_index,schema_name
0,SPITZER,SPITZER_goodsnIRS16micron,cx,Cartesian X (J2000),,pos.eq.x;pos.frame=j2000,,float,8,53,0,1,dbo
1,SPITZER,SPITZER_goodsnIRS16micron,cy,Cartesian Y (J2000),,pos.eq.y;pos.frame=j2000,,float,8,53,0,2,dbo
2,SPITZER,SPITZER_goodsnIRS16micron,cz,Cartesian Z (J2000),,pos.eq.z;pos.frame=j2000,,float,8,53,0,3,dbo
3,SPITZER,SPITZER_goodsnIRS16micron,htmid,HTM ID (J2000),,pos.eq.HTM; pos.frame=j2000,,bigint,8,19,0,4,dbo
4,SPITZER,SPITZER_goodsnIRS16micron,zoneid,Zone ID (J2000),,pos.eq.zone;pos.frame=j2000,,int,4,10,0,5,dbo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,SPITZER,SPITZER_goodsnIRS16micron,ebmag,HST B magnitude uncertainty,mag,stat.error;phot.mag;em.opt.B,,real,4,24,0,38,dbo
38,SPITZER,SPITZER_goodsnIRS16micron,evmag,HST V magnitude uncertainty,mag,stat.error;phot.mag;em.opt.V,,real,4,24,0,39,dbo
39,SPITZER,SPITZER_goodsnIRS16micron,eimag,HST I magnitude uncertainty,mag,stat.error;phot.mag;em.opt.I,,real,4,24,0,40,dbo
40,SPITZER,SPITZER_goodsnIRS16micron,ezmag,HST z magnitude uncertainty,mag,stat.error;phot.mag;em.opt.SDDS.z,,real,4,24,0,41,dbo


---
## Cross-Match Examples

In [5]:
# Setting up seartch parameters. Search radius must be in arcseconds, the rest in degrees.

radius = 30
ra1 = 160
ra2 = 160.5
dec1 = 25
dec2 = 25.5

### Cross-matching two small catalogs - Pandas DataFrame output

Here we cross-match 2 local tables (the `PhotoObjAll` tables in the `FUSE` and `FIRST` catalogs). 

We can retrieve the cross-match output table as a Pandas dataframe by using the syncronous `executeQuery` function in CasJobs, as this cross-match takes less time than its `1 minute timeout`. 

In [6]:
%%time

sql = f"""
EXECUTE SQLxMatch @table1='FUSE_PhotoObjAll', @table2='FIRST_PhotoObjAll', @radius={radius}

-- Note that we could be more explicit by specifying the columns names, but that was not needed:
--EXECUTE SQLxMatch @table1='FUSE_PhotoObjAll', @id_col1='objid', @ra_col1='ra', @dec_col1='dec',  @table2='FIRST_PhotoObjAll', @radius={radius}
"""

df = cj.executeQuery(sql, context="xmatch")
df

CPU times: user 33.3 ms, sys: 1.13 ms, total: 34.5 ms
Wall time: 4.34 s


Unnamed: 0,id1,id2,sep
0,88,643407,0.665150
1,88,643462,22.824550
2,88,643365,27.299218
3,508,26286,1.397182
4,762,804039,0.538000
...,...,...,...
79,739,53917,0.356798
80,740,53917,0.356798
81,741,53917,0.356798
82,770,592139,0.194586


### Cross-matching two filtered catalogs - Pandas DataFrame output

Here we use 2 local tables (the `PhotoObjAll` tables in the `SDSSDR17` and `GALEXDR6` catalogs), which we filter by `ra` and `dec` in rectangular regions and store in local temporary tables. These temporary tables are then passed as input to the `sp_xmatch` procedure.

Note that we can also store the (`cx`, `cy`, `cz`) columns in the temporary tables in order to avaoid having the code to compute them later internally on the fly.

In [7]:
%%time

sql = f"""

SELECT objid, ra, dec, cx, cy, cz INTO #temp1
FROM SDSSDR17_PhotoObjAll WHERE ra BETWEEN {ra1} AND {ra2} AND dec BETWEEN {dec1} AND {dec2}

SELECT objid, ra, dec, cx, cy, cz INTO #temp2 
FROM GALEXGR6_PhotoObjAll WHERE ra BETWEEN {ra1} AND {ra2} AND dec BETWEEN {dec1} AND {dec2}

EXECUTE SQLxMatch @table1='#temp1', @table2='#temp2', @radius={radius}
"""

df = cj.executeQuery(sql, context="xmatch")
df

CPU times: user 91.8 ms, sys: 30 ms, total: 122 ms
Wall time: 4.89 s


Unnamed: 0,id1,id2,sep
0,1237667430101877625,6387874658268480342,24.713197
1,1237667430101877625,6387874658268480335,24.360351
2,1237667430101877799,6387874658268480335,21.183498
3,1237667430101877631,6387874658268480335,23.689764
4,1237667430101877644,6387874658268480316,28.674751
...,...,...,...
14477,1237667323247329512,6387874658268481830,27.112988
14478,1237667323247329953,6387874658268481830,13.102262
14479,1237667323247263986,6387874658268481805,16.845794
14480,1237667430638878978,6387874658268481805,16.813032


### Cross-Match beween 2 catalogs as an asynchronous job - Output to MyDB

When the cross-match is expected to take longer than the 1 minute timeout, it is recommendable to run it as an asynchronous job store the cross-match results asynchronous job into a table in `MyDB`.

First, one must create the output table in MyDB, or delete if this table was already created by this demo.


In [8]:
mydb_output_table_name = "xmatch_table"

sql = f"""
IF EXISTS (select * from sys.objects WHERE object_id = OBJECT_ID(N'{mydb_output_table_name}') AND TYPE = 'U')
DROP TABLE {mydb_output_table_name}  
CREATE TABLE {mydb_output_table_name}(id1 bigint, id2 bigint, sep float, ra1 float, dec1 float, ra2 float, dec2 float)
"""
df = cj.executeQuery(sql, context="mydb")

<br>
The result of the `sp_xmatch` procedure can be stored in a local temporary table specified by the `@output_table` input parameter. This output table can be then used to fill the table in MyDB.
<br>

In [9]:
%%time

sql = f"""
SELECT objid, ra, dec INTO #temp1
FROM SDSSDR17_PhotoObjAll WHERE ra BETWEEN {ra1} AND {ra2} AND dec BETWEEN {dec1} AND {dec2}

SELECT objid, ra, dec into #temp2 
FROM GALEXGR6_PhotoObjAll WHERE ra BETWEEN {ra1} AND {ra2} AND dec BETWEEN {dec1} AND {dec2}

-- Creating temporary output table
CREATE TABLE #out(id1 bigint, id2 bigint, sep float, ra1 float, dec1 float, ra2 float, dec2 float)

-- Executing cross-match that fills output table.
EXECUTE SQLxMatch @table1='#temp1', @table2='#temp2', @radius={radius}, @output_table='#out', @radec_in_output=1

-- Filling up table in MyDB:
INSERT INTO mydb.{mydb_output_table_name}
SELECT * from #out
"""

job_id = cj.submitJob(sql, context="xmatch")

# this line will make the code wait until the job is done, if desired:
job_description = cj.waitForJob(job_id)

CPU times: user 105 ms, sys: 12.6 ms, total: 118 ms
Wall time: 11.7 s


<br>
We can now inspect the contents of the MyDB output table:

In [10]:
sql = f"""
SELECT * FROM mydb.{mydb_output_table_name}
"""
df = cj.executeQuery(sql, context="mydb")
df

Unnamed: 0,id1,id2,sep,ra1,dec1,ra2,dec2
0,1237667323247329356,6387874658268481581,160.449557,25.471017,160.449766,25.470882,0.834876
1,1237667323247329357,6387874658268481581,160.449557,25.471017,160.449766,25.470882,0.834862
2,1237667323247394821,6387874658268481581,160.455875,25.469008,160.449766,25.470882,20.968780
3,1237667323247395143,6387874658268481676,160.486460,25.469006,160.489078,25.469168,8.527093
4,1237667323247395142,6387874658268481676,160.486460,25.469006,160.489078,25.469168,8.527093
...,...,...,...,...,...,...,...
14477,1237667323247329358,6387874658268481581,160.446842,25.468655,160.449766,25.470882,12.431651
14478,1237667323247394817,6387874658268481581,160.449547,25.471003,160.449766,25.470882,0.833813
14479,1237667323247329355,6387874658268481581,160.449550,25.471020,160.449766,25.470882,0.859314
14480,1237667323247394818,6387874658268481581,160.449555,25.470999,160.449766,25.470882,0.804681
