# Basic HiPSCat Functionality

In this notebook we will demonstrate analysis of single hipscats
* querying based off of columnar information
* querying based off of conesearches
* utilize the `dask.dataframe` parallelized pandas API to concatenate methods

In [1]:
from lsd2 import hipscat as hc

#Load your catalog
path_to_gaia = "/data3/epyc/projects3/ivoa_demo/"
gaia = hc.Catalog("gaia", path_to_gaia)


In [2]:
#load the catalog as a dask dataframe
gaia_df = gaia.load(columns=['ra', 'dec', 'pmra', 'pmdec'])

## Lazy Evaluation and Parallelized python infrastructure
Our computations rely on the `dask.dataframe` API, which as you can see hasn't evaluated anything. For all `dask.dataframe` computations, the scheduled process aren't executed until you perform a `.compute()` at the end of the line of code. 

When utilizing this API, we can define a client to specify the number of workers, the address/port in two ways:
* with `dask.distributed` [client](https://distributed.dask.org/en/latest/client.html#dask)
* with `dask_on_ray`'s [scheduler](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html#scheduler)

We have personally found Dask on Ray to be better at scheduling and memory management, so we encourage others to use that.

In [3]:
from lsd2 import lsd2_client
client = lsd2_client(dask_on_ray=True)

2023-07-13 17:00:13,160	INFO worker.py:1636 -- Started a local Ray instance.


## Actually performing some computations

Now that we have our parallelized framework set up. Let's analyze our dataset with the LSDB codebase:
* performing cone searches
* querying based off of column info
* assigning columns
    * concatenating all this functionality

In [4]:
#cone search
cone_search = gaia.cone_search(
    ra=30,
    dec=30,
    radius=10,
    columns=['ra', 'dec', 'pmra', 'pmdec']
)
cone_search

                     ra      dec     pmra    pmdec source_id    _DIST
npartitions=11                                                       
0               float64  float64  float64  float64     int64  float64
1                   ...      ...      ...      ...       ...      ...
...                 ...      ...      ...      ...       ...      ...
10                  ...      ...      ...      ...       ...      ...
10                  ...      ...      ...      ...       ...      ...

In [5]:
cone_search.compute()

Unnamed: 0_level_0,ra,dec,pmra,pmdec,source_id,_DIST
_hipscat_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
613915723277795328,19.212841,26.841175,-6.350239,-10.811676,306957842311547264,9.993966
613915877896617984,19.207104,26.855160,13.004301,1.799691,306957911031105664,9.993759
613915963795963904,19.234180,26.863457,-1.596328,7.070467,306957979750501120,9.968233
613915993860734976,19.240827,26.855305,-7.413680,-11.669042,306957979751724928,9.965597
613916019630538752,19.229054,26.858292,7.009552,-1.114430,306957979750500992,9.974345
...,...,...,...,...,...,...
228205678454374400,39.698460,24.899149,-19.781346,-51.746918,114102648101977856,9.998279
228205755763785728,39.700551,24.911938,-1.194285,-3.182412,114102712525804416,9.992947
228206009166856192,39.705043,24.918058,0.741951,-1.816784,114102746885544320,9.993063
228206022051758080,39.716971,24.937145,-10.921423,-4.131019,114102849964765440,9.991868


### Now lets perform the same computation but let's use the LSDB api to create a new column

by using the `dask.dataframe.assign` method

In [6]:
import numpy as np

gaia.cone_search(
    ra=30,
    dec=30,
    radius=10,
    columns=['ra', 'dec', 'pmra', 'pmdec']
).assign(
    pm=lambda x: np.sqrt(x['pmra']**2 + x['pmdec']**2)
).compute()

Unnamed: 0_level_0,ra,dec,pmra,pmdec,source_id,_DIST,pm
_hipscat_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
613915723277795328,19.212841,26.841175,-6.350239,-10.811676,306957842311547264,9.993966,12.538655
613915877896617984,19.207104,26.855160,13.004301,1.799691,306957911031105664,9.993759,13.128242
613915963795963904,19.234180,26.863457,-1.596328,7.070467,306957979750501120,9.968233,7.248432
613915993860734976,19.240827,26.855305,-7.413680,-11.669042,306957979751724928,9.965597,13.824948
613916019630538752,19.229054,26.858292,7.009552,-1.114430,306957979750500992,9.974345,7.097589
...,...,...,...,...,...,...,...
228205678454374400,39.698460,24.899149,-19.781346,-51.746918,114102648101977856,9.998279,55.398963
228205755763785728,39.700551,24.911938,-1.194285,-3.182412,114102712525804416,9.992947,3.399126
228206009166856192,39.705043,24.918058,0.741951,-1.816784,114102746885544320,9.993063,1.962446
228206022051758080,39.716971,24.937145,-10.921423,-4.131019,114102849964765440,9.991868,11.676592


### Let's go a little bit further and cull our data based off of our calculation

by using the `dask.dataframe.query` method

In [7]:
gaia.cone_search(
    ra=30,
    dec=30,
    radius=10,
    columns=['ra', 'dec', 'pmra', 'pmdec']
).assign(
    pm=lambda x: np.sqrt(x['pmra']**2 + x['pmdec']**2)
).query(
    'pm > 20'
).compute()

Unnamed: 0_level_0,ra,dec,pmra,pmdec,source_id,_DIST,pm
_hipscat_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
613919300985552896,19.273558,26.901635,-15.098936,-19.610078,306959629017944832,9.921853,24.749405
613919773431955456,19.244651,26.910269,-18.965551,-19.419771,306959869536114304,9.942929,27.144422
613919936640712704,19.232518,26.928925,9.072718,-48.276453,306959942551261696,9.946508,49.121586
613920976022798336,19.282497,26.963268,-15.164895,-13.881558,306960488012628096,9.892768,20.558981
613921001792602112,19.283365,26.964217,19.963370,-34.377466,306960488011823488,9.891713,39.753570
...,...,...,...,...,...,...,...
594474931769573376,25.405278,27.190991,-3.370920,-34.456414,298734354329442432,4.915041,34.620912
228186827842912256,39.484437,24.687508,-7.348221,-38.560012,114093302253148928,9.953508,39.253928
228187291699380224,39.516281,24.724422,22.203642,-1.185242,114093474051365760,9.956666,22.235254
228188468520419328,39.539126,24.772903,5.102051,-23.296427,114094053871694080,9.946796,23.848573


### Pretty amazing results with what you can do with 4 cores and the 1.8Bn object Gaia catalog

Now lets shut down our client

In [8]:
client.shutdown()

