This notebook will demostrate how to match catalogs with this package.

## Data I/O
<!-- To initialize an `astrotable.table.Data` object, input path to the data, an `astropy.table.Table` object, or anything that can be converted to an `astropy.table.Table` object.  -->
In this package, data are handled by the `astrotable.table.Data` class. To get started, import the `astrotable.table.Data` class:

In [1]:
from astrotable.table import Data

Initialize a `Data` object simply using the path to the data file (note that the files in `./samples/` directory are randomly generated datasets):

In [2]:
cat1 = Data('samples/catalog1.csv', name='cat1')

It is highly recommended to input a `name` keyword argument, as this name will be used to distinguish different datasets. if `Data` is initialized with a path to file and no `name` is given, it will be automatically set to the file name.

Inside the class, the data is stored in an `astropy.table.Table` object, which can be accessed by `data.t`. Thus, you can do anything as you can do with an `astropy.table.Table` object. 

In [3]:
print("Name:", cat1.name)
cat1.t

Name: cat1


survey1_id,RA,Dec,A,B
int32,float64,float64,int32,int32
0,213.97270432253802,-64.56756226110878,6308,233
1,201.52301875416597,-9.28690472041771,8670,329
2,269.99728512493795,-58.493484032138056,1803,21
3,9.154948591816833,-48.53721306047669,9376,332
4,90.38158728957859,-60.65028754242699,5231,459
5,94.5738843176371,-0.300580707865322,6119,353
6,348.60867552294656,-83.26014129433523,337,332
7,132.1381138144912,-64.92924394039969,4552,143
8,187.4328313492788,-44.073367589742766,72,156
9,104.02947521856244,-31.695448201512292,6628,220


You may also add keyword arguments to be passed to `astropy.table.Table.read()`:

In [4]:
cat3id = Data('samples/catalog3_id.txt', name='cat3id',
              names=['cat3ID', 'survey2_id'], format='ascii')
cat3v = Data('samples/catalog3_measurement.txt', name='cat3v',
             names=['survey2_id', 'x', 'y', 'class1'], format='ascii')

A `Data` object can also be created with an `astropy.table.Table` object, or anything that can be converted to an `astropy.table.Table` object. 

In [5]:
from astropy.table import Table
cat2_table = Table.read('samples/catalog2.fits')
cat2 = Data(cat2_table, name='cat2')
cat4_dict = dict(Table.read('samples/catalog4.hdf5'))
# print(cat4_dict)
cat4 = Data(cat4_dict, name='cat4')

<!-- Let us have a look at  -->

## Matching
In this package, catalog B is said to be *matched to* A, if each record (row) in A is assigned two values:
- Whether it can be matched to a record in catalog B;
- The index of the best match record in catalog B (if no match possible, the index can be any number but means nothing).

A is referred to as the *base data* of the match.

### Matching with a built-in matcher
To match `cat4` to `cat1` with the exact value of the `'survey1_id'` field in `cat1` and the `'survey1_id'` field in `cat4`, use an `ExactMatcher`:

In [6]:
from astrotable.matcher import ExactMatcher
cat1.match(cat4, ExactMatcher('survey1_id', 'survey1_id'))

"cat4" matched to "cat1": 51/100 matched.


<astrotable.table.Data at 0x1960807a680>

Since there are more than one records for the same `'survey1_id'` in `cat4`, matching `cat1` to `cat4` is not equal to matching `cat4` to `cat1`:

In [7]:
cat4.match(cat1, ExactMatcher('survey1_id', 'survey1_id'))

"cat1" matched to "cat4": 70/70 matched.


<astrotable.table.Data at 0x1960807aec0>

You may use any iterable object (e.g. an array) to match the catalogs, provided that what is used to match catalogs has the same length (i.e. number of records) as the catalogs.

In [8]:
print('len(cat3v) =', len(cat3v))
print('len(cat3id) =', len(cat3id))
cat2.match(cat3v, ExactMatcher('survey2_id', cat3id.t['survey2_id']))

len(cat3v) = 110
len(cat3id) = 110
"cat3v" matched to "cat2": 110/150 matched.


<astrotable.table.Data at 0x1960807aef0>

You can also match data with thier coordinates:

In [9]:
from astrotable.matcher import SkyMatcher
import astropy.units as u
cat1.match(cat2, SkyMatcher(unit=u.deg, unit1=(u.h, u.deg))) # RA for cat1 is dms; RA for cat2 is hms.

Data cat1: found RA name 'RA' and Dec name 'Dec'.
Data cat2: found RA name 'RA' and Dec name 'Dec'.
"cat2" matched to "cat1": 90/100 matched.


<astrotable.table.Data at 0x1960807a680>

For more information on `SkyMatcher`, use `help(SkyMatcher)`.

To remove all matches to `cat1`, use:

In [10]:
# cat1.reset_match()

### Defining custom matchers
You may also define your own matchers. A macther class should be defined like this:

In [11]:
class MyMatcher():
    def __init__(self, args): # 'args' means any number of arguments that you need
        # initialize it with args you need
        pass
    
    def get_values(self, data, data1, verbose=True): # data1 is matched to data
        # prepare the data that is needed to do the matching (if necessary)
        pass
    
    def match(self):
        # do the matching process and calculate:
        # idx : array of shape (len(data), ). 
        #     the index of a record in data1 that best matches the records in data
        # matched : boolean array of shape (len(data), ).
        #     whether the records in data can be matched to those in data1.
        return idx, matched

## Merging catalogs

### Match tree
If B is matched to A, I call A as the *child data* of B, and B as the *parent data* of A.

Say B, C are matched to A, and D is matched to B. Then B, C are children of A, and D is child of B. When we try to merge everything into A<!-- (i.e. add the information of corresponding records in B, C, D into A)--> (i.e. merge the information in A's chilren, grandchildren, etc. into A), it may be useful to see all of its children/grandchildren, or what I call the *match tree*:

In [12]:
cat1.match_tree(detail=False)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------


From the *match tree* we may see that `cat4` and `cat2` are matched to `cat1` and `cat3v` is matched to `cat2`. Although `cat1` is also matched to `cat4`, this match is a duplication in this match tree, and will be ignored when merging everything (`cat4`, `cat2` and `cat3v`) into `cat1`.

For more information on how they are matched:

In [13]:
cat1.match_tree(detail=True)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1 [base]
:   cat4 [ExactMatcher('survey1_id', 'survey1_id')]
:   :   (cat1) [ExactMatcher('survey1_id', 'survey1_id')]
:   cat2 [SkyMatcher with thres=1]
:   :   cat3v [ExactMatcher('survey2_id', 'survey2_id')]
---------------


For example, we may also use `cat4` as the base catalog:

In [14]:
cat4.match_tree(detail=False)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4
:   cat1
:   :   (cat4)
:   :   cat2
:   :   :   cat3v
---------------


### Catalog merging
Now we can merge everything possible to be merged into `cat1`:

In [15]:
merged_cat = cat1.merge(outname='my_merged_catalog')
print("Name:", merged_cat.name)
merged_cat.t

merged: cat1, cat4, cat2, cat3v
Name: my_merged_catalog


survey1_id_cat1,RA_cat1,Dec_cat1,A,B,id,survey1_id_cat4,i,j,survey2_id_cat2,RA_cat2,Dec_cat2,a,b,survey2_id_cat3v,x,y,class1
int32,float64,float64,int32,int32,int32,int32,int32,int32,int32,float64,float64,int32,int32,int32,int32,int32,str8
0,213.97270432253802,-64.56756226110878,6308,233,36,0,8985,5407,40,14.264846954835868,-64.56756226110878,5450,4,40,347,624,Type III
8,187.4328313492788,-44.073367589742766,72,156,50,8,1176,7390,71,12.495522089951923,-44.073367589742766,3791,92,71,200,784,Type III
12,274.9755359854324,-46.763915275241764,3101,474,23,12,5966,1823,11,18.331702399028824,-46.763915275241764,334,94,11,439,351,Type I
13,244.48845400028554,-76.08437131529189,679,350,68,13,608,3050,33,16.299230266685704,-76.08437131529189,4596,86,33,413,223,Type I
14,254.1930792486915,-56.51341037274159,1178,272,52,14,5087,134,18,16.9462052832461,-56.51341037274159,1657,51,18,521,565,Type III
15,300.94573020305,-33.538587039094594,3725,363,42,15,563,3221,92,20.063048680203334,-33.538587039094594,4326,57,92,22,664,Type I
18,102.5887558206159,-46.08805584014972,4319,42,38,18,4778,6342,8,6.8392503880410604,-46.08805584014972,4006,56,8,497,326,Type III
20,206.8220972937418,-5.662787214634307,1814,329,14,20,1219,2065,87,13.788139819582787,-5.662787214634307,779,31,87,293,499,Type III
29,333.1315228870165,-83.66608808539158,9519,243,0,29,5262,1187,32,22.20876819246777,-83.66608808539158,5141,58,32,493,284,Type I
30,60.52133494990202,-43.53139963697669,2221,37,27,30,5544,2099,78,4.034755663326801,-43.53139963697669,2113,4,78,379,220,Type I


Note that columns with the same names are renamed by the `name` of the `Data` objects. You may also check that the match is indeed correct.

### Merging options

Maybe you want to keep records that cannot be matched to `cat3v` and only want to merge subsets of columns from the catalogs:

In [16]:
merge_columns = { # specify columns to be merged
    'cat1': ['survey1_id', 'RA', 'Dec'],
    'cat4': ['i', 'j'],
    'cat2': ['survey2_id'],
    'cat3v': ['class1'],
    }

keep_unmatched = ['cat3v'] # keep records that cannot be matched to cat3v

another_merged_cat = cat1.merge(keep_unmatched=keep_unmatched, 
                                merge_columns=merge_columns) # use default outname
print("Name:", another_merged_cat.name)
another_merged_cat.t

entries with no match for cat3v is kept.
merged: cat1, cat4, cat2, cat3v
Name: match_cat1_cat4_cat2_cat3v


survey1_id,RA,Dec,i,j,survey2_id,class1
int32,float64,float64,int32,int32,int32,str8
0,213.97270432253802,-64.56756226110878,8985,5407,40,Type III
2,269.99728512493795,-58.493484032138056,1617,6439,73,--
4,90.38158728957859,-60.65028754242699,2177,6878,141,--
5,94.5738843176371,-0.300580707865322,8650,1254,89,--
6,348.60867552294656,-83.26014129433523,5229,4704,68,--
8,187.4328313492788,-44.073367589742766,1176,7390,71,Type III
11,170.11833223185067,-37.359813181905174,2698,7174,139,--
12,274.9755359854324,-46.763915275241764,5966,1823,11,Type I
13,244.48845400028554,-76.08437131529189,608,3050,33,Type I
14,254.1930792486915,-56.51341037274159,5087,134,18,Type III


You may also set the depth for `match_tree` and `merge` methods. Setting depth to 0 means only keeping the base catalog itself.

In [17]:
cat4.match_tree(depth=2)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
:   cat1 [ExactMatcher('survey1_id', 'survey1_id')]
:   :   (cat4) [ExactMatcher('survey1_id', 'survey1_id')]
:   :   cat2 [SkyMatcher with thres=1]
---------------


In [18]:
cat4.match_tree(depth=0)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
---------------


In [19]:
cat4.merge(depth=1).t

merged: cat4, cat1


id,survey1_id_cat4,i,j,survey1_id_cat1,RA,Dec,A,B
int32,int32,int32,int32,int32,float64,float64,int32,int32
0,29,5262,1187,29,333.1315228870165,-83.66608808539158,9519,243
1,20,7551,7445,20,206.8220972937418,-5.662787214634307,1814,329
2,30,1402,7410,30,60.52133494990202,-43.53139963697669,2221,37
3,50,8325,2188,50,49.66381683960189,-38.44786504223306,8468,36
4,57,7500,7129,57,8.044537199625843,-67.35697851611482,825,207
5,88,3776,569,88,262.79011830531977,-59.311299727603924,5764,278
6,94,6632,7582,94,38.60838741180505,-31.22680066907673,9025,244
7,62,2178,1883,62,8.674767223013312,-70.39892406293181,9249,77
8,36,6705,7392,36,265.00405198543075,-29.62284703390563,8529,23
9,74,8410,6314,74,172.10969703816716,-78.65452871433797,4294,463


## Things to be noted

This package does not support matching multiple records to a single record in the base data. For example, `table2` below has two records with the same `survey_id`:

In [20]:
table1 = Data({'survey_id': [0, 1, 2], 'value': ['A', 'B', 'C']}, name='t1')
table1.t

survey_id,value
int32,str1
0,A
1,B
2,C


In [21]:
table2 = Data({'table2_id': [0, 1, 2], 'survey_id': [0, 1, 0]}, name='t2')
table2.t

table2_id,survey_id
int32,int32
0,0
1,1
2,0


If you match `table2` to `table1` by `survey_id` using `ExactMatcher`, the first exact match in `table2` will be used:

In [22]:
table1.match(table2, ExactMatcher('survey_id', 'survey_id')).merge().t

"t2" matched to "t1": 2/3 matched.
merged: t1, t2


survey_id_t1,value,table2_id,survey_id_t2
int32,str1,int32,int32
0,A,0,0
1,B,1,1


If you wish to keep the records with the same `sruvey_id` in `table2`, you may match `table1` to `table2` instead of matching `table2` to `table1`:

In [23]:
table2.match(table1, ExactMatcher('survey_id', 'survey_id')).merge().t

"t1" matched to "t2": 3/3 matched.
merged: t2, t1


table2_id,survey_id_t2,survey_id_t1,value
int32,int32,int32,str1
0,0,0,A
1,1,1,B
2,0,0,A


Or you may merge these records (with the same `sruvey_id`) before matching and merging the catalogs.