## Overview
Databases of stellar magnitude have a position-dependent bias relative to Gaia DR2. We first train linear models of Gaia magnitudes vs. others, and then we model the cross-validated residuals of the linear models as a function of position, using a Neural Network.  Magnitudes from AllWISE, GSC 2.3, PPMXL, 2MASS and Tycho2 are corrected accordingly. Results are made available in the output tab.

## Introduction
Databases of stellar magnitude that use different photometric filters are useful in determining the spectral characteristics of stars. If different databases have position-based biases relative to one another, then stellar characteristics – as well as error – will appear to depend spuriously on a star's position in the sky.

We'll use machine learning to explore these magnitude biases that depend on position and see if we can correct them so they are in line with Gaia DR2.

## Data
We will use a [dataset](https://www.kaggle.com/solorzano/257k-gaia-dr2-stars) containing 257K Gaia DR2 stars, with cross-identified photometry from GSC 2.3, PPMXL, 2MASS, Tycho2 and AllWISE. 

In [None]:
import pandas as pd

data = pd.read_csv('../input/257k-gaiadr2-sources-with-photometry.csv', dtype={'source_id': str})

In [None]:
len(data)

The dataset contains some duplicates and apparent Gaia systematics which we will remove, based on results from [a prior kernel](https://www.kaggle.com/solorzano/removal-of-gaia-dr2-stars-with-apparent-systematic).

In [None]:
should_remove_set = set(pd.read_csv('../input/257k-gaiadr2-should-remove.csv', dtype={'source_id': str})['source_id'])

In [None]:
data = data[~data['source_id'].isin(should_remove_set)]
data.reset_index(inplace=True, drop=True)

In [None]:
len(data)

In [None]:
assert len(data) == len(set(data['source_id']))

## Modeling helper functions
This is just boilerplate:

In [None]:
import inspect

pd_concat_argspec = inspect.getfullargspec(pd.concat)
pd_concat_has_sort = 'sort' in pd_concat_argspec.args

def pd_concat(frames):
    # Due to Pandas versioning issue
    new_frame = pd.concat(frames, sort=False) if pd_concat_has_sort else pd.concat(frames)
    new_frame.reset_index(inplace=True, drop=True)
    return new_frame

The following function can be used to train regression models by averaging out multiple runs of k-fold cross-validation. 

In [None]:
import types
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler 

np.random.seed(201808011)

def model_results(data_frame, label_extractor, var_extractor, trainer_factory, id_column='source_id', n_splits=2, n_runs=3, scale=False, max_n_training=None):
    '''
    Returns a frame with source_id, response and residual columns, with the same ordering as data_frame.
    '''
    sum_series = pd.Series([0] * len(data_frame))
    for r in range(n_runs):
        shuffled_frame = data_frame.sample(frac=1)
        shuffled_frame.reset_index(inplace=True, drop=True)
        response_frame = pd.DataFrame(columns=[id_column, 'response'])
        kf = KFold(n_splits=n_splits)
        for train_idx, test_idx in kf.split(shuffled_frame):
            train_frame = shuffled_frame.iloc[train_idx]
            if max_n_training is not None:
                train_frame = train_frame.sample(max_n_training)
            test_frame = shuffled_frame.iloc[test_idx]
            train_labels = label_extractor(train_frame) if isinstance(label_extractor, types.FunctionType) else train_frame[label_extractor]
            train_vars = var_extractor(train_frame)
            test_vars = var_extractor(test_frame)
            if scale:
                scaler = StandardScaler()  
                scaler.fit(train_vars)
                train_vars = scaler.transform(train_vars)  
                test_vars = scaler.transform(test_vars) 
            trainer = trainer_factory()
            fold_model = trainer.fit(train_vars, train_labels)
            test_responses = fold_model.predict(test_vars)
            test_id = test_frame[id_column]
            assert len(test_id) == len(test_responses)
            fold_frame = pd.DataFrame({id_column: test_id, 'response': test_responses})
            response_frame = pd_concat([response_frame, fold_frame])
        response_frame.sort_values(id_column, inplace=True)
        response_frame.reset_index(inplace=True, drop=True)
        assert len(response_frame) == len(data_frame), 'len(response_frame)=%d' % len(response_frame)
        sum_series += response_frame['response']
    cv_response = sum_series / n_runs
    assert len(cv_response) == len(data_frame)
    sorted_result = pd.DataFrame({
        id_column: np.sort(data_frame[id_column].values), 
        'response': cv_response})
    data_frame_partial = pd.DataFrame({id_column: data_frame[id_column]})
    merged_frame = pd.merge(data_frame_partial, sorted_result, how='inner', on=id_column, sort=False)
    data_frame_labels = label_extractor(data_frame) if isinstance(label_extractor, types.FunctionType) else data_frame[label_extractor]
    merged_frame['residual'] = data_frame_labels - merged_frame['response']
    assert len(merged_frame) == len(data_frame)
    return merged_frame

We'll also define a function we can use to evaluate the results produced by the *model_results* function.

In [None]:
import math
import scipy.stats as stats

def print_evaluation(data_frame, label_column, response_frame):
    _response = response_frame['response']
    _label = label_column(data_frame) if isinstance(label_column, types.FunctionType) else data_frame[label_column]
    _error = _label - _response
    assert sum(response_frame['residual'] == _error) == len(data_frame)
    _rmse = math.sqrt(np.sum(_error ** 2) / len(data_frame))
    _correl = stats.pearsonr(_response, _label)[0]
    print('RMSE: %.4f | Correlation: %.4f' % (_rmse, _correl,), flush=True)

## Testing position bias modeling
There are 15 non-Gaia magnitude columns we will be correcting. To illustrate how it will work, we'll pick one of them and do a run-through. Change the TEST_TARGET_COLUMN variable below to test the process with a different column.

In [None]:
TEST_TARGET_COLUMN = 'gsc23_b_mag'
MAX_N_TRAINING = 50000

Each target magnitude column will be modeled as a linear function of the 3 Gaia magnitude columns returned by the following function.

In [None]:
def get_gaia_vars(data_frame):
    var_list = [data_frame['phot_g_mean_mag'], data_frame['phot_bp_mean_mag'], data_frame['phot_rp_mean_mag']]
    return np.transpose(var_list)

Once we have trained the linear model, we will also want to model its residuals. The presumption is that residuals should not depend on position. If they do, a correction is warranted.

The position variables of the residual model will be sine and cosine of Right Ascension and Declination. Why not use Right Ascension and Declination directly? The variable space is not actually shaped like a rectangle. It's essentially the surface of a sphere, so we use 4 variables instead. It's theoretically more sound, and it does work better.

In [None]:
def get_position_vars(data_frame):
    ra_rad = np.deg2rad(data_frame['ra'])
    dec_rad = np.deg2rad(data_frame['dec'])    
    ra_sin = np.sin(ra_rad)
    ra_cos = np.cos(ra_rad)
    dec_sin = np.sin(dec_rad)
    dec_cos = np.cos(dec_rad)
    feature_list = [ra_sin, ra_cos, dec_sin, dec_cos]    
    return np.transpose(feature_list)        

The following function is a trainer factory that creates linear regressors.

In [None]:
from sklearn import linear_model

def get_linear_trainer():
    return linear_model.LinearRegression()

For magnitude residuals, here we have a factory of Neural Network regressors:

In [None]:
from sklearn.neural_network import MLPRegressor

def get_mag_res_trainer():
    return MLPRegressor(hidden_layer_sizes=(80), max_iter=500, alpha=0.02, random_state=np.random.randint(1, 10000))

Let's define a function that executes the model pipeline. The function will return a frame containing the cross-validated *response* of the residual model. That *response* series is what we will use to correct magnitudes.

In [None]:
def model_position_bias(data_frame, target_column, var_extractor, verbose=False):
    # Model target magnitude as a function of Gaia magnitudes
    m_results = model_results(data, target_column, var_extractor, get_linear_trainer)
    if verbose:
        print_evaluation(data_frame, target_column, m_results)
    m_frame = pd.DataFrame({
        'source_id': m_results['source_id'],
        'residual': m_results['residual'],
        'ra': data_frame['ra'],
        'dec': data_frame['dec']
    })
    # Model m_frame['residual'] as a function of position variables.
    m_res_results = model_results(m_frame, 'residual', get_position_vars, 
                                  get_mag_res_trainer, n_runs=4, scale=True, max_n_training=MAX_N_TRAINING)       
    if verbose:
        print_evaluation(m_frame, 'residual', m_res_results)
    return m_res_results

Let's try out the function with our test magnitude column. 

In [None]:
test_result = model_position_bias(data, TEST_TARGET_COLUMN, get_gaia_vars, verbose=True)

The evaluation results above correspond to the linear model, and the position-based residual model, in that order. 

If we subtract the *response* column of *test_result* from the target magnitude column, its position bias would be substantially corrected.

## Visualization of bias
Cross-database error is considerably larger than the biases we'd like to model, so the bias is not easy to visualize directly. Looking at outliers is one way to notice the problem. It's easiest to visualize the bias by looking at residual model responses. They can be thought of as regional residual averages.

We'll sample 10 thousand stars and put them in scatter charts. Color (red to blue) will be used to depict the residual value of each star.

In [None]:
sample_indexes = np.random.choice(len(data), 10000, replace=False)
assert len(sample_indexes) == len(set(sample_indexes))

In [None]:
data_sample = data.iloc[sample_indexes]

In [None]:
test_result_sample = test_result.iloc[sample_indexes]

In [None]:
import matplotlib.pyplot as plt

In [None]:
%%html
<!-- Allow bigger output cells -->
<style>
.output_wrapper, .output {
    height:auto !important;
    max-height: 1500px;
}
</style>

In [None]:
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap

plt.rcParams['figure.figsize'] = (14, 7)
plt.scatter(data_sample['ra'], data_sample['dec'], s=2, c=test_result_sample['response'].values, cmap=cm.coolwarm_r)
plt.title('Positional bias of %s relative to Gaia DR2' % TEST_TARGET_COLUMN)
plt.xlabel('Right Ascension (degrees)')
plt.ylabel('Declination (degrees)')
plt.colorbar()
plt.show()

## Correction routine
The following function takes a magnitude column and creates a new column with 'adj_' as prefix. The new column will contain the original magnitude values minus the response of the position-based residual model.

In [None]:
CORRECTION_PREFIX = 'adj_'

In [None]:
def correct_mag_column(data_frame, column_name, verbose=False):
    column_result = model_position_bias(data_frame, column_name, get_gaia_vars, verbose=verbose)
    corrected_mags = data_frame[column_name] - column_result['response']
    if verbose:
        correl = stats.pearsonr(data_frame[column_name], corrected_mags)[0]
        print('Replacing column %s | Correlation: %.4f' % (column_name, correl,))    
    data_frame[CORRECTION_PREFIX + column_name] = corrected_mags

These are all the magnitude columns we will be correcting:

In [None]:
mag_columns = ['allwise_w1', 'allwise_w2', 'allwise_w3', 'allwise_w4', 
        'gsc23_v_mag', 'gsc23_b_mag',
        'ppmxl_b1mag', 'ppmxl_b2mag', 'ppmxl_r1mag', 'ppmxl_imag', 
        'tmass_j_m', 'tmass_h_m', 'tmass_ks_m', 
        'tycho2_bt_mag', 'tycho2_vt_mag']

Finally, we'll execute a routine that will call the correction function for each column. This can take several minutes to run.

In [None]:
for mag_column in mag_columns:
    correct_mag_column(data, mag_column, verbose=False)
print('Done.')

## Sanity check
What if we apply the model pipeline to one of the adjusted columns?

In [None]:
_ = model_position_bias(data, CORRECTION_PREFIX + TEST_TARGET_COLUMN, get_gaia_vars, verbose=True)

What the second *correlation* value tells us is that only a small amount of positional bias remains. Trying to further correct for it could add noise unnecessarily.

## Output
The entire dataset will be written to the output tab. It contains the new adjusted magnitude columns, but we're keeping the original ones too. Keep in mind that the new columns add some noise and could even carry some information loss.

In [None]:
data.to_csv('254k-gaiadr2-sources-adj-mags.csv')

Note also that Gaia DR2 seems to have its own positional biases, which have not been corrected here.

## Acknowledgments

This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.