# Ariel Data Challenge 2024: inference

The original version of the Train code is at:  [ADC24 Intro training](https://www.kaggle.com/code/ambrosm/adc24-intro-training).

The original version of the infer code is at: [ADC24 Intro inference](https://www.kaggle.com/code/ambrosm/adc24-intro-inference).

I have added some new features, which resulted in a slight performance improvement(LB:0.388).

This is the inference code.

Training code is at: https://www.kaggle.com/code/royalacecat/adc24-training-with-add-feature

In [None]:
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
from tqdm import tqdm
import pickle

from sklearn.linear_model import Ridge


In [None]:
directory = "/kaggle/input/test-002/"

#exec(open(directory + 'f_read_and_preprocess.py', 'r').read())
#exec(open(directory + 'a_read_and_preprocess.py', 'r').read())
#exec(open(directory + 'feature_engineering.py', 'r').read())
#exec(open(directory + 'postprocessing.py', 'r').read())

In [None]:
#f_read_and_preprocess

def f_read_and_preprocess(dataset, adc_info, planet_ids):
    """Read the FGS1 files for all planet_ids and extract the time series.
    
    Parameters
    dataset: 'train' or 'test'
    adc_info: metadata dataframe, either train_adc_info or test_adc_info
    planet_ids: list of planet ids
    
    Returns
    dataframe with one row per planet_id and 67500 values per row
    
    """
    f_raw_train = np.full((len(planet_ids), 67500), np.nan, dtype=np.float32)
    for i, planet_id in tqdm(list(enumerate(planet_ids))):
        f_signal = pl.read_parquet(f'../input/ariel-data-challenge-2024/{dataset}/{planet_id}/FGS1_signal.parquet')
        mean_signal = f_signal.cast(pl.Int32).sum_horizontal().cast(pl.Float32).to_numpy() / 1024 # mean over the 32*32 pixels
        net_signal = mean_signal[1::2] - mean_signal[0::2]
        f_raw_train[i] = net_signal
    return f_raw_train

In [None]:
# a_read_and_preprocess
def a_read_and_preprocess(dataset, adc_info, planet_ids):
    """Read the AIRS-CH0 files for all planet_ids and extract the time series.
    
    Parameters
    dataset: 'train' or 'test'
    adc_info: metadata dataframe, either train_adc_info or test_adc_info
    planet_ids: list of planet ids
    
    Returns
    dataframe with one row per planet_id and 5625 values per row
    
    """
    a_raw_train = np.full((len(planet_ids), 5625), np.nan, dtype=np.float32)
    for i, planet_id in tqdm(list(enumerate(planet_ids))):
        signal = pl.read_parquet(f'../input/ariel-data-challenge-2024/{dataset}/{planet_id}/AIRS-CH0_signal.parquet')
        mean_signal = signal.cast(pl.Int32).sum_horizontal().cast(pl.Float32).to_numpy() / (32*356) # mean over the 32*356 pixels
        net_signal = mean_signal[1::2] - mean_signal[0::2]
        a_raw_train[i] = net_signal
    return a_raw_train

# feature_engineering
def feature_engineering(f_raw, a_raw):
    """Create a dataframe with two features from the raw data.
    
    Parameters:
    f_raw: ndarray of shape (n_planets, 67500)
    a_raw: ndarray of shape (n_planets, 5625)
    
    Return value:
    df: DataFrame of shape (n_planets, 2)
    """
    obscured = f_raw[:, 23500:44000].mean(axis=1)
    unobscured = (f_raw[:, :20500].mean(axis=1) + f_raw[:, 47000:].mean(axis=1)) / 2
    f_relative_reduction = (unobscured - obscured) / unobscured
    
    half_obscured1 = f_raw[:, 20500:23500].mean(axis=1)
    half_obscured2 = f_raw[:, 44000:47000].mean(axis=1)
    f_half_reduction1 = (unobscured - half_obscured1) / unobscured
    f_half_reduction2 = (unobscured - half_obscured2) / unobscured
    
    obscured = a_raw[:, 1958:3666].mean(axis=1)
    unobscured = (a_raw[:, :1708].mean(axis=1) + a_raw[:, 3916:].mean(axis=1)) / 2
    a_relative_reduction = (unobscured - obscured) / unobscured
    
    half_obscured1 = a_raw[:, 1708:1958].mean(axis=1)
    half_obscured2 = a_raw[:, 3666:3916].mean(axis=1)
    a_half_reduction1 = (unobscured - half_obscured1) / unobscured
    a_half_reduction2 = (unobscured - half_obscured2) / unobscured

    df = pd.DataFrame({'a_relative_reduction': a_relative_reduction,
                       'f_relative_reduction': f_relative_reduction,
                      'f_half_reduction1': f_half_reduction1,
                       'f_half_reduction2': f_half_reduction2,
                       'a_half_reduction1': a_half_reduction1,
                       'a_half_reduction2': a_half_reduction2
                      
                      })
    
    return df

In [None]:
"""
obscured：取数组中第 23,500 到 44,000 列的平均值，代表行星遮挡星光时的亮度。
unobscured：取数组中前 20,500 列和从 47,000 列之后的平均值，然后取平均，代表行星未遮挡星光时的亮度。
half_obscured1,half_obscured2： 取数组中前 20,500 列到从23,500 列，44,000列到47,000 列的平均值，然后取平均，代表行星半遮挡星光时的亮度。
f_relative_reduction：计算遮挡和未遮挡状态下亮度的相对减少量。
"""

def feature_engineering(f_raw, a_raw):
    """Create a dataframe with two features from the raw data.
    
    Parameters:
    f_raw: ndarray of shape (n_planets, 67500)
    a_raw: ndarray of shape (n_planets, 5625)
    
    Return value:
    df: DataFrame of shape (n_planets, 2)
    """
    obscured = f_raw[:, 23500:44000].mean(axis=1)
    unobscured = (f_raw[:, :20500].mean(axis=1) + f_raw[:, 47000:].mean(axis=1)) / 2
    unobscured1 = f_raw[:, :20500].mean(axis=1)
    unobscured2 = f_raw[:, 47000:].mean(axis=1)     
    f_relative_reduction = (unobscured - obscured) / unobscured

    half_obscured1 = f_raw[:, 20500:23500].mean(axis=1)
    half_obscured2 = f_raw[:, 44000:47000].mean(axis=1)
    f_relative_reduction_half1 = (unobscured - half_obscured1) / unobscured1
    f_relative_reduction_half2 = (unobscured - half_obscured2) / unobscured2
    
    obscured = a_raw[:, 1958:3666].mean(axis=1)
    unobscured = (a_raw[:, :1708].mean(axis=1) + a_raw[:, 3916:].mean(axis=1)) / 2
    unobscured1 = a_raw[:, :1708].mean(axis=1)
    unobscured2 = a_raw[:, 3916:].mean(axis=1)    
    a_relative_reduction = (unobscured - obscured) / unobscured

    half_obscured1 = f_raw[:, 1708:1958].mean(axis=1)
    half_obscured2 = f_raw[:, 3666:3916].mean(axis=1)
    a_relative_reduction_half1 = (unobscured - half_obscured1) / unobscured1
    a_relative_reduction_half2 = (unobscured - half_obscured2) / unobscured2

    df = pd.DataFrame({'a_relative_reduction': a_relative_reduction,
                       'a_relative_reduction_half1':a_relative_reduction_half1,
                       'a_relative_reduction_half2':a_relative_reduction_half2,
                       'f_relative_reduction': f_relative_reduction,
                       'f_relative_reduction_half1':f_relative_reduction_half1,
                       'f_relative_reduction_half2':f_relative_reduction_half2,    
                        
                      })
    
    return df

In [None]:
# 'postprocessing
def postprocessing(pred_array, index, sigma_pred):
    """Create a submission dataframe from its components
    
    Parameters:
    pred_array: ndarray of shape (n_samples, 283)
    index: pandas.Index of length n_samples with name 'planet_id'
    sigma_pred: float
    
    Return value:
    df: DataFrame of shape (n_samples, 566) with planet_id as index
    """
    return pd.concat([pd.DataFrame(pred_array.clip(0, None), index=index, columns=wavelengths.columns),
                      pd.DataFrame(sigma_pred, index=index, columns=[f"sigma_{i}" for i in range(1, 284)])],
                     axis=1)

People have been asking how to choose a good value for sigma_pred. As explained in [Understanding the competition metric](https://www.kaggle.com/competitions/ariel-data-challenge-2024/discussion/528114), with sigma_pred we indicate what root mean squared error (rmse) we expect for our test predictions.

The training data cover planets of only two stars (stars 0 and 1), but the test data include planets of other stars.

This leads to the following recipe:
- For known stars (stars 0 and 1), we expect the test rmse to be equal to our cross-validation rmse, i.e. we predict the out-of-fold rmse of our model (0.000293 as shown in the training notebook).
- For unknown stars, the prediction error can only be higher. We thus predict a higher value (0.001 in this notebook).

如何选择一个好的 sigma_pred 值。正如在理解竞赛指标中解释的，通过 sigma_pred 我们表示我们期望的测试预测的均方根误差（rmse）。

训练数据只涵盖了两颗恒星（恒星0和1）的行星，但测试数据包括其他恒星的行星。

这导致了以下问题：

对于已知恒星（恒星0和1），我们期望测试 rmse 等于我们的交叉验证 rmse，即我们预测模型的 out-of-fold rmse（如训练所示，为0.000293）。
对于未知恒星，预测误差只能更高。因此我们预测一个更高的值（目前为0.001）。

In [None]:
# Load the data
wavelengths = pd.read_csv('../input/ariel-data-challenge-2024/wavelengths.csv')
test_adc_info = pd.read_csv('../input/ariel-data-challenge-2024/test_adc_info.csv',
                           index_col='planet_id')
f_raw_test = f_read_and_preprocess('test', test_adc_info, test_adc_info.index)
a_raw_test = a_read_and_preprocess('test', test_adc_info, test_adc_info.index)
test = feature_engineering(f_raw_test, a_raw_test)

# Load the model
with open(directory + 'model.pickle', 'rb') as f:
    model = pickle.load(f)
with open(directory + 'sigma_pred.pickle', 'rb') as f:
    sigma_pred = pickle.load(f)
    
# Predict
test_pred = model.predict(test)

# Package into submission file
sub_df = postprocessing(test_pred,
                        test_adc_info.index,
                        sigma_pred=np.tile(np.where(test_adc_info[['star']] <= 1, sigma_pred, 0.001), (1, 283)))
display(sub_df)
sub_df.to_csv('submission.csv')
#!head submission.csv