Published on August 01, 2024. By Marília Prata, mpwolke

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

![](https://www.ariel-datachallenge.space/static/images/adc2024.png)https://www.ariel-datachallenge.space/

#Predicting Exoplanetary Features

Predicting Exoplanetary Features with a Residual Model for Uniform and Gaussian Distributions

Author: Andrew Sweet

"In order to help bridge the gap between machine learning and astrophysics domain experts, the 2023 Ariel Data Challenge was hosted to predict posterior distributions of 7 exoplanetary features. The procedure outlined in this paper leveraged a combination of two deep learning models to address this challenge: a Multivariate Gaussian model that generates the mean and covariance matrix of a multivariate Gaussian distribution, and a Uniform Quantile model that predicts quantiles for use as the upper and lower bounds of a uniform distribution."

"Training of the Multivariate Gaussian model was found to be unstable, while training of the Uniform Quantile model was stable.An ensemble of uniform distributions was found to have competitive results during testing (posterior score of 696.43), and when combined with a multivariate Gaussian distribution achieved a final rank of third in the 2023 Ariel Data Challenge (final score of 681.57)."

Spectral and Auxiliary data

"The Ariel Data Challenge at NeurIPS 2022 and ECML PPKD 2023 had three
main differences: the size of the provided data sets, the number of exoplanetary features being predicted, and the scoring metrics. The methods here will focus on the data for ECML PPKD 2023. There were five data files distributed for the challenge, which can broadly be separated into input and output data for machine learning purposes, and contain simulated data for 41,423 (denoted as N below) planets. For input data there were two files:"

– spectral data: which was composed of the wavelength grid, spectrum, uncertainty and bin width across 52 wavelength channels, of shape N × 4 × 52."

– auxiliary data: containing 8 auxiliary features for each planetary system, of shape N × 8. These features were the planet’s mass, orbital period, radius, semi-major axis, and surface gravity, as well as it’s host star’s mass, radius, temperature, and distance from Earth."

"A few suggestions for future work include hyperparameter optimization, separate backbones, alternate feature engineering and feature selection, transfer learning with a pre-trained unsupervised learning scheme such as an encoder-decoder network, and combining the output distributions as part of the learning process such as with a Mixture-of-Experts (MoE).

https://arxiv.org/abs/2406.10771

https://astrobiology.com/tag/2023-ariel-data-challenge

ADC 2023 - Exoplanet Atmospheric Retrieval

Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights from winning the Ariel Data Challenge 2023 using Normalizing Flows

Authors: Mayeul Aubin, Carolina Cuesta-Lazaro, Ethan Tregidga, Javier Viaña, Cecilia Garraffo, Iouli E. Gordon, Mercedes López-Morales, Robert J. Hargreaves, Vladimir Yu. Makhnev, Jeremy J. Drake, Douglas P. Finkbeiner, and Phillip Cargile

Retrieving Exoplanet Atmospheric compositions - Transmission Spectroscopy

"The most commonly used method to study the atmospheric composition of an exoplanet is called transmission spectroscopy, which consists of measuring how light from the host star gets absorbed by the planetary atmosphere during a transit, i.e. when the planet crosses in front of the disk of its star."

"Retrieving the atmospheric properties of exoplanets from their transmission spectra is challenging. There have been many efforts in this direction. Due to the high dimensionality of the parameter space and the low-resolution spectra at hand, there are degeneracies, and a deterministic answer is not usually possible or informative. A number of codes, most based on Bayesian sampling algorithms but some on machine learning too, have been developed to do this."

"In their study, the authors utilized Neural Spline Flows to model the posterior distribution of atmospheric parameters given the observed spectra. Neural Spline Flows, employ monotonic rational-quadratic splines to model the invertible mapping, and neural networks to predict the necessary parameters of these transformations. To implement Neural Spline Flows, they utilized the Zuko python package."

Ensembling the best models:

"Once the hyperparameters optimization was complete, the authors ensembled
the 10 best models to reduce model’s errors and increase robustness."

The Ariel Data Challenge 2023 Solution:

https://github.com/AstroAI-CfA/Ariel_Data_Challenge_2023_solution

Metric: Gaussian Log-Likelihood

[theanets.losses.GaussianLogLikelihood](https://theanets.readthedocs.io/en/stable/api/generated/theanets.losses.GaussianLogLikelihood.html)

[Why we consider log likelihood instead of Likelihood in Gaussian Distribution](https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution)

[Ariel Gaussian Log Likelihood](https://www.kaggle.com/code/metric/ariel-gaussian-log-likelihood/notebook) By Sohier Dane and Kaggle Competition Metrics (aka Kaggle Bot!)

#Competition citation

@misc{ariel-data-challenge-2024,

    author = {Kai Hou Yip, Lorenzo V. Mugnai, Rebecca L. Coates, Andrea Bocchieri, Andreas Papageorgiou, Orphée Faucoz, Tara Tahseen, Virginie Batista, Angèle Syty, Arun Nambiyath Govindan, Sohier Dane, Maggie Demkin, Enzo Pascale, Quentin Changeat, Billy Edwards, Paul Eccleston, Clare Jenner, Ryan King, Theresa Lueftinger, Nikolaos Nikolaou, Pascale Danto, Sudeshna Boro Saikia, Luís F. Simões, Giovanna Tinetti, Ingo P. Waldmann (2024)},
    
    title = {NeurIPS - Ariel Data Challenge 2024},
    
    publisher = {Kaggle},
    year = {2024},
    url = {https://kaggle.com/competitions/ariel-data-challenge-2024}
}

#Ariel Challenge: Wavelengths, Light Curves, Star Spots

Use ML to identify and correct the effects of stellar spots in noisy transiting light curves of exoplanests. 

Hybrid Model outperforms the LSTM Model

In [None]:
wave = pd.read_csv('/kaggle/input/ariel-data-challenge-2024/wavelengths.csv', delimiter=',', encoding='utf-8')
#pd.set_option('display.max_columns', None)
wave.head()

#Metadata - Contains analog-to-digital (ADC)

"Contains analog-to-digital (ADC) conversion parameters (gain and offset) for restoring the original dynamic range of the data. Also includes a star column identifying which star was used for that planet's simulation."

https://www.kaggle.com/competitions/ariel-data-challenge-2024/data

In [None]:
adc = pd.read_csv('/kaggle/input/ariel-data-challenge-2024/train_adc_info.csv', delimiter=',', encoding='utf-8')
#pd.set_option('display.max_columns', None)
adc.tail()

#No missing values

In [None]:
adc.isnull().sum()

#ADC pairplot

In [None]:
cols = ['planet_id','FGS1_adc_offset','FGS1_adc_gain','AIRS-CH0_adc_offset','AIRS-CH0_adc_gain', 'star']
sns.pairplot(adc[cols],height=2,kind='scatter')
plt.show();

#Correlation Matrix

In [None]:
adc.corr()
plt.figure(figsize=(10,4))
sns.heatmap(adc.corr(),annot=True,cmap='summer')
plt.show()

#AIRS-CH0

AIRS-CH0 is the first channel (CH0) of the Ariel InfraRed Spectrometer (AIRS). It is an infrared spectrometer with a sensitivity between 1.95 and 3.90 µm, and has a resolving power of approximately R=100. 

https://www.kaggle.com/competitions/ariel-data-challenge-2024/data

In [None]:
test_adc = pd.read_csv('/kaggle/input/ariel-data-challenge-2024/test_adc_info.csv', delimiter=',', encoding='utf-8')
#pd.set_option('display.max_columns', None)
test_adc.head()

In [None]:
sub = pd.read_csv('/kaggle/input/ariel-data-challenge-2024/sample_submission.csv', delimiter=',', encoding='utf-8')
#pd.set_option('display.max_columns', None)
sub.head()

#Calibration files

Dark, Dead, Flat, Linear_corr and read parquet files

Calibration frames are essential for eye-catching images, although it can be time consuming to acquire them you will noteice a difference in your images when you do take the time. 

https://practicalastrophotography.com/a-brief-guide-to-calibration-frames/

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

df = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/2633183716/AIRS-CH0_calibration/dead.parquet")
df.tail()

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

dar = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/test/499191466/AIRS-CH0_calibration/dark.parquet")
dar.tail(3)

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

fgs = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/1215971796/FGS1_signal.parquet")
fgs.tail(3)

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

air = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/1215971796/AIRS-CH0_signal.parquet")
air.tail(3)

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

fla = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/1215971796/AIRS-CH0_calibration/flat.parquet")
fla.tail(3)

#Wow! Linear_corr file 

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

lin = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/1215971796/AIRS-CH0_calibration/linear_corr.parquet")
lin.tail()

In [None]:
#By Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql

rea = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/train/1215971796/AIRS-CH0_calibration/read.parquet")
rea.tail(3)

#Clueless about these parquet - TY Cyberia for matshow

Thanks to Cyberia I learned matshow.

In [None]:
#Cyberia https://www.kaggle.com/code/cyberia/eda-adc2024/notebook

dark = dar.values.reshape(32, 356)  # Reshape to 2D array
plt.matshow(dark)
plt.show()

#Linear_corr

Only 2 tiny spots on the left upper side

In [None]:
linear = lin.values  
plt.matshow(linear)
plt.show()

#Read parquet file sample

In [None]:
read = rea.values  
plt.matshow(read)
plt.show()

In [None]:
AIRS = air.values  
plt.matshow(AIRS)
plt.show()

#Dead sample

In [None]:
dead = df.values  
plt.matshow(dead)
plt.show()

#FGS1 is different

In [None]:
fgs1 = fgs.values  
plt.matshow(fgs1)
plt.show()

#Flat sample

![](https://y.yarn.co/b49cffdb-fb31-4767-b3c4-a2690fd83f59_text.gif)Yarn

In [None]:
flat = fla.values  
plt.matshow(flat)
plt.show()

#One hour fifty two minutes just to arrive here. More 2:12 reading papers

#Acknowledgements:

Cyberia https://www.kaggle.com/code/cyberia/eda-adc2024/notebook

Bachir https://www.kaggle.com/code/bachrr/gemma2b-finetuned-lora-text2sql