# Noise correction and sensor fusion

In this notebook, I will correct for the infrared and visual wavelength sensors' errors. I will also explore the wavelength.csv files, what labels look like, and what the axis information entails

## Noise Correction for AIRS-CH0 Infrared Sensor and FGS1 Visual wavelength sensor on one planet's signal

Starting off, I am just going to use gain and offset as well as dark, dead, linear correlation, flat, and read frames to correct for measurement noise, as laid out in the Kaggle data section for this competition. In the following code block, I am going to load in the noise correction frames, and get the infrared and visual signal for a random planet. I will go through the process of noise correction for this one planet and visualize the transition to ensure that the correction works properly.

In [None]:
# import data, libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from matplotlib import pyplot as plt

# load gain, offset
gain_offset_csv = pd.read_csv("/kaggle/input/ariel-data-challenge-2024/train_adc_info.csv")

Since correction information is stored in the same way, we can make a function for both sensors and all planets. For an explaination of why I am dividing some correction frames, subtracting others and treating others as polynomial coefficients, read the linked article in the kaggle data section for this competition.

In [None]:
sensor_sizes_dict = {"AIRS-CH0":[[11250, 32, 356], [1, 32, 356]], "FGS1":[[135000, 32, 32], [1, 32, 32]]} # switch data sizes depending on the sensor

# define data loading function
def load_data(sensor, planet_id):
    
    # load offset, gain for this planet
    planet_gain_offset = gain_offset_csv[gain_offset_csv["planet_id"] == planet_id]
    
    # get all noise correction frames and signal
    signal = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_signal.parquet', engine='pyarrow')
    dark_frame = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_calibration/dark.parquet', engine='pyarrow')
    dead_frame = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_calibration/dead.parquet', engine='pyarrow')
    linear_corr_frame = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_calibration/linear_corr.parquet', engine='pyarrow')
    flat_frame = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_calibration/flat.parquet', engine='pyarrow')
    read_frame = pd.read_parquet('/kaggle/input/ariel-data-challenge-2024/train/' + str(planet_id) + '/' + sensor + '_calibration/read.parquet', engine='pyarrow')

    # to numpy
    signal = signal.to_numpy().reshape(sensor_sizes_dict[sensor][0])

    # read and dark frame correction, just subtraction
    signal = signal - dark_frame.to_numpy().reshape(sensor_sizes_dict[sensor][1])
    signal = signal - read_frame.to_numpy().reshape(sensor_sizes_dict[sensor][1])

    # flat frame correction + ensure dead pixels do not disrupt. We divide by the flat frame
    flat = flat_frame.to_numpy().reshape(sensor_sizes_dict[sensor][1])
    flat[dead_frame.to_numpy().reshape(sensor_sizes_dict[sensor][1])] = 1
    signal = signal/flat

    # linear correction - treat as sixth degree polynomial, first number is C, not coefficient?
    coefficients = linear_corr_frame.to_numpy().reshape([6] + sensor_sizes_dict[sensor][1])
    coefficients = np.repeat(coefficients, sensor_sizes_dict[sensor][0][0], axis=1)
    corrected = coefficients[0]
    for i in range(5):
        corrected += np.multiply(np.power(signal, i+1),coefficients[i+1])
        
    # gain, offset
    corrected = corrected*planet_gain_offset[sensor + "_adc_gain"].values + planet_gain_offset[sensor + "_adc_offset"].values
        
    # zero out malfunctioning pixels
    corrected[np.repeat(dead_frame.to_numpy().reshape(sensor_sizes_dict[sensor][1]), sensor_sizes_dict[sensor][0][0], axis=0)] = 0
        
    return corrected


In [None]:
# planet
planet_id = 100468857

infrared_data = load_data("AIRS-CH0", planet_id)
visual_data = load_data("FGS1", planet_id)

In [None]:
# graph
%matplotlib inline 
im_data = infrared_data[2, :, :]

print(np.min(im_data))
print(np.max(im_data))

im_data = (im_data - np.min(im_data))/np.max(im_data)

plt.imshow(im_data, cmap='inferno')
plt.show()

In [None]:
im_data = visual_data[0, :, :]

print(np.min(im_data))
print(np.max(im_data))

im_data = (im_data - np.min(im_data))/np.max(im_data)

plt.figure(figsize=(10, 10))
plt.imshow(im_data, cmap='inferno')
plt.colorbar()
plt.show()

If you are wondering how much the cleaning process changes the data, look at version one of the data for a comparison w/ raw data. Short answer is that the data only changes very slightly. I think these data look pretty sensible. If we are going to make informed predictions from these data, however, we should know what we are predicting. Lets look at the wavelength.csv and labels for this challenge:

## Wavelength and Label examples

In [None]:
wavelengths = pd.read_csv("/kaggle/input/ariel-data-challenge-2024/wavelengths.csv")
labels = pd.read_csv("/kaggle/input/ariel-data-challenge-2024/train_labels.csv")

In [None]:
print(wavelengths)

It looks like wavelengths are measured in micrometers! That means that most of the things they care about are actually in the infrared part of the absorbtion spectrum. It may be important later to only predict for certain wavelengths using the sensors that have those wavelengths within thier observation capability.

In [None]:
print(labels)
print(max(labels.iloc[1].values))

It looks like the absorbtion spectra are extremely sparse, so softmax output functions might be our friend with this.

## Axis Info Exploration

In [None]:
axis_data = pd.read_parquet("/kaggle/input/ariel-data-challenge-2024/axis_info.parquet", engine='pyarrow')
axis_data

After taking a quick look at the data, I realized that this is not the first go-round of this competition. Be sure to read previous research papers: https://arxiv.org/pdf/2309.09337