# Physics Based Data Imputation

## Objective

The objective of this notebook is to impute the data of the approved columns in the `full_data.csv`. This notebook imputes the columns present in the file by using a combination of physics and different imputation techniques like `KNNImputer`, `IterativeImputer` and `Mode`.
Using such approch makes sure that the data after imputation stays relevant with respect to the laws of physics as much as possible.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

### Reading the CSV file

In [2]:
df = pd.read_csv("../../datasets/full_data.csv")
df.head(3)

Unnamed: 0,P_NAME,P_STATUS,P_MASS,P_MASS_ERROR_MIN,P_MASS_ERROR_MAX,P_RADIUS,P_RADIUS_ERROR_MIN,P_RADIUS_ERROR_MAX,P_YEAR,P_UPDATED,...,P_HABZONE_CON,P_TYPE_TEMP,P_HABITABLE,P_ESI,S_CONSTELLATION,S_CONSTELLATION_ABR,S_CONSTELLATION_ENG,P_RADIUS_EST,P_MASS_EST,P_SEMI_MAJOR_AXIS_EST
0,11 Com b,3.0,6165.8633,-476.742,476.742,,,,2007,2014-05-14,...,0,Hot,0,0.083813,Coma Berenices,Com,Berenice's Hair,12.082709,6165.8633,1.29
1,11 UMi b,3.0,4684.7848,-794.57001,794.57001,,,,2009,2018-09-06,...,0,Hot,0,0.082414,Ursa Minor,UMi,Little Bear,12.229641,4684.7848,1.53
2,14 And b,3.0,1525.5744,,,,,,2008,2014-05-14,...,0,Hot,0,0.081917,Andromeda,And,Andromeda,12.848516,1525.5744,0.83


### Selecting the approved columns

In [3]:
approved_columns = [
    "P_MASS", "P_RADIUS", "P_DENSITY", "P_GRAVITY", "P_ESCAPE", "P_TYPE",
    "P_PERIOD", "P_SEMI_MAJOR_AXIS", "P_ECCENTRICITY", "P_INCLINATION",
    "P_OMEGA", "P_PERIASTRON", "P_APASTRON", "P_IMPACT_PARAMETER",
    "P_HILL_SPHERE",
    "S_MASS", "S_RADIUS", "S_LUMINOSITY", "S_TEMPERATURE", "S_AGE",
    "S_METALLICITY", "S_LOG_G", "S_TYPE", "S_MAG", "S_DISC",
    "S_MAGNETIC_FIELD",
    "S_SNOW_LINE", "S_TIDAL_LOCK", "P_DETECTION", "P_DISTANCE", "P_HABITABLE"
]

approved_df = df[approved_columns]
approved_df.head(3)

Unnamed: 0,P_MASS,P_RADIUS,P_DENSITY,P_GRAVITY,P_ESCAPE,P_TYPE,P_PERIOD,P_SEMI_MAJOR_AXIS,P_ECCENTRICITY,P_INCLINATION,...,S_LOG_G,S_TYPE,S_MAG,S_DISC,S_MAGNETIC_FIELD,S_SNOW_LINE,S_TIDAL_LOCK,P_DETECTION,P_DISTANCE,P_HABITABLE
0,6165.8633,,,,,Jovian,326.03,1.29,0.231,,...,2.31,K0 III,4.74,,,34.529063,0.6424,Radial Velocity,1.324418,0
1,4684.7848,,,,,Jovian,516.21997,1.53,0.08,,...,1.93,K4 III,5.016,,,42.732816,0.648683,Radial Velocity,1.534896,0
2,1525.5744,,,,,Jovian,185.84,0.83,0.0,,...,2.63,G8 III,5.227,,,20.593611,0.60001,Radial Velocity,0.83,0


### Identifying the missing data

In [4]:
# Identifying the different types of data types present
print('Data Types:\n', approved_df.dtypes.value_counts())

# Finding the missing percentages of values in columns
missing = approved_df.isnull().sum() * 100 / approved_df.shape[0]
missing = missing[missing > 0].sort_values(ascending=False)

# Show the top 10 missing columns
missing[:10]

Data Types:
 float64    27
object      3
int64       1
Name: count, dtype: int64


S_MAGNETIC_FIELD      100.000000
S_DISC                100.000000
P_GRAVITY              82.559289
P_ESCAPE               82.559289
P_DENSITY              82.559289
P_OMEGA                81.571146
P_INCLINATION          79.150198
S_TYPE                 66.156126
P_ECCENTRICITY         65.909091
P_IMPACT_PARAMETER     65.192688
dtype: float64

In [5]:
# Finding the columns that are missing completely and removing them
empty_cols = [col for col in approved_columns if approved_df[col].isnull().all()]
approved_df = approved_df.drop(empty_cols, axis=1)

### Separating the numeric and categorical columns as both will use different imputation techniques

* Numeric -> **KNNImputer**, **IterativeImputer**
* Categorical -> **Mode**

In [6]:
# Separating the column that we have to predict
to_predict = pd.DataFrame(approved_df['P_HABITABLE'])
approved_df = approved_df.drop(columns = ['P_HABITABLE'])

# Separate numeric and categorical columns
num_cols = approved_df.select_dtypes(include=['float64','int64']).columns
cat_cols = approved_df.select_dtypes(include=['object']).columns

### Imputing categorical columns

For the categorical columns we impute with the most frequent value (`strategy='most_frequent'`) i.e. **mode**

In [7]:
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_imputed_df = pd.DataFrame(cat_imputer.fit_transform(approved_df[cat_cols]), columns=cat_cols)

### Imputing Numerical columns

* According to astrophysics mass and radius are related ($Mass \approx Radius^\text{2.06}$), therefore we will use `IterativeImputer`. Also for all the planetary columns we will use the same.
* The `P_DENSITY`, `P_GRAVITY` and `P_ESCAPE` are related to Mass and Radius, therefore we will use there formulas as given by physics.
* We can fill `P_ECCENTRICITY` with *0* (circular orbit assumption).
* For the stellar columns we will use `KNNImputer` as stars follow a fixed path.

In [8]:
num_df = pd.DataFrame(approved_df[num_cols])

# Fill the missing values in P_ECCENTRICITY with 0
num_df['P_ECCENTRICITY'] = num_df['P_ECCENTRICITY'].fillna(0)

# Using KNN imputer on stellar columns
stellar_cols = [col for col in num_df.columns if col[0] == 'S']
num_df['S_AGE'] = num_df['S_AGE'].apply(lambda x: np.nan if x < 0 else x) # ensure age value is not negative
scaler = MinMaxScaler()
df_stellar_scaled = scaler.fit_transform(num_df[stellar_cols])
num_df[stellar_cols] = scaler.inverse_transform(KNNImputer(n_neighbors=5).fit_transform(df_stellar_scaled))

# Using IterativeImputer on planet columns (with stellar columns as helpers)
planet_cols = [col for col in num_df.columns if col[0] == 'P' and col not in ['P_DENSITY', 'P_ESCAPE', 'P_GRAVITY']]

# Converting mass and radius into log so that no negative values occur after imputation
cols_to_log = ['P_MASS', 'P_RADIUS']
for col in cols_to_log:
    num_df[col] = np.log10(num_df[col])

mice_imputer = IterativeImputer(max_iter=20, random_state=42)
subset_impute = num_df[planet_cols + stellar_cols]
num_df[planet_cols] = mice_imputer.fit_transform(subset_impute)[:, :len(planet_cols)]

for col in cols_to_log:
    num_df[col] = 10 ** num_df[col]

# Filling the P_DENSITY, P_ESCAPE and P_GRAVITY values
num_df['P_DENSITY'] = num_df['P_MASS'] / (num_df['P_RADIUS'] ** 3)
num_df['P_GRAVITY'] = num_df['P_MASS'] / (num_df['P_RADIUS'] ** 2)
num_df['P_ESCAPE'] = np.sqrt(num_df['P_MASS'] / num_df['P_RADIUS'])

### Preparing the final dataset and a final check

In [9]:
# Combining the dataframes into final imputed dataframe
final_imputed_df = pd.concat([num_df, cat_imputed_df, to_predict], axis=1)
final_imputed_df

Unnamed: 0,P_MASS,P_RADIUS,P_DENSITY,P_GRAVITY,P_ESCAPE,P_PERIOD,P_SEMI_MAJOR_AXIS,P_ECCENTRICITY,P_INCLINATION,P_OMEGA,...,S_METALLICITY,S_LOG_G,S_MAG,S_SNOW_LINE,S_TIDAL_LOCK,P_DISTANCE,P_TYPE,S_TYPE,P_DETECTION,P_HABITABLE
0,6165.863300,654.904611,2.195127e-05,0.014376,3.068371,326.030000,1.29000,0.231,71.718900,94.800000,...,-0.35,2.310,4.740,34.529063,0.642400,1.324418,Jovian,K0 III,Radial Velocity,0
1,4684.784800,1762.439997,8.557494e-07,0.001508,1.630376,516.219970,1.53000,0.080,61.059901,195.347304,...,-0.02,1.930,5.016,42.732816,0.648683,1.534896,Jovian,K4 III,Radial Velocity,0
2,1525.574400,108.734327,1.186679e-03,0.129033,3.745703,185.840000,0.83000,0.000,82.900072,0.000000,...,-0.24,2.630,5.227,20.593611,0.600010,0.830000,Jovian,G8 III,Radial Velocity,0
3,1481.078500,17.967734,2.553278e-01,4.587662,9.079091,1773.400000,2.93000,0.370,85.487588,147.596899,...,0.41,4.450,6.610,2.141648,0.445415,3.130558,Jovian,K0 V,Radial Velocity,0
4,565.733850,19.741228,7.353425e-02,1.451656,5.353268,798.500000,1.66000,0.680,86.009939,141.996744,...,0.06,4.360,6.250,3.019411,0.473325,2.043792,Jovian,G2.5 V,Radial Velocity,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4043,115.010463,1.870000,1.758784e+01,32.889263,7.842380,28.165600,0.13456,0.000,89.800000,0.000000,...,0.00,4.774,14.900,0.430331,0.341601,0.134560,Superterran,M,Transit,2
4044,185.662881,2.760000,8.830755e+00,24.372884,8.201778,7.906961,0.05769,0.000,89.356000,0.000000,...,0.00,4.774,14.900,0.430331,0.341601,0.057690,Neptunian,M,Transit,0
4045,1.380000,0.725404,3.615255e+00,2.622521,1.379271,3.204000,0.02100,0.000,92.860680,145.000000,...,-0.08,4.762,13.060,0.109941,0.244044,0.021000,Terran,M5.5V,Radial Velocity,0
4046,1.750000,0.761062,3.969890e+00,3.021331,1.516384,6.689000,0.03500,0.000,92.870684,88.000000,...,-0.08,4.762,13.060,0.109941,0.244044,0.035000,Terran,M5.5V,Radial Velocity,1


In [10]:
final_imputed_df.isnull().sum() # This shows that no missing value is present

P_MASS                0
P_RADIUS              0
P_DENSITY             0
P_GRAVITY             0
P_ESCAPE              0
P_PERIOD              0
P_SEMI_MAJOR_AXIS     0
P_ECCENTRICITY        0
P_INCLINATION         0
P_OMEGA               0
P_PERIASTRON          0
P_APASTRON            0
P_IMPACT_PARAMETER    0
P_HILL_SPHERE         0
S_MASS                0
S_RADIUS              0
S_LUMINOSITY          0
S_TEMPERATURE         0
S_AGE                 0
S_METALLICITY         0
S_LOG_G               0
S_MAG                 0
S_SNOW_LINE           0
S_TIDAL_LOCK          0
P_DISTANCE            0
P_TYPE                0
S_TYPE                0
P_DETECTION           0
P_HABITABLE           0
dtype: int64

In [12]:
# Save the cleaned dataset
final_imputed_df.to_csv("imputed_df.csv", index=False)

# Conclusion

This notebook implemented a physics-informed hybrid imputation strategy, using MICE on log-transformed planetary data to preserve mass-radius power laws and KNN for stellar properties to maintain spectral consistency. Derived features (Density, Gravity) were calculated post-imputation to guarantee physical validity, ensuring a robust dataset for habitability analysis.