
The goal of this notebook is to handle missing values in the approved dataset by:
- Identifying missing-value patterns
- Removing non-imputable columns
- Applying KNN Imputation for numerical features
- Applying appropriate imputation for categorical features
- Producing a clean dataset


In [36]:
# Core data handling
import pandas as pd
import numpy as np

In [37]:
df=pd.read_csv('full_data.csv')# load the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4048 entries, 0 to 4047
Columns: 112 entries, P_NAME to P_SEMI_MAJOR_AXIS_EST
dtypes: float64(94), int64(4), object(14)
memory usage: 3.5+ MB


In [40]:
# Count missing values per column
missing_summary = df.isnull().sum().sort_values(ascending=False)

# Display top columns with missing values
missing_summary.head(15)


Unnamed: 0,0
P_ATMOSPHERE,4048
P_ALT_NAMES,4048
P_DETECTION_RADIUS,4048
P_GEO_ALBEDO,4048
P_DETECTION_MASS,4048
S_MAGNETIC_FIELD,4048
S_DISC,4048
P_TEMP_MEASURED,4043
P_GEO_ALBEDO_ERROR_MIN,4043
P_GEO_ALBEDO_ERROR_MAX,4043


We observe that missing values are present in both numerical and categorical features.
Since different data types require different imputation strategies, they are handled separately.


In [41]:
# Identify numerical columns
num_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Identify numerical columns that are completely null
all_null_cols = [col for col in num_cols if df[col].isnull().all()]

print("All-null numeric columns:", all_null_cols)

# Drop all-null columns (KNN cannot impute these)
df = df.drop(columns=all_null_cols)

# Update numeric column list
num_cols = [col for col in num_cols if col not in all_null_cols]


All-null numeric columns: ['P_GEO_ALBEDO', 'P_DETECTION_MASS', 'P_DETECTION_RADIUS', 'P_ALT_NAMES', 'P_ATMOSPHERE', 'S_DISC', 'S_MAGNETIC_FIELD']


Columns containing 100% missing values were removed because KNN Imputation requires
at least some observed values to compute nearest neighbors.


In [42]:
# Imputation
from sklearn.impute import KNNImputer

# Initialize KNN Imputer
knn_imputer = KNNImputer(
    n_neighbors=5,
    weights="distance"
)

# Apply imputation only on numerical columns
imputed_array = knn_imputer.fit_transform(df[num_cols])

# Convert back to DataFrame to preserve column names and index
df[num_cols] = pd.DataFrame(
    imputed_array,
    columns=num_cols,
    index=df.index
)


KNN Imputation was chosen for numerical features because it preserves relationships
between features by imputing values based on nearest neighbors rather than global statistics.


In [43]:
# Identify categorical columns
cat_cols = df.select_dtypes(include=["object"]).columns.tolist()

# Mode imputation for categorical columns
for col in cat_cols:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].mode()[0])


Categorical features were imputed using mode imputation, as distance-based methods
like KNN are not suitable for discrete labels.


In [44]:
# Final missing value check
df.isnull().sum().sort_values(ascending=False).head(10)


Unnamed: 0,0
P_NAME,0
P_STATUS,0
P_MASS,0
P_MASS_ERROR_MIN,0
P_MASS_ERROR_MAX,0
P_RADIUS,0
P_RADIUS_ERROR_MIN,0
P_RADIUS_ERROR_MAX,0
P_YEAR,0
P_UPDATED,0


After applying the above imputation strategies, the dataset contains no missing values
and is ready for further preprocessing and modeling.


In [45]:
# Save the cleaned dataset for downstream tasks
df.to_csv("cleaned_imputed_data.csv", index=False)


## Inference & Observations

- Numerical features were imputed using KNN Imputer to maintain inter-feature relationships.
- Columns with 100% missing values were removed as imputation was infeasible.
- Categorical features were imputed using mode imputation.
- Post-imputation validation confirms zero missing values across the dataset.
- The cleaned dataset is now suitable for modeling tasks such as habitability prediction.
