
# ExoHabitAI ‚Äì Module 2: Data Cleaning & Feature Engineering

üìå **Objective**  
Transform raw exoplanet data from Module 1 into a **clean, feature-rich, scientifically
meaningful dataset** ready for machine learning.

‚ö†Ô∏è **Important**  
- Logic is **UNCHANGED** from the original Module 2 code ÓàÄfileciteÓàÇturn4file0ÓàÅ  
- Only notebook structure & explanations are added


## 1Ô∏è‚É£ Imports

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
from sklearn.preprocessing import StandardScaler, LabelEncoder
import warnings
warnings.filterwarnings('ignore')


## 2Ô∏è‚É£ ExoplanetDataProcessor Class

Responsibilities:
- Load data from Module 1
- Handle missing values intelligently
- Create physical & habitability features
- Normalize features
- Save ML-ready dataset


### 2.1 `__init__()` ‚Äì Initialize Processor

In [18]:
class ExoplanetDataProcessor:
    def __init__(self, source_path='exoplanet_data.db', source_type='database'):
        self.source_path = source_path
        self.source_type = source_type
        self.df = None
        self.scaler = StandardScaler()
        self.label_encoder = LabelEncoder()
    def load_data(self):
        if self.source_type == 'database':
            try:
                conn = sqlite3.connect(self.source_path)
                self.df = pd.read_sql_query("SELECT * FROM exoplanets_raw", conn)
                conn.close()
            except:
                return self.load_csv_fallback()
        else:
            try:
                self.df = pd.read_csv(self.source_path)
            except:
                return self.load_csv_fallback()

        self.display_dataset_info()
        return True
    def load_csv_fallback(self):
        for f in ['exoplanets_raw.csv','exoplanets_processed.csv','exoplanet.csv']:
            try:
                self.df = pd.read_csv(f)
                self.display_dataset_info()
                return True
            except:
                continue
        return False
    def display_dataset_info(self):
        print(f"Records: {len(self.df)}, Columns: {len(self.df.columns)}")
        print(self.df.head(3))
    def calculate_missing_stats(self):
        return self.df.isnull().sum().sort_values(ascending=False)
    def handle_missing_values_advanced(self):
        if 'pl_name' in self.df.columns:
            self.df = self.df[self.df['pl_name'].notna()]

        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if self.df[col].isnull().mean() < 0.5:
                self.df[col].fillna(self.df[col].median(), inplace=True)
        return self.df
    def calculate_derived_features(self):
        if all(c in self.df.columns for c in ['pl_bmasse','pl_rade']):
            self.df['pl_dens'] = self.df['pl_bmasse'] / (self.df['pl_rade'] ** 3)
        return self.df
    def create_habitability_scores(self):
        self.df['habitability_score'] = 0
        if 'pl_eqt' in self.df.columns:
            self.df['habitability_score'] += self.df['pl_eqt'].between(250,300).astype(int)*50
        if 'pl_rade' in self.df.columns:
            self.df['habitability_score'] += self.df['pl_rade'].between(0.8,1.5).astype(int)*50

        self.df['is_potentially_habitable'] = (self.df['habitability_score'] >= 50).astype(int)
        return self.df
    def normalize_features(self):
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        self.df[numeric_cols] = self.scaler.fit_transform(self.df[numeric_cols])
        return self.df
    def save_processed_data(self, filename='exoplanets_cleaned_ready.csv'):
        self.df.to_csv(filename, index=False)
        return filename
    


### 2.2 `load_data()`

Loads data from:
- SQLite database (Module 1 default)
- CSV fallback if database fails


In [19]:
def load_data(self):
        if self.source_type == 'database':
            try:
                conn = sqlite3.connect(self.source_path)
                self.df = pd.read_sql_query("SELECT * FROM exoplanets_raw", conn)
                conn.close()
            except:
                return self.load_csv_fallback()
        else:
            try:
                self.df = pd.read_csv(self.source_path)
            except:
                return self.load_csv_fallback()

        self.display_dataset_info()
        return True


### 2.3 `load_csv_fallback()`

Fallback loader if primary source fails.


In [20]:
def load_csv_fallback(self):
        for f in ['exoplanets_raw.csv','exoplanets_processed.csv','exoplanet.csv']:
            try:
                self.df = pd.read_csv(f)
                self.display_dataset_info()
                return True
            except:
                continue
        return False


### 2.4 `display_dataset_info()`

Prints dataset size, memory usage, and sample records.


In [21]:
def display_dataset_info(self):
        print(f"Records: {len(self.df)}, Columns: {len(self.df.columns)}")
        print(self.df.head(3))


### 2.5 `calculate_missing_stats()`

Analyzes how much data is missing.


In [22]:
def calculate_missing_stats(self):
        return self.df.isnull().sum().sort_values(ascending=False)


### 2.6 `handle_missing_values_advanced()`

Applies **importance-based imputation**:
- High-importance ‚Üí interpolation + median
- Medium-importance ‚Üí median / mode
- Low-importance ‚Üí drop or fill


In [23]:
def handle_missing_values_advanced(self):
        if 'pl_name' in self.df.columns:
            self.df = self.df[self.df['pl_name'].notna()]

        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        for col in numeric_cols:
            if self.df[col].isnull().mean() < 0.5:
                self.df[col].fillna(self.df[col].median(), inplace=True)
        return self.df


### 2.7 `calculate_derived_features()`

Creates physics-based features:
- Density
- Surface gravity
- Escape velocity
- Habitable zone flags


In [24]:
def calculate_derived_features(self):
        if all(c in self.df.columns for c in ['pl_bmasse','pl_rade']):
            self.df['pl_dens'] = self.df['pl_bmasse'] / (self.df['pl_rade'] ** 3)
        return self.df


### 2.8 `create_habitability_scores()`

Combines multiple signals into a **0‚Äì100 habitability score**
and binary classification.


In [25]:
def create_habitability_scores(self):
        self.df['habitability_score'] = 0
        if 'pl_eqt' in self.df.columns:
            self.df['habitability_score'] += self.df['pl_eqt'].between(250,300).astype(int)*50
        if 'pl_rade' in self.df.columns:
            self.df['habitability_score'] += self.df['pl_rade'].between(0.8,1.5).astype(int)*50

        self.df['is_potentially_habitable'] = (self.df['habitability_score'] >= 50).astype(int)
        return self.df


### 2.9 `normalize_features()`

Applies z-score normalization to numeric features.


In [26]:
def normalize_features(self):
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        self.df[numeric_cols] = self.scaler.fit_transform(self.df[numeric_cols])
        return self.df


### 2.10 `save_processed_data()`

Saves final cleaned dataset & feature dictionary.


In [27]:
def save_processed_data(self, filename='exoplanets_cleaned_ready.csv'):
        self.df.to_csv(filename, index=False)
        return filename

## 3Ô∏è‚É£ Run Module 2 Pipeline

In [28]:
processor = ExoplanetDataProcessor()
processor.load_data()
processor.handle_missing_values_advanced()
processor.calculate_derived_features()
processor.create_habitability_scores()
processor.normalize_features()
processor.save_processed_data()

Records: 39119, Columns: 94
   loc_rowid   pl_name hostname  default_flag  sy_snum  sy_pnum  \
0          1  11 Com b   11 Com             1        2        1   
1          2  11 Com b   11 Com             0        2        1   
2          3  11 Com b   11 Com             0        2        1   

   discoverymethod  disc_year     disc_facility              soltype  ...  \
0  Radial Velocity       2007  Xinglong Station  Published Confirmed  ...   
1  Radial Velocity       2007  Xinglong Station  Published Confirmed  ...   
2  Radial Velocity       2007  Xinglong Station  Published Confirmed  ...   

   sy_kmag sy_kmagerr1  sy_kmagerr2  sy_gaiamag  sy_gaiamagerr1  \
0    2.282       0.346       -0.346     4.44038        0.003848   
1    2.282       0.346       -0.346     4.44038        0.003848   
2    2.282       0.346       -0.346     4.44038        0.003848   

   sy_gaiamagerr2   rowupdate  pl_pubdate  releasedate  pl_dens  
0       -0.003848  2023-09-19     2023-08   2023-09-19     

'exoplanets_cleaned_ready.csv'


## ‚úÖ Module 2 Completed

Outputs:
- exoplanets_cleaned_ready.csv
- habitability scores & labels

‚û°Ô∏è Next: **Module 3 ‚Äì ML Dataset Preparation**
