## 03 — Feature Engineering and Baseline Modeling

In this notebook, we will:
1. Load the **BLS results** from the previous notebook.
2. Create a structured **feature table** for modeling (period, depth, SNR, duration, etc.).
3. Engineer additional derived features such as:
   - Transit duty cycle
   - Log-scaled features
   - Ratios and normalized quantities
4. Build a **baseline machine learning model** to classify or rank transit candidates.
5. Evaluate model performance and save outputs for further analysis.


### Step 1 — Load Processed Results

We'll import the results generated by the previous notebook (`02_bls_periodogram.ipynb`) —  
this includes transit parameters and detrended flux data.


In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
import pandas as pd
from pathlib import Path

# Path to processed data
data_dir = Path(r"E:\Data science\projects\exoplanet-discovery\data\processed")

# Load BLS results
bls_results = pd.read_csv(data_dir / "bls_results.csv")
print("✅ Loaded BLS results:")
display(bls_results)

# Load detrended light curve (if available)
try:
    lc_detrended = pd.read_csv(data_dir / "detrended_lightcurve.csv")
    print(f"\n✅ Loaded detrended light curve with {len(lc_detrended)} rows.")
except FileNotFoundError:
    print("⚠️ Detrended light curve not found. Proceeding with parameter-based features only.")


✅ Loaded BLS results:


Unnamed: 0,period_days,t0_bkjd,duration_days,duration_hours,depth_relative,depth_ppm,snr,n_transits_covered
0,0.837538,131.572315,0.07,1.68,0.000165,164.985406,2.548123,39.963634



✅ Loaded detrended light curve with 1432 rows.


### Step 2 — Create Additional Derived Features

We add useful derived features for machine learning models:
- `log_period`: log10 of orbital period (days)
- `log_duration`: log10 of transit duration (hours)
- `log_snr`: log10 of signal-to-noise ratio
- `duty_cycle`: ratio of duration to period
- `depth_to_duration`: transit depth divided by duration


In [6]:
import numpy as np

features = bls_results.copy()

features["log_period"] = np.log10(features["period_days"])
features["log_duration"] = np.log10(features["duration_hours"])
features["log_snr"] = np.log10(features["snr"]) if features["snr"].iloc[0] > 0 else np.nan
features["duty_cycle"] = features["duration_days"] / features["period_days"]
features["depth_to_duration"] = features["depth_relative"] / features["duration_days"]

display(features)


Unnamed: 0,period_days,t0_bkjd,duration_days,duration_hours,depth_relative,depth_ppm,snr,n_transits_covered,log_period,log_duration,log_snr,duty_cycle,depth_to_duration
0,0.837538,131.572315,0.07,1.68,0.000165,164.985406,2.548123,39.963634,-0.076995,0.225309,0.40622,0.083578,0.002357


### Step 3 — Baseline Candidate Scoring

Because we currently have **one** candidate only, a supervised ML model is not meaningful yet.  
Instead, we compute a **rule-based score** combining physically-motivated signals:

- **log_snr:** higher is better (stronger detection)
- **depth_ppm:** deeper transits are easier to confirm (up to a point)
- **duty_cycle = duration_days / period_days:** smaller is generally more planet-like (short duty)
- **n_transits_covered:** more observed transits is better

We normalize each term over reasonable astrophysical ranges and combine them into a single score in [0, 1].


In [7]:
import numpy as np

scored = features.copy()

# --- Define robust ranges for normalization (clip to [lo, hi]) ---
ranges = {
    "log_snr": (-1.0, 1.2),           # ~ SNR from 0.1 to ~16 (log10)
    "depth_ppm": (50.0, 3000.0),      # very shallow to quite deep
    "duty_cycle": (0.005, 0.10),      # 0.5% to 10%
    "n_transits_covered": (1.0, 20.0) # at least one up to ~20
}

def norm(x, lo, hi, invert=False):
    x = np.clip(x, lo, hi)
    y = (x - lo) / (hi - lo + 1e-12)
    return 1.0 - y if invert else y

# Extract values (scalars here, but code works vectorized too)
log_snr  = scored["log_snr"].astype(float).values
depth_ppm = scored["depth_ppm"].astype(float).values
duty     = scored["duty_cycle"].astype(float).values
ntr      = scored["n_transits_covered"].astype(float).values

# Normalize (duty uses invert=True because smaller is better)
nsnr   = norm(log_snr,  *ranges["log_snr"])
ndepth = norm(depth_ppm, *ranges["depth_ppm"])
nduty  = norm(duty,      *ranges["duty_cycle"], invert=True)
ncover = norm(ntr,       *ranges["n_transits_covered"])

# Weighted score (tweakable weights)
w_snr, w_depth, w_duty, w_cover = 0.45, 0.35, 0.15, 0.05
score = w_snr*nsnr + w_depth*ndepth + w_duty*nduty + w_cover*ncover

scored["score"] = score
display(scored[[
    "period_days","duration_hours","depth_ppm","snr","n_transits_covered",
    "duty_cycle","log_snr","score"
]])
print("Candidate score (0-1):", float(score))


Unnamed: 0,period_days,duration_hours,depth_ppm,snr,n_transits_covered,duty_cycle,log_snr,score
0,0.837538,1.68,164.985406,2.548123,39.963634,0.083578,0.40622,0.377207


Candidate score (0-1): 0.37720736526793663


**Result:** A single scalar **score** in [0, 1] ranking how promising the candidate is using simple, physics-aware heuristics.  
Once we have more targets, this table can be used directly for ML training.


### Step 3.1 — Save the feature table with the baseline score


In [8]:
from pathlib import Path

out_dir = Path(r"E:\Data science\projects\exoplanet-discovery\data\processed")
out_dir.mkdir(parents=True, exist_ok=True)
scored.to_csv(out_dir / "features_with_score.csv", index=False)
print("Saved:", out_dir / "features_with_score.csv")


Saved: E:\Data science\projects\exoplanet-discovery\data\processed\features_with_score.csv


## ✅ Conclusion (Notebook 03)

- Loaded BLS-derived parameters and engineered additional features.
- Created a **baseline candidate score** (0–1) based on SNR, depth, duty cycle, and transit coverage.
- Saved the resulting table to `data/processed/features_with_score.csv`.

### Next steps
- Process multiple targets to build a richer dataset.
- Replace the baseline score with a trained ML model once labels are available.
- Promote final figures and tables to `reports/` and update the project README.

