# 01 — Exploratory Data Analysis (EDA)

Course: **Machine Learning and Artificial Intelligence for Astronomy**  
Dataset: **NASA Exoplanet Archive export** (`nea_2026_jan.csv`)

**Goals**
- Understand the structure, quality, and limitations of an exoplanet catalog
- Diagnose missing values, ranges, correlations, and outliers **before** modeling

⚠️ In this notebook we **do not** impute, normalize, or select features. We only **diagnose**.


## 0. Setup
Import the minimal EDA toolkit.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

# Plot settings (simple + readable)
plt.rcParams["figure.dpi"] = 120
sns.set_context("notebook")


## 1. Load the dataset


In [2]:
# Try a few common locations relative to this notebook
DATA_PATHS = [
    Path("../data/nea_2026_jan.csv"),
]

data_path = next((p for p in DATA_PATHS if p.exists()), None)
if data_path is None:
    raise FileNotFoundError(
        "Could not find nea_2026_jan.csv. Tried:\n"
        + "\n".join([str(p) for p in DATA_PATHS])
        + "\n\nEdit DATA_PATHS to match your repository structure."
    )

df = pd.read_csv(data_path)
print(f"Loaded: {data_path}  |  shape={df.shape}")


Loaded: ..\data\nea_2026_jan.csv  |  shape=(6065, 14)


## 2. First look at the data


## 3. Basic descriptive statistics

## 4. Missing values


### Optional: visualize missingness (sample)

Heatmaps can be heavy for large tables; better to subsample rows.


## 5. Choose key columns for the course



- Mass (Earth masses): `pl_bmasse`
- Radius (Earth radii): `pl_rade`
- Stellar metallicity: `st_met`


If any of these is `None`, pick the correct column by inspecting `df.columns`.


## 6. Univariate distributions

Always label axes with units.


## 7. Bivariate relationship: Mass–Radius

### Optional: log scaling

Mass–radius relations are often easier to see in log space.


## 8. Adding physical context (color-coding)

## 9. Correlations and potential redundancy (diagnosis only)

Correlation matrices can indicate strongly related features, but they do **not** decide what to remove.


## 10. Outliers: bugs or astrophysics?

Inspect extremes. Discuss whether they are:
- instrumental / pipeline artifacts
- unit issues
- genuinely rare astrophysical objects


## 11. EDA summary (write short notes)

Answer briefly:
1. What fraction of rows have both mass and radius?
2. Where are the main missing values?
3. Does mass–radius look linear, monotonic, or multi-regime?
4. Which features look strongly correlated?
5. What outliers are suspicious vs interesting?

These notes will guide **feature engineering and regression** later.
