# Advanced Astroquery Analysis Report  
## Gaia DR3: Pleiades (M45) cluster membership, kinematics, and HR diagram

**Goal:** Pull **real** Gaia DR3 data via `astroquery`, do a *repeatable* membership selection, and build an astrophysically meaningful **H–R diagram** (color–magnitude diagram).  
This notebook is written like a *report*: **Methods → Results → Discussion → Next steps**.

> Runs best with internet access (Gaia archive query).  
> Date: 2026-01-07


## 0) Setup

This notebook uses:
- `astroquery.gaia` to run an ADQL query on the ESA Gaia archive
- `astropy` tables/units
- `numpy/pandas` for manipulation
- `matplotlib` for plots
- `scikit-learn` for a simple probabilistic membership model (Gaussian Mixture)

If something is missing, uncomment the install cell.


In [None]:
# If needed, install packages (run once)
# !pip -q install astroquery astropy pandas matplotlib scikit-learn


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from astropy.table import Table
from astropy.coordinates import SkyCoord
import astropy.units as u

from astroquery.gaia import Gaia


## 1) Define the target and query parameters

We’ll query a **circular region** around the Pleiades center.

Reference coordinates (ICRS):
- RA ≈ 56.75°
- Dec ≈ 24.12°

We start with a **wide** radius so we can model field contamination and then separate members.


In [None]:
# Pleiades (M45) approximate center
ra0_deg  = 56.75
dec0_deg = 24.12

# Query radius in degrees (wide, includes field stars)
radius_deg = 6.0

print("Center (deg):", ra0_deg, dec0_deg, " radius:", radius_deg, "deg")


## 2) ADQL query (Gaia DR3)

We pull only the columns we need:
- Astrometry: parallax, proper motions
- Photometry: G, BP, RP
- Quality: RUWE, parallax_error, phot_bp_rp_excess_factor

Then we’ll apply:
- Basic quality cuts
- A probabilistic membership model in (pmra, pmdec, parallax)


In [None]:
Gaia.MAIN_GAIA_TABLE = "gaiadr3.gaia_source"

adql = f"""
SELECT
  source_id, ra, dec,
  parallax, parallax_error,
  pmra, pmra_error,
  pmdec, pmdec_error,
  ruwe,
  phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag,
  phot_bp_rp_excess_factor
FROM gaiadr3.gaia_source
WHERE 1 = CONTAINS(
  POINT('ICRS', ra, dec),
  CIRCLE('ICRS', {ra0_deg}, {dec0_deg}, {radius_deg})
)
AND parallax IS NOT NULL
AND pmra IS NOT NULL
AND pmdec IS NOT NULL
AND phot_g_mean_mag IS NOT NULL
AND phot_bp_mean_mag IS NOT NULL
AND phot_rp_mean_mag IS NOT NULL
"""

print(adql[:600] + "...")


In [None]:
# This can take ~seconds to a few minutes depending on the archive load.
# If the query fails due to connectivity or service throttling, retry later or reduce radius_deg.

job = Gaia.launch_job_async(adql)
tbl = job.get_results()
tbl[:5]


## 3) Convert to pandas + compute basic derived quantities

### Color
- `bp_rp = BP - RP`

### Absolute magnitude (no extinction correction)
Using parallax in milliarcseconds:

\[
M_G = G + 5\log_{10}(\varpi_{mas}) - 10
\]

This assumes a simple distance estimate `d ≈ 1000 / parallax_mas` and ignores extinction.
For Pleiades, extinction is relatively small, so this is OK for a learning report.


In [None]:
df = tbl.to_pandas()

# Color index
df["bp_rp"] = df["phot_bp_mean_mag"] - df["phot_rp_mean_mag"]

# Parallax signal-to-noise ratio
df["parallax_snr"] = df["parallax"] / df["parallax_error"]

# Absolute magnitude (rough; no extinction correction)
df["M_G"] = df["phot_g_mean_mag"] + 5*np.log10(df["parallax"]) - 10

df.shape, df.head()


## 4) Quality cuts (recommended starting point)

These are common Gaia “sanity” filters:
- `parallax_snr > 10` (better distance estimates)
- `ruwe < 1.4` (good astrometric fit heuristic)
- A loose cut on `phot_bp_rp_excess_factor` (photometry consistency)

⚠️ These are not perfect. They’re good defaults to learn from, and you can tune them.


In [None]:
q = (
    (df["parallax_snr"] > 10) &
    (df["ruwe"] < 1.4) &
    (df["phot_bp_rp_excess_factor"] > 0.5) &
    (df["phot_bp_rp_excess_factor"] < 2.0) &
    np.isfinite(df["bp_rp"]) &
    np.isfinite(df["M_G"])
)

dfq = df.loc[q].copy()
df.shape, dfq.shape


## 5) Membership model (field vs cluster)

We’ll fit a **2-component Gaussian Mixture Model** (GMM) in:

- `pmra` (mas/yr)
- `pmdec` (mas/yr)
- `parallax` (mas)

This is a *simple* but powerful approach:
- One Gaussian tends to capture the tight cluster locus
- The other captures the broader field

Then we interpret the component with **smaller covariance volume** as the cluster.


In [None]:
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler

features = dfq[["pmra", "pmdec", "parallax"]].values

scaler = StandardScaler()
X = scaler.fit_transform(features)

gmm = GaussianMixture(n_components=2, covariance_type="full", random_state=42)
gmm.fit(X)

# Membership probability of each component
proba = gmm.predict_proba(X)

# Decide which component is "cluster": pick the one with smaller determinant of covariance
covs = gmm.covariances_
det = [np.linalg.det(c) for c in covs]
cluster_comp = int(np.argmin(det))

dfq["p_cluster"] = proba[:, cluster_comp]
dfq["gmm_label"] = (dfq["p_cluster"] > 0.5).astype(int)

cluster_comp, det


### Quick sanity check: kinematic distributions

If the model worked, the high-probability members should form a tight clump in proper motion space.


In [None]:
fig, ax = plt.subplots(figsize=(7,6))
sc = ax.scatter(dfq["pmra"], dfq["pmdec"], c=dfq["p_cluster"], s=4)
ax.set_xlabel("pmra (mas/yr)")
ax.set_ylabel("pmdec (mas/yr)")
ax.set_title("Proper motion space (colored by cluster probability)")
plt.colorbar(sc, ax=ax, label="p(cluster)")
plt.show()


## 6) Define a clean member sample

We’ll choose a probability threshold.  
Common choices: 0.7–0.95 depending on how pure you want the sample.

We’ll use **p ≥ 0.8** as a default.


In [None]:
p_thresh = 0.8
mem = dfq[dfq["p_cluster"] >= p_thresh].copy()
field = dfq[dfq["p_cluster"] < p_thresh].copy()

len(mem), len(field), mem[["parallax","pmra","pmdec"]].describe()


## 7) Results: sky distribution

A good membership selection should trace the cluster on the sky.


In [None]:
fig, ax = plt.subplots(figsize=(7,6))
ax.scatter(field["ra"], field["dec"], s=2, alpha=0.15, label="field")
ax.scatter(mem["ra"], mem["dec"], s=4, alpha=0.8, label=f"members (p≥{p_thresh})")
ax.invert_xaxis()  # RA increases to the left in many sky plots
ax.set_xlabel("RA (deg)")
ax.set_ylabel("Dec (deg)")
ax.set_title("Sky distribution")
ax.legend()
plt.show()


## 8) Results: cluster distance estimate

For a quick “report-level” estimate:
- Use the **median parallax** of members
- Convert to distance `d ≈ 1000 / parallax_mas` in parsecs

⚠️ For precision work:
- Use proper Bayesian distance inference
- Consider Gaia parallax systematics / zero-points


In [None]:
par_median = np.median(mem["parallax"])
par_std = np.std(mem["parallax"])
d_pc = 1000.0 / par_median

print(f"Median parallax (mas): {par_median:.3f}  (std ~ {par_std:.3f})")
print(f"Naive distance (pc):   {d_pc:.1f}")


## 9) Results: Color–Magnitude Diagram (HR diagram)

Plot:
- x-axis: `BP−RP`
- y-axis: `M_G` (absolute G magnitude)

We expect a **well-defined main sequence** for the cluster.


In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(field["bp_rp"], field["M_G"], s=2, alpha=0.08, label="field")
ax.scatter(mem["bp_rp"], mem["M_G"], s=6, alpha=0.7, label="members")

ax.set_xlabel("BP − RP (mag)")
ax.set_ylabel("M_G (mag)")
ax.set_title("Pleiades CMD / HR diagram (Gaia DR3)")
ax.invert_yaxis()
ax.legend()
plt.show()


## 10) A small “report summary” (auto-generated text)

We’ll summarize:
- Sample sizes
- Kinematic medians
- Distance estimate

This is useful when turning a notebook into a reusable analysis template.


In [None]:
summary = {
    "n_total": int(len(df)),
    "n_quality": int(len(dfq)),
    "n_members_p>=0.8": int(len(mem)),
    "median_parallax_mas": float(np.median(mem["parallax"])),
    "median_pmra_masyr": float(np.median(mem["pmra"])),
    "median_pmdec_masyr": float(np.median(mem["pmdec"])),
    "naive_distance_pc": float(1000.0 / np.median(mem["parallax"])),
}

summary


## 11) Discussion & interpretation

Things you can see from the plots:
- The **proper motion clump** is tight → strong cluster signal
- The CMD shows a clean **main sequence** for members
- Field stars are scattered broadly in both kinematics and CMD

### Common next steps (real research workflow)
1. Add extinction correction (3D dust maps or literature A_V)
2. Use a better distance inference (Bayesian parallax-to-distance)
3. Fit an isochrone to estimate age/metallicity
4. Compare different membership methods (DBSCAN, UPMASK, HDBSCAN)


## 12) Repro tips

- If Gaia is slow/busy: reduce `radius_deg` or request fewer columns.
- Cache results: write the table to disk (`parquet/csv`) and reuse it.
- For other clusters: change `(ra0_deg, dec0_deg, radius_deg)` and rerun.


In [None]:
# Optional: save a cached copy for fast iteration
# mem and dfq are useful outputs for later notebooks

# dfq.to_parquet("pleiades_gaia_quality.parquet", index=False)
# mem.to_parquet("pleiades_gaia_members.parquet", index=False)

print("Ready.")
