In [1]:
# Installation step as requested
%pip install numpy pandas json matplotlib bioverse==1.1.8

[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Example 1: Finding the habitable zone

In Section 6 of [Bixel & Apai (2021)](https://ui.adsabs.harvard.edu/abs/2021AJ....161..228B/abstract), we propose that the concept of a "habitable zone" could be validated by searching for a region of space where planets with atmospheric water vapor are statistically more frequent.

In this example, you will use Bioverse to determine whether a [LUVOIR](https://ui.adsabs.harvard.edu/abs/2019arXiv191206219T/abstract)-like imaging survey could test this hypothesis.

## Setup
First, we'll import the Bioverse code:

In [2]:
# Import numpy
import numpy as np
import pandas as pd # Added for data saving
import json # Added for saving analysis results

# Import the relevant modules
from bioverse.survey import ImagingSurvey
from bioverse.generator import Generator
from bioverse.hypothesis import Hypothesis
from bioverse import analysis

# Import pyplot (for making plots later) and adjust some of its settings
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 20.

np.random.seed(42)

For this example, we will use the LUVOIR-like imaging survey and host star catalog.

In [3]:
generator = Generator('imaging')
survey = ImagingSurvey('default')

## Injecting the statistical effect

The first step is to inject the statistical effect we are searching for into the simulated planet population. Specifically, we will simulate the likelihood that a planet has atmospheric water vapor as follows:

In [4]:
def habitable_zone_water(d, f_water_habitable=0.75, f_water_nonhabitable=0.01):
    d['has_H2O'] = np.zeros(len(d),dtype=bool)

    # Non-habitable planets with atmospheres
    m1 = d['R'] > 0.8*d['S']**0.25
    d['has_H2O'][m1] = np.random.uniform(0,1,size=m1.sum()) < f_water_nonhabitable

    # exo-Earth candidates
    m2 = d['EEC']
    d['has_H2O'][m2] = np.random.uniform(0,1,size=m2.sum()) < f_water_habitable

    return d

generator.insert_step(habitable_zone_water)

Next, let's make a simulated dataset using this modified Generator object and the imaging Survey. Let's start with a relatively optimistic assumption that 75% of EECs are habitable and only 1% of non-EECs have "false positive" water vapor.

In [5]:
sample, detected, data = survey.quickrun(generator, f_water_habitable=0.75, f_water_nonhabitable=0.01)

Let's take a look at the simulated data set by plotting which planets have H2O versus their insolation. You might notice that planets within the habitable zone (approx 0.3 < S < 1.1) are more likely to have water-rich atmospheres.

In [6]:
# Replaced Plotting Code with Data Saving
# Save the raw simulated data ('data') to a CSV file
df = pd.DataFrame(data)
output_filename = 'simulated_raw_data.csv'
df.to_csv(output_filename, index=False)
print(f"Raw simulated data saved to {output_filename}")

Raw simulated data saved to simulated_raw_data.csv


The effect injected by `habitable_zone_water` is apparent in the dataset. But is this effect of high enough statistical significance to confirm the habitable zone hypothesis?

## Defining the hypothesis

Next, we'll create a new Hypothesis object representing the hypothesis that planets within one region of space are more likely to have atmospheric water vapor than those outside of it. Since we do not know the planets' sizes, the only independent variable is the distance from the star modulated by the stellar luminosity, called `a_eff` (this is already calculated). Likewise, the dependent variable is the presence or absence of water vapor, called `has_H2O`, which is either 0 or 1. The relationship between these values is parameterized by `a_inner` (the inner edge of the habitable zone in AU), `delta_a` (the width in AU), `f_HZ` (the fraction of HZ planets with H2O), and `df_notHZ` (the fraction of non-HZ planets with H2O *divided by* `f_HZ`). By defining the four parameters in this way, we can easily avoid parameter combinations inconsistent with our hypothesis (such as the fraction of non-HZ planets with H2O being higher than `f_HZ` planets).

In [7]:
# Define the hypothesis in functional form
def f(theta, X):
    a_inner, delta_a, f_HZ, df_notHZ = theta
    in_HZ = (X > a_inner) & (X < (a_inner + delta_a))
    return in_HZ * f_HZ + (~in_HZ) * f_HZ*df_notHZ

# Specify the names of the parameters (theta), features (X), and labels (Y)
params = ('a_inner', 'delta_a', 'f_HZ', 'df_notHZ')
features = ('a_eff',)
labels = ('has_H2O',)

In addition, we must consider the prior probability distribution of these parameters. Conservatively, we suppose the inner edge might extend far inward, or be slightly farther from the Sun than Earth, and that the HZ could be very narrow or very wide. Similarly, the fraction of planets with water vapor in the HZ and outside of it could span many orders of magnitude. We therefore choose to impose log-uniform prior distributions on these parameters across the following ranges:

0.1 < `a_inner` < 2 AU

0.01 < `delta_a` < 10 AU

0 < `f_HZ` < 1

0 < `f_notHZ` < 1

After deciding on the bounds, we can initialize the Hypothesis object.

In [8]:
bounds = np.array([[0.1, 2], [0.01, 10], [0.001, 1.0], [0.001, 1.0]])
h_HZ = Hypothesis(f, bounds, params=params, features=features, labels=labels, log=(True, True, True, True))

We also need to define the null hypothesis against which `h_HZ` is to be compared. The null hypothesis says that the fraction of planets with water vapor is independent of their orbits - this is a one parameter hypothesis where `f_H2O`, the fraction of planets with water vapor, is the only parameter. Again, the prior distribution on `f_H2O` should be broad (0 to 1) and log-uniform in shape. We can define this null hypothesis and attach it to `h_HZ` as follows:


In [9]:
def f_null(theta, X):
    shape = (np.shape(X)[0], 1)
    return np.full(shape, theta)
bounds_null = np.array([[0.001, 1.0]])
h_HZ.h_null = Hypothesis(f_null, bounds_null, params=('f_H2O',), features=features, labels=labels, log=(True,))


Now that the Hypothesis has been formed, we can formally test it using our simulated dataset. The `fit()` method will automatically extract the appropriate variables from the data and estimate the Bayesian evidence for both the hypothesis and null hypothesis using nested sampling. It will return the difference between the two i.e. the evidence in favor of the habitable zone hypothesis.

In [10]:
results_fit = h_HZ.fit(data)
print("The evidence in favor of the hypothesis is: dlnZ = {:.1f} (corresponds to p = {:.1E})".format(results_fit['dlnZ'], np.exp(-results_fit['dlnZ'])))

# Save the hypothesis fit result (replaces a non-existent plotting cell for completeness)
output_filename_fit = 'hypothesis_fit_result.json'
with open(output_filename_fit, 'w') as f:
    serializable_result = {
        k: v.tolist() if isinstance(v, np.ndarray) else v 
        for k, v in results_fit.items()
    }
    json.dump(serializable_result, f, indent=4)
print(f"Hypothesis fit result saved to {output_filename_fit}")

The evidence in favor of the hypothesis is: dlnZ = 10.8 (corresponds to p = 2.0E-05)
Hypothesis fit result saved to hypothesis_fit_result.json


Generally speaking, a result where dlnZ > 3 is considered significant, so this was a succcessful test of the hypothesis.

**Note for iPython/Jupyter users:** We will need to reload the `generator` and `h_HZ` objects here to replace the ones you have defined above. These will produce the same results, but the next code block uses the `multiprocessing` module which is incompatible with functions defined in iPython.

In [11]:
# Reload `generator` and `h_HZ`
generator = Generator('imaging')
from bioverse.hypothesis import h_HZ


## Computing statistical power

We have not yet tackled the most uncertain part of this analysis: namely, just how common are habitable worlds? The fraction of Earth-sized planets in the habitable zone with atmospheric water vapor could far smaller than the assumed 75%, in which case it might be impossible to test the habitable zone hypothesis. To quantify the importance of this assumption, we will need to repeat the previous analysis several times with different values of `f_water_habitable`.

The `analysis` module enables this through its `test_hypothesis_grid()` function, which loops the planet simulation, survey simulation, and hypothesis test routines over a grid of input values. Let's use it to iterate over values of `f_water_habitable` ranging from 1% to 100%. For each value, we will repeat the analysis 20 times to average over Poisson noise.

This may take a few minutes. To speed things up, we will run 8 processes in parallel (you may need to change this number for an older CPU).

In [12]:
f_water_habitable = np.logspace(-2, 0, 2)
results_grid = analysis.test_hypothesis_grid(h_HZ, generator, survey, f_water_habitable=f_water_habitable, t_total=10*365.25, processes=8, N=2)

100%|██████████| 4/4 [00:41<00:00, 10.28s/it]


Now, let's plot the average Bayesian evidence for the hypothesis as a function of `f_water_habitable`. **(The plotting code is replaced with saving the grid results.)**

In [13]:
# Replaced Plotting Code with Data Saving

# Use pickle to save the entire results_grid object, as it handles complex Python objects
# like Hypothesis objects and nested dictionaries/lists/NumPy types reliably.
import pickle

output_filename = 'habitable_zone_grid_results.pkl'

with open(output_filename, 'wb') as f:
    pickle.dump(results_grid, f)

print(f"Analysis grid results saved to {output_filename}")

Analysis grid results saved to habitable_zone_grid_results.pkl


A more useful metric than the average Bayesian evidence is the survey's "statistical power", which is the likelihood that the survey could successfully test the hypothesis under a certain set of assumptions. Run this cell to plot the statistical power versus the fraction of EECs with water vapor. **(The plotting code is replaced with computing and saving the statistical power metric.)**

In [14]:
# Replaced Plotting Code with Data Saving
power = analysis.compute_statistical_power(results_grid, method='dlnZ', threshold=3)
power_df = pd.DataFrame({
    'f_water_habitable': f_water_habitable,
    'statistical_power': power
})
output_filename_power = 'statistical_power_results.csv'
power_df.to_csv(output_filename_power, index=False)

print(f"Statistical power data saved to {output_filename_power}")

Statistical power data saved to statistical_power_results.csv


We can see that a LUVOIR-like survey will be able to determine the existence of the habitable zone - but only if habitable planets are relatively common.