# Data Modeling I: Probability and Statistics

## Introduction

### Why Probability and Statistics Matter in Science

Any physical measurement involves some level of uncertainty or noise.
Whether we are measuring the intensity of a star's light, the decay rate of a radioactive sample, or the temperature of the cosmic microwave background, the data we collect are never perfectly exact.
Probability theory offers a systematic way to handle this lack of certainty.
More specifically:
* **Data Interpretation**:
  We want to connect noisy observations to underlying physical models.
  If a dataset appears to fluctuate, is that signal real, or is it random?
  Probability gives us formal tools—like hypothesis testing or confidence intervals—to decide.
* **Incomplete Knowledge**:
  Even when processes are entirely deterministic at some level, we often lack complete information.
  Probability distributions let us quantify the range of possible outcomes or parameter values.
* **Unrepeatable Events**:
  Fields like astronomy pose a unique challenge: many phenomena (e.g., a supernova) cannot be restarted under controlled conditions.
  We must rely on "fair samples" of data from one-time observations.
  Probability theory becomes crucial for making sense of these non-repeatable experiments.

From a broad perspective, probability theory is the logic of science:
it extends our classical (Boolean) logic into a realm where conclusions cannot be absolutely certain but can be assigned degrees of belief or confidence.

### Historical Context and Key Contributors

Probability theory and its practical offshoot—statistics—did not emerge fully formed.
Many scientists who pioneered the subject were themselves astronomers or physicists grappling with noisy measurements:
* **Blaise Pascal & Pierre Fermat** (1650s):
  Their work on games of chance launched the formal study of probability, initially focusing on gambling problems but laying the groundwork for more general applications.
* **Jacob Bernoulli & Thomas Bayes** (18th century):
  They introduced foundational ideas on how to assign and update probabilities.
  Bayes's Theorem still underlies modern Bayesian statistics, which treats probability as "degree of belief" and updates those degrees using observed data.
* **Pierre-Simon Laplace & Carl Friedrich Gauss** (late 18th/early 19th century):
  Both were astronomers/mathematicians.
  Gauss's work on least squares and the "Gaussian (normal) distribution of errors" became central to how we handle measurement noise.
  Laplace's rediscovery of Bayesian methods brought probability firmly into the domain of scientific data interpretation.
* **Frequentist vs. Bayesian** (20th century):
  Mathematicians and statisticians debated how best to define, interpret, and use probabilities, especially for inference.
  The result was a rich theoretical framework that scientists still apply daily.

### Probability in Observational Fields (e.g., Astronomy)

The inherent randomness or incomplete knowledge in data is especially stark in astronomy and astrophysics:
1. Detection
   * We often ask: "Is this faint signal real, or is it simply noise?"
   * Probability-based hypothesis testing helps decide when a new source (like an exoplanet transit or a distant supernova) is detected with sufficient confidence.
2. Parameter Estimation
   * Astronomical models (e.g., cosmological models) include many parameters: densities of matter, dark energy, expansion rates, etc.
   * Observed data (galaxy distributions, microwave background fluctuations) are used in conjunction with statistical techniques (maximum likelihood, Bayesian inference) to estimate these parameters and their uncertainties.
3. Model Comparison
   * Even if multiple models fit the data well (e.g., different dark matter profiles, different supernova light-curve templates), probability theory guides us in choosing which model is "better supported" by the data.
4. Sampling Limitations
   * Astronomers typically cannot control the experiment—can't "turn off" a star or "re-run" a supernova to see if the same outcome occurs.
   * Instead, we rely on collecting large samples across space or time.
     Statistical arguments (assuming each event or source is an independent draw from an underlying population) become critical.

These points generalize beyond astronomy:
any domain with random influences or inherent uncertainty (particle physics, bio-physics experiments, sensor measurement in engineering, etc.) sees a central role for probability.

### Bread-and-Butter Statistical Aims

Across all these scientific domains, three recurring tasks stand out:
1. Hypothesis Testing
   * "Is the signal real or just random fluctuation?"
   * "Does this distribution deviate significantly from a known model?"
2. Parameter Estimation
   * "Given a theoretical model with free parameters, which parameter values best explain the observed data?"
   * Example: Fitting a light-curve model to a set of brightness measurements.
3. Model Comparison / Model Selection
   * "Which of two (or more) possible models is most consistent with the data?"
   * Are more complex models justified by the evidence, or does a simpler approach suffice?