# Detector noise

During training, simulated noise $n_I$ is added to waveforms $h_I(\theta)$ measured in detectors to produce realistic simulated data,

$$
d_I = h_I(\theta) + n_I.
$$

Dingo assumes this noise to be stationary and Gaussian, thus it is independent in each frequency bin, with variance given by some power spectral density (PSD).
```{important}
Similar to extrinsic parameters, detector noise is repeatedly sampled **during training** and added to the simulated signal. This augments the training set with new noise realizations for each epoch, reducing overfitting. 
```

Although noise is *mostly* stationary and Guassian during an LVK observing run, the PSD in each detector does tend to drift from event to event. In a usual likelihood-based PE run, this is taken into account by estimating the PSD at the time of the event (either using [Welch's method](https://en.wikipedia.org/wiki/Welch%27s_method) on signal-free data surrounding the event, or at the same time as the event using [BayesWave](https://git.ligo.org/lscsoft/bayeswave)), and using this in the likelihood integral.

Dingo also estimates the PSD just prior to an event and uses this at inference time in two ways:
1. It whitens the data with respect to this PSD.
2. It provides the PSD (or rather, the inverse ASD) as context to the neural network.

A suitably trained model can therefore make use of the PSD as needed to generate the posterior.

(asd-dataset)=
## ASD dataset

To train a model to perform inference conditioned on the noise PSD, it is necessary to not just sample random noise realizations for a given PSD, but also **sample the PSD** from a distribution for a given observing run. Training in this way is necessary to perform fully amortized inference and account for the variation of PSDs from event to event.

The `ASDDataset` class stores a set of ASD samples for several detectors, allowing for sampling during training.

```{eval-rst}
.. autoclass:: dingo.gw.ASD_dataset.noise_dataset.ASDDataset
    :members:
    :inherited-members:
    :show-inheritance:
```

As with the noise realizations, a random ASD is chosen from the dataset when preparing each sample during training. This augments the training set compared to fixing the noise ASD for each sample prior to training.

Similarly to the `WaveformDataset`, the `ASDDataset` is just a container. Dingo includes routines for building such a dataset from observational data.

## Command-line scripts

### `dingo_generate_asd_dataset`
 The basic approach is as follows:
1. Identify stretches of data within an observing run meeting certain criteria (sufficiently long, without events, and sufficiently high quality, ...) or take-in user-specified stretches.
2. Fetch data corresponding to these stretches using either
    - [GWOSC](https://www.gw-openscience.org)
    - channels, optionally specified in the settings file.
3. Estimate ASDs using Welch's method on these stretches.
4. Save the collection of ASDs.

```text
usage: dingo_generate_asd_dataset [-h] --data_dir DATA_DIR [--settings_file SETTINGS_FILE] [--time_segments_file TIME_SEGMENTS_FILE] [--out_name OUT_NAME] [--verbose]

Generate an ASD dataset based on a settings file.

optional arguments:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   Path where the PSD data is to be stored. Must contain a 'settings.yaml' file.
  --settings_file SETTINGS_FILE
                        Path to a settings file in case two different datasets are generated in the same directory
  --time_segments_file TIME_SEGMENTS_FILE
                        Optional file containing a dictionary of a list of time segments that should be used for estimating PSDs.This has to be a pickle file.
  --out_name OUT_NAME   Path to resulting ASD dataset
  --verbose

```
where the settings file is of the form
```yaml
dataset_settings:
  f_s: 4096
  time_psd: 1024
  T: 8
  time_gap: 0
  window:
    roll_off: 0.4
    type: tukey
  num_psds_max: 20
  channels:
   H1: H1:DCS-CALIB_STRAIN_C02
   L1: L1:DCS-CALIB_STRAIN_C02
  detectors:
    - H1
    - L1
  observing_run: O2
condor:
  env_path: path/to/environment
  num_jobs: 2    # per detector
  num_cpus: 16
  memory_cpus: 16000
  bid: 200
```

Options correspond to the following:

Sampling rate `f_s` (Hz)
: This should be at least twice the value of `f_max` expected to be used.

Data length `time_psd` (s)
: The entire length of data from which to estimate a PSD using Welch's method. Periodigrams are calculated on segments of this, and then averaged using the `median` method.

Segment length `T` (s)
: The length of each segment on which to take the DFT and calculate a periodigram.

Gap `time_gap` (s)
: Gap between duration-`T` segments. E.g., if `T_PSD=1024`, `T=8`, `T_gap=8`, then for each PSD, 64 periodigrams are computed, each using data stretches 8 s long, with gaps of 8 s between segments. Segments would then be $[0~\text{s}, 8~\text{s}], [16~\text{s}, 24~\text{s}], \ldots$.

Window function
: Parameters of the window function used before taking DFT of data segments.

`num_psds_max` (optional)
: If set, stop building the dataset after this number of PSDs have been estimated. This setting is useful for building a single-PSD dataset for pretraining a network.

Channels 'channels (optional)
: If set, data will be fetched from these channels, instead of using GWOSC.

Detectors
: Which detectors (H1, L1, V1, ...) to include in the dataset.

Observing run
: Which observing run to use when estimating PSDs.

Condor (optional)
: Settings for [HTCondor](https://htcondor.readthedocs.io/en/latest/index.html) useful for parallelizing the ASD estimation across condor jobs.

(ref:window-factor)=
## Data conditioning

Importantly, the variance of *white* noise in each frequency bin is not 1, but rather

$$
\sigma^2_{\text{white}} = \frac{w}{4\delta f}
$$

where $\delta f$ is the frequency resolution and $w$ is a "window factor".

The denominator in the noise variance is seen to arise most easily in the noise-weighted inner product,

$$
(a | b) = 4 \text{Re} \int_{f_\text{min}}^{f_\text{max}} df\, \frac{a^\ast(f)b(f)}{S_{\text{n}}(f)}
$$

The window factor comes in because a window must be applied to time series data prior to taking the FFT. The windowing is assumed to reduce the power in the noise, but not affect the signal (which is localized away from the edge of the data segment). To simulate this, we add noise with variance scaled by the window factor.

The noise standard deviation is stored in the property `FrequencyDomain.noise_std`. The window factor is calculated from the data conditioning settings specified in the train settings file.