# Earthquake denoising via Cluster Analysis of Trimmed Spectrogram (CATS)

This notebook explains the usage of the CATS detector.

A minimalistic example would look like this:

```python
data = import_sample_data()
denoiser = cats.CATSDenoiser(**parameters)
result = detector.denoise(data)
result.plot((1, 2))
```

Below is more detailed explanation. 

In [1]:
import numpy as np
import holoviews as hv

In [2]:
import cats

<hr>

# Import of synthetic dataset

Basically, we may have any number of traces/receivers and components so that shape of the data can be arbitrary, including only one trace `(N,)`. 

But in this example, we will consider multiple 3-component receivers

**Note**, the API supports data format in `numpy.ndarray`.

In [3]:
data = cats.import_sample_data()

Dclean = data['data']
time = data['time']  # time
dt = data['dt']      # sampling time
x = data['x']        # location of recievers
dimensions = ["Component", "Receiver", "Time"]

In [4]:
print(f"Input dataset shape\t=\t{Dclean.shape}")
print(f"Dimensions correspond to\t{dimensions}")
print(*[f"{dim} : {shp}" for dim, shp in zip(dimensions, Dclean.shape)], sep='\t')

Input dataset shape	=	(3, 10, 70000)
Dimensions correspond to	['Component', 'Receiver', 'Time']
Component : 3	Receiver : 10	Time : 70000


In [5]:
# contamination with white gaussian noise
np.random.seed(132)
noise_scale = 0.1
Noise = np.random.randn(*Dclean.shape) * noise_scale   # colored noise
Noise += noise_scale * np.sin(time * 2 * np.pi * 50)[None, None, :]  # constant electric 50 Hz noise
D = Dclean + Noise

# CATS Denoiser

Detector is implemented as operator `cats.CATSDenoiser(parameters)`

Data parameters:

0. `dt_sec`                 - sampling time in **seconds**

Main free parameters:

1. `stft_window_type`       - type of STFT window like 'hann' or 'hamming'. See also `scipy.signal.get_window()` for more windows
1. `stft_window_sec`        - length of STFT window in **seconds**
2. `stft_overlap`           - overlap rate of STFT windows, range (0, 1) (e.g. `0.5` is 50%)
3. `minSNR`                 - minimum Signal-to-Noise Ratio, range ~ (3.5 - 5.5). It is used to estimate noise standard deviation and as minimum average SNR in clusters
4. `stationary_frame_sec`   - frame length where noise is stationary, in **seconds**
5. `cluster_size_t_sec`     - minimum cluster size in time or **minimum time duration** of strongest phases in signal, in **seconds**
6. `cluster_size_f_Hz`      - minimum cluster size in frequency (frequency width of signal), in **hertz**
7. `cluster_distance_t_sec`  - neighborhood distance for clustering in time or **minimum separation time** between different events, in **seconds**, default `cluster_size_t_sec/2`
8. `cluster_distance_f_Hz`   - neighborhood distance for clustering in frequency (minimum separation in frequency), in **hertz**, default `cluster_size_f_Hz/2`

Additional parameters:
1. `stft_nfft`               - zero-padding of STFT windows, recommended a power of 2 (e.g. `512`)
2. `clustering_multitrace`   - multitrace clustering, `True/False`, improves detection/denoising for regular arrays of receivers, default `False`
3. `cluster_size_trace`      - minimum number of traces for multitrace clustering, {1, 2, ...} (e.g. `2`)
4. `cluster_distance_trace`  - neighborhood distance for multitrace clustering, {1, 2, ...} (e.g. `2`)
5. `freq_bandpass_Hz`        - inclusive bandpass range in Hertz, e.g. [1, 256] Hz, which zeroes everythin outside of the range
6. `cluster catalogs`        - creates dataframes of cluster catalogs with their characteristics

Experimental parameters:
1. `bedate_freq_grouping_Hz`   - makes grouping of frequency bins for B-E-DATE noise estimation, except zero and Nyquist frequencies, given as group width in Hertz

2. `bedate_log_freq_grouping`  - makes grouping of frequency bins for B-E-DATE noise estimation in log10 scale, except zero and Nyquist frequencies, given as group width in log10 Hertz

3. `cluster_minSNR`            - min average cluster SNR, by default the same as `minSNR`  

4. `cluster_fullness`          - min number of elements in cluster, by default 0 (optional parameter)

5. `cluster_size_f_logHz`      - min cluster size in frequency in log10 scale, e.g. 0.5 log10 Hz 

6. `cluster_distance_f_logHz`  - clustering distance in frequency in log10 scale, increases computational cost

In [6]:
denoiser = cats.CATSDenoiser(dt_sec=dt,
                             stft_window_type='hann',
                             stft_window_sec=0.75, 
                             stft_overlap=0.8,
                             minSNR=6.0,
                             stationary_frame_sec=-1,  # `-1` means the least possible
                             cluster_size_t_sec=0.2,
                             cluster_size_f_Hz=5,
                             cluster_size_f_logHz=0.5,
                             cluster_distance_t_sec=0.2,
                             cluster_distance_f_Hz=2, 
                             cluster_minSNR=0.0)

The instantiated `denoiser` has four main methods:
1. `.denoise` - performs the detection of events
2. `.denoise_to_file` - performs the detection and saves to a file
3. `.denoise_on_files` - performs the detection by reading from fiels and saving to files
4. `.STFT` - link to the corresponding STFT operator, see more in `cats.STFTOperator(...)` 

As well, all the input paramaters such as `stationary_frame_sec`, `cluster_size_t_sec`, `cluster_size_f_Hz`, etc., are used to calculate indexed lengths `stationary_frame_len`, `cluster_size_t_len`, `cluster_size_f_len`, etc., according to the given samplings `dt_sec`, `stft_overlap`, and `stft_nfft`.

**Note**, other properties can viewed via `.dict()`

#### Noise PSD

In [7]:
# visualize the Noise PSD using the corresponding STFT operator
noise_psd = abs(denoiser.STFT * Noise).mean(axis=(0, 1, 3))
f_dim = hv.Dimension('Frequency', unit='Hz')
fontsize = dict(labels=16, ticks=14, title=20)
fig_noise_psd = hv.Curve((denoiser.stft_frequency, noise_psd), 
                         kdims=f_dim).opts(fig_size=250, aspect=2.5, logx=True, 
                                           fontsize=fontsize, ylabel='', title='Noise PSD',
                                           logy=True, xlim=(0.5, np.nan))
fig_noise_psd

To save figure, we need to use `hv.save(...)`

In [8]:
hv.save(fig_noise_psd, "../figures/fig_noise_psd_sample.png", dpi=250)

## Applying CATS Denoiser

To apply the detection, we need to pass the input data `x` to the function `denoiser.denoise(x)`.

```python
denoiser.denoise(
                x,  # input data (numpy.ndarray), the last axis is time
                verbose,  # True/False, print status messages (stages and timing)
                full_info, # True/False/'qc', save intermediate steps for quality control
                )
```

`x` is `numpy.ndarray` and may have any number of dimensions with shape `(..., N)`, but the last axis `N` must be Time.

In [9]:
print(D.shape)

(3, 10, 70000)


In [10]:
result = denoiser.denoise(D, verbose=True, full_info=True)

1. STFT	...	Completed in 0.187 sec
2. B-E-DATE trimming	...	Completed in 0.0536 sec
3. Clustering	...	Completed in 0.0507 sec
4. Inverse STFT	...	Completed in 0.186 sec
Total elapsed time:	0.477 sec



The result of detection is written to `cats.CATSDenoisingResult` object for convenience of looking at the results on each step of the detection. 

It has several attributes:
1. `.signal`  -  original input signal `x`, shape `(..., N)`
2. `.coefficients`  -  result of STFT transform, complex-valued coefficients, shape `(..., Nf, Nt)`, where `Nf` number of frequencies, `Nt` number of time samples after STFT
3. `.spectrogram`  -  absolute value of complex-valued `.coefficients`, shape `(..., Nf, Nt)`
4. `.time_frames`  -  time frames of where noise is assumed to be stationary, shape `(nt, 2)`, `[[frame1_start, frame1_end], ...]`, where `nt` is number of stationary time frames
5. `frequency_groups`  -  frequency groups that were processed in B-E-DATE for noise estimation together, shape `(gNf, 2)`, where `gNf` number of frequency groups
5. `.noise_std`  -  noise standard deviation as a function of time and frequency, shape `(..., gNf, nt)`
6. `.noise_threshold_conversion`  -  converions constants from `noise_std` to actual thresholds `(gNf,)`
7. `.spectrogram_SNR_trimmed`  -  trimmed spectrogram of SNR values, shape `(..., Nf, Nt)`
8. `.spectrogram_SNR_clustered`  -  clustered trimmed SNR spectrogram, shape `(..., Nf, Nt)`
9. `.spectrogram_cluster_ID`  -  cluster indexes of `.spectrogram_SNR_clustered`, shape `(..., Nf, Nt)`
10. `.signal_denoised`  -  denoised input signal shape `(..., N)`
11. `.stft_frequency`  -  frequency axis of STFT `(Nf,)`
12. `.minSNR` - minimum Signal-to-Noise Ratio used in noise estimation
13. `.dt_sec` - sampling in seconds of the input `signal` 
14. `.stft_dt_sec` - sampling in seconds of the STFT operator
15. `.stft_t0_sec` - starting time of performed STFT, almost always `0` but rarely ~`-stft_dt_sec`
16. `.npts` - number of time points of the input `signal` -> `N`
17. `.stft_npts` - number of time points after the STFT -> `Nt`
18. `.history` - recorded timing of each processing stage (in seconds)
19. `.time()` - function, giving original time axis in seconds `(N,)`
20. `.stft_time()`  -  function, giving time axis after STFT in seconds `(Nt,)`
21. `.main_params`  -  all the params that are needed to initialize the denoiser, `CATSDenoiser.from_result(CATSDenoisingResult)`

Additionally:
- `.save(filename)` - to save the result as MAT file
- `.load(filename)` - to load a saved result from MAT file

To save the results with all the info we specified in `full_info` argument, use `save` method.

In [11]:
result.save("test_save.mat")

We can also import a saved result via `cats.CATSDetectionResult.load`

In [12]:
result = cats.CATSDenoisingResult.load("test_save.mat")

The previously saved result can also be used to reconstruct the used denoiser: 

In [13]:
denoiser = cats.CATSDenoiser.from_result(result)

Each attribute can be accessed as follows:

In [14]:
# in seconds
result.time_frames

array([[  0.        ,  45.4302054 ],
       [ 45.58063655,  91.01084195],
       [ 91.16127309, 136.74190964]])

### Clusters catalog
From the denoised events, we can retrieve a catalog of their cluster statistics. For that we could set `cats.CATSDenoiser(..., cluster_catalogs=True)` before, or we can update the denoiser by using 
```python 
denoiser.reset_params(cluster_catalogs=True)
```

After this, next time we apply the denoiser, the result will contain array of catalogs for each trace `.cluster_catalogs`

In [15]:
denoiser.reset_params(cluster_catalogs=True)

## Visualization

### Single trace and workflow steps

Another available option from `cats.CATSDenoisingResult` is method `.plot(ind)`. Which can display each step of the denoising workflow on a 1D trace indicated by argument `ind` which must be `tuple of ints`.

**Note**, to plot the workflow stages, `full_info` must be `'qc'` or `True` in `.denoise(..., full_info='qc')` (`'qc'` is equivalent to `.get_qc_keys()`)

**\*** `'qc'` will save only what is needed for quality control `.plot` function

In [16]:
# to see how `full_info` works
result = denoiser.denoise(D, verbose=True, 
                         full_info='qc'  # to save only necessary stages for quality control plotting
                        )

1. STFT	...	Completed in 0.184 sec
2. B-E-DATE trimming	...	Completed in 0.0541 sec
3. Clustering	...	Completed in 0.0505 sec
4. Inverse STFT	...	Completed in 0.117 sec
Total elapsed time:	0.405 sec



In [17]:
# Testing Save and load again
result.save("test_save.mat")
result = cats.CATSDenoisingResult.load("test_save.mat")

In [18]:
# retrieved statistics of clusters 
result.cluster_catalogs[0, 5]

Statistics,Time_start_sec,Time_end_sec,Time_center_of_mass_sec,Frequency_start_Hz,Frequency_end_Hz,Frequency_center_of_mass_Hz,Average_SNR,Peak_SNR,Area
Cluster_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,40.842056,41.142918,40.962422,0.0,1.499596,0.589164,3.829833,4.718478,0.451172
2,50.018355,51.673098,50.997595,0.0,11.496902,5.296938,10.138462,32.958866,10.677734
3,114.402885,114.703747,114.534163,0.0,1.499596,0.276322,3.750758,4.193226,0.451172
4,122.827029,126.286945,123.674967,1.499596,9.497441,5.343257,7.496793,28.396891,12.632812
5,10.154102,11.20712,10.617339,3.499057,13.496364,7.694575,5.237129,11.792676,6.617188
6,17.224366,18.427815,17.732414,3.499057,24.493401,13.193826,4.912335,8.994674,14.287109


In [19]:
result.plot(ind=(0, 5), time_interval_sec=(7, 22))

### Visualization of the denoising result on multiple traces

In [20]:
comp = 2

fig = cats.plot_traces(data=Dclean[comp], time=time, time_interval_sec=(7, 22), gain=0.5, each_trace=1)
fontsize = dict(labels=17, ticks=15, title=20); figsize = 450
fig = fig.opts(hv.opts.Curve(fontsize=fontsize), hv.opts.Rectangles(fontsize=fontsize))
fig = fig.opts(aspect=2, fig_size=figsize,
               ylabel='Location (km)', xlabel='Time (s)', title='Clean data')
# hv.save(fig, "../figures/clean_traces_sample.png", dpi=250)  # use this to save figure
fig

In [21]:
# for plotting traces with the results, we could also use the built-in `plot_traces` method of the `result` object
fig = result.plot_traces(ind=comp, show_denoised=True, time_interval_sec=(7, 22), gain=0.4)

fig = fig.opts(hv.opts.Curve(fontsize=fontsize, linewidth=0.3), hv.opts.Rectangles(fontsize=fontsize))
fig = fig.opts(aspect=2, fig_size=figsize,
               ylabel='Location (km)', xlabel='Time (s)', title='Denoised data')
fig

<hr>

# Extra example - multitrace denoising

Sometimes, the data are given on a "regular" array of receivers and earthquakes on the array represent coherent signal. This coherence across multiple stations can be used to enhance the denoising quality. Specifically, we can do clustering of spectrograms across multiple stations simultaneously `Trace x Time x Frequency`.

Here we show the multitrace denoising on the same dataset to showcase the improvement of the performance.

In [22]:
denoiser_mt = cats.CATSDenoiser(dt_sec=dt,
                                stft_window_type='hann',
                                stft_window_sec=0.75, 
                                stft_overlap=0.8,
                                minSNR=6.5,
                                stationary_frame_sec=-1,  # `-1` means the least possible
                                cluster_size_t_sec=0.2,
                                cluster_size_f_Hz=5,
                                cluster_size_f_logHz=0.35,
                                cluster_distance_t_sec=0.2,
                                cluster_distance_f_Hz=2,
                                clustering_multitrace=True,  # now we set multitrace clustering as `True`
                                cluster_size_trace=1,
                                cluster_distance_trace=1, 
                                cluster_catalogs=True, 
                                cluster_minSNR=3.5
                               )

Note, the `multitrace` clustering will be performed across the dimension of the input array which goes before the Time axis.

In [23]:
print(f"Input shape : {D.shape}")
print(f"Multitrace clustering will be done on {len(D.shape) - 2} axis with {D.shape[-2]} elements")
print(f"If we swapped axes of input data to {D.swapaxes(0, 1).shape}, \
then clustering would be done on {D.swapaxes(0, 1).shape[-2]} elements")

Input shape : (3, 10, 70000)
Multitrace clustering will be done on 1 axis with 10 elements
If we swapped axes of input data to (10, 3, 70000), then clustering would be done on 3 elements


In [24]:
result_mt = denoiser_mt.denoise(D, verbose=True, full_info='qc')

1. STFT	...	Completed in 0.204 sec
2. B-E-DATE trimming	...	Completed in 0.0511 sec
3. Clustering	...	Completed in 0.0543 sec
4. Inverse STFT	...	Completed in 0.115 sec
Total elapsed time:	0.424 sec



In [28]:
comp = 2
interval = (7, 23)

fig = cats.plot_traces(data=D[comp], time=time, time_interval_sec=interval, gain=0.5, each_trace=1)
fontsize = dict(labels=17, ticks=15, title=20); figsize = 450
fig = fig.opts(hv.opts.Curve(fontsize=fontsize), hv.opts.Rectangles(fontsize=fontsize))
fig = fig.opts(aspect=2, fig_size=figsize,
               ylabel='Location (km)', xlabel='Time (s)', title='Noisy data')
# hv.save(fig, "../figures/clean_traces_sample.png", dpi=250)  # use this to save figure
fig

In [29]:
fig = result_mt.plot_traces(ind=comp, time_interval_sec=interval, gain=0.5, clip=False)

fig = fig.opts(hv.opts.Curve(fontsize=fontsize, linewidth=0.5), hv.opts.Rectangles(fontsize=fontsize))
fig = fig.opts(aspect=2, fig_size=figsize,
               ylabel='Location (km)', xlabel='Time (s)', title='Denoised data')
fig

In [30]:
fig = cats.plot_traces(data=Dclean[comp], time=time, time_interval_sec=interval, gain=0.5, each_trace=1)
fontsize = dict(labels=17, ticks=15, title=20); figsize = 450
fig = fig.opts(hv.opts.Curve(fontsize=fontsize), hv.opts.Rectangles(fontsize=fontsize))
fig = fig.opts(aspect=2, fig_size=figsize,
               ylabel='Location (km)', xlabel='Time (s)', title='Clean data')
# hv.save(fig, "../figures/clean_traces_sample.png", dpi=250)  # use this to save figure
fig