![image](../docs/_static/seisbench_logo_subtitle_outlined.svg)

# Dataset basics

This tutorial introduces the basics of datasets and benchmark datasets in SeisBench. It explains how load a dataset, how to filter it and how to access the data.

In [None]:
import seisbench
import seisbench.data as sbd

### Loading a dataset

There are two ways of loading a dataset:
1. loading a benchmark dataset
2. loading a dataset from disk

We will first explore benchmark datasets. Benchmark datasets are represented by classes in SeisBench. When instantiating the class, SeisBench will check if the data is available and otherwise download it. Our example dataset is the DummyDataset, that we load below.

In [None]:
data = sbd.DummyDataset()
print(data)

When running this command for the first time, the dataset is downloaded. All downloaded data is stored in the SeisBench cache. The location of the cache defaults to `~/.seisbench`, but can be set using the environment variable `SEISBENCH_CACHE_ROOT`. Let's inspect the cache. Depending which commands where used before, it contains at least the directory `datasets`. Inside this directory, each locally available dataset has its own folder. If we look into the folder `dummydataset`, we find two relevant files `metadata.csv` and `waveforms.hdf5`, containing the metadata and the waveforms.

In [None]:
import os
print("Cache root:", seisbench.cache_root)
print("Contents:", os.listdir(seisbench.cache_root))
print("datasets:", os.listdir(seisbench.cache_root / "datasets"))
print("dummydataset:", os.listdir(seisbench.cache_root / "datasets" / "dummydataset"))

The second way of loading a dataset is loading it from disk by simply providing the path to the directory containing the `metadata.csv` and `waveforms.hdf5` files. We'll demonstrate this using the DummyDataset, even though we'd always recommend loading benchmark dataset through their classes.

In [None]:
dummy_from_disk = sbd.WaveformDataset(seisbench.cache_root / "datasets" / "dummydataset")
print(dummy_from_disk)

### What does a dataset contain?

Each dataset consists of waveforms and the associated metadata. Let's first inspect the metadata. It is represented by a pandas DataFrame and lists for each trace different attributes, describing properties of the source, the trace, the station and possibly the path. When loading a dataset, only the metadata is loaded into memory. The waveforms are loaded on demand. For details, see the section on "Configuring a dataset".

In [None]:
data.metadata

Now let's say, we want to obtain the waveforms associated with trace 3. This can be done using the `get_waveforms` method.

In [None]:
waveforms = data.get_waveforms(3)
print("waveforms.shape:", waveforms.shape)

import matplotlib.pyplot as plt
plt.plot(waveforms.T);

You can also request waveforms for multiple traces at once.

In [None]:
waveforms = data.get_waveforms([3, 20, 45, 70])
print("waveforms.shape:", waveforms.shape)

Benchmark dataset contain several special attributes, which simple waveform dataset do not posses. Here are two examples:

In [None]:
print('Citation:', data.citation)
print('License:', data.license)

### Filtering a dataset

Often, you don't want to use a full dataset, but only parts of it. For this, datasets offer the `filter` method. By default `filter` is applied inplace, but it can also be used to return the desired subset.

In [None]:
mask = data.metadata["source_magnitude"] > 2.5  # Only select events with magnitude above 2.5
data.filter(mask)

print(data)
data.metadata

A special case of filtering is to access the training, development and test splits of a dataset. Most datasets in SeisBench define those splits.

In [None]:
data = sbd.DummyDataset() # Reload to ensure we have the full dataset again

train = data.train()
dev = data.dev()
test = data.test()

print("Train:", train)
print("Dev:", dev)
print("Test:", test)

You can also use a shorthand notation to split the dataset into its parts:

In [None]:
train, dev, test = data.train_dev_test()

print("Train:", train)
print("Dev:", dev)
print("Test:", test)

### Configuring a dataset

Datasets offer a range of configuration options. Here, we are going to explore four of them:

- component order
- dimension order
- sampling rate
- waveform caching

Standard seismometers will consist of three components, commonly vertical (Z), north-south (N) and east-west (E). Depending on your application, you'll need to arrange the components differently. SeisBench can do this automatically. Here is an example:

In [None]:
data = sbd.DummyDataset(component_order="ZNE")
zne_array = data.get_waveforms(0)

data = sbd.DummyDataset(component_order="NEZ")
nez_array = data.get_waveforms(0)

print('ZNE:\n', zne_array[:, :5])
print('NEZ:\n', nez_array[:, :5])

Sometimes, not all components are available. You can use the `missing_components` parameter to define how to handle this case. Check the documentation for details.

Similar to the component order, the dimension order specifies how to order the dimensions of your data, i.e., the traces (N), the channels (C) and the samples (W).

In [None]:
data = sbd.DummyDataset(dimension_order="NCW")
waveforms = data.get_waveforms([3, 20, 45, 70])
print("NCW - waveforms.shape:", waveforms.shape)

data = sbd.DummyDataset(dimension_order="NWC")
waveforms = data.get_waveforms([3, 20, 45, 70])
print("NWC - waveforms.shape:", waveforms.shape)

Often, applications will require waveforms of a specific sampling rate. By default, seisbench will return data at the sampling rate provided in the dataset. However, you can configure datasets to always return a specific sampling rate, simply by setting it in the constructor. SeisBench will then automatically resample the trace.

In [None]:
data = sbd.DummyDataset(sampling_rate=100)
waveforms = data.get_waveforms(3)
print("100 Hz - waveforms.shape:", waveforms.shape)

data = sbd.DummyDataset(sampling_rate=200)
waveforms = data.get_waveforms(3)
print("200 Hz - waveforms.shape:", waveforms.shape)

Alternatively, you can specify the sampling rate in a call to `get_waveforms`.

In [None]:
data = sbd.DummyDataset()

waveforms = data.get_waveforms(3, sampling_rate=100)
print("100 Hz - waveforms.shape:", waveforms.shape)
waveforms = data.get_waveforms(3, sampling_rate=200)
print("200 Hz - waveforms.shape:", waveforms.shape)

The last configuration option discussed in this tutorial is waveform caching. As mentioned earlier, loading a dataset actually only loads the metadata into memory, and only reads the waveforms on demand. Depending on your use case, this might not be the optimal scenario. For example, when training a deep learning model, it's usually best to first load all the waveforms into memory, instead of reloading them from disk every epoch. Therefore, SeisBench allows to cache waveforms in memory and to preload them into memory. Here is an example:

In [None]:
data = sbd.DummyDataset(cache='trace')
data.preload_waveforms(pbar=True)

You can either use a `trace` cache or a `full` cache. Check the documentation for details on these strategies. As a rule of thumb, `trace` should be used if you only need a small fraction of the dataset, while `full` is better suited when using most of the dataset or a full train/dev/test split. Note that `full` might cache traces that you did not actually filter for. On the other side, `full` will have better read performance than `trace` when using many traces.

In general, when preloading and filtering a dataset, you should always first filter it and then preload to avoid loading unnecessary traces.

### Visualizing a dataset

If you have the package `cartopy` installed, you can visualize your dataset using the method `plot_map`.

In [None]:
data.plot_map()