# Datasets

ObsPlus includes a few interesting datasets which are used for testing purposes, but a [template with instructions](https://github.com/niosh-mining/opsdata) is provided if you would like to create and distribute your own.


The datasets are "lazy" in that all but the most essential information will be downloaded only when some code requests the dataset. This helps keep the size of ObsPlus small, but does mean you will need an network connection the first time you use each dataset. Here are a few examples of things you can do with datasets:

## Dataset basics
Loading a dataset only requires knowing its name (and having installed it, more on that later).

In [1]:
import obspy
import obsplus
ds = obsplus.load_dataset('crandall_test')

downloading waveform data for crandall_test dataset ...


[2020-08-01 00:17:55,538] - obspy.clients.fdsn.mass_downloader - INFO: Initializing FDSN client(s) for http://service.iris.edu.
[2020-08-01 00:17:55,540] - obspy.clients.fdsn.mass_downloader - INFO: Successfully initialized 1 client(s): http://service.iris.edu.
[2020-08-01 00:17:55,541] - obspy.clients.fdsn.mass_downloader - INFO: Total acquired or preexisting stations: 0
[2020-08-01 00:17:55,541] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Requesting reliable availability.
[2020-08-01 00:17:58,539] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Successfully requested availability (3.00 seconds)
[2020-08-01 00:17:58,592] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Found 19 stations (57 channels).
[2020-08-01 00:17:58,598] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Will attempt to download data from 19 stations.
[2020-08-01 00:17:58,602] - obspy.c

finished downloading waveform data for crandall_test
downloading station data for crandall_test dataset ...


[2020-08-01 00:18:29,514] - obspy.clients.fdsn.mass_downloader - INFO: Initializing FDSN client(s) for http://service.iris.edu.
[2020-08-01 00:18:29,515] - obspy.clients.fdsn.mass_downloader - INFO: Successfully initialized 1 client(s): http://service.iris.edu.
[2020-08-01 00:18:29,516] - obspy.clients.fdsn.mass_downloader - INFO: Total acquired or preexisting stations: 0
[2020-08-01 00:18:29,517] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Requesting reliable availability.
[2020-08-01 00:18:31,331] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Successfully requested availability (1.81 seconds)
[2020-08-01 00:18:31,388] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Found 19 stations (57 channels).
[2020-08-01 00:18:31,394] - obspy.clients.fdsn.mass_downloader - INFO: Client 'http://service.iris.edu' - Will attempt to download data from 19 stations.
[2020-08-01 00:18:31,397] - obspy.c

finished downloading station data for crandall_test
downloading event data for crandall_test dataset ...
finished downloading event data for crandall_test


The best way to access the data in a dataset is by using the desired client:

In [2]:
wave_client = ds.waveform_client
station_client = ds.station_client
event_client = ds.event_client

which behave the same as any `Client` in ObsPy:

In [3]:
st = wave_client.get_waveforms()
assert isinstance(st, obspy.Stream)
inv = station_client.get_stations()
assert isinstance(inv, obspy.Inventory)
cat = event_client.get_events()
assert isinstance(cat, obspy.Catalog)

You can also use a `Fetcher` for "dataset aware" querying.

In [4]:
fetcher = ds.get_fetcher()

Each dataset is a just a directory of files whose path is stored as the `data_path` attribute:

In [5]:
ds.data_path

PosixPath('/home/runner/opsdata/crandall_test')

And the *included* data files are found in the `source_path`:

In [6]:
ds.source_path

PosixPath('/home/runner/work/obsplus/obsplus/obsplus/datasets/crandall_test')

If you plan to modify any data, Datasets can be copied with the `copy_dataset` function.

In [7]:
from pathlib import Path

obsplus.copy_dataset('crandall_test', '.')
path = Path('.') / 'crandall_test'
assert path.exists() and path.is_dir()

## Data path
By default, all datasets are stored in the user's home directory in a directory called 'opsdata'. Each dataset is contained by a subdirectory with the same name as the dataset. If you would prefer the datasets be stored somewhere else, the locations can be controlled by the environmental variable `OPSDATA_PATH`.

## Included Test Datasets

1. TA_test:
    A small dataset with two stations from the TA with channels that have very low sampling rates. 

2. Crandall_test:
    Event waveforms for the [Crandall Canyon Mine collapse](https://en.wikipedia.org/wiki/Crandall_Canyon_Mine) and associated aftershocks. The dataset also includes a catalog of the events and a station inventory.
    
3. Bingham_test:
    Event waveforms associated with the [Bingham Canyon Landslide](https://en.wikipedia.org/wiki/Bingham_Canyon_Mine#Landslides), one of the largest anthropogenic landslides ever recorded. Luckily, the situation was well managed and no one was hurt. The dataset also includes a catalog of the events and a station inventory. 
    
Each of these data sets is accessed via `obsplus.load_dataset` function which takes the name of the dataset as the only argument. It then returns a `DataSet` instance. This will take a few minutes if the datasets have not yet been downloaded, otherwise it should be very quick.

In [8]:
# cleanup temporary directory
import shutil
from pathlib import Path

path = Path('crandall_test')
if path.exists():
    shutil.rmtree(path)