# WaveBank
`WaveBank` is an in-process database for accessing seismic time-series data. Any directory structure containing obspy-readable waveforms can be used as the data source. `WaveBank` uses a simple indexing scheme and the [Hierarchical Data Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) to keep track of each `Trace` in the directory.

For example, after downloading seismic data with [obspy's fdsn mass download](https://docs.obspy.org/packages/autogen/obspy.clients.fdsn.mass_downloader.html), you need a way to access the data. You could manually read in each file but this can become very tedious and clutters up your application code with data access code. Enter WaveBank. 

## Create a WaveBank object
We will use the [Crandall Canyon](https://en.wikipedia.org/wiki/Crandall_Canyon_Mine) dataset from [ObsPlus' datasets](../datasets/datasets.ipynb) to create a new `WaveBank` instance. This will ensure the waveforms have been downloaded (to the obsplus installation directory), copy the dataset to a temporary directory, and initialize a `WaveBank` instance.

In [1]:
import obsplus

In [2]:
%%capture
# Make sure the crandall canyone dataset is loaded and supress output.
crandall = obsplus.load_dataset('crandall_test')

In [3]:
# create a copy of the crandall dataset, storing the copied files in a temporary directory
crandall = obsplus.copy_dataset('crandall_test')
# a directory of waveforms now lives here:
waveform_path = crandall.waveform_path
print(f"The waveform path is: {waveform_path}")

The waveform path is: /tmp/tmp1llg3ak4/crandall_test/waveforms


Now we just need to feed the path of the waveform files to the WaveBank constructor.

In [4]:
bank = obsplus.WaveBank(waveform_path)

To ensure the index is up-to-date you can call the `udpate_index` method on the bank. This will iterate through all files that are timestamped later than the last time `update_index` was run.

You only need to run `update_index` if the directory has not been indexed before or you have added new files to it. 

In [5]:
bank.update_index()

WaveBank(base_path=/tmp/tmp1llg3ak4/crandall_test/waveforms)

## Get waveforms

Now we can get files from the directory with the `get_waveforms` method, which has the same signature as the obspy client get_waveform methods:

In [6]:
import obspy

t1 = obspy.UTCDateTime('2007-08-06T01-44-48')
t2 = t1 + 60
st = bank.get_waveforms(starttime=t1, endtime=t2)
print (st[:5])  # print first 5 traces

5 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O15A..BHZ | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples


We can, of course, filter on channels, locations, stations, networks, etc. using linux style search strings or regex. 

In [7]:
st2 = bank.get_waveforms(network='UU', starttime=t1, endtime=t2)

# ensure only UU traces were returned
for tr in st2:
    assert tr.stats.network == 'UU'

print(st2[:5])  # print first 5 traces

5 Trace(s) in Stream:
UU.CTU..HHE | 2007-08-06T01:44:48.004000Z - 2007-08-06T01:45:48.004000Z | 100.0 Hz, 6001 samples
UU.CTU..HHN | 2007-08-06T01:44:48.004000Z - 2007-08-06T01:45:48.004000Z | 100.0 Hz, 6001 samples
UU.CTU..HHZ | 2007-08-06T01:44:48.004000Z - 2007-08-06T01:45:48.004000Z | 100.0 Hz, 6001 samples
UU.MPU..HHE | 2007-08-06T01:44:48.002000Z - 2007-08-06T01:45:48.002000Z | 100.0 Hz, 6001 samples
UU.MPU..HHN | 2007-08-06T01:44:48.002000Z - 2007-08-06T01:45:48.002000Z | 100.0 Hz, 6001 samples


In [8]:
st = bank.get_waveforms(starttime=t1, endtime=t2, station='O1??', channel='BH[NE]')

# test returned traces
for tr in st:
    assert tr.stats.starttime >= t1 - .00001
    assert tr.stats.endtime <= t2 + .00001
    assert tr.stats.station.startswith('O1')
    assert tr.stats.channel.startswith('BH')
    assert tr.stats.channel[-1] in {'N', 'E'}

print(st)

6 Trace(s) in Stream:
TA.O15A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O15A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O16A..BHE | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O16A..BHN | 2007-08-06T01:44:48.000000Z - 2007-08-06T01:45:48.000000Z | 40.0 Hz, 2401 samples
TA.O18A..BHE | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples
TA.O18A..BHN | 2007-08-06T01:44:47.999998Z - 2007-08-06T01:45:47.999998Z | 40.0 Hz, 2401 samples


WaveBank also shares the `get_waveforms_bulk` method with the FDSN client for efficiently retrieving a large number of streams. 

In [9]:
args = [  # in practice this list may contain hundreds or thousands of requests
    ('TA', 'O15A', '', 'BHZ', t1 - 5, t2 - 5,),
    ('UU', 'SRU', '', 'HHZ', t1, t2,),
]
st = bank.get_waveforms_bulk(args)
print(st )

2 Trace(s) in Stream:
TA.O15A..BHZ | 2007-08-06T01:44:42.999998Z - 2007-08-06T01:45:42.999998Z | 40.0 Hz, 2401 samples
UU.SRU..HHZ  | 2007-08-06T01:44:47.995000Z - 2007-08-06T01:45:47.995000Z | 100.0 Hz, 6001 samples


## Yield waveforms
The Bank class also provides a generator for iterating large amounts of continuous waveforms. For example, if you wanted to run a power detector on the data it might make sense to process one hour at a time with a minute of overlap between the slices. We first need to create a bank on a dataset which has continuous data. For this we will use the TA dataset.

In [10]:
ds = obsplus.load_dataset('TA_test')
ta_bank = obsplus.WaveBank(ds.waveform_client)

downloading waveform data for ta_test dataset ...
finished downloading waveform data for ta_test
downloading station data for ta_test dataset ...
finished downloading station data for ta_test
downloading event data for ta_test dataset ...
finished downloading event data for ta_test


In [11]:
# get a few hours of kemmerer data
ta_t1 = obspy.UTCDateTime('2007-02-15')
ta_t2 = obspy.UTCDateTime('2007-02-16')

for st in ta_bank.yield_waveforms(starttime=ta_t1, endtime=ta_t2, duration=3600, overlap=60):
    print (f'got {len(st)} streams from {st[0].stats.starttime} to {st[0].stats.endtime}')

got 6 streams from 2007-02-15T00:00:09.999998Z to 2007-02-15T01:00:59.999998Z
got 6 streams from 2007-02-15T00:59:59.999998Z to 2007-02-15T02:00:59.999998Z
got 6 streams from 2007-02-15T01:59:59.999998Z to 2007-02-15T03:00:59.999998Z
got 6 streams from 2007-02-15T02:59:59.999998Z to 2007-02-15T04:00:59.999998Z
got 6 streams from 2007-02-15T03:59:59.999998Z to 2007-02-15T05:00:59.999998Z
got 6 streams from 2007-02-15T04:59:59.999998Z to 2007-02-15T06:00:59.999998Z
got 6 streams from 2007-02-15T05:59:59.999998Z to 2007-02-15T07:00:59.999998Z
got 6 streams from 2007-02-15T06:59:59.999998Z to 2007-02-15T08:00:59.999998Z
got 6 streams from 2007-02-15T07:59:59.999998Z to 2007-02-15T09:00:59.999998Z
got 6 streams from 2007-02-15T08:59:59.999998Z to 2007-02-15T10:00:59.999998Z
got 6 streams from 2007-02-15T09:59:59.999998Z to 2007-02-15T11:00:59.999998Z
got 6 streams from 2007-02-15T10:59:59.999998Z to 2007-02-15T12:00:59.999998Z
got 6 streams from 2007-02-15T11:59:59.999998Z to 2007-02-15T13:

## Put waveforms
You can also add files to the bank by passing a stream or trace to the `bank.put_waveforms` method. WaveBank, however, does not do any file merging so you might end up with overlaps in data if you are not careful.

In [12]:
# show that no data for RJOB is in the bank
st = bank.get_waveforms(station='RJOB')

assert len(st) == 0

print(st)

0 Trace(s) in Stream:



In [13]:
# add the default stream to the archive (which contains data for RJOB)
bank.put_waveforms(obspy.read())
st_out = bank.get_waveforms(station='RJOB')

# test output
assert len(st_out)
for tr in st_out:
    assert tr.stats.station == 'RJOB'


print(st_out)

3 Trace(s) in Stream:
BW.RJOB..EHE | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHN | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples
BW.RJOB..EHZ | 2009-08-24T00:20:03.000000Z - 2009-08-24T00:20:32.990000Z | 100.0 Hz, 3000 samples


## Availability
You can also use WaveBank to get availability of data, either as a dataframe or as a list of tuples in the form of [(network, station, location, channel, min_starttime, max_endtime)], which is the same as output by the `availability` method of [Obspy's Earthworm client](https://docs.obspy.org/master/packages/obspy.clients.earthworm.html).

In [14]:
# get a dataframe of availability by seed ids and timestamps
bank.get_availability_df(channel='BHE', station='[OR]*')

Unnamed: 0,network,station,location,channel,starttime,endtime
0,TA,O15A,,BHE,2007-08-06 01:44:38.825000,2007-08-07 21:43:51.124998
1,TA,O16A,,BHE,2007-08-06 01:44:38.825000,2007-08-07 21:43:51.125000
2,TA,O18A,,BHE,2007-08-06 01:44:38.824998,2007-08-07 21:43:51.125000
3,TA,R16A,,BHE,2007-08-07 02:04:54.500000,2007-08-07 21:43:51.125000
4,TA,R17A,,BHE,2007-08-06 01:44:38.825000,2007-08-07 21:43:51.125000


In [15]:
bank.availability(channel='BHE', station='[OR]*')

[('TA',
  'O15A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.124998Z),
 ('TA',
  'O16A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'O18A',
  '',
  'BHE',
  2007-08-06T01:44:38.824998Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'R16A',
  '',
  'BHE',
  2007-08-07T02:04:54.500000Z,
  2007-08-07T21:43:51.125000Z),
 ('TA',
  'R17A',
  '',
  'BHE',
  2007-08-06T01:44:38.825000Z,
  2007-08-07T21:43:51.125000Z)]

## Get Gaps and uptime
WaveBank can also return a dataframe of missing data with the `get_gaps_df` method, as well as return a dataframe of reliability statistics. These are useful, for example, if you are trying to assess the completeness of an archive of contiguous data.

In [16]:
bank.get_gaps_df(channel='BHE', station='O*').head()

Unnamed: 0,network,station,location,channel,starttime,endtime,sampling_period,path,gap_duration
0,TA,O15A,,BHE,2007-08-06 01:45:48.799998,2007-08-06 08:48:30.024998,0 days 00:00:00.025000,/TA.O15A..BHE__20070806T014438Z__20070806T0145...,0 days 07:02:41.225000
1,TA,O15A,,BHE,2007-08-06 08:49:39.999998,2007-08-06 10:47:15.624998,0 days 00:00:00.025000,/TA.O15A..BHE__20070806T084830Z__20070806T0849...,0 days 01:57:35.625000
2,TA,O15A,,BHE,2007-08-06 10:48:25.599998,2007-08-07 02:04:54.499998,0 days 00:00:00.025000,/TA.O15A..BHE__20070806T104715Z__20070806T1048...,0 days 15:16:28.900000
3,TA,O15A,,BHE,2007-08-07 02:06:04.474998,2007-08-07 02:14:14.100000,0 days 00:00:00.025000,/TA.O15A..BHE__20070807T020454Z__20070807T0206...,0 days 00:08:09.625002
4,TA,O15A,,BHE,2007-08-07 02:15:24.074998,2007-08-07 03:44:08.474998,0 days 00:00:00.025000,/TA.O15A..BHE__20070807T021414Z__20070807T0215...,0 days 01:28:44.400000


In [17]:
ta_bank.get_uptime_df()

Unnamed: 0,network,station,location,channel,starttime,endtime,duration,gap_duration,uptime,availability
0,TA,M11A,,VHE,2007-02-15 00:00:09.999998,2007-02-24 23:59:59.999998,9 days 23:59:50,0 days,9 days 23:59:50,1.0
1,TA,M11A,,VHN,2007-02-15 00:00:09.999998,2007-02-24 23:59:59.999998,9 days 23:59:50,0 days,9 days 23:59:50,1.0
2,TA,M11A,,VHZ,2007-02-15 00:00:09.999998,2007-02-24 23:59:59.999998,9 days 23:59:50,0 days,9 days 23:59:50,1.0
3,TA,M14A,,VHE,2007-02-15 00:00:00.000003,2007-02-25 00:00:00.000003,10 days 00:00:00,0 days,10 days 00:00:00,1.0
4,TA,M14A,,VHN,2007-02-15 00:00:00.000003,2007-02-25 00:00:00.000003,10 days 00:00:00,0 days,10 days 00:00:00,1.0
5,TA,M14A,,VHZ,2007-02-15 00:00:00.000004,2007-02-25 00:00:00.000004,10 days 00:00:00,0 days,10 days 00:00:00,1.0


## Read index
You can also read the index directly, although in most cases this shouldn't be needed.

In [18]:
ta_bank.read_index().head()

Unnamed: 0,network,station,location,channel,starttime,endtime,sampling_period,path
0,TA,M11A,,VHN,2007-02-22 19:59:59.999998,2007-02-22 20:59:59.999998,0 days 00:00:10,/TA/M11A/VHN/2007-02-22T20-00-00.mseed
1,TA,M14A,,VHN,2007-02-22 20:00:00.000003,2007-02-22 21:00:00.000003,0 days 00:00:10,/TA/M11A/VHN/2007-02-22T20-00-00.mseed
2,TA,M11A,,VHN,2007-02-22 04:59:59.999998,2007-02-22 05:59:59.999998,0 days 00:00:10,/TA/M11A/VHN/2007-02-22T05-00-00.mseed
3,TA,M14A,,VHN,2007-02-22 05:00:00.000003,2007-02-22 06:00:00.000003,0 days 00:00:10,/TA/M11A/VHN/2007-02-22T05-00-00.mseed
4,TA,M11A,,VHN,2007-02-15 09:59:59.999998,2007-02-15 10:59:59.999998,0 days 00:00:10,/TA/M11A/VHN/2007-02-15T10-00-00.mseed


## Similar Projects
`WaveBank` is a useful tool, but it may not be a good fit for every application. Check out the following items as well:

Obspy has a way to visualize availability of waveform data in a directory using [obspy-scan](https://docs.obspy.org/tutorial/code_snippets/visualize_data_availability_of_local_waveform_archive.html). If you prefer a graphical option to working with `DataFrame`s this might be for you.

Obspy also has [filesystem client](https://docs.obspy.org/master/packages/autogen/obspy.clients.filesystem.sds.Client.html#obspy.clients.filesystem.sds.Client) for working with SeisComP structured archives.

[IRIS](https://www.iris.edu/hq/) released a mini-seed indexing program called [mseedindex](https://github.com/iris-edu/mseedindex) which will have an [ObsPy API](https://github.com/obspy/obspy/pull/2206). We personally have not yet used mseedindex but it certainly looks worth checking out.