The full functionality of `seaflowpy` can be found through the various submodules, e.g. `seaflowpy.db`, `seaflowpy.evt`. However, for convenience a few of the most commonly-used functions and classes are exposed at the package level.

* **seaflowpy.EVT**  
Class for EVT particle data

* **seaflowpy.find_evt_files**  
Function to recursively find EVT/OPP file paths within a directory

The code below will use the test dataset in this repository at `./tests/testcruise/`. The files in this directory hold raw EVT data, but in this example workflow we'll treat the files as though they actually contain filtered OPP data. For example, we may have already filtered EVT to OPP data on the command-line with `filterevt`.

A note on terminology: In this package the phrase EVT can have two subtley different meanings.  

* Any binary file or Python data structure which holds SeaFlow particle data, regardless of whether raw or filtered.
* In the context of a filtering workflow, EVT refers to the raw, unfocused/unfiltered version of particle data, distinct from OPP data which refers to the filtered/focused particles.  

When filtering, we talk about converting EVT data to OPP data. We may read a raw EVT file into Python as a `seaflowpy.EVT` object and the raw particle data is stored as a pandas DataFrame in the `EVT.evt` attribute. We then filter the raw particle data with `seaflowpy.EVT.filter` and this filtered particle data is accessible as a pandas DataFrame in the `EVT.opp` atttribute. This is essentially what `filterevt` does.

But when we read filtered OPP files from disk, `seaflowpy.EVT` treats them in the same way it would treat reading raw EVT files. Particle data is stored in the new `EVT` object as a pandas DataFrame in the `EVT.evt` attribute (even though we know that this is the OPP data).

In [1]:
import os
import pandas as pd
import seaflowpy as sfp

In [2]:
opp_files = sfp.find_evt_files("./tests/testcruise")
opp_files

['./tests/testcruise/2014_185/2014-07-04T00-00-02+00-00',
 './tests/testcruise/2014_185/2014-07-04T00-03-02+00-00.gz',
 './tests/testcruise/2014_185/2014-07-04T00-06-02+00-00',
 './tests/testcruise/2014_185/2014-07-04T00-09-02+00-00',
 './tests/testcruise/2014_185/2014-07-04T00-12-02+00-00']

Let's read the EVT files into memory. In many cases we don't plan on using all 10 channels of particle data, so here we'll select only the channels (columns) we care about. This can significantly speed up data import when transforming (exponentiating log data) and lowers the memory footprint.

Note, some of the files are unreadable, which is normal and even expected in a real cruise, so we'll catch and print any errors with try/except.

In [3]:
# The possible column names to choose from
sfp.EVT.all_columns

['time',
 'pulse_width',
 'D1',
 'D2',
 'fsc_small',
 'fsc_perp',
 'fsc_big',
 'pe',
 'chl_small',
 'chl_big']

In [4]:
opps = []
for f in opp_files:
    try:
        opps.append(sfp.EVT(f, transform=True, columns=["fsc_small", "chl_small", "pe"]))
    except sfp.errors.EVTFileError as e:
        print "{} {}".format(f, e)

./tests/testcruise/2014_185/2014-07-04T00-06-02+00-00 File is empty
./tests/testcruise/2014_185/2014-07-04T00-09-02+00-00 File has incorrect number of data bytes. Expected 960000, saw 236
./tests/testcruise/2014_185/2014-07-04T00-12-02+00-00 File has invalid particle count header


Now we have some EVT objects in `opps`. Let's take a look at the pandas DataFrame of particle data.

In [5]:
opps[0].evt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns (total 3 columns):
fsc_small    40000 non-null float64
pe           40000 non-null float64
chl_small    40000 non-null float64
dtypes: float64(3)
memory usage: 1.2 MB


In [13]:
opps[0].evt.head()

Unnamed: 0,fsc_small,pe,chl_small
0,1,1.192991,1.530514
1,1,1.0,2.055943
2,1,1.0,2.128752
3,1,1.625772,2.488267
4,1,1.342477,2.178271


We can use the DataFrames for each file separately, or we can combine the DataFrames into one object.

In [14]:
opp_df = pd.concat([o.evt for o in opps])
opp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80000 entries, 0 to 39999
Data columns (total 3 columns):
fsc_small    80000 non-null float64
pe           80000 non-null float64
chl_small    80000 non-null float64
dtypes: float64(3)
memory usage: 2.4 MB
