The full functionality of `seaflowpy` can be found through the various submodules, e.g. `seaflowpy.db`, `seaflowpy.evt`. However, for convenience a few of the most commonly-used functions and classes are exposed at the package level.

* **seaflowpy.EVT**  
Class for EVT particle data

* **seaflowpy.find_evt_files**  
Function to recursively find EVT/OPP file paths within a directory

The code below will use the test dataset in this repository at `./tests/testcruise/`. The files in this directory hold raw EVT data, but in this example workflow we'll treat the files as though they actually contain filtered OPP data. For example, we may have already filtered EVT to OPP data on the command-line with `seaflowpy_filter`.

A note on terminology: In this package the phrase EVT can have two subtley different meanings.  

* Any binary file or Python data structure which holds SeaFlow particle data, regardless of whether raw or filtered.
* In the context of a filtering workflow, EVT refers to the raw, unfiltered/unfocused version of particle data, distinct from OPP data which refers to the filtered/focused particles.  

When filtering, we talk about converting EVT data to OPP data. We may read a raw EVT file into Python as a `seaflowpy.EVT` object and the raw particle data is stored as a pandas DataFrame in the `EVT.df` attribute. We then filter the raw particle data with `seaflowpy.EVT.filter` and this filtered particle data is returned as a new EVT object representing OPP data. `seaflowpy_filter` is a convenient command-line wrapper for this process that can operate on whole cruises at a time.

In [1]:
import seaflowpy as sfp

In [2]:
opp_files = sfp.find_evt_files("./tests/testcruise_opp")
opp_files

['./tests/testcruise_opp/2014_185/2014-07-04T00-00-02+00-00.opp.gz',
 './tests/testcruise_opp/2014_185/2014-07-04T00-03-02+00-00.opp.gz']

Let's read the EVT files into memory. In many cases we don't plan on using all 10 channels of particle data, so here we'll select only three of the possible ten channels (columns). This can significantly speed up data import when transforming (exponentiating log data) and lowers the memory footprint.

In [3]:
# The possible column names to choose from
sfp.EVT.all_columns

['time',
 'pulse_width',
 'D1',
 'D2',
 'fsc_small',
 'fsc_perp',
 'fsc_big',
 'pe',
 'chl_small',
 'chl_big']

In [4]:
opps = []
for f in opp_files:
    opps.append(sfp.EVT(f, transform=True, columns=["fsc_small", "chl_small", "pe"]))

Now we have some EVT objects in `opps`. We can print an one of the `EVT` objects in `opps` to get a quick summary of it's content.

In [5]:
print opps[0]

{
  "path": "./tests/testcruise_opp/2014_185/2014-07-04T00-00-02+00-00.opp.gz", 
  "file_id": "2014_185/2014-07-04T00-00-02+00-00", 
  "header_count": 386, 
  "event_count": 386, 
  "particle_count": 386, 
  "columns": [
    "fsc_small", 
    "pe", 
    "chl_small"
  ], 
  "filter_options": {
    "origin": null, 
    "width": null, 
    "notch1": null, 
    "notch2": null, 
    "offset": null
  }
}


The underlying particle data can be accessed as a pandas DataFrame in the `df` attribute.

In [6]:
opps[0].df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386 entries, 0 to 385
Data columns (total 3 columns):
fsc_small    386 non-null float64
pe           386 non-null float64
chl_small    386 non-null float64
dtypes: float64(3)
memory usage: 9.1 KB


In [7]:
opps[0].df.head()

Unnamed: 0,fsc_small,pe,chl_small
0,5.910258,1.207901,7.622597
1,1.569201,1.399968,3.347019
2,73.487071,1.660321,74.359745
3,2.330406,1.275528,5.276099
4,2.386083,1.172052,7.77406


Let's assume this data set has already been analyzed and population classifications exist in a directory called `./tests/testcruise_vct`. We can add these per-particle classifications to our `EVT` objects with `seaflowpy.EVT.add_vct`.

`add_vct` can take a directory as its only argument. In order to find the appropriate VCT file the structure of this directory should mirror the OPP and raw EVT directories (julian day/file).

`add_vct` can also take a path to a VCT file if the file has an unconventional name or location.

In [11]:
for opp in opps:
    opp.add_vct("./tests/testcruise_vct")
opps[0].df.head()  # a new "pop" column has been added

Unnamed: 0,fsc_small,pe,chl_small,pop
0,5.910258,1.207901,7.622597,prochloro
1,1.569201,1.399968,3.347019,prochloro
2,73.487071,1.660321,74.359745,picoeuk
3,2.330406,1.275528,5.276099,prochloro
4,2.386083,1.172052,7.77406,prochloro


Now we can operate on the pandas particle DataFrame directly to get per-population statistics ...

In [12]:
bypop = opps[0].df.groupby("pop")
bypop.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,chl_small,fsc_small,pe
pop,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
beads,count,62.0,62.0,62.0
beads,mean,61.331928,65.510412,703.436398
beads,std,18.317616,16.732172,186.882179
beads,min,25.481625,28.523328,340.508047
beads,25%,51.075239,58.808637,620.671852
beads,50%,60.147136,62.292189,685.371556
beads,75%,66.955352,66.079745,738.692231
beads,max,135.056675,131.306472,1318.311393
picoeuk,count,20.0,20.0,20.0
picoeuk,mean,290.085973,163.255617,1.836653


In [13]:
bypop.mean()

Unnamed: 0_level_0,fsc_small,pe,chl_small
pop,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
beads,65.510412,703.436398,61.331928
picoeuk,163.255617,1.836653,290.085973
prochloro,3.581225,1.435859,4.240353
synecho,17.089432,9.630957,10.070132
unknown,356.833772,4.404959,5.433694


In [14]:
bypop.size()

pop
beads         62
picoeuk       20
prochloro    221
synecho       78
unknown        5
dtype: int64

Or we can call the `calc_pop_stats` method of the EVT object to get the same summary statistics that are saved in the SQLite3 database in a full classification workflow. This returns a dictionary keyed by population name with particle counts and per-channel means for fsc_small, fsc_perp, pe, chl_small. If one of these columns is not present in the EVT object, as fsc_perp is not here, it's left out of the results.

In [15]:
import pprint
pprint.pprint(opps[0].calc_pop_stats())

{'beads': {'chl_small': 61.331927505955953,
           'count': 62,
           'fsc_small': 65.510412124807516,
           'pe': 703.43639797379763,
           'pop': 'beads'},
 'picoeuk': {'chl_small': 290.08597340883045,
             'count': 20,
             'fsc_small': 163.2556167296695,
             'pe': 1.836653035695474,
             'pop': 'picoeuk'},
 'prochloro': {'chl_small': 4.240352677583366,
               'count': 221,
               'fsc_small': 3.5812246921776758,
               'pe': 1.4358590192656131,
               'pop': 'prochloro'},
 'synecho': {'chl_small': 10.070132102414576,
             'count': 78,
             'fsc_small': 17.089432037286574,
             'pe': 9.6309571133821361,
             'pop': 'synecho'},
 'unknown': {'chl_small': 5.4336937909910636,
             'count': 5,
             'fsc_small': 356.83377202864756,
             'pe': 4.404958684938495,
             'pop': 'unknown'}}
