Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Patterns in phytoplankton diversity
##Project summary Microscopic algae (called phytoplankton) form the base of the oceanic food chain, and are key players in the biogeochemical cycles of many climatically-active elements. Ecological theory predicts that diverse ecosystems are more stable, i.e. more resistant to stressors, than less diverse ecosystems. However data on the diversity of oceanic phytoplankton communities is very sparse as it typically depends on very labor-intensive methods (e.g. microscope identification, molecular sequencing). In order to understand how phytoplankton diversity may be affected by climate change, it is essential to have a baseline understanding of current patterns in diversity and how they relate to environmental conditions.
In this study, we will calculate indices of phytoplankton diversity using data collected using SeaFlow, a continuously sampling underway flow cytometer. This will produce diversity estimates at high resolution over large spatial scales, and across different seasons. We will adapt Li’s (1997) cytometric diversity to better reflect the taxonomic diversity of phytoplankton observed with SeaFlow, and develop methods for integrating data from different instruments and cruises in such a way that they are comparable. Using data from the Pacific and Atlantic Oceans collected during 18 oceanographic cruises, we will conduct a meta-analysis of the patterns in cytometric diversity, and how these relate to other biotic and abiotic variables (e.g. temperature, salinity, density gradients, biomass).
For most recent update, please see our blog.
For final project report, please see below.
For code, please see our github repository.
- SDS data properly binned and uploaded to Myria (done)
- standardize data across cruises (done)
- calculate diversity indices (done)
- produce maps of biodiversity (done)
- explore spatial/temporal variations in patterns (done).
- explore correlations between diversity indices and physical properties (e.g. temperature, salinity, gradients), i.e. produce scatter/heatmap plots of data (done)
- maintain blog describing progress (done).
- develop a tutorial so that others can manipulate seaflow data in Myria
- develop methods paper for a general oceanography audience
- ipython notebooks for plots (done)
One of the main aims of this project was to bring together SeaFlow data from 18 different cruises along with the associated environmental data, and explore macro-ecological patterns in phytoplankton in the North Pacific. In effect we are working with two associated datasets, one derived from SeaFlow data, which includes the optical properties of the particles measured and a general classification of each particle (e.g. beads, noise, phytoplankton), and a second dataset which includes all of the environmental data collected by the ship's underway system (e.g. temperature, salinity). The entire dataset represents >500GB of data stored in a somewhat complex file system which made working with the entire dataset unwieldy, as the system was designed to work with data on a cruise-by-cruise basis. The first step in the project was to transfer all of this data to the Myria database system. This initial step also involved "cleaning" the environmental data, which was not all in a standardized format, and needed to be interpolated spatially/temporally to coincide with the SeaFlow data. This step was done using SQLshare, as the files for each cruise were relatively small and easily handled by SQLshare. Joining all of the SeaFlow and environmental data, and storing it in tables in Myria was a big step in enabling us to analyze the entire dataset much more efficiently. Data analysis operations that had previously taken several hours to complete in R, running on a compute cluster, can now be run in ~10 minutes as SQL queries using Myria.
The data was collected with three different SeaFlow instruments, over the course of 5 years in a wide range of oceanic environments, from coastal seas, to the Pacific subtropical gyre. In order to undertake a comparative analysis of the data, it was necessary to devise a way of standardizing all of the SeaFlow data. Small fluorescent beads are used as an internal standard when SeaFlow is running, and we were able to develop an algorithm to normalize the data using the signal from these beads. This normalization step was necessary to produce meaningful estimates of diversity indices from the fluorescence and scatter data.
Although we chose to do a lot of the data analysis "heavy lifting" using Myria, we also made heavy use of iPython notebooks for downloading and visualizing the results of Myria queries. A collection of iPython notebooks used to produce maps and figures from the data are stored in this repository.
Figure 1. Map of cytometric diversity (N0) calculated using Myria and produced using the basemap package in Python.
We have achieved all of our key goals with this project, not least of which was building a framework within which we could quickly run analyses on the entire SeaFlow dataset. We now have a data analysis pipeline using Myria for the "big data" tasks that filter/reduce the data to smaller datasets which can be easily manipulated and visualized using Python. We are working on a methods/outlook paper describing our approach to dealing with oceanographic "big data" and outlining a general recipe for dealing with similar datasets incorporating data from different sources. We also continue to use the data analysis pipeline to work up the results for a domain science paper describing patterns in phytoplankton diversity in the North Pacific.