# PSTAT 100 example project report

This file is part of an adaptation of HW3 into a project-like format, and paired with the file *codes.ipynb*, which contains the codes used in data processing, analysis, tabulation, and visualization to generate the results shown here.

Since you did this analysis already in the homework, you have some idea of what went into generating this report. Notice just how little of the full process appears here -- just four figures and three tables are included. **The focus is less on showing each step and documenting the analysis, and more on telling a story.**

There are four sections: background, data description, methods and results, and discussion. I recommend using a similar format, but you are free to adjust as best suits your project.

Some details to pay attention to:
* this isn't a very long report (about 4 pages);
* it doesn't contain much technical detail;
* no codes are shown;
* all figures and tables have captions;
* figures are sized so that all labels are legible;
* sub-headers are used to help guide the reader. 

---
## 0. Background 

Diatoms are a type of phytoplankton -- photosynthetic algae that function as primary producers in aquatic ecosystems. Diatoms are at the bottom of the food web, and as a result, changes in the composition of diatom species in marine ecosystems have ripple effects that can dramatically alter overall community structure. They come in a great diversity of shapes, some of which are shown below (image credit: Scientific American).

<center><img src='figures/diatoms-sciam.jpg' style='width:300px'></center>

Diatoms' glass bodies preserve remarkably well over time, and they are present in high density throught marine sedimentation layers. This makes it possible to study their relative abundances over long stretches of time; the deeper one looks in sediment, the older the material, and there is often good taxonomic resolution stretching back tens to hundreds of thousands of years.

There was a major climate event toward the end of the Pleistocene epoch (ice age), at which time there was a pronounced warming (Late Glacial Interstadial, 14.7 - 12.9 KyrBP) followed by a return to glacial conditions (Younger Dryas, 12.9 - 11.7 KyrBP). This fluctuation can be seen from temperature reconstructions.

> **Figure 1**: reconstructed sea surface temperature over time between the present and 15,000 years ago. The shaded region indicates the time window with unusually large flucutations in sea surface temperature. 

<center><img src = 'figures/fig1.svg' style = 'width:400px'></center>

> Data from Barron *et al.*, 2003. Northern Coastal California High Resolution Holocene/Late Pleistocene Oceanographic Data. IGBP PAGES/World Data Center for Paleoclimatology. Data Contribution Series # 2003-014. NOAA/NGDC Paleoclimatology Program, Boulder CO, USA.


This project compares relative abundances of diatom taxa recorded from sediment cores taken in the gulf of California near Hermosillo, Mexico before and after the climate shift. Changes in the distributions of relative abundances are identified for certain taxa as well as changes in more complex measures of community composition derived from principal components analysis (PCA). Both sets of changes point to a shift in community structure coinciding with the climate event. Furthermore, a clustering of the time points based on relative abundances alone recovers the grouping of data before and after the climate event with good accuracy, suggesting that there is strong signal in the community composition about changing climate conditions.

---

## 1. Data description

The data for this project are diatom counts sampled from evenly-spaced depths in a sediment core from the gulf of California. These data are publicly available: 
> Barron, J.A., *et al.* 2005. High Resolution Guaymas Basin Geochemical, Diatom, and Silicoflagellate Data. IGBP PAGES/World Data Center for Paleoclimatology Data Contribution Series # 2005-022. NOAA/NGDC Paleoclimatology Program, Boulder CO, USA.

### Sample and measurement information

The counts were recorded by sampling material from sediment cores at each depth, and examining the sampled material for phytoplankton cells. Depth correlates with time before the present -- deeper layers are older -- and depth intervals were chosen to obtain a desired temporal resolution. For each sample, phytoplankton were identified at the taxon level and counts of diatom taxa were recorded along with the total number of phytoplankton cells identified, the depth of the sample in the sediment core, and the radiocarbon-dated age of the material. The relevant population comprises sediment layers within the geographical region where these cores were taken; without further information about the spatial homogeneity of diatom taxonomic structure, scope of inference is limited. 

### Data structure

For this study, the observational units are sediment samples and the variables are depth (age), diatom abundance counts, and the total number of identified phytoplankton. Age is inferred from radiocarbon dating. One observation was made at each depth from 0cm (surface) to  13.71 cm. The particular taxa recorded in the dataset are indicated with the variable descriptions in Table 1.

> **Table 1**: variable descriptions and units for each variable in the dataset.

Variable | Description | Units
---|---|---
Depth | Depth interval location of sampled material in sediment core | Centimeters (cm)
Age | Radiocarbon age | Thousands of years before present (KyrBP)
A_curv | Abundance of *Actinocyclus curvatulus* | Count (n)
A_octon | Abundance of *Actinocyclus octonarius* | Count (n)
ActinSpp | Abundance of *Actinoptychus* species | Count (n)
A_nodul | Abundance of *Azpeitia nodulifer* | Count (n)
CocsinSpp | Abundance of *Coscinodiscus* species | Count (n)
CyclotSpp | Abundance of *Cyclotella* species | Count (n)
Rop_tess | Abundance of *Roperia tesselata* | Count (n)
StephanSpp | Abundance of *Stephanopyxis* species | Count (n)
Num.counted | Number of diatoms counted in sample | Count (n)

In preprocessing, the counts were converted to proportions, as differing numbers of diatoms were counted in each sample. The subsquent analysis uses this relative abundance -- rather than count -- data. The first few rows of the relative abundance data are shown in Table 2.

> **Table 2**: example rows of relative abundance data.

| Row   |   Depth |   Age |    A_curv |    A_octon |   ActinSpp |   A_nodul |   CoscinSpp |   CyclotSpp |   Rop_tess |   StephanSpp |
|---:|--------:|------:|----------:|-----------:|-----------:|----------:|------------:|------------:|-----------:|-------------:|
|  0 |    0    |  1.33 | 0.0248756 | 0.00995025 |   0.159204 | 0.0696517 |    0.104478 |    0.109453 | 0.00497512 |   0.00497512 |
|  1 |    0.05 |  1.37 | 0.04      | 0.01       |   0.155    | 0.08      |    0.1      |    0.08     | 0.035      |   0.01       |
|  2 |    0.1  |  1.42 | 0.04      | 0.03       |   0.165    | 0.09      |    0.145    |    0.035    | 0.005      |   0.005      |
|  3 |    0.15 |  1.46 | 0.055     | 0.005      |   0.105    | 0.005     |    0.06     |    0.14     | 0.125      |   0.015      |
|  4 |    0.2  |  1.51 | 0.0366667 | 0.00333333 |   0.126667 | 0.01      |    0.06     |    0.08     | 0.01       |   0          |

---
## 2. Methods

Exploratory analysis aimed at illuminating changes before and after the climate shift in the average relative abundances and variation in relative abundances for each taxon separately. This stage of the analysis identified just three taxa that experienced significant shifts in relative abundance averages or variations. Subsequently, principal components analysis was performed on normalized relative abundance data to identify measures of community composition that capture a significant portion of total variation in the data; the typical values of these measures were compared before and after the climate shift. Changes suggest a coinciding shift in community composition. Lastly, a clustering of observation times was performed to determine whether the observed patterns in community composition measures associated with the climate shift could be recovered based purely on signal in the relative abundance data.

---
## 3. Results

#### Averages and variabilities of relative abundances by taxon

Exploratory analysis focused on shifts in long-term relative abundance on a taxon-by-taxon basis. Figure 2 shows the averages and variabilities of relative abundances for each taxon before and after the climate shift.

> **Figure 2**: Average relative abundance and variability by taxon before and after the climate shift, approximated at 11,000 years ago. Points indicate average values during each era; the error bars represent two standard deviations in either direction over the same time period.

<center><img src = 'figures/fig2.svg' style = 'width:500px'></center>

Just three taxa show notable shifts: *A. Nodulifer* decreased in both average relative abundance and variability; and *R. Tesselata* and *Cyclopella spp.* increased in both average relative abundance and variability. The remaining taxa reflect minimal to no changes.

#### Measures of community composition

Principal components analysis was performed on the normalized relative abundances to identify two measures of community composition. These measures are weighted averages of relative abundances across taxa, and together they capture about half of the total variation in relative abundances of all taxa over time. The first measure predominantly describes the abundance of *A. Nodulifer* relative to all other taxa. The second is a more complex measure of community composition based largely on four taxa: *Cyclotella*, *R. tesselata*, *Coscinodiscus*, and *Actinoptychus*. Notably, three of the five taxa that figure prominently in the community composition measures underwent significant changes in typical relative abundance coinciding with the climate shift. 

Further analysis of these measures of community composition reveals a distinct change in typical values coinciding with the climate shift. Figure 3 shows a visualization of how the two measures shift both together and individually, along with a loading plot indicating the taxonomic composition of each measure.

> **Figure 3**: center panel, scatterplot of Nodulifer/other composition measure (x axis) and complex community measure (y axis), with points colored according to the climate shift; univariate distributions of each measure shown in panels adjacent to the scatterplot; far right, principal component loadings indicating the taxonomic composition of each measure.

<center><img src = 'figures/fig3.svg' style = 'width:800px'></center>

The center and spread of each measure changes noticeably, as shown by the univariate distribution panels. Furthermore, there appears to be a relationship between the two measures that emerges before the climate shift, in which they are positively related; this relationship is much less evident, if present at all, after the climate shift.

#### Recovering the climate shift via clustering

Lastly, a $K$-means clustering of the relative abundance data was performed to see if the manual grouping of observation times by the climate shift could be recovered using signal in the diatom community structure. The results are shown in Figure 4, which displays the same scatterplot as in Figure 3 but with *inferred cluster* shown via the color aesthetic.

> **Figure 4**: scatterplot of Nodulifer/other composition measure (x axis) and complex community measure (y axis), with points colored according to the inferred cluster label based on relative abundance data. The square points indicate observations for which the inferred cluster did not match the classification according to the climate shift; there are 18 such points.

<center><img src = 'figures/fig4.svg' style = 'width:500px'></center>

The degree of similarity between Figures 3 and 4 is striking, and indicates that the clustering largely recovers the before-and-after-climate-shift distinction. There were just a handful of points that did not match. These are indicated graphically in the figure (square points), and the corresponding rows of the data are shown in Table 3 below.

> **Table 3**: observations in the dataset for which inferred clusters did not align with the climate shift. The Cluster 1 mismatches are in fact more recent than 11Kyr but grouped with older times; conversely, the Cluster 2 mismatches are in fact older times, but grouped with more recent times.

|     |   Age |   A_curv |   A_octon |   ActinSpp |   A_nodul |   CoscinSpp |   CyclotSpp |   Rop_tess |   StephanSpp | Cluster   |
|----:|------:|---------:|----------:|-----------:|----------:|------------:|------------:|-----------:|-------------:|:----------|
|  72 |  4.49 |    0.017 |     0.022 |      0.096 |     0.143 |       0.039 |       0.07  |      0.048 |        0     | Cluster 1 |
| 157 | 10.93 |    0.02  |     0.015 |      0.199 |     0.114 |       0.134 |       0     |      0.01  |        0     | Cluster 1 |
| 158 | 11.02 |    0.044 |     0.005 |      0.256 |     0.089 |       0.064 |       0.025 |      0.005 |        0     | Cluster 2 |
| 159 | 11.1  |    0.034 |     0.005 |      0.167 |     0.054 |       0.163 |       0.025 |      0.03  |        0.005 | Cluster 2 |
| 160 | 11.18 |    0.054 |     0     |      0.153 |     0.099 |       0.099 |       0.015 |      0.064 |        0.005 | Cluster 2 |
| 161 | 11.27 |    0.02  |     0     |      0.152 |     0.059 |       0.142 |       0.044 |      0.074 |        0     | Cluster 2 |
| 162 | 11.35 |    0.039 |     0.01  |      0.176 |     0.088 |       0.088 |       0.059 |      0.029 |        0     | Cluster 2 |
| 163 | 11.43 |    0.025 |     0.005 |      0.137 |     0.049 |       0.127 |       0.025 |      0.123 |        0     | Cluster 2 |
| 164 | 11.52 |    0.02  |     0.02  |      0.199 |     0.05  |       0.129 |       0.035 |      0.04  |        0     | Cluster 2 |
| 165 | 11.6  |    0.034 |     0.01  |      0.227 |     0.025 |       0.128 |       0.03  |      0.034 |        0     | Cluster 2 |
| 166 | 11.68 |    0.01  |     0.025 |      0.184 |     0.114 |       0.1   |       0.045 |      0.015 |        0     | Cluster 2 |
| 187 | 13.44 |    0.104 |     0.005 |      0.163 |     0.03  |       0.074 |       0.109 |      0.005 |        0     | Cluster 2 |
| 191 | 13.61 |    0.114 |     0.015 |      0.193 |     0.05  |       0.084 |       0.03  |      0     |        0     | Cluster 2 |
| 192 | 13.66 |    0.054 |     0.029 |      0.21  |     0.093 |       0.063 |       0.039 |      0     |        0     | Cluster 2 |
| 199 | 12.99 |    0.01  |     0.02  |      0.191 |     0.088 |       0.162 |       0.015 |      0     |        0.005 | Cluster 2 |
| 201 | 13.84 |    0.054 |     0.015 |      0.266 |     0.069 |       0.049 |       0.03  |      0     |        0.005 | Cluster 2 |
| 202 | 13.88 |    0.044 |     0.005 |      0.157 |     0.098 |       0.127 |       0.049 |      0.005 |        0     | Cluster 2 |
| 205 | 14.02 |    0.039 |     0.01  |      0.222 |     0.113 |       0.069 |       0.039 |      0     |        0     | Cluster 2 |


The majority of the mismatches (10/18) occur right around the approximated transition point of 11,000 years before present. The mismatches before this transition point (older) are marked by lower (under 10%) abundances of *A. Nodulifer*, and the mismatches after this transition point (younger) are marked by higher abundances of *A. Nodulifer*. Of particular note is the first mismatch -- 4.49K years ago -- which had an unusually high abundance of *A. Nodulifer*.

---
## 2. Discussion

This project analyzed diatom relative abundance over a time span of 15,000 years based on data recorded from sediment cores in the gulf of California. The analysis focused on apparent differences in community structure before and after a major climate shift around 11,000 years ago (Figure 1), and identified both individual taxa that reflect corresponding shifts in relative abundance (Figure 2) as well as community measures that appear to change around the same time (Figure 3). These patterns were largely recoverable based on signal from the relative abundance data alone using a standard clustering method (Figure 4).

The analysis suggests that before 11,000 years ago, *A. nodulifer* were more abundant relative to other taxa (PC1 is typically positive), and the same period is characterized by low levels of abundance of other taxa, but possibly some proportion of *Coscinodiscus* and *Actinoptychus* in the community (PC2 is typically slightly negative); most variation in the community composition is driven by variation in the relative abundance of *A. nodulifer*. By contrast, between 11,000 years ago and now, *A.nodulifer* are generally less abundant and most of the variation in community composition is driven by alternations in the relative abundances of *Coscinodiscus* and *Actinoptychus* with those of *Cyclotella* and *R. tesselata*. The markedly different relationships between the community composition indices given by the measures of community composition suggest that the dynamics of community composition may have shifted around the time of the last major climate change event, and in particular that the present community is both more dynamic (varies more) and diverse (based on a greater number of taxa). These differing relationships that coincide with the major climate shift are recoverable from signal in the relative abundance data, as demonstrated here by clustering.

Although not analyzed here, this dataset is paired with water chemistry data aligned by time and reconstructed from sediment cores in the same location. A promising extension of this analysis would be to search for drivers of the changes observed here among water chemistry information.