# PSTAT 100 example project code notebook

This file is part of an adaptation of HW3 (the diatom homework) into a project format. 

It's organizationally useful when working on projects to separate codes and scratch work from presentation documents; this file contains all codes used in generating results in the example report. **I strongly recommend keeping a code notebook and a report notebook separately**. You might also consider keeping a draft work notebook.

This wouldn't be accesible to a general reader, but to someone familiar with the project (*i.e.*, all of us), it is organized just well enough to follow. In other words, notice that while this notebook isn't the neatest, it's also not terribly messy. In particular:
* code cells are commented;
* some crude sectioning is used to separate parts of the analysis;
* text cells record explanations, notes, and observations.

This notebook serves two purposes: it documents my work; and it is used to generate results for presentation (tables and figures).

In [24]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.decomposition import PCA

# Import, tidy, acquaint

In [25]:
# import diatom data
diatoms_raw = pd.read_csv('data/barron-diatoms.csv')
diatoms_raw.head(5)

Unnamed: 0,Depth,Age,A_curv,A_octon,ActinSpp,A_nodul,CoscinSpp,CyclotSpp,Rop_tess,StephanSpp,Num.counted
0,0.0,1.33,5.0,2.0,32,14.0,21,22.0,1.0,1.0,201
1,0.05,1.37,8.0,2.0,31,16.0,20,16.0,7.0,2.0,200
2,0.1,1.42,8.0,6.0,33,18.0,29,7.0,1.0,1.0,200
3,0.15,1.46,11.0,1.0,21,1.0,12,28.0,25.0,3.0,200
4,0.2,1.51,11.0,1.0,38,3.0,18,24.0,3.0,,300


The data are already in tidy format, because each row is an observation (a set of measurements on one sample of sediment) and each column is a variable (one of age, depth, or counts). However, examine rows 3 and 4. These rows illustrate two noteworthy features of the raw data:

1. NaNs are present
2. The number of individuals counted in each sample varies by a lot from sample to sample.

### 'Missing' values

The NaNs are an artefact of the data recording -- if *no* diatoms in a particular taxa are observed, a `-` is entered in the table (you can verify this by checking the .csv file). In these cases the value isn't missing, but rather zero. These entries are parsed by pandas as NaNs, but they correspond to a value of 0 (no diatoms observed). 

In [26]:
# replace NaNs by 0
diatoms_mod1 = diatoms_raw.fillna(0)
diatoms_mod1.loc[4:5, :]

Unnamed: 0,Depth,Age,A_curv,A_octon,ActinSpp,A_nodul,CoscinSpp,CyclotSpp,Rop_tess,StephanSpp,Num.counted
4,0.2,1.51,11.0,1.0,38,3.0,18,24.0,3.0,0.0,300
5,0.25,1.55,4.0,9.0,30,10.0,16,14.0,16.0,0.0,203


### Conversion to proportions

Since the total number of phytoplankton counted in each sample varies, the raw counts are not directly comparable -- *e.g.*, a count of 18 is actually a *different* abundance in a sample with 200 individuals counted than in a sample with 300 individuals counted.

For exploratory analysis, the values should be comparable across rows. This can be achieved by a simple transformation so that the values are *relative* abundances: *proportions* of phytoplankton observed from each taxon.

In [27]:
# set depth, age to indices and drop num.counted
diatoms_mod2 = diatoms_mod1.set_index(['Depth', 'Age'])

# store sample sizes
sampsize = diatoms_mod2['Num.counted']

# divide
diatoms_mod3 = diatoms_mod2.div(sampsize, axis = 0)

# drop num.counted and reset index
diatoms = diatoms_mod3.drop(columns = 'Num.counted').reset_index()

# print
diatoms.head()

Unnamed: 0,Depth,Age,A_curv,A_octon,ActinSpp,A_nodul,CoscinSpp,CyclotSpp,Rop_tess,StephanSpp
0,0.0,1.33,0.024876,0.00995,0.159204,0.069652,0.104478,0.109453,0.004975,0.004975
1,0.05,1.37,0.04,0.01,0.155,0.08,0.1,0.08,0.035,0.01
2,0.1,1.42,0.04,0.03,0.165,0.09,0.145,0.035,0.005,0.005
3,0.15,1.46,0.055,0.005,0.105,0.005,0.06,0.14,0.125,0.015
4,0.2,1.51,0.036667,0.003333,0.126667,0.01,0.06,0.08,0.01,0.0


In [42]:
# render for report
tbl1 = diatoms.head()

print(tbl1.to_markdown())

|    |   Depth |   Age |    A_curv |    A_octon |   ActinSpp |   A_nodul |   CoscinSpp |   CyclotSpp |   Rop_tess |   StephanSpp |
|---:|--------:|------:|----------:|-----------:|-----------:|----------:|------------:|------------:|-----------:|-------------:|
|  0 |    0    |  1.33 | 0.0248756 | 0.00995025 |   0.159204 | 0.0696517 |    0.104478 |    0.109453 | 0.00497512 |   0.00497512 |
|  1 |    0.05 |  1.37 | 0.04      | 0.01       |   0.155    | 0.08      |    0.1      |    0.08     | 0.035      |   0.01       |
|  2 |    0.1  |  1.42 | 0.04      | 0.03       |   0.165    | 0.09      |    0.145    |    0.035    | 0.005      |   0.005      |
|  3 |    0.15 |  1.46 | 0.055     | 0.005      |   0.105    | 0.005     |    0.06     |    0.14     | 0.125      |   0.015      |
|  4 |    0.2  |  1.51 | 0.0366667 | 0.00333333 |   0.126667 | 0.01      |    0.06     |    0.08     | 0.01       |   0          |


### Temporal resolution

Before diving in, it will be helpful to resolve two matters:

1. How far back in time do the data go?
2. What is the time resolution of the data?

In [29]:
# time range
diatoms.Age.aggregate(['min', 'max'])

min     1.33
max    15.19
Name: Age, dtype: float64

In [30]:
# histogram of timesteps
diffs = pd.DataFrame({'diff': diatoms.Age.sort_values().diff().loc[1:, ]})
alt.Chart(diffs).mark_bar().encode(
    x = alt.X('diff', 
              bin = alt.Bin(step = 0.02), 
              title = 'Time step between consecutive sample ages'), 
    y = 'count()'
)

Most time steps are 40-60 years.

---
# Explore

Here are some initial questions:
* Which taxa are most and least abundant on average over time?
* Which taxa vary the most over time?

These can be answered by computing simple summary statistics for each column in the diatom data.

In [31]:
# summary statistics
diatom_summary = diatoms.iloc[:, 2:10].aggregate(['mean', 'std']).transpose()
diatom_summary

Unnamed: 0,mean,std
A_curv,0.028989,0.018602
A_octon,0.018257,0.016465
ActinSpp,0.1359,0.053797
A_nodul,0.07294,0.092677
CoscinSpp,0.085925,0.031795
CyclotSpp,0.070366,0.042423
Rop_tess,0.060448,0.076098
StephanSpp,0.002447,0.007721


In [32]:
# reset index
plot_df = diatom_summary.reset_index()

# create base chart
base = alt.Chart(plot_df).encode(
    y = alt.Y('index', title = 'Taxon', 
              sort = {'field': 'mean', 'order': 'descending'})
)

# create point plot
means = base.mark_point().encode(
    x = alt.X('mean', title = 'Average relative abundance')
)

# create bar plot
bars = base.transform_calculate(
    lwr = 'datum.mean - 2*datum.std',
    upr = 'datum.mean + 2*datum.std'
).mark_errorbar().encode(
    x = alt.X('lwr:Q', title = 'Average relative abundance'), 
    x2 = 'upr:Q'
)

# layer
means + bars

#### Observations

* *Actinoptychus* is the most abundant on average (The point for that taxon is highest).
* *Stephanopyxis* is the rarest on average. (The point for that taxon is lowest).
* *Azpeitia nodulifer* shows the most variation (the bar for that taxon is widest).

The following takes a closer look at the taxon with the most temporal variability in relative abundance: *Azpeitia nodulifer*. 

In [33]:
# histogram
hist = alt.Chart(diatoms).transform_bin(
    as_ = 'bin', 
    field = 'A_nodul', 
    bin = alt.Bin(step = 0.03)
).transform_aggregate(
    Count = 'count()',
    groupby = ['bin']
).transform_calculate(
    Density = 'datum.Count/(0.03*230)',
    binshift = 'datum.bin + 0.015'
).mark_bar(size = 25, opacity = 0.8).encode(
    x = alt.X('binshift:Q', 
              title = 'Relative abundance', 
              scale = alt.Scale(domain = (0.03, 0.38))), 
    y = 'Density:Q'
)

# kde
smooth = alt.Chart(diatoms).transform_density(
    density = 'A_nodul',
    as_ = ['Relative abundance', 'Density'],
    bandwidth = 0.03,
    extent = [0, 0.4],
    steps = 500
).mark_line(color = 'black').encode(
    x = 'Relative abundance:Q',
    y = 'Density:Q'
)

hist + smooth

#### Observations

* Low abundances are most common for this taxon. 
* Relative abundances over 0.33 are rare.  
* Almost all abundances are between 0 and 0.33. Values are concentrated near zero and then more or less evenly distributed between 0.06 and 0.33.
* Overall, the frequency of abundances decays for larger values.
* There are a disproportionately large number of zeroes, because in many samples no *Azpeitia nodulifer* diatoms were observed.
* It kind of looks like the taxon has two 'states': absent or varying evenly in abundance from 5% to 30%.
* No bandwidth parameter for the KDE captured the shape well.

### Temperature fluctuations

Below is a plot of sea surface temperature reconstructions off the coast of Northern California. Data come from the following source:

> Barron *et al.*, 2003. Northern Coastal California High Resolution Holocene/Late Pleistocene Oceanographic Data. IGBP PAGES/World Data Center for Paleoclimatology. Data Contribution Series # 2003-014. NOAA/NGDC Paleoclimatology Program, Boulder CO, USA.

The shaded region indicates the time window with unusually large flucutations in sea surface temperature; this window roughly corresponds to the dates of a major climate event.

In [34]:
# import sea surface temp reconstruction
seatemps = pd.read_csv('data/barron-sst.csv')

# line plot of time series
line = alt.Chart(seatemps).mark_line().encode(
    x = alt.X('Age', title = 'Thousands of years before present'),
    y = 'SST'
)

# highlight region with large variations
highlight = alt.Chart(
    pd.DataFrame(
        {'SST': np.linspace(0, 14, 100), 
         'upr': np.repeat(11, 100), 
         'lwr': np.repeat(15, 100)}
    )
).mark_area(opacity = 0.2, color = 'orange').encode(
    y = 'SST',
    x = alt.X('upr', title = 'Thousands of years before present'),
    x2 = 'lwr'
)

# add smooth trend
smooth = line.transform_loess(
    on = 'Age',
    loess = 'SST',
    bandwidth = 0.2
).mark_line(color = 'black')

# layer
fig1 = line + highlight + smooth

# display
fig1

In [35]:
kdes = alt.Chart(diatoms).transform_calculate(
    pre_dryas = 'datum.Age > 11'
).transform_density(
    density = 'A_nodul',
    groupby = ['pre_dryas'],
    as_ = ['Relative abundance', 'Density'],
    bandwidth = 0.025,
    extent = [0, 0.4],
    steps = 500
).mark_line(color = 'black').encode(
    x = 'Relative abundance:Q',
    y = 'Density:Q',
    color = alt.Color('pre_dryas:N', title = 'Before 11 KyrBP')
)

kdes + kdes.mark_area(opacity = 0.2)

#### Observations

* Relative abundances are low after 11K years before present.
* Relative abundances are distributed between around 5% and 35% before 11K years before present.

In [36]:
# summary statistics conditional on age > 11
grouped_summary = diatoms.iloc[:, 2:10].groupby(
    diatoms.Age > 11
).aggregate(
    ['mean', 'std']
).transpose().melt(
    ignore_index = False
).reset_index().pivot(
    index = ['level_0', 'Age'],
    columns = 'level_1',
    values = 'value'
).reset_index().rename(
    columns = {'level_0': 'taxon'}
)

# means before and after 11k yrs bp
points = alt.Chart(grouped_summary).mark_point().encode(
    x = alt.X('mean', title = 'Average relative abundance'),
    y = alt.Y('Age', title = '', axis = None),
    color = alt.Color('Age', title = 'Before 11KyrBP')
)

# variability about means
bars = alt.Chart(grouped_summary).transform_calculate(
    lwr = 'datum.mean - 2*datum.std',
    upr = 'datum.mean + 2*datum.std'
).mark_errorbar().encode(
    x = alt.X('lwr:Q', title = 'Average relative abundance'), 
    x2 = 'upr:Q',
    y = alt.Y('Age', title = '', axis = None),
    color = alt.Color('Age', title = 'Before 11KyrBP')
)

# layer
fig2 = (points + bars).facet(
    row = alt.Row('taxon', 
                  title = None, 
                  header = alt.Header(labelAngle = 0, 
                                      labelAlign = 'left'))
).configure_facet(spacing = 0)

# display
fig2


#### Observations

* Rop. tess. and Cyclot spp. increase in average relative abundance and variability after the temperature stabilizes.
* A. nodul. decreases in average relative abundance and variability after the temperature stabilizes.

### PCA

GOAL: use PCA to explore variation in community composition *among* all eight taxa.

In [37]:
# correlation matrix
corr_mx = diatoms.set_index(['Depth', 'Age']).corr()
corr_mx

Unnamed: 0,A_curv,A_octon,ActinSpp,A_nodul,CoscinSpp,CyclotSpp,Rop_tess,StephanSpp
A_curv,1.0,0.11148,0.390898,-0.446778,0.091222,0.219439,-0.06269,0.151909
A_octon,0.11148,1.0,-0.005009,-0.217992,0.049589,0.065249,-0.023047,-0.041017
ActinSpp,0.390898,-0.005009,1.0,-0.363475,0.306021,-0.055732,-0.34341,0.058494
A_nodul,-0.446778,-0.217992,-0.363475,1.0,-0.01092,-0.407338,-0.471941,-0.151409
CoscinSpp,0.091222,0.049589,0.306021,-0.01092,1.0,-0.266157,-0.341755,-0.016332
CyclotSpp,0.219439,0.065249,-0.055732,-0.407338,-0.266157,1.0,0.018149,0.070684
Rop_tess,-0.06269,-0.023047,-0.34341,-0.471941,-0.341755,0.018149,1.0,0.032607
StephanSpp,0.151909,-0.041017,0.058494,-0.151409,-0.016332,0.070684,0.032607,1.0


In [38]:
# melt corr_mx
corr_mx_long = corr_mx.reset_index().rename(
    columns = {'index': 'row'}
).melt(
    id_vars = 'row',
    var_name = 'col',
    value_name = 'Correlation'
)

# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(
    x = alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    y = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    color = alt.Color('Correlation', 
                      scale = alt.Scale(scheme = 'blueorange',
                                        domain = (-1, 1), 
                                        type = 'sqrt'),
                     legend = alt.Legend(tickCount = 5))
).properties(width = 300, height = 300)

#### Observations

* *Azpeitia nodulifer* abundances are negatively correlated with abundances of all other taxa -- all entries in the `A_nodul` row are blue. 
* Interpretation: when *A. nodulifer* diatoms are more abundant than usual diatoms in other taxa tend to be less abundant than usual and conversely.

In [39]:
# center/scale
pcdata = diatoms.set_index(['Depth', 'Age'])
pcdata = (pcdata - pcdata.mean())/pcdata.std()

# compute pcs
pca = PCA(8)
pca.fit(pcdata)

# store proportion of variance explained as a dataframe
pcvars = pd.DataFrame({'Proportion of variance explained': pca.explained_variance_ratio_})

# add component number as a new column
pcvars['Component'] = np.arange(1, 9)

# add cumulative variance explained as a new column
pcvars['Cumulative variance explained'] = pcvars.iloc[:, 0].cumsum(axis = 0)

# encode component axis only as base layer
base = alt.Chart(pcvars).encode(
    x = 'Component')

# make a base layer for the proportion of variance explained
prop_var_base = base.encode(
    y = alt.Y('Proportion of variance explained',
              axis = alt.Axis(titleColor = '#57A44C'))
)

# make a base layer for the cumulative variance explained
cum_var_base = base.encode(
    y = alt.Y('Cumulative variance explained', axis = alt.Axis(titleColor = '#5276A7'))
)

# add points and lines to each base layer
prop_var = prop_var_base.mark_line(stroke = '#57A44C') + prop_var_base.mark_point(color = '#57A44C')
cum_var = cum_var_base.mark_line() + cum_var_base.mark_point()

# layer the layers
alt.layer(prop_var, cum_var).resolve_scale(y = 'independent')

#### Observations

* The first two PC's capture over 20% of covariation each.
* The remaining PC's (3 and up) capture relatively much less (8-12%). 
* The first two PCs together explain roughly half of the total variation in relative abundance.

In [40]:
# store the loadings as a data frame with appropriate names
loading_df = pd.DataFrame(pca.components_).transpose().rename(
    columns = {0: 'PC1', 1: 'PC2'}
).loc[:, ['PC1', 'PC2']]

# add a column with the taxon names
loading_df['Taxon'] = pcdata.columns.values

# melt from wide to long
loading_plot_df = loading_df.melt(
    id_vars = 'Taxon',
    var_name = 'PC',
    value_name = 'Loading'
)

# create base layer with encoding
base = alt.Chart(loading_plot_df).encode(
    y = alt.X('Taxon', title = ''),
    x = 'Loading',
    strokeDash = 'PC'
)

# store horizontal line at zero
rule = alt.Chart(pd.DataFrame({'Loading': 0}, index = [0])).mark_rule(color = 'grey').encode(x = 'Loading', size = alt.value(2))

# layer points + lines + rule to construct loading plot
loading_plot = (base.mark_point(color = 'black') + base.mark_line(color = 'black') + rule).properties(width = 100, height = 300)

# show
loading_plot

#### Observations

* PC1 is heavily up-weighted by high abundance of *A. nodulifer* and downweighted by high abunances of other taxa. 
* Roughly, PC1 reflects the difference in relative abundance between *A. nodulifer* and a weighted average of all other taxa.
* When PC1 is positive, *A. nodulifer* are more abundant than usual and other taxa are less abundant than usual; vice-versa when negative.
* The PC2 loading is large and positive for two taxa, *Cyclotella* and *R. tesselata*, and large and negative for two taxa, *Coscinodiscus* and *Actinoptychus*; all other loadings are negligibly small. 
* Roughly, PC2 reflects the difference in average relative abundance between two groups of taxa -- a more complex measure of community composition.

In [41]:
# project pcdata onto first two components; store as data frame
projected_data = pd.DataFrame(pca.fit_transform(pcdata)).iloc[:, 0:2].rename(columns = {0: 'PC1', 1: 'PC2'})

# add index and reset
projected_data.index = pcdata.index
projected_data = projected_data.reset_index()

# base chart
base = alt.Chart(projected_data).transform_calculate(
    since_11KyrBP = 'datum.Age < 11'
)

# data scatter
scatter = base.mark_point().encode(
    x = alt.X('PC1:Q', title = 'Nodulifer/non-nodulifer composition (PC1)'),
    y = alt.Y('PC2:Q', title = 'Complex community composition (PC2)'),
    color = alt.Color('since_11KyrBP:N', title = 'More recent than 11KyrBP')
)

# pc1 density estimates
pc1_kde = base.transform_density(
    density = 'PC1',
    groupby = ['since_11KyrBP'],
    as_ = ['PC1', 'Density'],
    extent = [-3, 5],
    steps = 500
).encode(
    x = alt.X('PC1:Q', axis = None),
    y = alt.Y('Density:Q', axis = None),
    color = alt.Color('since_11KyrBP:N', title = 'More recent than 11KyrBP')
)

# panel for layering
top_panel = (pc1_kde.mark_line() + pc1_kde.mark_area(opacity = 0.2)).properties(height = 50)

# pc2 density estimates
pc2_kde = base.transform_density(
    density = 'PC2',
    groupby = ['since_11KyrBP'],
    as_ = ['PC2', 'Density'],
    extent = [-3, 5],
    steps = 500
).encode(
    y = alt.Y('PC2:Q', axis = None),
    x = alt.X('Density:Q', axis = None),
    color = alt.Color('since_11KyrBP:N', title = 'More recent than 11KyrBP')
)

# panel for layering
side_panel = (pc2_kde.mark_line(order = False) + pc2_kde.mark_area(order = False, opacity = 0.2)).properties(width = 50)

# layer with loading plot
fig3 = top_panel & (scatter | side_panel | loading_plot) 

# display
fig3

top_panel & (scatter | side_panel)

# Clustering

In [19]:
import sklearn.cluster as cl

# for reproducibility
np.random.seed(52221)

# extract relative abundances as array
x_mx = diatoms.iloc[:, 2:].values

# compute clustering and store labels
km = cl.KMeans(n_clusters = 2, n_init = 10)
projected_data['cluster_label'] = pd.Categorical.from_codes(km.fit_predict(x_mx), categories = ['Cluster 1', 'Cluster 2'])

# plot
alt.Chart(projected_data).mark_point().encode(
    x = alt.X('PC1:Q', title = 'Nodulifer/non-nodulifer composition (PC1)'),
    y = alt.Y('PC2:Q', title = 'Complex community composition (PC2)'),
    color = alt.Color('cluster_label:N', title = 'K-means clusters')
)

In [20]:
# determine whether clustering coincides with time epoch
projected_data['agreement'] = pd.Categorical.from_codes(
    np.abs((projected_data.Age < 11) - projected_data.cluster_label.cat.codes),
    categories = ['Coincides', 'Does not coincide']
)

# add shape to indicate agreement
fig4 = alt.Chart(projected_data).mark_point().encode(
    x = alt.X('PC1:Q', title = 'Nodulifer/non-nodulifer composition'),
    y = alt.Y('PC2:Q', title = 'Complex community composition'),
    color = alt.Color('cluster_label:N', title = 'K-means clusters'),
    opacity = alt.Opacity('agreement:N', title = 'Mismatch'),
    shape = alt.Shape('agreement:N')
)

# show
fig4

In [21]:
# inspect mismatches
inspect_ix = projected_data[projected_data.agreement == 'Does not coincide'].index
tbl2 = diatoms.loc[inspect_ix].copy().drop(columns = 'Depth')
tbl2['Cluster'] = projected_data.cluster_label[inspect_ix]

tbl2

Unnamed: 0,Age,A_curv,A_octon,ActinSpp,A_nodul,CoscinSpp,CyclotSpp,Rop_tess,StephanSpp,Cluster
72,4.49,0.017391,0.021739,0.095652,0.143478,0.03913,0.069565,0.047826,0.0,Cluster 1
157,10.93,0.0199,0.014925,0.199005,0.114428,0.134328,0.0,0.00995,0.0,Cluster 1
158,11.02,0.044335,0.004926,0.256158,0.08867,0.064039,0.024631,0.004926,0.0,Cluster 2
159,11.1,0.034483,0.004926,0.167488,0.054187,0.162562,0.024631,0.029557,0.004926,Cluster 2
160,11.18,0.054455,0.0,0.153465,0.09901,0.09901,0.014851,0.064356,0.00495,Cluster 2
161,11.27,0.019608,0.0,0.151961,0.058824,0.142157,0.044118,0.073529,0.0,Cluster 2
162,11.35,0.039024,0.009756,0.17561,0.087805,0.087805,0.058537,0.029268,0.0,Cluster 2
163,11.43,0.02451,0.004902,0.137255,0.04902,0.127451,0.02451,0.122549,0.0,Cluster 2
164,11.52,0.0199,0.0199,0.199005,0.049751,0.129353,0.034826,0.039801,0.0,Cluster 2
165,11.6,0.034483,0.009852,0.226601,0.024631,0.128079,0.029557,0.034483,0.0,Cluster 2


#### Observations

* Cluster 1 mismatches are actually more recent than 11Kyr, but grouped with older times.
* Cluster 2 mismatches are actually older, but grouped with more recent times.
* Only two cluster 1 mismatches; only one of those is significantly more recent than 11K yr BP.
* That outlying cluster 1 mismatch had an unusually high relative abundance of A_nodul.
* Cluster 2 mismatches seem mostly characterized by lower levels of A_nodul.

In [22]:
# render as markdown table
print(tbl2.round(3).to_markdown())

|     |   Age |   A_curv |   A_octon |   ActinSpp |   A_nodul |   CoscinSpp |   CyclotSpp |   Rop_tess |   StephanSpp | Cluster   |
|----:|------:|---------:|----------:|-----------:|----------:|------------:|------------:|-----------:|-------------:|:----------|
|  72 |  4.49 |    0.017 |     0.022 |      0.096 |     0.143 |       0.039 |       0.07  |      0.048 |        0     | Cluster 1 |
| 157 | 10.93 |    0.02  |     0.015 |      0.199 |     0.114 |       0.134 |       0     |      0.01  |        0     | Cluster 1 |
| 158 | 11.02 |    0.044 |     0.005 |      0.256 |     0.089 |       0.064 |       0.025 |      0.005 |        0     | Cluster 2 |
| 159 | 11.1  |    0.034 |     0.005 |      0.167 |     0.054 |       0.163 |       0.025 |      0.03  |        0.005 | Cluster 2 |
| 160 | 11.18 |    0.054 |     0     |      0.153 |     0.099 |       0.099 |       0.015 |      0.064 |        0.005 | Cluster 2 |
| 161 | 11.27 |    0.02  |     0     |      0.152 |     0.059 |       0.142 

In [23]:
# check raw data, possibly low counts?
diatoms_raw.loc[72]

Depth            3.60
Age              4.49
A_curv           4.00
A_octon          5.00
ActinSpp        22.00
A_nodul         33.00
CoscinSpp        9.00
CyclotSpp       16.00
Rop_tess        11.00
StephanSpp        NaN
Num.counted    230.00
Name: 72, dtype: float64