# MiBiPreT example: Ordination with Amersfoort data

Diagnostic plots for data analysis on microbial biodegredation at the contaminant Amersfoort site. Author: Alraune Zech

Data based on the PhD thesis of *Johan van Leeuwen*, 2021 'Biodegredation of mono- and polyaromatic hydrocarbons in a contaminated aquifer originating from a former Pintsch gas factory site' which is equivalent to the manuscript of van Leeuwen et al., 2022 'Anaerobic degradation of benzene and other aromatic hydrocarbons in a tar-derived plume: Nitrate versus iron reducing conditions', J. of Cont. Hydrol. The data was provided by Johan van Leeuwen.
  
## Background: Amersfoort contaminant site

Close to the train station in Amersfoort, the Netherlands, the subsurface is contaminated with organic hydrocarbons forming a NAPL. Contamination originates from decades of operating manufactured gas plant, which dumped tar by-products in waste lagoons. The tar is a DNAPL and has spread into the underlying shallow unconfined aquifer. Sample wells were installed to measure various characteristics of the subsurface. The raw data contains measurements on
* environmental conditions, such as pH, redox potential, concentrations of oxygen, nitrate, etc
* contaminant concentration such as BTEX, indene, indane, naphtalene and multiple other (typically cyclic) petroleum hydrocarbons
* metabolite concentration, i.e. byproducts of degredation processes of contaminant
* isotope measurments for specific contaminants and samples
* counts of genes (RNA/DNA) of mibrobiota that is know to perform biodegredation as well as functional enzymes know to be responsible for biodegredation

**Required packages**

In [None]:
import mibiscreen as mbs

## Analysis

### Data loading

In [None]:
file_path = './amersfoort.xlsx'

Load and standardize data of environmental quantities/chemicals

In [None]:
environment_raw,units = mbs.load_excel(file_path,
                                    sheet_name = 'environment',
                                    verbose = False)

environment,units = mbs.standardize(environment_raw,
                                reduce = True,
                                verbose=False)

Load and standardize data of contaminants:

In [None]:
contaminants_raw,units = mbs.load_excel(file_path,
                                    sheet_name = 'contaminants',
                                    verbose = False)

contaminants,units = mbs.standardize(contaminants_raw,
                                  reduce = False,
                                  verbose = False)

mbs.total_contaminant_concentration(contaminants,
                                include = True,
                                verbose = False)

Merge both data frames into one for further analysis:

In [None]:
data = mbs.merge_data([environment,contaminants],clean = True)

### Data pre-processing

Specification of variables to chose for ordination. We make sure that variables selected for analysis are in standard names.

In [None]:
variables_1 = mbs.standard_names(['total_contaminants'])
variables_2 = mbs.standard_names(['nitrate','pH','nitrite','sulfate','Redox','EC','DOC',"Mn","Fe"])

Extracting data of interest from data frame. All other quantities are removed, since only columns of interest should be passed to ordination analysis routines. 

In [None]:
data_ordination = mbs.extract_data(data,
                               name_list = variables_1 + variables_2,
                               keep_setting_data = True)

Cleaning data, i.e. remove all sites where measurements are missing for any of the quantities

In [None]:
data_filtered = mbs.filter_values(data_ordination, 
              replace_NaN = 'remove', 
              inplace = True,
              verbose = True)
data_filtered

### Data transformation

Concentrations of contaminants (`Sum GC`) is log-transformed (given that concentration vary over orders of magnitude) and then standardized.
Other data is only standardized.

In [None]:
data_transformed = mbs.transform_values(data_ordination,
                 name_list = variables_1,
                 how = 'log_scale',
                 inplace = True,
                 )

data_transformed = mbs.transform_values(data_ordination,
                  name_list = variables_1,
                  how = 'standardize',
                  inplace = True,
                  )

data_transformed = mbs.transform_values(data_ordination,
                  name_list = variables_2,
                  how = 'standardize',
                  inplace = True,
                )

### Ordination analysis by PCA

In [None]:
ordination_output = mbs.pca(data_ordination,
                        independent_variables = variables_1+variables_2,
                        verbose = True)

### Ordination plot

In [None]:
fig, ax = mbs.ordination_plot(ordination_output=ordination_output,
                plot_scores = True, 
                plot_loadings = True,
                rescale_loadings_scores = True, 
                title = "Unconstrained Ordination PCA",                
                # plot_scores = False, 
                # axis_ranges = [-0.6,0.8,-0.8,1.0],
                # save_fig = 'ordination_amersfoort.png',
                )

### Interpretation

The figure shows the results of the PCA (principle component analysis) as multi-variate statistics tool for the geochemical parameters. The axes denote the percentage of variance explained by each principal component. Data was standardized, with `Sum GC` (being the total sum of contaminants) being logarithmically transformed as via $log(x+1)$.

This biplot shows two different elements of information: the variable loadings show as arrows and the site scores shown as dots. The two plot axis represent the two main axis identified by the ordination methods. They are scales to the minimal value of $-1$ and the maximum value of $1$. 

The direction of the arrow indicates to which ordination axis it correlates. The length of the vector is equivalent to the extent of that correlation. Positive scores or loadings indicate positive correlation with the axis, where negative values indicate negative 
correlation. For example, a variable with negative loadings for the first two ordination axes is anticorrelated with the two largest trends in the data. Arrows pointing in the same direction indicate that the variable are correlated. Arrows at an right angle to one another are uncorrelated. Arrows that point in opposite directions are anti-correlated. A vector very close to the origin shows little to no correlation with the axes. 

Proximity of the site scores in the plot indicate the similarity between the sample sites. 