# MiBiPreT example: Metabolite Data Analysis with Amersfoort data

Diagnostic plots for data analysis on microbial biodegredation at the contaminant Amersfoort site. 

Author: Alraune Zech

Data based on the PhD thesis of *Johan van Leeuwen*, 2021 'Biodegredation of mono- and polyaromatic hydrocarbons in a contaminated aquifer originating from a former Pintsch gas factory site' which is equivalent to the manuscript of van Leeuwen et al., 2022 'Anaerobic degradation of benzene and other aromatic hydrocarbons in a tar-derived plume: Nitrate versus iron reducing conditions', J. of Cont. Hydrol. The data was provided by Johan van Leeuwen.
  
## Background: Amersfoort contaminant site

Close to the train station in Amersfoort, the Netherlands, the subsurface is contaminated with organic hydrocarbons forming a NAPL. Contamination originates from decades of operating manufactured gas plant, which dumped tar by-products in waste lagoons. The tar is a DNAPL and has spread into the underlying shallow unconfined aquifer. Sample wells were installed to measure various characteristics of the subsurface. The raw data contains measurements on
* environmental conditions, such as pH, redox potential, concentrations of oxygen, nitrate, etc
* contaminant concentration such as BTEX, indene, indane, naphtalene and multiple other (typically cyclic) petroleum hydrocarbons
* metabolite concentration, i.e. byproducts of degredation processes of contaminant
* isotope measurments for specific contaminants and samples
* counts of genes (RNA/DNA) of mibrobiota that is know to perform biodegredation as well as functional enzymes know to be responsible for biodegredation

**Required packages**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
import mibiscreen as mbs

## Data loading

In [None]:
file_path = './amersfoort.xlsx'

**Load and standardize data of metabolites**

*Note*: Column names of metabolites is not checked. All columns provided (except settings) will be interpreted as individual metabolites.

In [None]:
metabolites_raw,units = mbs.load_excel(file_path,
                                       sheet_name = 'metabolites',
                                       verbose = False)

metabolites,units = mbs.standardize(metabolites_raw,
                                    reduce = False,
                                    verbose=False)

# metabolites_columns = mbs.check_columns(metabolites_raw,verbose = False)
# metabolites_units_check = mbs.check_units(metabolites_raw,verbose = False)
# metabolites_pure = mbs.check_values(metabolites_raw,verbose = False)

#display(metabolites)

## Metabolites Concentration Analysis

### Calculating total concentration and counts of metabolites per sample

In [None]:
metabolites_total = mbs.total_concentration(metabolites,
                                            name_list = 'all',
                                            include_as = False,    
                                            # verbose = True
                                            )

Note that calculated values are not included to data frame as for further analysis all column (except settings) will be used. If including the total concentration, the value of total count will be overestimated by 1.

In [None]:
metabolites_count = mbs.total_count(metabolites,
                                    name_list = 'all', 
                                    include_as = False,    
                                    #verbose = True
                                    )


Previous two steps are equivalent to:

In [None]:
mbs.total_metabolites_concentration(metabolites,
                                    include = True,
                                    verbose = False)

Now including results as own data column to data frame of *metabolites*:

In [None]:
print("Names of last two columns", metabolites.columns[-2:].values)

In [None]:
mbs.total_metabolites_count(metabolites.iloc[:,:-1],
                            include = True,
                            verbose = False)


In [None]:
print("Names of last two columns", metabolites.columns[-2:].values)

In [None]:
display(metabolites_total)

In [None]:
mbs.total_metabolites_count(metabolites,
                            include = True,
                            verbose = True)



### Plotting total concentration of metabolites

In [None]:
mbs.contaminants_bar(metabolites,
                     list_contaminants = ['metabolites_concentration'],
                     list_labels = ['all metabolites'],
                     figsize = [18,5],
                     textsize = 14,
                     ylabel = r'Total metabolites concentration [$\mu$g/l]',
                     loc='upper left',
                     title_text = 'Total concentration of metabolites per sample',
                     )

Producing the plot individually using `matplotlib` and `numpy` for individual adaptions:

In [None]:
plt.figure(figsize = [18,5])
plt.bar(np.arange(len(metabolites_total.values)),metabolites_total.values,label='all')
plt.xlabel('Samples')
plt.ylabel(r'Total metabolites concentration [$\mu$g/l]')
plt.title('Total concentration of metabolites per sample')

**Plot with sorting of values and sample names**

In [None]:
mbs.contaminants_bar(metabolites,
                     list_contaminants = ['metabolites_concentration'],
                     list_labels = ['all metabolites'],
                     sort = True,
                     name_sample = True,
                     figsize = [18,5],
                     textsize = 12,
                     ylabel = r'Total metabolites concentration [$\mu$g/l]',
                     loc='upper left',
                     title_text = 'Total concentration of metabolites per sample',
                     xtick_autorotate = True,
                      # save_fig = 'metabolites_bar.png',
                     )

Producing the plot individually using `matplotlib` and `numpy` for individual adaptions:

In [None]:
fig = plt.figure(figsize = [18,5])
sort_args = np.argsort(metabolites['metabolites_concentration'].values)
plt.bar(metabolites.sample_nr.values[sort_args],metabolites['metabolites_concentration'].values[sort_args],label='all')
plt.xlabel('Samples')
plt.ylabel(r'Total metabolites concentration [$\mu$g/l]')
plt.title('Total concentration of metabolites per sample')
fig.autofmt_xdate(bottom=0.2, rotation=30, ha='right', which='major')

### Plotting total counts of metabolites

In [None]:
mbs.contaminants_bar(metabolites,
                     list_contaminants = ['metabolites_count'],
                     list_labels = ['total metabolites count'],
                     sort = True,
                     name_sample = True,
                     figsize = [18,5],
                     textsize = 12,
                     ylabel = r'Total metabolites count',
                     loc='upper left',
                     title_text = 'Total count of metabolites per sample',
                     xtick_autorotate = True,
                      # save_fig = 'metabolites_count_bar.png',
                     )

## Relating Metabolite activity to electron availability

#### Load data of contaminants and geochemicals to identify electron availability and NA activity color

In [None]:
contaminants_raw,_ = mbs.load_excel(file_path,
                                    sheet_name = 'contaminants',
                                    verbose = False)
contaminants,units = mbs.standardize(contaminants_raw,verbose = False)
environment_raw,_ = mbs.load_excel(file_path,
                                   sheet_name = 'environment',
                                   verbose = False)
environment,units = mbs.standardize(environment_raw,verbose = False)

#### Enrich data by data analysis needed

In [None]:
mbs.total_contaminant_concentration(contaminants,include = True)
mbs.total_metabolites_count(metabolites,include = True)
data_NA = mbs.merge_data([environment,contaminants,metabolites])
mbs.sample_NA_traffic(data_NA,include = True)

#### Activity plot linking contaminant concentration to metabolite occurence and NA screening

In [None]:
data_activity =  mbs.activity_data_prep(data_NA)
data_activity['tot_cont'] = data_activity['tot_cont']*0.001

In [None]:
#fig, ax = mbs.activity_plot(data_activity)
fig, ax = mbs.activity_plot(data_activity,
                       figsize = [6,4],
                       textsize = 12,
                       xscale = 'log',
                       markersize = 60,
                       loc='center right',
                       xlabel = r"Concentration contaminants [mg/L]",
                       #save_fig = 'activity.png'
                       )