# Statistical Testing for Metabolomics

In this module we will review common statistical tests performed on metabolomics data. 

<a href="https://colab.research.google.com/drive/1skA1e0nS6qlHzMsP1IA44oPTmkWeyrep?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, mannwhitneyu
from sklearn.preprocessing import StandardScaler # for preprocesssing data before PCA
from sklearn.decomposition import PCA

In [None]:
# demo data are at https://github.com/shuzhao-li/khipu/tree/main/testdata/
# Have a look via pd
ecoli = pd.read_table("../Datasets/ecoli_pos.tsv", index_col=0, sep="\t")
ecoli.head()

In [None]:
# t-test of each row between 12C and 13C samples
# data are log2 transformed in this test to have more normal distribution; +1 to avoid log2(0)
def ttest(row):
    t,p = ttest_ind(np.log2(row[3:6]+1), np.log2(row[6:9]+1))
    return p

pvalues_featurelist = ecoli.apply(ttest, axis=1)


In [None]:
# sort p-values
pvalues_featurelist = pvalues_featurelist.sort_values()
pvalues_featurelist.head(10)

In [None]:
# how many features have p < 0.001
pvalues_featurelist[pvalues_featurelist<0.001].shape

In [None]:
# in many situations you will need to perform some for of multiple testing correction.

# bonferroni correction is most straightforward, simply multiply the p-values by the number of p-values

pvalues_featurelist_bonferroni = pvalues_featurelist * len(pvalues_featurelist)
pvalues_featurelist_bonferroni[pvalues_featurelist_bonferroni<0.001].shape

In [None]:
# FDR correction rather than bonferroni often will preserve statistical power better than bonferroni.
# note that we must remove any zero values from the p-value list for these procedures.

from statsmodels.stats.multitest import fdrcorrection

significant, q_vals = fdrcorrection([x for x in pvalues_featurelist if x > 0], 0.001)
print("\n".join([str(x) for x in sorted(q_vals)][:10]))

# by default this performs a benjamini-hochberg correction, other versions of the correction
# can be performed using options passed to fdrcorrection.

In [None]:
# alternatively, a non-parametric test could be used. 

def mwutest(row):
    t,p = mannwhitneyu(np.log2(row[3:6]+1), np.log2(row[6:9]+1))
    return p

pvalues_featurelist_nonpara = ecoli.apply(mwutest, axis=1)

In [None]:
# sort p-values
pvalues_featurelist_nonpara = pvalues_featurelist_nonpara.sort_values()
pvalues_featurelist_nonpara.head(10)

In [None]:
# Oh no, none of the p-values are significant!!!! that's because we only have three 
# samples per group and a significant result is impossible. 
# this was intentional, to highlight the challenges that may arise using non-parametric stats.