## Basic stats regarding genetic basis on metabolite levels in SxY cross

Including: 
- \# metabolites/timepoints significantly different between parental strains, 
- transgressive segregation 
- and directional genetics

In [2]:
.libPaths("~/R/x86_64-redhat-linux-gnu-library/3.2/")
# config opts and libraries
options(repr.plot.width = 6)
options(repr.plot.height = 5)
library(ggplot2);
library(plyr);
library(dplyr);
library(reshape2);
library(LSD);
library(qtl);
library(pheatmap);
library(parallel);
options(mc.cores = 24);
library(stringr);
library(RColorBrewer);

### Metabolites/timepoints significantly different between parental strains

How many metabolites (timepoints) differ significantly between two parental strains

As in [Breunig JS, Hackett SR, Rabinowitz JD, Kruglyak L (2014) Genetic Basis of Metabolome Variation in Yeast. PLoS Genet 10(3): e1004142](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004142)

In [151]:
# load parental data
endo_f = "/g/steinmetz/project/GenPhen/data/endometabolome/data/endometabolite_full_12102015.rda"
load(endo_f)

In [152]:
head(parents)

Unnamed: 0,strain,parent,time,metabolite,endo_quant,endo_quant_log,exo_quant,exo_quant_log,cellconc_1.ml,biovolume_ul.ml,singlecellvol_fl,endo_quant_log_normalized,exo_quant_log_normalized,endo_quant_rel,exo_quant_rel,endo_rate,exo_rate
1,S1,S288c,16,CIT,3844.5,11.90858,142.43,7.154109,13763600,0.794721,57.74,12.2062,7.049826,1.0,1.0,,
2,S1,S288c,17,CIT,2575.07,11.3304,144.88,7.178715,20412000,1.128718,55.3,11.69755,7.075077,0.9583284,1.003582,-0.5086517,0.02525106
3,S1,S288c,18,CIT,2579.87,11.33308,139.83,7.12753,33698000,1.680631,49.87,11.521,6.951486,0.9438643,0.9860507,-0.1765525,-0.123591
4,S1,S288c,19,CIT,2153.68,11.07259,151.57,7.24384,47319300,2.299066,48.59,11.28718,7.156562,0.9247086,1.01514,-0.2338181,0.2050763
5,S1,S288c,20,CIT,1609.98,10.65283,142.05,7.150255,72192000,3.173397,43.96,10.84166,7.475629,0.8882089,1.060399,-0.4455226,0.3190664
6,S2,S288c,16,CIT,3584.21,11.80744,143.05,7.160376,13303800,0.7479802,56.22,11.88108,7.04624,1.0,1.0,,


### How many metabolites are different in parental strains? (one-way ANOVA)

I am using a one-way ANOVA to be consistent with [Breunig JS et al 2014](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004142), though it isn't clear to me why they use ANOVA rather than a t-test...

In [155]:
parent_diff_anova = parents %>% group_by(metabolite) %>% do({
    thismetabolite = .$metabolite[1]
    test = try({aov(formula = endo_quant_log ~ parent, data = .)}, silent = T)
    if (!(class(test)=="try-error")[1]) {  
        anova_p = summary(test)[[1]][["Pr(>F)"]][1]
        return(data.frame(metabolite = thismetabolite, anova = anova_p))
    } else {
        return(data.frame())
    }
})
#replications(formula = endo_quant_log ~ parent, data = tmp)
parent_diff_anova$anova_BH = p.adjust(parent_diff_anova$anova, method = "BH")
# restrict to metabolites in segregrant study
common_m = intersect(levels(parent_diff_anova$metabolite),levels(endometabolite$metabolite))
parent_diff_anova = filter(parent_diff_anova, metabolite %in% common_m)

In [156]:
cat(sum(parent_diff_anova$anova_BH <= 0.05), "out of", 
      dim(parent_diff_anova)[1], "metabolites that were detected",
    "in both parental strains are different between S288c and YJM789")

16 out of 26 metabolites that were detected in both parental strains are different between S288c and YJM789

### How many metabolites are different in parental strains at at least 1 timepoint? (one-way ANOVA)

In [157]:
parent_diff_anova_pertime = parents %>% group_by(metabolite, time) %>% do({
    thismetabolite = .$metabolite[1]
    thistime = .$time[1]
    test = try({aov(formula = endo_quant_log ~ parent, data = .)}, silent = T)
    if (!(class(test)=="try-error")[1]) {  
        anova_p = summary(test)[[1]][["Pr(>F)"]][1]
        return(data.frame(metabolite = thismetabolite, anova = anova_p))
    } else {
        return(data.frame())
    }
})
#replications(formula = endo_quant_log ~ parent, data = tmp)
parent_diff_anova_pertime$anova_BH = p.adjust(parent_diff_anova_pertime$anova, method = "BH")
# restrict to metabolites in segregrant study
common_m = intersect(levels(parent_diff_anova_pertime$metabolite),
                     levels(endometabolite$metabolite))
parent_diff_anova_pertime = filter(parent_diff_anova_pertime, metabolite %in% common_m)

In [158]:
parent_diff_anova_pertime_summary = parent_diff_anova_pertime %>% group_by(metabolite) %>% 
    summarise(sigtimepoint = sum(anova_BH <= 0.05))

cat(sum(parent_diff_anova_pertime_summary$sigtimepoint > 0), "out of", 
      dim(parent_diff_anova_pertime_summary)[1], "metabolites that were detected",
    "in both parental strains are different between S288c and YJM789 at at least 1 timepoint")

25 out of 26 metabolites that were detected in both parental strains are different between S288c and YJM789 at at least 1 timepoint

### \# of metabolites for which a mQTL is detected

In [159]:
# load genotype and markers files
genotype_f = "/g/steinmetz/brooks/yeast/genomes/S288CxYJM789/genotypes_S288c_R64.rda"
load(genotype_f)

In [160]:
# load normalized QTLs
load("/g/steinmetz/brooks/genphen//metabolome/qtls/mQTLs_comball_funqtl_2014.rda")

In [170]:
type = "mlod"
co = .1 # 10% FDR
bayesint = .95 # 95% Bayesian confidence interval around QTL

# normalized data
data = mQTLs_funqtl_2014
qtls_norm = do.call(rbind,lapply(names(data), function(i){
    o = try({
    #print(i)
    m = i
    type = type
    chrs = unique(data[[m]]$qtls_alt[data[[m]]$qtls_alt[,type]>=
                                     summary(data[[m]]$permout[,type],co)[1],"chr"])
    chrs = levels(chrs)[chrs]
    lodcolumn = if(type=="mlod"){ 2 } else { 1 }
    qtl_intervals = list()
    if (length(chrs)>0) {
      for (i in chrs) {
        qtl_intervals[[i]] = try(
            mrk[rownames(bayesint(data[[m]]$qtls_alt, 
                                  chr = str_pad(i, 2, pad = "0"), 
                                  prob=bayesint, lodcolumn=lodcolumn))],silent = T)
        if (class(qtl_intervals[[i]])=="try-error") {
          qtl_intervals[[i]] = NULL
        } else {
          nn = sapply(as.character(seqnames(qtl_intervals[[i]])),function(i){
            paste(substr(i,1,3),as.roman(substr(i,4,5)),sep="")
          })
          qtl_intervals[[i]] = renameSeqlevels(qtl_intervals[[i]],nn)
          qtl_intervals[[i]] = keepSeqlevels(qtl_intervals[[i]],unique(nn))
          qtl_intervals[[i]] = range(qtl_intervals[[i]])
        }
      }
    }
    if (length(qtl_intervals) > 1) {
        qtl_df = do.call(rbind,qtl_intervals)
    } else {
        qtl_df = as.data.frame(qtl_intervals[[1]])
    }
    qtl_df = cbind(metabolite = m, qtl_df)
    })
    if (class(o)!="try-error") {
        return(o)
    } else {
        return(NULL)
    }
}))

In [171]:
cat("A QTL is identified for", sum(qtls_norm$metabolite %in% common_m), 
    "metabolites out of", sum(parent_diff_anova$anova_BH <= 0.05), 
    "with a difference between parental strains")

A QTL is identified for 10 metabolites out of 16 with a difference between parental strains

### Transgressive segregation

Range of phenotype in the segregants significantly exceeds that spanned by the parent strains

*Calculated accroding to [Brem et al 2005](http://www.pnas.org/content/102/5/1572)*

### Directional genetics

Range of phenotype in the segregants is intermediate between the parent strains