## Introduction

As mentioned in the last notebook, it is necessary to inspect your data before analysis. For example, are you sure you know which file delimiter was used to create the file? Should the data be transposed? Data preprocessing steps also include inspecting the covariates and plotting data to search for outliers that may indicate measurement (or data entry) errors. 

We also discussed steps that can be used to determine whether the data fit classical statistical assumptions. This notebook continues in this area. In particular, we introduce generalize linear models (GLMs) that can be used to analyze non-normal data, such as count data or percentages.

For this notebook, we will use the glucosinolate data that we downloaded earlier.

### Load the glucosinolate data

In [1]:
# rm(list=ls());
# open the glucosinolate file in R
# same file as before...
glucosinolateFileName <- "data/cmeyer_glucs2015/bmeyer_etal.txt";  
glucs <- read.table(glucosinolateFileName, header=T, sep="\t", as.is=T, stringsAsFactors=FALSE);  
glucs <- glucs[order(glucs[,"accession_id"]),];

In [2]:
# what's in the working environment?
ls();

### What is the statistical model?

During today's notebook, we will discuss *Poisson* family analyses. Poisson models (and their derivatives) are used to analyze count data, which are very common.

Examples:
  - The number of microbial species in each of our gut microbiomes 
  - The number of people that die from hippo attacks each year
  - The number of reads assigned to a gene in an RNA-Seq dataset
 
The glucosinolate data are also count data.

In [3]:
str(glucs)

'data.frame':	2199 obs. of  24 variables:
 $ accession_id : int  1 1 2 2 2 2 4 4 4 6 ...
 $ sample_weight: num  27 22.1 15.2 29.9 31 30.9 17.8 24.6 29.5 18.3 ...
 $ G2P          : num  1391 7735 8268 1411 1055 ...
 $ G3B          : num  103709 69932 144750 27740 82118 ...
 $ G3HP         : num  35.9 0 202.1 0 88.7 ...
 $ G4P          : num  5248 2501 5546 117 2590 ...
 $ G2H3B        : num  25310 24289 78818 24818 29056 ...
 $ G4HB         : num  0 0 7.14 0 32.25 ...
 $ G2H4P        : num  656 390 1967 435 1214 ...
 $ G3MTP        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ G4MTB        : num  322 0 0 0 323 ...
 $ G3MSP        : num  0 0 0 0 0 ...
 $ G5MTP        : num  0 0 0 0 80.3 ...
 $ G4MSB        : num  0 68.6 87 0 22.5 ...
 $ G6MTH        : num  3611 117 412 2256 8807 ...
 $ G5MSP        : num  32.3 56.1 76.7 0 227.5 ...
 $ G7MTH        : num  2951.5 0 0 46.9 3107 ...
 $ G6MSH        : num  96 248 1017 0 429 ...
 $ G8MTO        : num  8244 0 0 577 18943 ...
 $ G7MSH        : num  2142 140

In [4]:
head(glucs);

Unnamed: 0,accession_id,sample_weight,G2P,G3B,G3HP,G4P,G2H3B,G4HB,G2H4P,G3MTP,⋯,G6MTH,G5MSP,G7MTH,G6MSH,G8MTO,G7MSH,G3BOP,G8MSO,G4BOB,G5BOP
892,1,27.0,1391.3143,103708.84,35.89132,5248.4694,25310.19,0.0,656.2984,0,⋯,3611.0269,32.32934,2951.46171,96.03157,8243.8638,2142.1924,0.0,23523.238,1424.3284,0.0
1692,1,22.1,7734.5149,69932.37,0.0,2500.7568,24289.34,0.0,390.3385,0,⋯,116.8998,56.07342,0.0,247.84098,0.0,1401.4957,67.66343,27462.923,1789.0852,920.6735
683,2,15.2,8267.7205,144750.06,202.10448,5545.6431,78817.75,7.136438,1966.7867,0,⋯,411.9168,76.65779,0.0,1016.59783,0.0,11407.0228,0.0,102816.76,8169.7092,3357.24848
870,2,29.9,1411.4871,27739.7,0.0,116.5999,24817.58,0.0,435.3465,0,⋯,2256.3011,0.0,46.88039,0.0,577.0562,389.6314,0.0,7644.041,284.5212,0.0
887,2,31.0,1055.4795,82117.61,88.72688,2589.5366,29056.36,32.245193,1213.7386,0,⋯,8806.8255,227.51344,3107.01884,428.95742,18943.1403,2913.7202,0.0,36178.023,2610.6636,61.77065
973,2,30.9,967.1328,38721.42,0.0,880.6843,17378.24,34.532081,391.132,0,⋯,2292.279,0.0,810.27345,93.65825,2614.6889,970.8893,0.0,9763.23,493.1664,0.0


In [5]:
# if lme4, ggplot2, and gridExtra aren't installed, install them...
if( !require("lme4" )){  
    install.packages("lme4");  
}

if( !require("ggplot2" )){  
    install.packages("ggplot2");  
}

if( !require("gridExtra" )){  
    install.packages("gridExtra");  
}


Loading required package: lme4
“package ‘lme4’ was built under R version 3.4.4”Loading required package: Matrix
Loading required package: ggplot2
Loading required package: gridExtra


### Replicates in the data

As noted earlier, the glucosinolate data were generated with *Arabidopsis thaliana*, the plant genetic model species. It is among the main model species used by plant geneticists in part because it is self compatible. This means that replicated inbred lines can be used in experiments to reduce experimental noise. Importantly, the mean phenotype of an inbred line can typically be measured with higher precision than for an outbred line.

### Estimate the mean phenotype per genotype

The speed of GWAS is affected by the sample size and the number of SNPs, as well as other factors. To speed up analyses, researchers working with repeated measures data (e.g. longitudinal analyses) or inbred lines often reduce the number of observations for each sample or individual. One approach, as illustrated in the previous notebook, is to simply average the phenotypic data per individual. However, this is not ideal (i.e. the errors propagate when analyzing the "means of means").

In [6]:
# recall, there is more than one observation per inbred line
counts <- table(glucs$accession_id);
cat("The distribution of the number of replicates per accession\n");
table(counts);

# earlier, we visualized this as a barplot:
# ggplot() + aes(counts) + 
#         xlab( paste0( "The number of replicates" )) + ylab("Counts") +
#         geom_bar(stat="count", fill="tan1") + 
#         geom_vline(aes(xintercept=mean(counts)), linetype=3);

The distribution of the number of replicates per accession


counts
  2   3   4   5   6   7   8 
148 154 131  84  53  21   4 

But how should we extract the mean phenotype per individual?

Recall that, earlier, we used linear mixed models to fit BLUPs:

In [7]:
# previously, we used a mixed model to specify a random effect with the following code:
lmer0 <- lmer( G2P ~ 1 + (1|accession_id), data=glucs );
linearBlups <- ranef(lmer0)$accession_id; # these are the best-linear unbiased predictors:
linearBlups <- data.frame( accession_id=rownames(linearBlups), linear_blup=linearBlups[,1], stringsAsFactors=FALSE );
head(linearBlups);

accession_id,linear_blup
1,-5271.448
2,-8390.965
4,-8618.79
6,-9610.424
7,-7955.538
8,-9497.424


This is a standard approach with mixed models. However, you'll remember that these are not really *normally distributed* data.

Indeed, a lot of the data that biologists work with are non-normal (e.g.):
  - number of offspring 
  - infection rates
  - survival/death
  - gene expression data
  - metabolomic data

In general, it is appropriate to analyze such data with generalized linear models (or GLMs). In the case of *count* data, it is typical to use *Poisson* (or Poisson-family) models. An easy rule-of-thumb to remember is that if the data are all non-negative integers, with no natural upper bound, then the Poisson-family models should be used.


<!--# however, these are count data (see the output from head above), which suggests 
# that a Poisson-family model should be used; let's try a simple quasi-Poisson GLM:
glm0 <- glm( G2P ~ 1, data=glucs, family="quasipoisson" );
glm0.res <- residuals( glm0 );
glm0.res[1:10];

# note: the names are no longer the accession Ids, but the row names from the glucs data frame. Do you know why?
# let's use a workaround:
################################################################################
## my version of stack, which avoids factor generation
################################################################################
mstack <- function(arg, newHeaders, setRowNames=T, sorted=TRUE, decreasing=F){
    values <- data.frame(names=I(names(arg)), values=as.numeric(arg));

    if( setRowNames ){
        rownames(values) <- values[,"names"];
    }
    
    if( sorted ) {
        values <- values[order(values[,"values"], decreasing=decreasing),];	
    }
    
    colnames(values) <- newHeaders;
    return(values);
}

glm0.res <- mstack( glm0.res, newHeaders=c("row_id", "residual"), sorted=FALSE );
head(glm0.res);-->

In [8]:
# the estimates from Meyer et al., however, are areas estimated under the curve, from a QQQ Mass spectrometer.
# these can be rounded without a loss of precision 
glucs[,3:ncol(glucs)] <- round(glucs[,3:ncol(glucs)]);
glucs[1:5,];

Unnamed: 0,accession_id,sample_weight,G2P,G3B,G3HP,G4P,G2H3B,G4HB,G2H4P,G3MTP,⋯,G6MTH,G5MSP,G7MTH,G6MSH,G8MTO,G7MSH,G3BOP,G8MSO,G4BOB,G5BOP
892,1,27.0,1391,103709,36,5248,25310,0,656,0,⋯,3611,32,2951,96,8244,2142,0,23523,1424,0
1692,1,22.1,7735,69932,0,2501,24289,0,390,0,⋯,117,56,0,248,0,1401,68,27463,1789,921
683,2,15.2,8268,144750,202,5546,78818,7,1967,0,⋯,412,77,0,1017,0,11407,0,102817,8170,3357
870,2,29.9,1411,27740,0,117,24818,0,435,0,⋯,2256,0,47,0,577,390,0,7644,285,0
887,2,31.0,1055,82118,89,2590,29056,32,1214,0,⋯,8807,228,3107,429,18943,2914,0,36178,2611,62


In [9]:
# now, these integers can be fit with GLMs/GLMMs:
glmer0 <- glmer( G2P ~ 1 + (1|accession_id), data=glucs, family="poisson" );
glmBlups <- ranef(glmer0)$accession_id; # these are the best-linear unbiased predictors:
glmBlups <- data.frame( accession_id=rownames(glmBlups), pois_blup=glmBlups[,1], stringsAsFactors=FALSE );
head(glmBlups);

accession_id,pois_blup
1,0.229793162
2,-0.214801548
4,-1.864293949
6,-1.431712033
7,0.009484085
8,-1.232474893


For poisson/negative binomial GLMs the distribution of these random effects (or residuals, if a GLM is used) will tend to be skewed, which is to be expected. For larger counts (and larger $\lambda$) the residuals will tend to normality.

### Poisson-family models allow offsets

When the counts from a Poisson process should be modeled as a rate, then an offset should be included in the model. Specifying an exposure will thus enable you to express the count data as a function of the *effort* that was necessary to collect the data. For example, if you detect 25,000 counts for G2P in a sample with a high background on a mass spec (the instrument that Meyer et al., used to measure glucosinolates), then you should take that higher *ion count* into **account** when comparing that sample with samples that have less of a background. 

As another example, if two botanists count the number of species at the botanical gardens and one botanist spends 24 hours counting species and the other 72 hours (assuming the same skill level), then you would typically expect the second botanist to count **more species**, given the extra effort.

But how do you specify an offset?

In [10]:
# determine the total ion count
totalIonCounts <- rowSums( glucs[,-c(1,2)] );
head(totalIonCounts);

In [11]:
# now include the total ion count as an offset
# if neither df has been sorted, you can immediately add it to the model without merging with the original df
glm0 <- glm( G2P ~ 1, offset(log(totalIonCounts)), data=glucs, family="quasipoisson");

ERROR: Error in glm(G2P ~ 1, offset(log(totalIonCounts)), data = glucs, family = "quasipoisson"): negative weights not allowed


Wait a minute, negative weight? The only weighting comes from the offset, which is expressed as a log in Poisson-family models.

In [12]:
which(log(totalIonCounts) < 0);
which(totalIonCounts == 0);

### Quality control (data curation) is an ongoing process

There are samples in the data that have *no* ion counts. If this wasn't a simple mistake during data collection (e.g. failing to input the data into the file), then the samples could be rerun or dropped. We have a lot of data, so let's drop them.

In [13]:
dropouts <- which( totalIonCounts == 0 );
glucs2 <- glucs[-dropouts,];
dim(glucs2);

totalIonCounts <- rowSums( glucs2[,-c(1,2)] );
cat("The range of ion counts is now:", range(totalIonCounts), "\n");

The range of ion counts is now: 20 366902 


In [21]:
# 20 is a small number, but the advantage of adding an offset is 
# that you can include as much data as possible, and let the model weight things appropriately
# glm0 <- glm( G2P ~ 1, offset(log(totalIonCounts)), data=glucs2, family="quasipoisson");
# summary(glm0);

glmer0 <- glmer( G2P ~ 1 + offset(log(totalIonCounts)) + (1|accession_id), data=glucs2, control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e5)), family="poisson" );
glmBlups <- ranef(glmer0)$accession_id; # these are the best-linear unbiased predictors:
glmBlups <- data.frame( accession_id=rownames(glmBlups), pois_blup=glmBlups[,1], stringsAsFactors=FALSE );
tail(glmBlups);

Unnamed: 0,accession_id,pois_blup
590,9482,1.543241
591,9490,-3.82746
592,9496,-3.182933
593,9499,-1.803268
594,9504,-3.298541
595,100000,-1.102997
