new notebook demonstrating overdispersion

mahort · mahort · commit dc3cf447dd19 · 2020-11-15T18:48:19.000+01:00
diff --git a/Lecture-7-DataPreprocessing3.ipynb b/Lecture-7-DataPreprocessing3.ipynb
@@ -0,0 +1,379 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "In the last notebook, we introduced generalize linear models (GLMs) that can be used to analyze non-normal data, such as count data or percentages. In general, GLMs tend to be much more powerful than linear models. However, there are a few statistical assumptions that you should be aware of...\n",
+    "\n",
+    "For this notebook, we will continue to use the glucosinolate data that we downloaded earlier.\n",
+    "\n",
+    "### Load the glucosinolate data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# rm(list=ls());\n",
+    "# open the glucosinolate file in R\n",
+    "# same file as before...\n",
+    "glucosinolateFileName <- \"data/cmeyer_glucs2015/bmeyer_etal.txt\";  \n",
+    "glucs <- read.table(glucosinolateFileName, header=T, sep=\"\\t\", as.is=T, stringsAsFactors=FALSE);  \n",
+    "glucs <- glucs[order(glucs[,\"accession_id\"]),];\n",
+    "glucs[,3:ncol(glucs)] <- round(glucs[,3:ncol(glucs)]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# what's in the working environment?\n",
+    "ls();"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### What is the statistical model?\n",
+    "\n",
+    "With this notebook, we will continue to discuss *Poisson* family analyses. Poisson models (and their derivatives) are used to analyze count data, which - as you will remember - are very common.\n",
+    "\n",
+    "Examples:\n",
+    "  - The number of insects within several $1m^2$ quadrats\n",
+    "  - The number of ions that hit a detector in a time interval\n",
+    "  - The number of meteorites that hit the Earth each year\n",
+    "\n",
+    " \n",
+    "The glucosinolate data are also count data (i.e. the number of glucosinolate ions that hit the QQQ detector in a given time frame)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "str(glucs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "head(glucs);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if lme4, ggplot2, and gridExtra aren't installed, install them...\n",
+    "if( !require(\"lme4\" )){  \n",
+    "    install.packages(\"lme4\");  \n",
+    "}\n",
+    "\n",
+    "if( !require(\"ggplot2\" )){  \n",
+    "    install.packages(\"ggplot2\");  \n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Recall that we used linear and generalized linear mixed models (i.e. fitted and random effects) to model BLUPs. As an example with a Poisson GLM:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# previously, we used the following code to determine the total 'effort' or offset\n",
+    "totalIonCounts <- rowSums( glucs[,-c(1,2)] );\n",
+    "cat(\"The range of ion counts is now:\", range(totalIonCounts), \"\\n\");"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the 0 indicates that there are missing data; get rid of those samples:\n",
+    "dropouts <- which( totalIonCounts == 0 );\n",
+    "glucs2 <- glucs[-dropouts,];\n",
+    "dim(glucs2);\n",
+    "\n",
+    "totalIonCounts <- rowSums( glucs2[,-c(1,2)] );\n",
+    "cat(\"The range of ion counts is now:\", range(totalIonCounts), \"\\n\");"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we can now fit the mixed model:\n",
+    "glmer0 <- glmer( G2P ~ 1 + offset(log(totalIonCounts)) + (1|accession_id), data=glucs2, control=glmerControl(optimizer=\"bobyqa\", optCtrl=list(maxfun=2e5)), family=\"poisson\" );\n",
+    "glmBlups <- ranef(glmer0)$accession_id; # these are the best-linear unbiased predictors:\n",
+    "glmBlups <- data.frame( accession_id=rownames(glmBlups), pois_blup=glmBlups[,1], stringsAsFactors=FALSE );\n",
+    "tail(glmBlups);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Model assumptions\n",
+    "\n",
+    "The Poisson distribution is used with non-negative integers that have no natural upper bound. However, there are other assumptions, including independence among events. For example, the number of people buying lunch each minute is not really Poisson distributed, because (a) people go to lunch together and (b) people eat at certain times (among other factors).\n",
+    "\n",
+    "Furthermore, the mean and variance of a Poisson random variable are expected to be equal and are described by the single parameter $\\lambda$.  \n",
+    "\n",
+    "When the mean and variance are *not* equal, then the data are either underdispersed or overdispersed. Underdispersion (mean > variance) is rare compared to overdispersion (mean < variance) and is often ignored.\n",
+    "\n",
+    "<!--# however, these are count data (see the output from head above), which suggests \n",
+    "# that a Poisson-family model should be used; let's try a simple quasi-Poisson GLM:\n",
+    "glm0 <- glm( G2P ~ 1, data=glucs, family=\"quasipoisson\" );\n",
+    "glm0.res <- residuals( glm0 );\n",
+    "glm0.res[1:10];\n",
+    "\n",
+    "# note: the names are no longer the accession Ids, but the row names from the glucs data frame. Do you know why?\n",
+    "# let's use a workaround:\n",
+    "################################################################################\n",
+    "## my version of stack, which avoids factor generation\n",
+    "################################################################################\n",
+    "mstack <- function(arg, newHeaders, setRowNames=T, sorted=TRUE, decreasing=F){\n",
+    "    values <- data.frame(names=I(names(arg)), values=as.numeric(arg));\n",
+    "\n",
+    "    if( setRowNames ){\n",
+    "        rownames(values) <- values[,\"names\"];\n",
+    "    }\n",
+    "    \n",
+    "    if( sorted ) {\n",
+    "        values <- values[order(values[,\"values\"], decreasing=decreasing),];\t\n",
+    "    }\n",
+    "    \n",
+    "    colnames(values) <- newHeaders;\n",
+    "    return(values);\n",
+    "}\n",
+    "\n",
+    "glm0.res <- mstack( glm0.res, newHeaders=c(\"row_id\", \"residual\"), sorted=FALSE );\n",
+    "head(glm0.res);-->\n",
+    "\n",
+    "We can test for overdispersion using the [following code from Ben Bolker](http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#testing-for-overdispersioncomputing-overdispersion-factor):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# a test for overdispersion, from Ben Bolker's GLMM website:\n",
+    "overdisp_fun <- function(model) {\n",
+    "    rdf <- df.residual(model)\n",
+    "    rp <- residuals(model,type=\"pearson\")\n",
+    "    Pearson.chisq <- sum(rp^2)\n",
+    "    prat <- Pearson.chisq/rdf\n",
+    "    pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)\n",
+    "    c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# determine whether the data are overdispersed:\n",
+    "glmer0 <- glmer( G2P ~ 1 + offset(log(totalIonCounts)) + (1|accession_id), data=glucs2, family=\"poisson\" );\n",
+    "overdisp_fun( glmer0 );"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data are highly overdispersed.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# one workaround is to simply update the coefficient table by multiplying the\n",
+    "# standard error by the square root of the dispersion factor (phi) and then \n",
+    "# updating the Z-values and the corresponding P-values such as:\n",
+    "\n",
+    "modelToTest <- coef(summary(glmer0));\n",
+    "phi <- overdisp_fun(glmer0)[\"ratio\"];\n",
+    "\n",
+    "modelToTest <- within(as.data.frame(modelToTest),\n",
+    "{   `Std. Error` <- `Std. Error`*sqrt(phi)\n",
+    "    `z value` <- Estimate/`Std. Error`\n",
+    "    `Pr(>|z|)` <- 2*pnorm(abs(`z value`), lower.tail=FALSE)\n",
+    "})\n",
+    "\n",
+    "cat(\"Before:\\n\");\n",
+    "coef(summary(glmer0));\n",
+    "\n",
+    "cat(\"\\nAnd after:\\n\");\n",
+    "printCoefmat(modelToTest, digits=3);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# another workaround: simply use glmer.nb as an alternative\n",
+    "# however glmer.nb is \"slow and fragile\" (see the Bolker wiki page noted above):\n",
+    "glmer.nb0 <- glmer.nb( G2P ~ 1 + offset(log(totalIonCounts)) + (1|accession_id), data=glucs2);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# can we use quasi models with mixed models? Not with lme4. One alternative is to fit an olre\n",
+    "# or an 'observation level random effect'. This is somewhat conservative, but will aid you in accounting for overdispersion\n",
+    "glucs2 <- data.frame(olre=1:nrow(glucs2), glucs2, stringsAsFactors=FALSE);\n",
+    "glucs2[1:3,];"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# with the olre as a random effect:\n",
+    "glmer.olre <- glmer( G2P ~ 1 + offset(log(totalIonCounts)) + (1|olre) + (1|accession_id), data=glucs2, family=\"poisson\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### Last but not least, in a pinch, you could use residuals instead of BLUPs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# another possible workaround\n",
+    "glm0 <- glm( G2P ~ 1, offset=log(totalIonCounts), data=glucs2, family=\"quasipoisson\");\n",
+    "summary(glm0);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# like BLUPs, residuals are regularly used in GWAS\n",
+    "glm0.res <- residuals( glm0 );\n",
+    "glm0.res[1:10];"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# note: the names are no longer the accession Ids, but the row names from the glucs data frame. Do you know why?\n",
+    "# let's use a workaround:\n",
+    "################################################################################\n",
+    "## my version of stack, which avoids factor generation\n",
+    "################################################################################\n",
+    "mstack <- function(arg, newHeaders, setRowNames=T, sorted=TRUE, decreasing=F){\n",
+    "    values <- data.frame(names=I(names(arg)), values=as.numeric(arg));\n",
+    "\n",
+    "    if( setRowNames ){\n",
+    "        rownames(values) <- values[,\"names\"];\n",
+    "    }\n",
+    "    \n",
+    "    if( sorted ) {\n",
+    "        values <- values[order(values[,\"values\"], decreasing=decreasing),];\t\n",
+    "    }\n",
+    "    \n",
+    "    colnames(values) <- newHeaders;\n",
+    "    return(values);\n",
+    "}\n",
+    "\n",
+    "glm0.res <- mstack( glm0.res, newHeaders=c(\"row_id\", \"residual\"), sorted=FALSE );\n",
+    "glucs2 <- data.frame( row_id=rownames(glucs2), glucs2, stringsAsFactors=FALSE);\n",
+    "\n",
+    "# the residuals here only correspond to the metabolite we've been working with, so \n",
+    "# to protect yourself from making downstream mistakes, subset to the metadata and that column\n",
+    "output <- merge(glm0.res, glucs2[,c(\"row_id\", \"accession_id\", \"sample_weight\", \"G2P\")], by=\"row_id\" );"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "head(output);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br><br>\n",
+    "\n",
+    "One could then take the averages of each residual for each accession id, using an approach like the tapply approach described earlier."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "3.4.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}