## Introduction

The first step in any analysis is to review the data that will be analyzed. Large datasets typically contain missing data, data errors (e.g. due to an error during measurement or data entry), or entries that violate statistical assumptions (so-called outliers). 

To put that another way, during data review you need to assess whether the data need to be cleaned *before* analysis (e.g. review the metadata, including the collection date, mass, or other covarites), and whether the data follow traditional statistical assumptions, including normality.

In this notebook, we will review data preprocessing steps that are common in bioinformatics. To do so, we will use the statistical computing environment *R*.

### Download the test data

To practice data preprocessing techniques, let's use [publicly available](http://www.pnas.org/content/112/13/4032) glucosinolate data. The formatted data can be downloaded here:

curl https://raw.githubusercontent.com/timeu/gwas-lecture/master/data/cmeyer_glucs2015/bmeyer_etal.txt --create-dirs --output data/cmeyer_glucs2015/bmeyer_etal.txt

### Check the file format

Before importing data, it is always important to review the file format. Is it a text file? Is it an hdf5 file? Does to need to be transposed?

- If it is a text file:
  - What delimiter was used to make the file? 
  - What character set was used?   (see the command *iconv*)
  - What carriage return was used? (see the command *dos2unix*)
  - Are there spaces or unexpected characters in the header? (see vim/emacs)
  - Are there quotes or comment characters that will interrupt the import?

- If it is an hdf5 file:
  - What are the keys?
  
### Load the glucosinolate data

In [None]:
# rm(list=ls());
# open the glucosinolate file in R

glucosinolateFileName <- "data/cmeyer_glucs2015/bmeyer_etal.txt";  
glucs <- read.table(glucosinolateFileName, header=T, sep="\t", as.is=T, stringsAsFactors=FALSE);  
glucs <- glucs[order(glucs[,"accession_id"]),];

In [None]:
# what's in the working environment?
ls(); 

In [None]:
# nb: this is a working R environment
# to learn the syntax of an argument, use the question mark followed by the command line
# ?head

In [None]:
# str(glucs);

In [None]:
head(glucs);

In [None]:
# it's important to run sanity checks to ensure that you have loaded the entire dataset
dim(glucs); # returns the dimensionality in row x column format
nrow(glucs); # returns the number of rows
ncol(glucs); # returns the number of columns

# when you begin to write scripts, add comments so that 
# future readers (such as yourself) know what you were doing
# the comment character in R is #

In [None]:
# R functions are often 'silent' (return from a function invisibly)
# In addition, the numbers above lack context. The two commands
# cat and print can be used to add user feedback. As an example:
cat("There are: ", nrow(glucs), " rows and ", sep="");
cat( ncol(glucs), " columns in the dataset.\n", sep="");
cat("There are:", length(which(is.na(glucs))), "missing data (i.e. NAs).\n");

# the newline character is '\n' 

### Quality control in *nix?
If the numbers do not match your expectation, open the file in vi/emacs or another text editor... and look for special characters or quotes/comments that may have interrupted the import.

You're probably familiar with some of the more common errors. For example, if the number of columns doesn't match the number of column names, *R* will return an error. However, this is a 'nice' error, because it forces you to look at the data in more detail. The scariest errors are the errors that escape your attention. It's thus very important to review your data (e.g. what happens if you concatenate 2 data files?).

In [None]:
# or alternatively:
# command <- paste0( "head ", glucosinolateFileName );
# print( system( command, wait=T, intern=T ));

### What is the phenotype?

The glucosinolate data were generated with the plant genetic model species *Arabidopsis thaliana*

One of the main reasons geneticists use *A. thaliana* is because it is self compatible, which means that inbred lines can be created through self pollination. In *A. thaliana*, these inbred lines were created by placing each plant's flowers in bags to minimize cross pollination among individuals.

The seeds from these inbred lines can then be used as replicates in experiments. Replicated inbred lines allow us to estimate the mean phenotype for any given genotype at a much higher precision level than is usually possible in obligate outcrossing species. For GWAS, work with inbred lines improves power - since one can use *fewer* individuals than would be necessary in other species.

But how many replicates are available for each line? And how should we handle these replicates?

In [None]:
cat("There are:", length(unique(glucs$accession_id)), "unique accessions.\n");
tableOfAccessionIds <- table(glucs$accession_id);
tableOfAccessionIds;

In [None]:
# That's hard to read, somewhat better is:
rangeOfReplicates <- range(tableOfAccessionIds);
cat("The number of replicates ranges between ", rangeOfReplicates[1], " and ", rangeOfReplicates[2], "\n", sep="");
cat("With a mean of:", mean(tableOfAccessionIds), "\n");

# if lme4, ggplot2, and gridExtra aren't installed, install them...
if( !require("lme4" )){  
    install.packages("lme4");  
}

if( !require("ggplot2" )){  
    install.packages("ggplot2");  
}

if( !require("gridExtra" )){  
    install.packages("gridExtra");  
}


In [None]:
# there's quite a range of replicate counts
# plot the # of replicates per genotype
options(repr.plot.width=2.5, repr.plot.height=2.5)
counts <- table(glucs$accession_id);
ggplot() + aes(counts) + 
        xlab( paste0( "The number of replicates" )) + ylab("Counts") +
        geom_bar(stat="count", fill="tan1") + 
        geom_vline(aes(xintercept=mean(counts)), linetype=3);

### Estimate the mean phenotype per genotype

In [None]:
# tapply offers a fast way to estimate the mean (or any other summary statistic) across a factor
dataSummary <- stack(with(glucs, tapply(G2P, list(accession_id), mean))); # stack can be used to convert a vector into a 2-col matrix
colnames(dataSummary) <- c("tapply_mean", "accession_id" );
tail(dataSummary);

<!-- # this is roughly equivalent to using something more model-based
# residuals are often used in GWAS, especially when nuisance variables are taken 'into account' (e.g. plate_id, etc.)
# as a simple example:
lm0 <- lm( G2P ~ 1, data=glucs); 
resids <- stack(residuals(lm0));-->

In [None]:
# another option: use a mixed model to specify a random effect
require(lme4);
lmer0 <- lmer( G2P ~ 1 + (1|accession_id), data=glucs );
blups <- ranef(lmer0)$accession_id; # these are the best-linear unbiased predictors:
blups <- data.frame( accession_id=rownames(blups), blup=blups[,1], stringsAsFactors=FALSE );
head(blups);

In [None]:
both <- merge(blups, dataSummary, by="accession_id" );
head(both);
ggplot( both, aes(x=blup, y=tapply_mean)) + geom_point(alpha=0.5, col="cadetblue2" ); 

The two techniques are similar, but the results aren't perfectly correlated. Why? (as follow up, you could identify the outlying accessions)

### Checking model assumptions/normality for a phenotype

In [None]:
# count data are often non-normal, there are various ways 
# to investigate normality, including Normal Q-Q plots (quantile-quantile) plots
# here is an example with data sampled from a normal distribution:
normalData <- rnorm(1e4);
p1 <- ggplot() + aes(sample=normalData) + stat_qq(col="forestgreen", size=2) + stat_qq_line(col="tan1", size=1.25);

# and the glucosinolate G2P
p2 <- ggplot( glucs, aes(sample=G2P)) + stat_qq(col="firebrick1", size=2) + stat_qq_line(col="cadetblue2", size=1.25);

options(repr.plot.width=4, repr.plot.height=2.5)
grid.arrange(p1, p2, ncol = 2);

In [None]:
# The Q-Q plot suggests the data are highly non-normal
# The non-normality is also evident in a density plot:
p1 <- ggplot() + aes(x=rnorm(1e5)) + 
            xlab( "Normal Distribution") +
            geom_density();
p2 <- ggplot( glucs, aes(x=G2P)) + 
            theme(axis.text.x = element_text(angle = 35)) +
            geom_density(color="darkblue", fill="lightblue");

grid.arrange(p1, p2, ncol = 2);

In [None]:
# Again, the data are clearly non-normal. These are visual tests, which are fine,
# but you may want to use formalized tests such as the Shapiro-Wilk test:
shapiro.test(glucs$G2P);

In [None]:
# or the Kolmogorov-Smirnov (KS) normality test:
# nb: y can either be a vector of data values or a character string naming a cumulative distribution function (cdf)
ks.test( glucs$G2P, y="pnorm"); 

### The data are clearly non-normal!!!

What are your other options? In the next notebook, we'll consider some alternative approaches.
