planet is an R package for inferring ethnicity from placental DNA
methylation microarray data .
You can install from this github repo with:
For demonstration purposes, I downloaded a placental DNAm dataset from
which contains samples collected in an Australian population. To save on
memory, I only use 6/24 samples, which I have saved in this repo as a
library(planet) library(minfi) # for normalization library(wateRmelon) # for normalization library(ggplot2) #load example data data(pl_rgset) pl_rgset # 6 samples #> class: RGChannelSet #> dim: 622399 6 #> metadata(0): #> assays(2): Green Red #> rownames(622399): 10600313 10600322 ... 74810490 74810492 #> rowData names(0): #> colnames(6): GSM1944959_9376561070_R05C01 #> GSM1944960_9376561070_R06C01 ... GSM1944963_9376561070_R03C02 #> GSM1944964_9376561070_R04C02 #> colData names(0): #> Annotation #> array: IlluminaHumanMethylation450k #> annotation: ilmn12.hg19
I recommend to normalize your data using the same methods I used to normalize the training data. Performance on datasets normalized by other methods has not been evaluated yet.
To apply normalization, run
minfi::preprocessNoob() and then
pl_noob <- preprocessNoob(pl_rgset) pl_bmiq <- wateRmelon::BMIQ(pl_noob)
preprocessNoob will drop SNP probes automatically. Because
we need these to infer ethnicity, we need to combine the methylation
data with the 65 snp probe data (59 SNPs, if using EPIC):
pl_snps <- getSnpBeta(pl_rgset) pl_dat <- rbind(pl_bmiq, pl_snps) dim(pl_dat) # 485577 6 #>  485577 6
The input data needs to contain all 1860 features in the final model. We
can check our data for these features with the
all(pl_ethnicity_features %in% rownames(pl_dat)) #>  TRUE
To obtain ethnicity calls, you can supply the full DNA methylation data
pl_ethnicity_infer(), as long as all 1860 features are present.
dim(pl_dat) #>  485577 6 results <- pl_infer_ethnicity(pl_dat) #>  "1860 of 1860 predictors present." print(results, row.names = F) #> Sample_ID Predicted_ethnicity_nothresh #> GSM1944959_9376561070_R05C01 Asian #> GSM1944960_9376561070_R06C01 Caucasian #> GSM1944961_9376561070_R01C02 Asian #> GSM1944962_9376561070_R02C02 Caucasian #> GSM1944963_9376561070_R03C02 Caucasian #> GSM1944964_9376561070_R04C02 Caucasian #> Predicted_ethnicity Prob_African Prob_Asian Prob_Caucasian Highest_Prob #> Asian 0.0123696461 0.9593950737 0.02823528 0.9593951 #> Caucasian 0.0156684101 0.1672797219 0.81705187 0.8170519 #> Asian 0.0230160188 0.9086999663 0.06828401 0.9087000 #> Caucasian 0.0006193453 0.0006078842 0.99877277 0.9987728 #> Caucasian 0.0026674323 0.0034159950 0.99391657 0.9939166 #> Caucasian 0.0047242556 0.0081101573 0.98716559 0.9871656
pl_infer_ethnicity returns probabilities corresponding to each
ethnicity for each sample (e.g
Prob_Asian). A final classification is determined in two ways:
Predicted_ethnicity_nothresh- returns a classification corresponding to the highest class-specific probability.
Predicted_ethnicity- if the highest class-specific probability is below
0.75, then the the sample is assigned an
Amibiguouslabel. This threshold can be adjusted with the
thresholdargument. Samples with this label might require special attention in downstream analyses.
qplot(data = results, x = Prob_Caucasian, y = Prob_African, col = Predicted_ethnicity, xlim = c(0,1), ylim = c(0,1))
qplot(data = results, x = Prob_Caucasian, y = Prob_Asian, col = Predicted_ethnicity, xlim = c(0,1), ylim = c(0,1))
*For the entire dataset (not just the subset shown here), 22/24 were predicted Caucasian and 2/24 Asian.
We can’t compare this to self-reported ethnicity as it is unavailable. But we know these samples were collected in Sydney, Australia, and are therefore likely mostly European with some East Asian ancestries.
table(results$Predicted_ethnicity) #> #> Asian Caucasian #> 2 4
Adjustment in differential methylation analysis
Because ‘Ambiguous’ samples might have different mixtures of ancestries, it might be inadequate to adjust for them as one group in an analysis of admixed populations (e.g. 50/50 Asian/African should not be considered the same group as 50/50 Caucasian/African). One solution would be to simply remove these samples. Another would be to adjust for the raw probabilities-in this case, use only two of the three probabilities, since the third will be redundant (probabilities sum to 1). If sample numbers are large enough in each group, stratifying downstream analyses by ethnicity might also be a valid option.
 Yuan V, Price M, Del Gobbo G, Mostafavi S, Cox B, Binder AM, Michels KB, Marsit C, Robinson W: Inferring population structure from placental DNA methylation studies. In prep.
 Yeung KR, Chiu CL, Pidsley R, Makris A, Hennessy A, Lind JM: DNA methylation profiles in preeclampsia and healthy control placentas. Am J Physiol Circ Physiol 2016, 310:H1295–H1303.
 Triche TJ, Weisenberger DJ, Van Den Berg D, Laird PW, Siegmund KD, Siegmund KD: Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res 2013, 41:e90.
 Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S: A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 2013, 29:189–96.