# De Werf/Diep/Jamieson Human Pediatric AML Study in Progenitors and Hematopoietic Stem Cells 
### Progenitors
* Pediatric AML vs Adult AML
* Pediatric AML vs Pediatric Non-leukemia

### Stem Cells
* Pediatric AML vs Adult AML
* Pediatric AML vs Pediatric Non-leukemia

### Pediatric AML
* Progenitors vs Stem Cells

### Adult AML
* Progenitors vs Stem Cells


# RNASeq Data Integration

   > * Tom Whisenant, CCBB (twhisenant@ucsd.edu)
   > * Based on upstream analysis by Guorong Xu, CCBB (g1xu@ucsd.edu)
</div>

* Modeled on "RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR" ([1](#Citations))

## Table of Contents
* [Background](#Background)
* [Introduction](#Introduction)
* [Parameter Input](#Parameter-Input)
* [Library Import](#Library-Import)
* [Data Import](#Data-Import)
    * [Count Data](#Count-Data)
    * [Metadata](#Metadata)
    * [Annotations](#Annotations)
* [Gene Separation By Coding Status](#Gene-Separation-By-Coding-Status)
* [Data Integration](#Data-Integration)
* [Annotation Integration](#Annotation-Integration)
* [Summary](#Summary)
* [Citations](#Citations)
* [Appendix: R Session Info](#Appendix:-R-Session-Info)


## Background

   >The count data analyzed in this notebook were produced by the upstream analysis of Dr. Guorong Xu of CCBB, who received raw sequencing data and performed quality control, trimming, alignment, and quantification of reads.
   
</div>

[Table of Contents](#Table-of-Contents)

## Introduction

This notebook takes in per-gene-per-sample count data (prepared either externally or by the  "RNASeq_RSEM_QC_and_Counts_Preparation" notebook) and per-sample metadata RNASeq data, and uses the edgeR ([2](#Citations)) Bioconductor ([3](#Citations)) package written in R ([4](#Citations)) to integrate and annotate these inputs in preparation for data exploration and preprocessing.

[Table of Contents](#Table-of-Contents)

## Parameter Input

In [115]:
gProjectName = "20200228_DeWerf_Human_PediatricAML"
gGeneCountsFilename = "DeWerf_Jamieson_PedAML_counts.txt"
gMetadataFilename = "DeWerf_Jamieson_PediatricAMLmetadata_02282020.csv"

# Note: the below R object file is necessary only if separating genes into coding 
# and non-coding before analysis.
# If NOT doing this, set the value of this filename to NULL.
# Note: be sure to record source of annotations file, such as:
# R object of contents of gencode.v29.annotation.gtf based on Homo sapiens GRCh38p12
gAnnotationsRdataFilename = "Homo_sapiens_GRCh38p12_gencodev29_ANNOT.Rdata"

# or use the same one as always
gAnnotationsRdataFilename_hg19 = "/mnt/data1/tomw/Holm_Jamieson_Analysis/gencodev19_ANNOT.RData"

In [116]:
gSourceDir = "/mnt/data1/tomw/RNASeq/notebooks/src/" # note trailing slash here but not below
gOutputDir = "../outputs"
gInputDir = "../inputs"
gReferenceDir = "/mnt/data1/tomw/RNASeq/reference"
gInterimDir = "../interim"
gGeneCountsFp = file.path(gOutputDir, gGeneCountsFilename)
gMetadataFp = file.path(gInputDir, gMetadataFilename)
gAnnotationFp_hg19 <- file.path(gAnnotationsRdataFilename_hg19)

In [117]:
# Import shared source code to load and save previous notebooks' environments:
source(paste0(gSourceDir, "ChainedNotebookSupport.R"))

Populate the run name parameter automatically to ensure that outputs from different runs do not overwrite each other:

In [118]:
gRunName = makeRunName(gProjectName, "data_integration")
gRunName

[Table of Contents](#Table-of-Contents)

## Library Import

Import the necessary R, Bioconductor, and CCBB libraries for the analysis:

In [None]:
#if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

In [None]:
#BiocManager::install("edgeR", version = "3.8")

In [None]:
#BiocManager::install("Homo.sapiens", version = "3.8")

In [119]:
library(Homo.sapiens)
gOrganismPackage = Homo.sapiens

In [120]:
library(edgeR)

[Table of Contents](#Table-of-Contents)


## Data Import

### Count Data

Import the count data file in which rows are genes identifiers, columns are sample identifiers, and row/column intersections contain the number of counts for the relevant gene in the relevant sample:

In [121]:
# Read in counts file containing info on all samples and genes
gUnorderedGeneCountsDf <- read.csv(gGeneCountsFp, sep="\t", stringsAsFactors=FALSE, row.names=1)
colnames(gUnorderedGeneCountsDf) <- gsub("_S[0-9]+_R1_001$", "", colnames(gUnorderedGeneCountsDf))
dim(gUnorderedGeneCountsDf)
colnames(gUnorderedGeneCountsDf) 

In [122]:
head(gUnorderedGeneCountsDf)

Unnamed: 0_level_0,X14x12488xPLUSPLUS,X02pid24760ctHPC,cell05id90HSC,X02id11251HSC,X04id11474HSC,X05x00047xPLUSPLUS,X06x00077xPLUSPLUS,X04pid24474ctHPC,X09x00020pxPLUSPLUS,X03id78cellHSC,⋯,X13x12488xPLUSMINS,X03id11474HPC,X05id00066HSC,X15x12584xPLUSPLUS,X01id11251HPC,X17x12451xPLUSPLUS,X11x00082xPLUSPLUS,X01x00077xPROGENIT,X16x12451xPLUSMINS,X04x00068xPLUSMINS
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003.10,10.0,35.0,0.0,3.0,0.0,18.0,49.0,116.0,0.0,1.0,⋯,54.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000005.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419.8,3579.0,2724.0,1705.0,1759.0,1923.0,2350.0,4884.0,3125.0,1646.0,1872.0,⋯,1712.0,2125.0,818.0,1652.0,1777.0,2146.0,2294.0,1225.0,1082.0,910.0
ENSG00000000457.9,456.88,1238.85,322.03,160.2,529.39,354.12,1435.02,158.04,513.23,96.22,⋯,151.32,424.19,488.89,158.72,20.94,1295.05,102.96,351.54,178.19,145.25
ENSG00000000460.12,1413.12,1699.15,1042.97,352.8,543.61,1377.88,2543.98,768.96,199.77,297.78,⋯,378.68,596.81,271.11,365.28,292.06,178.95,116.04,730.46,121.81,281.75
ENSG00000000938.8,181.0,293.0,18.0,70.0,357.0,466.0,436.0,50.0,238.0,199.0,⋯,230.0,659.0,91.0,129.0,286.0,279.0,13.0,43.0,133.0,114.0


In [123]:
detectParRecords = function(geneCountsDf){
    gene_names <- rownames(geneCountsDf)
    PAR_genes <- gene_names[grep("_PAR_", gene_names)] 
    if (length(PAR_genes) == 0){
        print("No PAR genes detected; analysis can proceed.")
    } else {
        print("ERROR: PAR genes found.  These must be removed before continuing analysis.")
    }
    return(PAR_genes)
}

In [124]:
detectParRecords(gUnorderedGeneCountsDf)

[1] "No PAR genes detected; analysis can proceed."


No assumption is made that the columns (samples) of the gene count file are currently ordered in the order desirable for the differential expression analysis.

[Table of Contents](#Table-of-Contents)

### Metadata

> For downstream analysis, sample-level information related to the experimental design needs to be associated with the columns of the counts matrix. This should include experimental variables, both biological and technical, that could have an effect on expression levels. Examples [could] include cell type (basal, LP and ML in this experiment), genotype (wild-type, knock-out), phenotype (disease status, sex, age), sample treatment (drug, control) and batch information (date experiment was performed if samples were collected and analysed at distinct time points) to name just a few. ([1](#Citations))

Import a metadata file in which rows are sample identifiers, columns are metadata features (e.g., subject id, time point, etc) and row/column intersections contain the value of the relevant feature for the relevant sample:

In [125]:
#Read in metadata
gMetadataDf <- read.csv(gMetadataFp, stringsAsFactors=FALSE)
dim(gMetadataDf)

In [126]:
colnames(gUnorderedGeneCountsDf) <- gsub("^X|", "", gsub("px", "x", colnames(gUnorderedGeneCountsDf)))
head(gMetadataDf)
table(gMetadataDf$X)

SequenceRun,SequenceDate,Sample,SampleName,Patient.ID,RIN,X,Adult.Pediatric,Disease,Cell.type,Sorted.Cell.Type,Tissue.Source,RNA.seq.status,Reads
<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,01id38cellHSC,38 HSC,32538,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,81234305
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,02id38cellPROGENITORS,38 Progenitors,32538,9.7,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,73196851
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,cell05id90HSC,90 HSC,22390,10.0,PedAML,Pediatric,AML,Stem,34+38-,BM,Completed,72625632
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,06id90cellPROGENITORS,90 Progenitors,22390,10.0,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,81721991
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,03id78cellHSC,78 HSC,28678,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,68483294
ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,04id78cellPROGENITORS,78 Progenitors,28678,10.0,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,88894032



AdultAML   PedAML    PedNL 
       9       18        9 

In [127]:
gSampleNames = gMetadataDf[["Sample"]]

Check the dimensions of the count data and the metadata to ensure that the count dataframe has the same number of columns (samples) as the metadata dataframe has rows (again, samples), and that the sample names are the same in both: 

In [128]:
dim(gUnorderedGeneCountsDf)
dim(gMetadataDf)

all(colnames(gUnorderedGeneCountsDf) %in% gSampleNames)

Assume that the order of the samples shown in the metadata is the desired order, and reorder the columns in the counts table to match it:

In [129]:
gGeneCountsDf = gUnorderedGeneCountsDf[gSampleNames]
head(gGeneCountsDf)

Unnamed: 0_level_0,01id38cellHSC,02id38cellPROGENITORS,cell05id90HSC,06id90cellPROGENITORS,03id78cellHSC,04id78cellPROGENITORS,05id00066HSC,06id00066PRO,04id11474HSC,03id11474HPC,⋯,15x12584xPLUSPLUS,16x12451xPLUSMINS,17x12451xPLUSPLUS,18x10720xPLUSPLUS,01pid24760ctHSC,02pid24760ctHPC,03pid24474ctHSC,04pid24474ctHPC,05pid25376ctHSC,06pid25376ctHPC
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000000003.10,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,⋯,15.0,0.0,0.0,21.0,70.0,35.0,4.0,116.0,98,47.0
ENSG00000000005.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0
ENSG00000000419.8,2227.0,2200.0,1705.0,2351.0,1872.0,1805.0,818.0,1745.0,1923.0,2125.0,⋯,1652.0,1082.0,2146.0,3122.0,1498.0,2724.0,1982.0,3125.0,476,3411.0
ENSG00000000457.9,580.36,347.01,322.03,143.54,96.22,277.31,488.89,840.19,529.39,424.19,⋯,158.72,178.19,1295.05,1056.94,2048.79,1238.85,128.74,158.04,0,656.66
ENSG00000000460.12,1137.64,651.99,1042.97,456.46,297.78,924.69,271.11,903.81,543.61,596.81,⋯,365.28,121.81,178.95,2008.06,878.21,1699.15,218.26,768.96,191,2071.34
ENSG00000000938.8,204.0,444.0,18.0,238.0,199.0,188.0,91.0,263.0,357.0,659.0,⋯,129.0,133.0,279.0,272.0,222.0,293.0,9.0,50.0,0,155.0


If the count file gene identifiers do NOT include version numbers (e.g., the ".4" part in a gene identifier like "ENSG00000268020.4"), then it is necessary to truncate the version information from the public annotation data to be used below in order to match the annotation data gene identifiers to the count file gene identifiers.  Set the flag for version removal accordingly:

In [130]:
gRemoveVersion <- FALSE
#gRemoveVersion <- TRUE

In [131]:

if (!is.null(gAnnotationsRdataFilename)) {
    gAnnotationsRdataFp = file.path(gReferenceDir, gAnnotationsRdataFilename)  
    
    # Import the R data object containing gene annotations and load its dataframe into a variable:
    gAnnotationEnv = loadToEnvironment(gAnnotationsRdataFp)
    gGeneTypeAnnotationsDf = gAnnotationEnv$ANNOT
    
    head(gGeneTypeAnnotationsDf)
} else {
    print("No annotations provided.")
}

gene_type,gene_id,transcript_id
<chr>,<chr>,<chr>
transcribed_unprocessed_pseudogene,ENSG00000223972.5,ENST00000456328.2
transcribed_unprocessed_pseudogene,ENSG00000223972.5,ENST00000450305.2
unprocessed_pseudogene,ENSG00000227232.5,ENST00000488147.1
miRNA,ENSG00000278267.1,ENST00000619216.1
lincRNA,ENSG00000243485.5,ENST00000473358.1
lincRNA,ENSG00000243485.5,ENST00000469289.1


[Table of Contents](#Table-of-Contents)

## Gene Separation By Coding Status

Gene annotations are records of each gene's identifier and symbol, where the gene begins and ends on the genome sequence, and whether it is anticipated to be a coding gene or not.  There are multiple sources of gene annotations.

   > Here we use the human gene annotations from the Gencode project, Release 19 (GRCh37.p13).


In [132]:
ENS2EG <- toTable(org.Hs.egENSEMBL2EG)
EG2SYM <- toTable(org.Hs.egSYMBOL2EG)
ENS2EG2SYM <- data.frame(gene_id=ENS2EG$gene_id, ens_id=ENS2EG, 
                         SYM=EG2SYM[match(ENS2EG$gene_id,EG2SYM$gene_id),"symbol"])
head(ENS2EG2SYM)

library(RColorBrewer)
load(gAnnotationFp_hg19)
ANNOT_protein_coding <- subset(ANNOT, gene_type == "protein_coding")
ANNOT_ncRNA <- subset(ANNOT, gene_type %in% c("lincRNA", "antisense", "processed_transcript","sense_overlapping", "sense_intronic") )

#make list of IDs to query
protein_coding_ids <- ANNOT_protein_coding$gene_id
ncRNA_ids <- ANNOT_ncRNA$gene_id


gene_id,ens_id.gene_id,ens_id.ensembl_id,SYM
<fct>,<chr>,<chr>,<fct>
1,1,ENSG00000121410,A1BG
2,2,ENSG00000175899,A2M
3,3,ENSG00000256069,A2MP1
9,9,ENSG00000171428,NAT1
10,10,ENSG00000156006,NAT2
12,12,ENSG00000196136,SERPINA3


In [133]:
dim(subset(gGeneTypeAnnotationsDf, 
           gene_type %in% c("protein_coding")))
dim(subset(gGeneTypeAnnotationsDf, 
           gene_type %in% c("lincRNA", "antisense", "processed_transcript","sense_overlapping", "sense_intronic")))

In [134]:
splitGeneCountsByCodingStatus = function(geneCountDf, gtfDf, removeVersion=FALSE){
    #Subset GTF by protein coding and noncoding
    ANNOT_protein_coding <- subset(gtfDf, gene_type == "protein_coding")
    ANNOT_ncRNA <- subset(gtfDf, gene_type %in% c("lincRNA", "antisense", "processed_transcript","sense_overlapping", "sense_intronic") )

    #make list of IDs to query
    protein_coding_ids <- ANNOT_protein_coding$gene_id
    ncRNA_ids <- ANNOT_ncRNA$gene_id
    
    if (removeVersion){
        protein_coding_ids <- removeAccessionVersion(protein_coding_ids)
        ncRNA_ids <- removeAccessionVersion(ncRNA_ids)        
    }

    #subset geneCounts
    geneCount_protein_coding <- subset(geneCountDf, row.names(geneCountDf) %in% protein_coding_ids)
    geneCount_ncRNA <- subset(geneCountDf, row.names(geneCountDf) %in% ncRNA_ids)
    return(list(codingGeneCountDf=geneCount_protein_coding, noncodingGeneCountDf=geneCount_ncRNA))
}

removeAccessionVersion = function(accessionVector){
    return (gsub("\\..*","",accessionVector))
}

writeSubsetCounts = function(subsetCountsDf, outputDir, runName, fileSuffix){
    fileName = sprintf(fileSuffix, runName)
    write.csv(subsetCountsDf, file.path(outputDir, fileName))
    print(paste0("Output file: ",fileName))
}

writeSubsetsCounts = function(splitGeneCountDfsList, outputDir, runName){
    writeSubsetCounts(splitGeneCountDfsList$codingGeneCountDf, outputDir, runName,"%s_raw_pc_genes_counts.csv")
    writeSubsetCounts(splitGeneCountDfsList$noncodingGeneCountDf, outputDir, runName,"%s_raw_nc_genes_counts.csv")
}

Split the count data into coding and non-coding subsets, and extract each subset into a file based on the annotation file provided in the input parameters:

In [135]:
gSplitGeneCountDfsList = splitGeneCountsByCodingStatus(gGeneCountsDf, ANNOT, gRemoveVersion)

In [136]:
dim(gGeneCountsDf)
dim(gSplitGeneCountDfsList$codingGeneCountDf)
dim(gSplitGeneCountDfsList$noncodingGeneCountDf)

In [60]:
writeSubsetsCounts(gSplitGeneCountDfsList, gOutputDir, gRunName)

[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200228163803_raw_pc_genes_counts.csv"
[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200228163803_raw_nc_genes_counts.csv"


[Table of Contents](#Table-of-Contents)

## Data Integration



Integrate the count data and the metadata into an edgeR DGEList object for use in downstream analysis:

> Our DGEList-object contains a samples data frame that stores both ... group ... and batch ... information, each of which consists of ... distinct levels. Note that within x$samples, library sizes are automatically calculated for each sample and normalisation factors are set to 1. ([1](#Citations))

In [137]:
# If the researcher wishes to analyze just coding genes, run: 
 gGeneType = "pc"
 gRelevantGeneCountsDf <- gSplitGeneCountDfsList$codingGeneCountDf

# If researcher wishes to analyze just non-coding genes, run:
# gGeneType = "nc"
# gRelevantGeneCountsDf <- gSplitGeneCountDfsList$noncodingGeneCountDf

# If researcher wishes to analyze both coding and non-coding genes together, run:
#gGeneType = "all"
#gRelevantGeneCountsDf <- gGeneCountsDf

In [138]:
# create a DGEList object
makeDgeList = function(countsDf, metadataDf, groupColName){
    # remove the accession version (.##etc) from the ensembl gene id
    id_list <- gsub("[.].*$","", row.names(countsDf))
    row.names(countsDf) <- id_list
    # Note: in DGEList constructor, parameters
    # lib.size = colSums(counts_matrix), 
    # norm.factors = rep(1,ncol(counts_matrix)), 
    # genes = NULL, and remove.zeros = FALSE
    # are all identical to the default values you'd get if you didn't 
    # specify these arguments at all ...
    x <- DGEList(counts = countsDf, lib.size = colSums(countsDf),
    norm.factors = rep(1,ncol(countsDf)), samples = metadataDf,
        group = metadataDf[[groupColName]], genes = NULL, remove.zeros = FALSE)
    return(x)
}

In [139]:
gGroupCategory = "X" # e.g., "day"

In [140]:
w.prog <- which(gMetadataDf$Cell.type == "Prog")
gRelevantGeneCountsDf.prog <- gRelevantGeneCountsDf[,w.prog]
gMetadataDf.prog <- gMetadataDf[w.prog,]
w.stem <- which(gMetadataDf$Cell.type == "Stem")
gRelevantGeneCountsDf.stem <- gRelevantGeneCountsDf[,w.stem]
gMetadataDf.stem <- gMetadataDf[w.stem,]

In [141]:
gDgeList = makeDgeList(gRelevantGeneCountsDf, gMetadataDf, gGroupCategory)
names(gDgeList)

gDgeList.prog = makeDgeList(gRelevantGeneCountsDf.prog, gMetadataDf.prog, gGroupCategory)
names(gDgeList.prog)

gDgeList.stem = makeDgeList(gRelevantGeneCountsDf.stem, gMetadataDf.stem, gGroupCategory)
names(gDgeList.stem)


As a sanity-check, look at representative content from the DGEList:

In [142]:
dim(gDgeList$counts)
dim(gDgeList$samples)
table(gDgeList$samples$X)
head(gDgeList$counts)
head(gDgeList$samples)
dim(gDgeList.prog$counts)
dim(gDgeList.prog$samples)
table(gDgeList.prog$samples$X)
head(gDgeList.prog$counts)
head(gDgeList.prog$samples)
dim(gDgeList.stem$counts)
dim(gDgeList.stem$samples)
table(gDgeList.stem$samples$X)
head(gDgeList.stem$counts)
head(gDgeList.stem$samples)


AdultAML   PedAML    PedNL 
       9       18        9 

Unnamed: 0,01id38cellHSC,02id38cellPROGENITORS,cell05id90HSC,06id90cellPROGENITORS,03id78cellHSC,04id78cellPROGENITORS,05id00066HSC,06id00066PRO,04id11474HSC,03id11474HPC,⋯,15x12584xPLUSPLUS,16x12451xPLUSMINS,17x12451xPLUSPLUS,18x10720xPLUSPLUS,01pid24760ctHSC,02pid24760ctHPC,03pid24474ctHSC,04pid24474ctHPC,05pid25376ctHSC,06pid25376ctHPC
ENSG00000000003,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,⋯,15.0,0.0,0.0,21.0,70.0,35.0,4.0,116.0,98,47.0
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0
ENSG00000000419,2227.0,2200.0,1705.0,2351.0,1872.0,1805.0,818.0,1745.0,1923.0,2125.0,⋯,1652.0,1082.0,2146.0,3122.0,1498.0,2724.0,1982.0,3125.0,476,3411.0
ENSG00000000457,580.36,347.01,322.03,143.54,96.22,277.31,488.89,840.19,529.39,424.19,⋯,158.72,178.19,1295.05,1056.94,2048.79,1238.85,128.74,158.04,0,656.66
ENSG00000000460,1137.64,651.99,1042.97,456.46,297.78,924.69,271.11,903.81,543.61,596.81,⋯,365.28,121.81,178.95,2008.06,878.21,1699.15,218.26,768.96,191,2071.34
ENSG00000000938,204.0,444.0,18.0,238.0,199.0,188.0,91.0,263.0,357.0,659.0,⋯,129.0,133.0,279.0,272.0,222.0,293.0,9.0,50.0,0,155.0


Unnamed: 0_level_0,group,lib.size,norm.factors,SequenceRun,SequenceDate,Sample,SampleName,Patient.ID,RIN,X,Adult.Pediatric,Disease,Cell.type,Sorted.Cell.Type,Tissue.Source,RNA.seq.status,Reads
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
01id38cellHSC,PedAML,41083390,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,01id38cellHSC,38 HSC,32538,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,81234305
02id38cellPROGENITORS,PedAML,33900321,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,02id38cellPROGENITORS,38 Progenitors,32538,9.7,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,73196851
cell05id90HSC,PedAML,32788996,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,cell05id90HSC,90 HSC,22390,10.0,PedAML,Pediatric,AML,Stem,34+38-,BM,Completed,72625632
06id90cellPROGENITORS,PedAML,33228834,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,06id90cellPROGENITORS,90 Progenitors,22390,10.0,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,81721991
03id78cellHSC,PedAML,28623523,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,03id78cellHSC,78 HSC,28678,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,68483294
04id78cellPROGENITORS,PedAML,41665738,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,04id78cellPROGENITORS,78 Progenitors,28678,10.0,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,88894032



AdultAML   PedAML    PedNL 
       5        9        6 

Unnamed: 0,02id38cellPROGENITORS,06id90cellPROGENITORS,04id78cellPROGENITORS,06id00066PRO,03id11474HPC,01id11251HPC,01x00077xPROGENIT,03x11379xHPCxxxxx,05x00047xPLUSPLUS,06x00077xPLUSPLUS,07x00023xPLUSPLUS,09x00020xPLUSPLUS,11x00082xPLUSPLUS,14x12488xPLUSPLUS,15x12584xPLUSPLUS,17x12451xPLUSPLUS,18x10720xPLUSPLUS,02pid24760ctHPC,04pid24474ctHPC,06pid25376ctHPC
ENSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,49.0,0.0,0.0,0.0,10.0,15.0,0.0,21.0,35.0,116.0,47.0
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,2200.0,2351.0,1805.0,1745.0,2125.0,1777.0,1225.0,2308.0,2350.0,4884.0,1039.0,1646.0,2294.0,3579.0,1652.0,2146.0,3122.0,2724.0,3125.0,3411.0
ENSG00000000457,347.01,143.54,277.31,840.19,424.19,20.94,351.54,833.81,354.12,1435.02,320.4,513.23,102.96,456.88,158.72,1295.05,1056.94,1238.85,158.04,656.66
ENSG00000000460,651.99,456.46,924.69,903.81,596.81,292.06,730.46,1320.19,1377.88,2543.98,374.6,199.77,116.04,1413.12,365.28,178.95,2008.06,1699.15,768.96,2071.34
ENSG00000000938,444.0,238.0,188.0,263.0,659.0,286.0,43.0,443.0,466.0,436.0,169.0,238.0,13.0,181.0,129.0,279.0,272.0,293.0,50.0,155.0


Unnamed: 0_level_0,group,lib.size,norm.factors,SequenceRun,SequenceDate,Sample,SampleName,Patient.ID,RIN,X,Adult.Pediatric,Disease,Cell.type,Sorted.Cell.Type,Tissue.Source,RNA.seq.status,Reads
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
02id38cellPROGENITORS,PedAML,33900321,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,02id38cellPROGENITORS,38 Progenitors,32538,9.7,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,73196851
06id90cellPROGENITORS,PedAML,33228834,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,06id90cellPROGENITORS,90 Progenitors,22390,10.0,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,81721991
04id78cellPROGENITORS,PedAML,41665738,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,04id78cellPROGENITORS,78 Progenitors,28678,10.0,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,88894032
06id00066PRO,PedAML,30096534,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,06id00066PRO,66 Progenitors,22666,9.8,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,77343883
03id11474HPC,AdultAML,24344437,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,03id11474HPC,11474 HPC,11474,9.6,AdultAML,Adult,AML,Prog,34+38+,PB,Completed,79871449
01id11251HPC,AdultAML,23597968,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,01id11251HPC,11251 HPC,11251,9.3,AdultAML,Adult,AML,Prog,34+38+,BM,Completed,80917720



AdultAML   PedAML    PedNL 
       4        9        3 

Unnamed: 0,01id38cellHSC,cell05id90HSC,03id78cellHSC,05id00066HSC,04id11474HSC,02id11251HSC,02x11379xHSCxxxxx,04x00068xPLUSMINS,08x00020xPLUSMINS,10x00082xPLUSMINS,12x12484xPLUSMINS,13x12488xPLUSMINS,16x12451xPLUSMINS,01pid24760ctHSC,03pid24474ctHSC,05pid25376ctHSC
ENSG00000000003,0.0,0.0,1.0,0.0,0.0,3.0,0.0,0.0,602.0,57,307.0,54.0,0.0,70.0,4.0,98
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0
ENSG00000000419,2227.0,1705.0,1872.0,818.0,1923.0,1759.0,2213.0,910.0,2569.0,1308,3837.0,1712.0,1082.0,1498.0,1982.0,476
ENSG00000000457,580.36,322.03,96.22,488.89,529.39,160.2,844.14,145.25,374.05,178,737.5,151.32,178.19,2048.79,128.74,0
ENSG00000000460,1137.64,1042.97,297.78,271.11,543.61,352.8,626.86,281.75,41.95,0,323.5,378.68,121.81,878.21,218.26,191
ENSG00000000938,204.0,18.0,199.0,91.0,357.0,70.0,169.0,114.0,0.0,0,467.0,230.0,133.0,222.0,9.0,0


Unnamed: 0_level_0,group,lib.size,norm.factors,SequenceRun,SequenceDate,Sample,SampleName,Patient.ID,RIN,X,Adult.Pediatric,Disease,Cell.type,Sorted.Cell.Type,Tissue.Source,RNA.seq.status,Reads
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
01id38cellHSC,PedAML,41083390,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,01id38cellHSC,38 HSC,32538,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,81234305
cell05id90HSC,PedAML,32788996,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,cell05id90HSC,90 HSC,22390,10.0,PedAML,Pediatric,AML,Stem,34+38-,BM,Completed,72625632
03id78cellHSC,PedAML,28623523,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,03id78cellHSC,78 HSC,28678,10.0,PedAML,Pediatric,AML,Stem,34+38-,PB,Completed,68483294
05id00066HSC,PedAML,15914869,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,05id00066HSC,66 HSC,22666,7.7,PedAML,Pediatric,AML,Stem,34+38-,BM,Completed,77519907
04id11474HSC,AdultAML,25493910,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,04id11474HSC,11474 HSC,11474,10.0,AdultAML,Adult,AML,Stem,34+38-,PB,Completed,79123032
02id11251HSC,AdultAML,22035310,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,02id11251HSC,11251 HSC,11251,9.3,AdultAML,Adult,AML,Stem,34+38-,BM,Completed,81776486


[Table of Contents](#Table-of-Contents)

## Annotation Integration

Next, extend the DGEList object with annotation information about the genes that have count data with symbol and EntrezId information, based upon their Ensembl ids.

> A second data frame named genes in the DGEList-object is used to store gene-level information associated with rows of the counts matrix. This information can be retrieved using organism specific packages such as Mus.musculus (Bioconductor Core Team 2016b) for mouse (or Homo.sapiens (Bioconductor Core Team 2016a) for human) ....
>
> The type of information that can be retrieved includes gene symbols, gene names, chromosome names and locations, Entrez gene IDs, Refseq gene IDs and Ensembl gene IDs to name just a few. .... Mus.musculus [and other organism-specific packages] packages information from various sources and allows users to choose between many different gene IDs as the key. ([1](#Citations))

In [143]:
getGeneDf = function(dgeList, organismPackage){
    geneid <-  rownames(dgeList)
    genes <- select(organismPackage, keys=geneid, columns=c("SYMBOL", "ENSEMBL", "ENTREZID"), 
                    keytype="ENSEMBL")
    return(genes)
}

In [144]:
gRawGenesDf = getGeneDf(gDgeList, gOrganismPackage)
dim(gRawGenesDf)
gRawGenesDf.prog = getGeneDf(gDgeList.prog, gOrganismPackage)
dim(gRawGenesDf.prog)
gRawGenesDf.stem = getGeneDf(gDgeList.stem, gOrganismPackage)
dim(gRawGenesDf.stem)

'select()' returned 1:many mapping between keys and columns


'select()' returned 1:many mapping between keys and columns


'select()' returned 1:many mapping between keys and columns


In [145]:
# Add gene type to gRawGenesDf
gGeneTypeAnnotationsDf.rmdec <- gGeneTypeAnnotationsDf
gGeneTypeAnnotationsDf.rmdec$gene_id <- gsub("\\..*","",gGeneTypeAnnotationsDf.rmdec$gene_id)
gRawGenesDf$gene_type <- gGeneTypeAnnotationsDf$gene_type[match(gRawGenesDf$ENSEMBL, gGeneTypeAnnotationsDf.rmdec$gene_id)]
gRawGenesDf.prog$gene_type <- gGeneTypeAnnotationsDf$gene_type[match(gRawGenesDf.prog$ENSEMBL, gGeneTypeAnnotationsDf.rmdec$gene_id)]
gRawGenesDf.stem$gene_type <- gGeneTypeAnnotationsDf$gene_type[match(gRawGenesDf.stem$ENSEMBL, gGeneTypeAnnotationsDf.rmdec$gene_id)]

In [146]:
head(gRawGenesDf.prog)

ENSEMBL,ENTREZID,SYMBOL,gene_type
<chr>,<chr>,<chr>,<chr>
ENSG00000000003,7105,TSPAN6,protein_coding
ENSG00000000005,64102,TNMD,protein_coding
ENSG00000000419,8813,DPM1,protein_coding
ENSG00000000457,57147,SCYL3,protein_coding
ENSG00000000460,55732,C1orf112,protein_coding
ENSG00000000938,2268,FGR,protein_coding


> [G]ene IDs may not map one-to-one to the gene information of interest. It is important to check for duplicated gene IDs. ([1](#Citations))

Examine how many records in the annotation dataset have the same id (for the gene identifier type--either ENSEMBL or ENTREZ--set below) as another record occurring earlier in the dataset:

In [147]:
gGeneIdCol <- "ENSEMBL"
# gGeneIdCol <- "ENTREZ"

In [148]:
gDuplicatesMask = duplicated(gRawGenesDf[[gGeneIdCol]])
sum(gDuplicatesMask) # Sum counts only those with a value of TRUE
gDuplicatesMask.prog = duplicated(gRawGenesDf.prog[[gGeneIdCol]])
sum(gDuplicatesMask.prog) # Sum counts only those with a value of TRUE
gDuplicatesMask.stem = duplicated(gRawGenesDf.stem[[gGeneIdCol]])
sum(gDuplicatesMask.stem) # Sum counts only those with a value of TRUE

Note that this sum includes only the second (or greater) instances of records for each gene id; the first record for each gene id is not included in this duplicate set.

Write a file of the duplicate records that can be examined if desired: 

In [80]:
writeOutRemovedDuplicates = function(countsDf, duplicatesMask, outputDir, runName, geneType, celltype){
    fileName = sprintf("%s_duplicated_%s_%s_genes_records.csv",runName, geneType, celltype)
    duplicatedCountsDf = countsDf[duplicatesMask,]
    write.csv(duplicatedCountsDf, file.path(outputDir, fileName))
    print(paste0("Output file: ",fileName))
}

In [149]:
writeOutRemovedDuplicates(gRawGenesDf, gDuplicatesMask, gOutputDir, gRunName, gGeneType, "ProgStem")
writeOutRemovedDuplicates(gRawGenesDf.prog, gDuplicatesMask.prog, gOutputDir, gRunName, gGeneType, "Progenitors")
writeOutRemovedDuplicates(gRawGenesDf.stem, gDuplicatesMask.stem, gOutputDir, gRunName, gGeneType, "Stem")

[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200318124751_duplicated_pc_ProgStem_genes_records.csv"
[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200318124751_duplicated_pc_Progenitors_genes_records.csv"
[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200318124751_duplicated_pc_Stem_genes_records.csv"



> As a basic approach, duplicate records for gene ids already existing in the annotation are removed:



In [150]:
gDeduplicatedGenesDf = gRawGenesDf[!duplicated(gRawGenesDf[[gGeneIdCol]]),]
gDeduplicatedGenesDf.prog = gRawGenesDf.prog[!duplicated(gRawGenesDf.prog[[gGeneIdCol]]),]
gDeduplicatedGenesDf.stem = gRawGenesDf.stem[!duplicated(gRawGenesDf.stem[[gGeneIdCol]]),]

After deduplication, check the dimensions of the count data and the gene annotation data to ensure that the count dataframe has the same number of rows (genes) as the gene annotation dataframe has rows (again, genes), and that the gene names are the same in both:

In [151]:
dim(gDgeList.prog$counts)
dim(gDeduplicatedGenesDf.prog)

all(rownames(gDgeList.prog$counts) %in% gDeduplicatedGenesDf.prog[[gGeneIdCol]])

Add the annotation information to the DGEList object:

In [152]:
gDgeList$genes = gDeduplicatedGenesDf
names(gDgeList)
gDgeList.prog$genes = gDeduplicatedGenesDf.prog
names(gDgeList.prog)
gDgeList.stem$genes = gDeduplicatedGenesDf.stem
names(gDgeList.stem)

As a sanity-check, look at representative content from the DGEList:

In [153]:
head(gDgeList.prog$counts)
head(gDgeList.prog$samples)
head(gDgeList.prog$genes)

Unnamed: 0,02id38cellPROGENITORS,06id90cellPROGENITORS,04id78cellPROGENITORS,06id00066PRO,03id11474HPC,01id11251HPC,01x00077xPROGENIT,03x11379xHPCxxxxx,05x00047xPLUSPLUS,06x00077xPLUSPLUS,07x00023xPLUSPLUS,09x00020xPLUSPLUS,11x00082xPLUSPLUS,14x12488xPLUSPLUS,15x12584xPLUSPLUS,17x12451xPLUSPLUS,18x10720xPLUSPLUS,02pid24760ctHPC,04pid24474ctHPC,06pid25376ctHPC
ENSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,49.0,0.0,0.0,0.0,10.0,15.0,0.0,21.0,35.0,116.0,47.0
ENSG00000000005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000000419,2200.0,2351.0,1805.0,1745.0,2125.0,1777.0,1225.0,2308.0,2350.0,4884.0,1039.0,1646.0,2294.0,3579.0,1652.0,2146.0,3122.0,2724.0,3125.0,3411.0
ENSG00000000457,347.01,143.54,277.31,840.19,424.19,20.94,351.54,833.81,354.12,1435.02,320.4,513.23,102.96,456.88,158.72,1295.05,1056.94,1238.85,158.04,656.66
ENSG00000000460,651.99,456.46,924.69,903.81,596.81,292.06,730.46,1320.19,1377.88,2543.98,374.6,199.77,116.04,1413.12,365.28,178.95,2008.06,1699.15,768.96,2071.34
ENSG00000000938,444.0,238.0,188.0,263.0,659.0,286.0,43.0,443.0,466.0,436.0,169.0,238.0,13.0,181.0,129.0,279.0,272.0,293.0,50.0,155.0


Unnamed: 0_level_0,group,lib.size,norm.factors,SequenceRun,SequenceDate,Sample,SampleName,Patient.ID,RIN,X,Adult.Pediatric,Disease,Cell.type,Sorted.Cell.Type,Tissue.Source,RNA.seq.status,Reads
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
02id38cellPROGENITORS,PedAML,33900321,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,02id38cellPROGENITORS,38 Progenitors,32538,9.7,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,73196851
06id90cellPROGENITORS,PedAML,33228834,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,06id90cellPROGENITORS,90 Progenitors,22390,10.0,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,81721991
04id78cellPROGENITORS,PedAML,41665738,1,ca_ne_586_001_400_000_JAMIESON_SR_human-ensembl-grch38-r91,5/5/19,04id78cellPROGENITORS,78 Progenitors,28678,10.0,PedAML,Pediatric,AML,Prog,34+38+,PB,Completed,88894032
06id00066PRO,PedAML,30096534,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,06id00066PRO,66 Progenitors,22666,9.8,PedAML,Pediatric,AML,Prog,34+38+,BM,Completed,77343883
03id11474HPC,AdultAML,24344437,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,03id11474HPC,11474 HPC,11474,9.6,AdultAML,Adult,AML,Prog,34+38+,PB,Completed,79871449
01id11251HPC,AdultAML,23597968,1,ca_ne_618_001_400_070_JAMIESON_SR_human-ensembl-ghrc38-r91,6/26/19,01id11251HPC,11251 HPC,11251,9.3,AdultAML,Adult,AML,Prog,34+38+,BM,Completed,80917720


ENSEMBL,ENTREZID,SYMBOL,gene_type
<chr>,<chr>,<chr>,<chr>
ENSG00000000003,7105,TSPAN6,protein_coding
ENSG00000000005,64102,TNMD,protein_coding
ENSG00000000419,8813,DPM1,protein_coding
ENSG00000000457,57147,SCYL3,protein_coding
ENSG00000000460,55732,C1orf112,protein_coding
ENSG00000000938,2268,FGR,protein_coding


[Table of Contents](#Table-of-Contents)

## Summary


> **Gene annotations**
* Human gene annotations were taken from the Gencode project, Release 19 (GRCh37.p13).

> **Gene type filtering**
* This analysis was limited to protein-coding genes. Of the original 57820 Ensembl genes in the dataset, 20345 are known coding genes.

</div>

Save the workspace objects for future reference:

In [154]:
writeWorkspaceImage(gInterimDir, gRunName)

[1] "Output file: 20200228_DeWerf_Human_PediatricAML_data_integration_20200318124751.RData"


In [103]:
iso_counts_all <- read.csv(file=paste0("../inputs/all_isoforms_results.txt"), sep="\t", stringsAsFactors=FALSE, header=TRUE)
iso_counts <- iso_counts_all[,sapply(colnames(iso_counts_all), function(x) any(grepl(".results_expected_count",x)))]
colnames(iso_counts) <- gsub(".isoforms.results_expected_count","", colnames(iso_counts))
row.names(iso_counts) <- iso_counts$gene_id
iso_counts_all[1:5,1:10]
head(rownames(iso_counts_all))
head(iso_counts)
                     

transcript_id,gene_id,X03id78cellHSC_S3_R1_001.isoforms.results_length,X03id78cellHSC_S3_R1_001.isoforms.results_effective_length,X03id78cellHSC_S3_R1_001.isoforms.results_expected_count,X03id78cellHSC_S3_R1_001.isoforms.results_TPM,X03id78cellHSC_S3_R1_001.isoforms.results_FPKM,X03id78cellHSC_S3_R1_001.isoforms.results_IsoPct,X05pid25376ctHSC_S5_R1_001.isoforms.results_length,X05pid25376ctHSC_S5_R1_001.isoforms.results_effective_length
<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>
ENST00000373020.4,ENSG00000000003.10,2206,2000.07,0,0.0,0.0,0,2206,1993.42
ENST00000494424.1,ENSG00000000003.10,820,614.07,1,0.07,0.05,100,820,607.43
ENST00000496771.1,ENSG00000000003.10,1025,819.07,0,0.0,0.0,0,1025,812.42
ENST00000373031.4,ENSG00000000005.5,1339,1133.07,0,0.0,0.0,0,1339,1126.42
ENST00000485971.1,ENSG00000000005.5,542,336.09,0,0.0,0.0,0,542,329.56


X03id78cellHSC_S3_R1_001,X05pid25376ctHSC_S5_R1_001,X03x11379xHPCxxxxx_S3_R1_001,X06id00066PRO_S6_R1_001,X01id38cellHSC_S1_R1_001,X09x00020pxPLUSPLUS_S3_R1_001,X04id78cellPROGENITORS_S4_R1_001,X07x00023xPLUSPLUS_S1_R1_001,X02id11251HSC_S2_R1_001,X12x12484xPLUSMINS_S6_R1_001,⋯,X02id38cellPROGENITORS_S2_R1_001,X01pid24760ctHSC_S1_R1_001,X13x12488xPLUSMINS_S1_R1_001,X08x00020xPLUSMINS_S2_R1_001,X01id11251HPC_S1_R1_001,X11x00082xPLUSPLUS_S5_R1_001,X14x12488xPLUSPLUS_S2_R1_001,X05x00047xPLUSPLUS_S5_R1_001,X01x00077xPROGENIT_S1_R1_001,X03id11474HPC_S3_R1_001
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.0,98,0.0,0.0,0.0,0.0,0.0,0.0,3.0,284.1,⋯,0.0,66.33,54.0,482.65,0.0,0,10.0,18.0,0.0,0.0
1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,3.67,0.0,15.51,0.0,0,0.0,0.0,0.0,0.0
0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.9,⋯,0.0,0.0,0.0,103.85,0.0,0,0.0,0.0,0.0,0.0
0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0
9.6,0,218.22,186.92,121.03,67.75,54.9,98.29,9.44,129.7,⋯,273.36,130.79,299.7,0.0,48.28,0,216.17,110.68,20.69,209.64


In [102]:
class(iso_counts)
head(rownames(iso_counts))
ADAR.tx <- c("ENST00000368471", "ENST00000368474")
iso_counts[1:5,1:5]
sapply(ADAR.tx, function(x) grep(x, gsub("[.].*$", "", rownames(iso_counts))))
iso_counts[sapply(ADAR.tx, function(x) grep(x, gsub("[.],*$", "", rownames(iso_counts)))),]


X03id78cellHSC_S3_R1_001,X05pid25376ctHSC_S5_R1_001,X03x11379xHPCxxxxx_S3_R1_001,X06id00066PRO_S6_R1_001,X01id38cellHSC_S1_R1_001
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,98,0,0,0
1,0,0,0,0
0,0,0,0,0
0,0,0,0,0
0,0,0,0,0


ERROR: Error in xj[i]: invalid subscript type 'list'


[Table of Contents](#Table-of-Contents)

## Citations

1. Law CW, Alhamdoosh M, Su S, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. Version 2. F1000Res. 2016 Jun 17 [revised 2016 Jan 1];5:1408.
2. Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140.
3. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, Gottardo R, Hahne F, Hansen KD, Irizarry RA, Lawrence M, Love MI, MacDonald J, Obenchain V, Oleś AK, Pagès H, Reyes A, Shannon P, Smyth GK, Tenenbaum D, Waldron L, Morgan M. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015 Feb;12(2):115-21.
4. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[Table of Contents](#Table-of-Contents)

## Appendix: R Session Info

In [89]:
Sys.time()
sessionInfo()

[1] "2020-03-03 14:20:17 PST"

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS: /mnt/data1/tomw/anaconda2/lib/R/lib/libRblas.so
LAPACK: /mnt/data1/tomw/anaconda2/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] RColorBrewer_1.1-2                     
 [2] edgeR_3.20.9                           
 [3] limma_3.34.9                           
 [4] Homo.sapiens_1.3.1                     
 [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [6] org.Hs.eg.db_3.5.0

[Table of Contents](#Table-of-Contents)

Copyright (c) 2018 UC San Diego Center for Computational Biology & Bioinformatics under the MIT License

Notebook template by Amanda Birmingham