Assessing biasing factors in Illumina RNA-seq protocols through realistic simulations
This repository contains the analysis pipeline, the description of workflow and the results presented in the manuscript:
- Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman (2013) Realistic simulations reveal extensive sample-specificity of RNA-seq biases arXiv:1308.3172
The rlsim package is a collection of tools aiming to integrate in a common simulation framework the documented biases present in the data generated by standard Illumina RNA-seq protocols due to individual experimental steps such as hexamer priming (Hansen et al.; 2010), PCR amplification (Benjamini and Speed; 2011) and size selection (Lee et al.; 2010).
The rlsim package provides an implicit simulation model inspired by the library construction protocols (implemented in the
rlsim tool) and also simple approaches for estimating the parameters that can be recovered from standard RNA-seq data (implemented in the
effest tool) in order to replicate the properties of individual datasets. The details of the simulation and estimation frameworks are described in the rlsim package documentation.
The analyses implemented in the pipelines have the following goals:
- To survey the magnitude of hexamer priming and PCR amplification biases in various paired-end RNA-seq datasets publicly available through ENA, generated by the standard Illumina protocol and its dUTP-based strand-specific modification (Levin et al.; 2010).
- To quantify whether the ensemble of knowledge incorporated in the simulation pipeline - parameters estimated from the individual datasets, priming affinities based on a physical model, species-specific information on poly(A) tail length - makes the simulated datasets more similar to the real ones as measured through different summary statistics, hence the "realism" of the
- To study the correlation of the same (or similar) summary statistics between library construction technical replicates in order to establish an expectation regarding the magnitude of correlation potentially achievable between real and simulated datasets.
Cloning the pipeline repository
git clone --recursive https://github.com/sbotond/paper-rlsim.git
Downloading the estimated expression levels
Due to the large size of the Fasta files, the canonical transcriptomes with the corrected expression levels estimated from the datasets below are available from a separate repository at http://github.com/sbotond/rlsim-params. The repository also contains the raw parameter JSON files in order to help running modified simulations.
The bias analysis pipeline
The main goal of the bias analysis pipeline is to quantify the magnitude of biases across different datasets and the ability of the
rlsim estimation/simulation framework to reproduce them (hence the degree of "realism" of the simulations). The pipeline relies on comparison of real datasets to simulated datasets obtained under two different simulation settings:
Datasets simulated under the "full" model, which use the corrected expression levels and GC-dependent PCR efficiencies estimated by the
effesttool in conjunction with a fragmentation method using an approach inspired by a physical model of oligonucleotide binding in order to simulate sequence biases around the start and end of the fragments (
Datasets simulated under the "flat" model, which use uncorrected expression levels calculated by
effestand a fixed PCR amplification efficiency (0.87 - the mean efficiency assumed by
effest) in conjunction with a fragmentation method assuming uniform priming affinities (
The flat model can be considered as a submodel of the the full one, making fewer assumptions about the process generating the data. Both simulations use the insert size distribution estimated from the data and sample the same amount of fragments. The details of the full and flat simulation settings are described below.
Comparing datasets through summary statistics
The similarity between the real and simulated datasets is assessed by calculating Spearman rank correlations between various features of the datasets, summarising their properties on different levels:
- Relative expression levels (per transcript fragment counts normalised by the number of total fragments) - calculated by the cov_cmp tool from the
- The GC content distribution of the mapped fragments - calculated by the pb_plot tool from the
- Sequence biases as summarised by the base counts around the start and end of mapped fragments (with a window size of 10) capturing sequence bias - calculated by
pb_plot. The rank correlations are calculated for count vectors for the four bases around start/end individually and averaged to obtain a single value.
- The fragment coverage vectors of individual transcripts - calculated by
The correlations and statistics calculated for individual datasets and pairs of datasets are stored as pickle files. The pipeline provides additional scripts to calculate and summarise the changes in correlation due to the additional simulation factors used in the full model:
- meta_cov_cmp calculates the shift in correlation for relative expression levels and per-transcript fragment coverage vectors.
- meta_pb_cmp calculates shift in correlation for GC content distributions and sequence biases around fragment start/end.
By assessing the direction of the shift in the correlations calculated for the features above one can asses whether including the additional factors (GC-dependent efficiencies, priming affinities, etc.) improve the "realism" of the simulation, while the magnitude of the shift quantifies the potential impact of the biasing factors on downstream applications.
Read mapping and parameter estimation
- The datasets defined by their accession numbers (AN) in the makefiles under
data.mkare automatically downloaded from ENA.
- Reads are mapped to the Ensembl v69 canonical transcriptomes fetched through the Perl API (along with the list of "single isoform" transcripts) using
BWA(paired-end mode, default parameters). The same reference transcriptome and mapping strategy is used to map the simulated datasets.
- Simulation parameters are estimated using
effestusing an expression multiplier (
-e) of 10000.0 (see the rlsim documentation for more information) and the default assumed mean efficiency (0.87). The estimation of GC-dependent efficiencies is restricted to single isoform transcripts (through the
- Note that the estimation and simulation steps ignore the presence of isoforms by mapping to the canonical transcriptome. This is likely to decrease the correlation between real and simulated datasets, however it should not affect significantly the comparison of the full and flat models as they both use the same mapping strategy.
The flat simulations do not attempt to simulate sequence-specific biases and they generate data under a simple scheme of random fragmentation, amplification under a fixed amplification efficiency and size selection:
- The number of simulated fragments is equal to the number of properly mapped read pairs from the real dataset.
- The fragmentation approach used (
after_noprim_double) assumes uniform "priming affinities".
- A fixed PCR amplification efficiency of 0.87 is used when simulating PCR, independent of the fragment GC content.
- Size selection is simulated according to the empirical fragment size distribution estimated from the data.
- The length of the simulated poly(A) tails are set to zero.
The full simulations attempt to use the maximum amount of knowledge derived from the dataset and literature, aiming to replicate the biases. Hence, in addition to the parameters used by the flat simulations, they also include the following factors:
- The fragmentation process (
after_prim_double, see the rlsim documentation for more details) uses sequence-specific "priming affinities" based on a nearest neighbor model of oligonucleotide binding in order to simulate sequence specific biases near the start and end of the simulated fragments.
- Simulation of PCR amplification biases uses the "empirical" GC-dependent efficiencies estimated by the
effesttool (with the minimum efficiency set to 0.05).
- The simulated poly(A) tail distributions in the full simulations are set on a per-species basis based on relevant publications. We assume that the distributions of the tail lengths (specified with the syntax described in the rlsim documentation) follows a geometric distribution with a mean equal to the half of the maximum tail length observed in the respective species and truncated at the maximum value:
Simulating Illumina sequencing
Simulation of paired-end Illumina sequencing of the fragments generated by
rlsim was performed using the simNGS package (version 1.6). Simulations were parametrised by the runfile s_3_4x shipped in the simNGS GitHub repository, trained on intensity data from a paired-end run with 101 cycles. The simulated read length was equal to the read length of the respective real datasets.
|Accession||Info URL||Read length||PCR cycles||Source|
|SRR065496||[details]||75||not reported||human liver carcinoma (HepG2)|
|SRR521457||[details]||75||not reported||K562 cell line|
|SRR521448||[details]||75||not reported||GM12878 cell line|
|ERR030885||[details]||50||15||Illumina BodyMap: kidney|
|ERR030874||[details]||50||15||Illumina BodyMap: ovary|
|Accession||Info URL||Read length||PCR cycles||Source|
|SRR040000||[details]||76||16||embryonic stem cells|
|SRR530635||[details]||101 stranded||not reported||leukemia cell line (K562 analog), dUTP-based stranded protocol|
|SRR530634||[details]||101 stranded||not reported||leukemia cell line (K562 analog), dUTP-based stranded protocol|
|SRR549333||[details]||99 stranded||not reported||mouse megakaryocyte erythroid progenitor cells with lineages CD16/32 and CD34, dUTP-based stranded protocol|
|SRR549337||[details]||99 stranded||not reported||mouse megakaryocyte erythroid progenitor cells with lineages CD16/32 and CD34, dUTP-based stranded protocol|
|SRR496440||[details]||100||not reported||multipotential cell line that can be converted by 5-azacytidine into three mesodermal stem cell lineages|
|Accession||Info URL||Read length||PCR cycles||Source|
|SRR042423||[details]||75||not reported||30 third-instar wandering larvae|
|SRR059066||[details]||75||not reported||30 third-instar wandering larvae|
|SRR034309||[details]||76||not reported||mixed stage embryos|
|SRR031717||[details]||37||not reported||S2-DRSC cell line|
|SRR015074||[details]||50||not reported||CME_W1_Cl.8+ cell line|
Running the bias analysis pipeline
The bias analysis pipeline can be launched on an LSF cluster by running the
The script will submit one job per dataset. Job parameters can be tuned by editing the script. The analyses
can be run locally as well by editing the
run function in the launch script. The runlogs are stored under the
After all jobs finish running, a global report can be generated by issuing:
Interpreting the results
The global report summarising the correlations of different features between data simulated under the full and flat conditions and real datasets is at
The per-dataset output of the bias analysis pipeline (with the exception of the transcriptome fasta files augmented with the estimated expression levels) are available at http://www.ebi.ac.uk/goldman-srv/paper-rlsim/Bias and they are also checked in the repository under the
Bias/AN/log directory, where
AN is the dataset accession number. They can be regenerated by running the pipeline as instructed above.
The log directories (e.g. Bias/SRR549337/log) store the following output files important for interpreting the results:
meta_cov_cmp.pdf- report on the shift of correlations and p-values for the per-transcript fragment coverage vectors (full/flat model vs. real data)
meta_cov_cmp.txt- text report on the shift in the correlation of relative expression levels.
meta_pb_sim_flat.log- text report on the correlation of GC content distribution and sequence bias between real and flat simulated data
meta_pb_sim.log- text report on the correlation of GC content distribution and sequence bias between real and full simulated data
effest_report.pdf- effest PDF report
cov_cmp_flat.pdf- report on the comparison of relative expression levels and transcript coverage vectors between the real and flat simulated dataset. Contains a scatter-plot of relative expression levels, the histogram of rank correlations and corresponding p-values for per-transcript fragment coverage vectors. Also contains a plot of the simulated and real coverage vectors for 30 transcripts with the lowest p-values.
cov_cmp_report_flat.txt- text report on the correlation of relative expression levels and coverage vectors between the real and the flat simulated dataset.
cov_cmp.pdf- report on the comparison of relative expression levels and transcript coverage vectors between the real and full simulated dataset
cov_cmp_report.txt- text report on the correlation of relative expression levels and coverage vectors between the real and the full simulated dataset.
pb_plot_aln.pdf- sequence bias and GC content distribution plots for the real data
pb_plot_sim_flat.pdf- sequence bias and GC content distribution plots for the flat simulated data
pb_plot_sim.pdf- sequence bias and GC content distribution plots for the full simulated data
re_effest_report.pdf- PDF report of the effest run on the full simulated data
rlsim_report_flat.json.pdf- PDF report file for flat simulation
rlsim_report.json.pdf- PDF report file for the full simulation.
The log directories also contain several intermediary files:
effest_log.txt- effest log file
effest_raw_params.json- "raw parameters" estimated by effest
effest_suggestions.txt- command line parameters suggested by effest
rlsim_report_flat.json- JSON report file for the flat simulation
rlsim_report.json- JSON report file for the full simulation
*.pk- pickle files storing the results of various analyses
flat_re_SRR549337_expr.fas- uncorrected relative expression levels re-estimated from the full simulated data (not included in repo)
flat_SRR549337_expr.fas- uncorrected relative expression levels estimated from the real data (not included in repo)
re_SRR549337_expr.fas- uncorrected relative expression levels estimated from the full simulated data (not included in repo)
SRR549337_expr.fas- relative expression levels estimated from the real data (available from a separate repository)
- files prefixed with
re_- output of effest parameter re-estimation from full simulated data
The replicate analysis pipeline
In addition to the bias analysis described above, we performed analysis of correlation of summary statistics between pairs of technical replicates.
In the absence of a paired-end dataset with technical replicates, we took advantage of the single-ended datasets from Bullard et al. (2010) where library preparation, lane and flow cell effects have been analysed using Universal Human Reference (UHR) and Ambion's human brain reference RNA. The RNA used was commercial-grade and the sequencing was performed in-house by Illumina; the differences between replicates obtained in less ideal conditions are likely to be larger and so we believe such analysis provides an upper bound on the correlation between any two datasets.
The calculated summary statistics are conceptually the same but there are small differences due the fact that sequencing data have unpaired-end reads:
- Relative expression levels are calculated in the same way as in the bias analysis
- GC content is calculated for reads rather than fragments
- Sequence biases are calculated for beginnings and ends of reads rather than fragments
|Accession||Info URL||Read length||PCR cycles||Source|
Running the replicate analysis pipeline
The replicate analysis can be launched in a manner similar to the bias analysis, by running
Run_replicate_analysis.sh and also uses the LSF queuing system for job submission.
Interpreting the results
The principal results of the analysis have been added to the global report file in the form of baselines representing the average correlation of each summary statistic between pairs of technical replicates.
The intermediate results of the analysis are stored in the
Replicate subdirectory. Results for each pair or replicates are stored in a directory called
AN1-AN2 where AN1 and AN2 are accession numbers for the two datasets. The
log directories contain the following output files:
cov_cmp.pdf- the same as cov_cmp_sim.pdf/cov_cmp_flat.pdf but containing correlations between the replicates
cov_cmp_report.txt- text report containing the correlation of relative expression levels between the replicates
meta_pb.log- text report containing the correlation of GC content distribution and sequence bias between the replicates
pb_plot_ANX.pdf- sequence bias and GC content distribution for AN1/AN2
*.pk- pickle files with results of various analyses
Thanks for the useful discussions and suggestions to (in no particular order): Kevin Gori, Christophe Dessimoz, Nick Williams, Ernest Turro, John Marioni, Nuno Fonseca, Pär Engström, Paul Bertone.
Files and dependencies
You will need the following software installed in order to compile and run the pipeline:
- The Ensembl Perl API version 69.
- The dependencies for building the rlsim package
- The dependecies for building the simNGS package.
- BWA >= 0.7.4-r385
- samtools >= 0.1.19-44428cd
- Bias/ - files produced by the bias analysis pipeline
- data.mk/ - dataset-specific makefiles
- Info/ - miscellaneous info
- lib/ - Python classes used by the analysis tools
- ref_cache/ - Reference transcriptomes indexed by BWA
- Replicate/ - files produced by the replicate analysis pipeline
- reports/ - PDF reports
- runlogs/ - directory to store runglogs
- scripts/ - analysis scripts
- simulators/ - directory containing the simulator (rlsim, simNGS) submodules
Scripts quick reference
Print out various Ensembl related info.
Print out Python-related versions.
Fetch canonical transcriptome and the list of single isoform transcipt using the Ensembl v69 Perl API.
usage: mapnsort [-h] [-f ref_fasta] [-s] [-o output_file] input file [input file ...] Map and sort reads. positional arguments: input file Paired-end reads in separate fastq files. optional arguments: -h, --help show this help message and exit -f ref_fasta Reference sequences in fasta format. -s Assume reads are single ended. -o output_file Output file name.
usage: simulate_data [-h] -f ref_fasta [-m map_ref] [-r rlsim_opts] [-n simngs_opts] -x simngs_runfile [-o out_dir] [-l log_file] -z tools_path [-y dir_name] Simulate RNA-seq library construction and Illumina sequencing. optional arguments: -h, --help show this help message and exit -f ref_fasta Reference sequences in fasta format for simulation. -m map_ref Reference sequences for mapping. -r rlsim_opts Extra rlsim options. -n simngs_opts simNGS options. -x simngs_runfile simNGS runfile. -o out_dir Output directory. -l log_file Log file. -z tools_path Path to simulation tools. -y dir_name Name of the simulation directory
usage: meta_cov_cmp [-h] -m m_pickle -f f_pickle [-r pdf] [-p el_pickle] Compare the distributions of pickled coverage stats (version 1.0). optional arguments: -h, --help show this help message and exit -m m_pickle Pickle generated with full model. -f f_pickle Pickle generated with flat model. -r pdf Report PDF. -p el_pickle Result pickle file.
usage: meta_pb_cmp [-h] -a a_pickle -s s_pickle [-p res_pickle] Compare pickled bias stats (version 1.0). optional arguments: -h, --help show this help message and exit -a a_pickle Pickle generated from real data. -s s_pickle Pickle generated from simulated data. -p res_pickle Result pickle file.
usage: plot_global_report [-h] [-c cov_pickles [cov_pickles ...]] [-p pb_pickles [pb_pickles ...]] [-l labels] [-r pdf] Collate the results from different datasets on a single plot(version 1.0). optional arguments: -h, --help show this help message and exit -c cov_pickles [cov_pickles ...] Pickles generated by cov_cmp. -p pb_pickles [pb_pickles ...] Pickles generated by pb_plot. -l labels Dataset labels file. -r pdf Report PDF.