Automated Illumina Pipeline
`-.`'.-' `-. .-'. `-. -./\.- .-' -. /_|\ .- `-. `/____\' .-'. `-. -./.-""-.\.- ' `-. /< (()) >\ .-' - .`/__`-..-'__\' .- ,...`-./___|____|___\.-'.,. ,-' ,` . . ', `-, ,-' `-,
Illuminati is a set of tools geared towards automating portions of Illumina’s secondary analysis pipeline. This application is custom tailored for a specific research institution and not architectured for general use. This may change in the future, and others dealing with Illumina software could still find it interesting.
Note: we will refer to this pipeline as the ‘primary’ analysis pipeline, even though Illumina documentation refers to it as the ‘secondary’ analysis pipeline.
Let’s say you have Illuminati installed here (like we do for some reason):
and you have a cluster with a head node
You get a notification that Flowcell XYZ is done and ready for primary analysis.
First, we generate a run script that will
ssh ngs-cluster cd /n/ngs/tools/pilluminati ./bin/startup_run.rb XYZ
XYZ is the flowcell ID for the flowcell you are running.
This will create a
XYZ.sh run script in your run folder.
Our run folder is in:
startup_run.rb has not done anything except create this run script. The run script will drive the primary analysis pipeline.
We can look over the run script before executing it
We see a lot of comments – mostly data output from LIMS. We also see a series of commands that will be executed. Look it over.
Most important is the comments that come after the
/n/ngs/tools/pilluminati/bin/config_maker.rb. These comments represent what will go into our
SampleSheet.csv files. If something looks wrong, we can call this command manually beforehand, tweak the results, and comment it out of our run script.
Once you are satisfied, all you have to do is run the run script…
Make sure to do that on the cluster head node (
ngs-cluster). It will kick off the primary analysis pipeline which includes demultiplexing, alignment, and post processing. Emails will be sent to the pipeline administrator during the run process. When it is done, files should be in their project directories. LIMS should be updated with the QA/QC data. Everything should be beautiful.
What it does
General overview of the execution of the analysis pipeline this tool performs:
- Create configuration files
- SampleSheet.csv and config.txt are created using data from custom LIMS system
- Convert raw unaligned reads to Fastq format
- Demultiplex indexed lanes
- Perform alignment to reference genome
- Aggregate and rename unaligned reads
- Remove reads that do not pass filter
- Split custom barcoded lanes
- Analyze unaligned reads using fastqc
- Aggregate and rename export files
- Create Sample_Report for name / lane / sample mapping
- Distribute data to project directories
- Distribute stats and quality control analysis to qcdata directory
- CASAVA 1.8.x – this tool is primarily a wrapper around the CASAVA pipeline software help automate it. This tool is designed to deal specifically with CASAVA 1.8, and would fail completely with an older version. CASAVA binaries should be located at
- Currently using CASAVA 1.8.1
- fastqc – we need fastqc available from the command line. Tested with fastqc version 0.9.2.
- This is run by the
fastqc.plscript – located in the
- This is run by the
- image magick – the fastqc.pl script also uses the
converttool of Image Magick to process the images created by fastqc. Expected to be on the path.
- fastx_barcode_splitter.pl – needs to be availible from command line
- lims_data.pl – the most troublesome external dependency. lims_data.pl should sit in the
SCRIPT_PATHdirectory. It is a perl script that connects to our LIMS system and provides flowcell and distribution information back to Illuminati. Future work will be to eliminate or reduce this external dependency.
- NOTE – There is a crude work-around for not needing this LIMS connection. See External Data below.
- Custom Directory Structure – By default, Illuminati expects a certain directory structure to be in place to be able to run correctly.
lib/illuminati/config.rbdescribes the default path structure, with
lib/illuminati/constants.rbproviding some description of how these paths are used. Configuration of the paths and other options used by Illuminati can be done by creating a config file and setting the
ILLUMINATI_CONFIGto point to that path. See
assests/example.config.yamlfor a starting point for the config file.
Frequently, we deal with flowcells processed from outsourcing centers and the resulting data returned to us. Each outsourcing center has their own way of doing things which results in their own tweaks and gotchas in the returned data.
Illuminati can handle outsourced data, but does not attempt to deal with all the nuances of the particular facility. It expects that the outsourced data flowcell directory has the
Aligned sub-directories setup as if they came out of a flowcell run.
If this is satisfied, then the
bin/post_run command can be used to process outsourced orders using the
To use Illuminati, you need a few things in place:
- Flowcell data is in LIMS system and data provided by
Once things are in order, running Illuminati should be a three step process:
Step 1: Run startup.rb
bin/startup_run.rb script passing in the flowcell id:
$ cd /solexa/bin/illuminati $ ./bin/startup_run.rb 639AXXPY
This should generate an Admin script in
/solexa/runs/. This admin script is named
<flowcell_id>.sh and is meant to be executed to start off the main Illuminati process. So for example, this admin script would be named
Step 2: Review Admin Script
Have a look at the admin script to make sure things look good
$ cd /solexa/runs less 639AXXPY.sh
First, you should see output from the LIMS system about the data of each of the flowcell’s lanes. This is the data that will be used to create the
config.txt, so its important that its right.
Then comes the
Next, you will see the output that will go into the
config.txt file. Check this over and ensure the output doesn’t include errors about missing genomes or otehr information.
Finally, the admin script contains the shell commands that will kick off CASAVA with the files to be generated. You can see that there is a
POST_RUN_COMMAND in the make command that will automate the next step, alignment. However, if everything goes as planned (like it always does), then you won’t really need to deal with anything but this file.
Step 3: Run Admin Script
If the admin script looks good, and you are ready to roll, then go!
This will actually create the
SampleSheet.csv files with the output as described in the admin script, then startup CASAVA’s BCL Converter.
What Illuminati Does Next
Ok, so the config files CASAVA needs are automatically generated and the BCL converter starts up, but what else does Illuminati do?
Well, the BCL converter in CASAVA 1.8 now also automatically does the demultiplexing step as well. So, if your lanes are multiplexed, you get the demultiplexing for free!
After the BCL converter finishes, The
POST_RUN_COMMAND we saw in the make command kicks off the next step in the process: alignment. Technically, we didn’t need to create the
config.txt file until this step, but getting it done initially makes running the alignment process pretty simple.
The alignment portion of the CASAVA pipeline is started by the
bin/align_runner.rb script. Its a little wordy in there, but ultimately we are just calling 3 commands. First, CASAVA’s
configureAlignment.pl script is run without the
--make flag, so it just does a test run. The output from this test run is analyzed by the
align_runner. If there is a problem (specifically, if “ERROR” is found in the output), then Illuminati shuts down and emails you to get things going again. Usually, this means that there is a bug in the
config.txt file. If there isn’t an error, then we run
configureAlignment.pl for reals, and create the
Aligned directory. Then we run make inside that directory to start CASAVA’s ELAND aligner.
The alignment’s make command contains another
POST_RUN_COMMAND which starts up the final part of the primary analysis pipeline: the post run.
The post run functionality is contained in
bin/post_run. Its an elaborate maze of fun. But hopefully the method names at least hint at what is going on. Briefly, this script does the following:
catto combine the multiple fastq.gz files that CASAVA makes into one for each lane / sample.
bin/fastq_filter.rbscript to filter out reads that don’t pass filter.
- split lanes with custom barcodes using
- distribute fastq.gz files to their project directories.
- run fastqc on the filtered fastq.gz files and distribute the results.
- create Sample_Report.csv to summarize the sample data
- distribute the relevant stats files to the project directories.
- combine export files using
catand distribute these files.
- distribute relevant qa / qc and stats files to the appropriate qcdata directory.
post_run tool has a lot of options if you want to run it manually, for outsourced orders, or to repeat a portion of the run.
bin/post_run -h to get a list of options to use.
Email & Logging
You also get emails along the way. Check out the
lib/illuminati/constants.rb to see who gets emailed.
A log file is also created for each flowcell in
/solexa/runs/log. This will be used for future dashboard-style awesome-ness.
post_run also generate log files during their execution. If
post_run completes, it will produce a shell script with all the commands it did, so that theoretically, you could run it again if you wanted.
CASAVA’s make output for the bcl conversion and align steps are captured in their own make.out files.
If you want to avoid using the
lims_data.pl script or other external LIMS systems, you can do so by creating an
external_data.yml file and placing it in the flowcell’s root directory.
external_data.yml contains all the required information about the samples on a run. As described in ExternalDataBase.rb, these required fields are:
:lane String name of lane (1 - 8), :name Sample name. :genome Code for genome used for lane. Should correlate to folder name in genomes dir, :protocol Should be either "eland_extended" or "eland_pair", :barcode_type Should be :illumina, :custom, or :none :barcode If :barcode_type is not :none, this provides the 6 sequence barcode
These are all under the
:samples: section of the YAML file. It also contains information about how the files are distributed, in the
:lane: Lane number :path: Path to distribute the data for this lane
These sections will probably be combined in the future to allow for distributions on a per-sample level.
Example external_data.yml file:
--- :samples: - :lane: "1" :name: 9.5dpc_HoxB1_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "2" :name: HoxB1_3FMS_myc_M2Flag_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "3" :name: HoxB1_3FMS_myc_myc_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "4" :name: HoxB1_3FMS_myc_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "5" :name: HoxB1_HF_M2Flag_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "6" :name: HoxB1_HF_M2Flag_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "7" :name: Pbx_input :genome: mm9 :protocol: eland_extended :bases: Y* :barcode_type: :none :barcode: "" - :lane: "8" :genome: phiX :name: Phi X :protocol: eland_extended :barcode_type: :none :barcode: "" :distributions: - :lane: 1 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 2 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 3 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 4 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 5 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 6 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 7 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
See the specs for
ExternalDataYml for details.
CASAVA 1.8 Directory Structure
CASAVA 1.8 has a pretty different structure for where it puts data now. Plus, the
post_run script creates some custom directories to do its job. Here’s an overview of the directories relevant to the primary analysis pipeline:
_ FLOWCELL_BASE_DIR | |___ Data | | | |___ Intensities | |___ Unaligned | | | |___ all | | | |___ filter | |___ Aligned | |___ all
The big change is that you don’t go digging in
Data/Intensities/GERALD_* directories anymore for your sequence / alignment files.
Instead, there are two separate directories right off the flowcell’s base directory called
Aligned that are now used to store that data (technically they are user configurable, but these are the defaults, and thats what we’ll use).
Unaligned contains your fastq.gz files. Illuminati adds two additional sub-directories –
all contains the concatenated, unfiltered fastq.gz files with the correct naming structure.
filter contains the same files, but with the reads that didn’t pass filter removed.
Aligned as you might guess contains the export.txt.gz files. Illuminati again adds the
all directory to store files using the naming convention we want to use.
Some effort has been put in to writing tests to ensure components of Illuminati are functional and don’t break when new features are added. These tests are currently incomplete, but it is a decent start. If you are on the particular machine this software was meant to be run on, you can simply run autotest to check all tests:
$ cd /solexa/bin/illuminati $ autotest
Everything should pass, or something is broken. If you are on another machine, some of the tests can be run using rspec. Some tests will fail due to requirements of file structure and database access. Future work includes making it configurable enough to test on other machines.
If you’d like to hack on Illuminati, start by forking the repo on GitHub:
To get all of the dependencies, run bundle install first. The best way to get
your changes merged back into core is as follows:
- Clone down your fork
- Create a thoughtfully named topic branch to contain your change
- Hack away
- Add tests and make sure at least the new features pass by running rspec
- If you are adding new functionality, document it in the README
- Do not change the version number, I will do that on my end
- If necessary, rebase your commits into logical chunks, without errors
- Push the branch up to GitHub
- Send a pull request to the vlandham/illuminati project.