Welcome to MCscan's documentation!
The MCscan download page can be accessed here.
This software provides a clustering module for viewing the relationship of colinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then try to build consensus segments from a set of related, overlapping segments.
Certain part of this package (
dagchainer.cc) is based on the TIGR software DAGchainer. The program used this as an initial step to generate pairwise segments.
Along with the DAGchainer guidelines, all code is copiable, distributable, modifiable, and usable without any restrictions.
MCscan currently will only run on linux or cygwin platform, as it is dependent on GNU function.
mcscan.tar.gz in any directory:
$ tar zxf mcscan_version.tar.gz $ cd mcscan_version.tar.gz/ && make
the compiled codes are within the same directory as the source.
Then put copy of MCL executable within the same folder as MCscan (MCL program downloadable here).
Inputs and outputs
MCscan reads in at least two sources of data:
.blast file and
.bed file. This may seem daunting at first, but these are very easy to retrieve. Have a look at the
at_at.bed in the folder. In the actual execution, MCL is used to generate mcl file (
at_at.mcl), which is used in multiple synteny construction.
Here is what can be used to genenerate the files.
The blast file is the following tab-delimited format:
gene1 gene2 e-value
easily genenerated from a m8 blast output format:
$ cut -f 1,2,11 xyz.m8 > xyz.blast.unfiltered
The first thing please ensure that for each gene pair, only one e-value is reported, the blast output normally would contain multiple HSPs, a convenience script is attached to filter all the redundant pairs:
$ python filter_blast.py xyz.blast.unfiltered xyz.blast
.bed file contains the following tab-delimited format (see bed format):
chromosome_id start stop gene_name
notice when you compare multiple genomes, formulate your molecule name carefully to avoid duplicated names. The
.bed file can usually be generated by parsing the gene annotation file provided by the sequencing group (usually the sequencing project ftp will provide a
Once you have everything ready, put them in the same folder. We need to generate
.mcl file if this is the first run (also take a look at the example in
$ more xyz.blast | mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.mcl
Some might encounter a problem exec the
mcl command, in which case the
mcl binary needs to be rebuilt from here. After the first time you run it (the
mcl file has been generated). You can simply use:
$ ./mcscan xyz
Parameters (for advanced user)
The help screen:
Usage: mcscan [OPTION...] prefix_fn MCSCAN -- multiple collinearity scan (compiled Jul 6 2010 15:52:03) Reference: Tang,H., Wang,X., Bowers,J.E., Ming,R., Alam,M., Paterson,A.H. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps Genome Research (2008) 18, 1944-1954. -a only builds the pairwise blocks (.aligns file) -A use base pair dist instead of gene ranks -b limit within genome synteny (e.g. Vv-Vv) mapping -e, --e_value=E_VALUE alignment significance -g, --gap_score=GAP_SCORE gap penalty -k, --match_score=MATCH_SCORE final score=MATCH_SCORE+NUM_GAPS*GAP_SCORE -p, --pivot=PIVOT PIVOT is the reference genome, make it two letter prefix inyour .bed file, everything else will be aligned to the reference -s, --match_size=MATCH_SIZE number of genes required to call synteny -u, --unit_dist=UNIT_DIST average intergenic distance -?, --help Give this help list --usage Give a short usage message -V, --version Print program version Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options. Report bugs to <email@example.com>.
The default values are quite generic and should work for many instances. The following are more detailed information for users who wish to tune their results.
The pairwise synteny formula is roughly (Haas et al. 2004), note that
DIST_Y is in base pair:
FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)/UNIT_DIST*GAP_SCORE
The multiple synteny formula is roughly, now
DIST_Y is the distance in the partial order graph (not in base unit, but in gene index unit):
FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)*GAP_SCORE
Sometimes you may want to run just the pairwise synteny on
.bed files, then you can try:
$ ./mcscan at_at -a
Note that to run this,
.mcl file is not required, the result is now slightly different, since MCscan uses the mcl file to filter the BLAST hits.
There are, by default
at_vv sets of files and
os_sb sets of files, which is basically two different projects.
First example, let us compare Os to Sb (rice to sorghum), just default settings, run:
$ ./mcscan os_sb
It takes about one minute to run, the result is best viewed in EXCEL. The first part of the file lists all the parameters of the program. The result is separated with a line like this:
## View 11: pivot Sb02
This is called a view, each view uses a different chromosome as the reference. Then the blocks following this line is the multiply aligned blocks. The first column is numerical identifier, the second column is the actual pivot. Then following columns are the regions that are aligned to the pivot. The alignments between rice and sorghum are in fact complicated by one or more shared WGDs, creating several columns but mostly are four regions matching each other.
For the second example, we wish to align Arabidopsis to grape, and use grape as the reference genome, but we need to do it a little differently. Unlike the first example, we are not interested in WGD in grape in this case, and we only wish to see the grape used as pivot. Therefore, we modify the pivot:
$ ./mcscan at_vv -p Vv -b
-b will limit any Vv-Vv matches (in fact this is an older duplication called gamma) in the output.
There are two outputs.
.aligns file and
.blocks file, corresponding to pairwise and multiple synteny respectively. You will find the
.aligns file very useful too, sometimes. But this is essentially similar to the output of DAGchainer (adding a few statistics and change the default paramters).
- May 12, 2007 (version <0.5) initial release.
- Aug 05, 2007 (version 0.5) add the option of of a reference genome
- Oct 13, 2007 (version 0.6) add convenience python script to streamline the process
- Mar 07, 2008 (version 0.7) implement statistical test for pairwise syntenic blocks
- Nov 13, 2008 (version 0.8) partial-order graph for alignment
Any questions, problems, bugs are welcome and should be dumped to
Haibao Tang : bao at uga dot edu
Plant Genome Mapping Laboratory, University of Georgia