Skip to content
Liam McIntyre edited this page Nov 20, 2019 · 15 revisions

# Welcome to the screen_assembly wiki!

Screen a bacterial assemblies (contigs/CDS or proteins) for nucleotide or protein sequences.

Required dependencies:

Python 3 and scipy/Biopython
Blast,
common_modules/lab_modules.py (from my Github)

Optional dependencies:

IQtree,
Clustal Omega

Required arguments:

-q QUERY, --query QUERY
                      Query sequences fasta.
-i INPUT_FOLDER, --input_folder INPUT_FOLDER
                      Folder with ONLY db seqs in it. Can be one file or
                      many. Must be the same sequence type. Names of files and names of 
                      contigs with files will be used.
                      Do not use fullstops ('.') in file names, i.e., use sample_1.fa NOT 
                      sample.1.fa

Optional arguments:

General:

-h, --help            
                      show this help message and exit
-t --threads
                      How many threads to give blast, muscle, IQtree etc.
-a, --aln             
                      Make alignments with muscle. On by default.
-x, --plots           
                      Make plots of AA variants. Requires aln. On by default.
-pi, --percent_identity
                      Percent_identity. Default 80.
-pl, --percent_length
                      Percent_length. Default 80.
-o, --operon          
                      Seq is operon or kmer, something where stops are
                      expected

Box and Wisker plot:

-c, --label_rotation
                     Labels on box plot. Default 90
-s, --style
                     Box plot style. Default classic
-g, --print_count
                     Overlay total count on percent carriage bars. Default True
-n, --number_samples_on_box_plot
                     number_samples_on_box_plot. Default 10
-k, --keep_flagged
                     Include hits flagged with Ns, premature stops or contig breaks. Default 
                     off

Where a parameter is set to on/off by default use that flag to toggle the parameter to the opposite on/off.

Example: screen_assembly3.py -q samples.fa -i contigs_folder WILL make alignments and
         screen_assembly3.py -q samples.fa -i contigs_folder -a WON'T make alignments

Output:

Top level:

blast_out.txt - raw blast output

binary_hits.csv, total_hits.csv and hits_as_percentage.csv are the same thing except one is 
binary (presence absence), one is a count of hits and one reports the actual percentage of 
sequence identity of hits (biggest if more than one). Negative results are reported as 0, Ns 
(excluded due to high level of Ns), stops (excluded due to frame shift) and contigs 
(excluded due to hitting a contig break). The non zero negative hits are excluded from all 
plots but this DOES NOT mean they are not there.

*_itol.txt - heatmap presence absence to drop into itol on tree (tree needs same samlpe 
names as used in contigs)

sequence_variants.csv - the amount of different sequences for each gene screened.

box_and_carriage_plot1.svg - high level summary of results. Shows carriage of genes and 
sequence variation between hits.

Per gene folder:

*fa - various fasta files of the gene snipped from contigs

*aln - various alignments

*svg - various plots to show the point of variation

Gotchas:

Its often useful to re-run this script a few times to tinker with parameters (like -p, or -i).
To save time the script checks if required files already exist and won't remake them if they 
do. If you want to start from scratch (recommended if theres any errors in the first run) its
recommended to manually delete all existing files. If you switch off QC (-k) delete old files
and vice versa if you put it back on.

You might notice that some of the contig names I have used are off by a count of one if you 
compare them to the original assemblies. Don’t panic! This is simply because the naming 
convention of the assemblies from different sources is different (some start counting at zero
and some at one). I have enforced a bioinformatic approach and started counting from zero.

Multiple hits:

The biggest problem I had in implementing this script was handling multiple hits. Multiple 
hits can occur for real biological reasons (gene duplications, homologs and paralogs for 
example) and as artefacts of bioinformatic algorithms (poor assemblies, sub optimal 
blast hits etc). The choices that were made re handling these were based on the needs of the 
lab at the time and won't necessarily suit your needs if you use default parameters.
The script picks the 'best' hit (defined as length x % homology) that passes QC (doesn't hit
a contig break, have Ns or stops (framshift)) to report and plot. QC can be turned off with -k 
(--keep_flagged) and a lot of users might find this the best option, especially if they're
just after a summary plot. If you're interested in the knowing about multiple hits or seqs 
that fail QC all the details are kept in the various fasta files.

Known issues:

Fixed an issue where if a query had multiple hits and one failed QC then all were reported
as a fail in the csvs. This  had created a discrepancy between the sequences in the alignments
and the csv. However, I wouldn't be surprised if - if there are many multiple hits - that some 
small discrepancy will occur pertaining to exactly which seq is reported where. Please advise
if you find an example and be sure investigate multiple hits manually in the various fastas 
provided.

dNdS is very temperamental.

Some plots make python crash on some versions of MAC OS. 

'ValueError: bottom cannot be >= top' means the names of your query seqs are too long to 
make box plot

Examples using provided MGAS assemblies and vaccine antigens:

screen_assembly3.py -q ../antigen_seqs/antigens.fa -i ../assemblies -t 8 -x -r

Clone this wiki locally