-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Liam McIntyre edited this page Oct 11, 2018
·
15 revisions
# Welcome to the screen_assembly wiki!
screen_assembly - last updated on 23rd Sepatember, 2018
Screen a bacterial assemblies (contigs/CDS or proteins) for nucleotide or protein sequences.
Required dependencies:
Python 3 and scipy/Biopython Blast, common_modules/lab_modules.py (from my Github)
Optional dependencies:
Raxml, Clustal Omega
Required arguments:
-q QUERY, --query QUERY
Query sequences fasta.
-i INPUT_FOLDER, --input_folder INPUT_FOLDER
Folder with ONLY db seqs in it. Can be one file or
many. Must be the same sequence type. Names of files and names of
contigs with files will be used.
Do not use fullstops ('.') in file names, i.e., use sample_1.fa NOT
sample.1.fa
Optional arguments:
General:
-h, --help
show this help message and exit
-t --threads
How many threads to give blast, muscle, raxml etc.
-d, --dNdS
Calculate dNdS. Off by default.
-a, --aln
Make alignments with muscle. On by default.
-x, --plots
Make plots of AA variants. Requires aln. On by default.
-i, --percent_identity
Percent_identity. Default 80.
-l, --percent_length
Percent_length. Default 80.
-o, --operon
Seq is operon or kmer, something where stops are
expected
RAXML:
-r, --raxml
Run raxml, requires alignments. Off by default.
-b, --bootstraps
Number of bootstraps for raxml. Default 100.
-m, --raxml_model
Model for raxml. Default is GTRGAMMA
-z, --raxml_executable
Default is raxmlHPC-PTHREADS-AVX2
Box and Wisker plot:
-c, --label_rotation
Labels on box plot. Default 90
-s, --style
Box plot style. Default classic
-g, --print_count
Overlay total count on percent carriage bars. Default True
-n, --number_samples_on_box_plot
number_samples_on_box_plot. Default 10
-k, --keep_flagged
Include hits flagged with Ns, premature stops or contig breaks. Default
off
Where a parameter is set to on/off by default use that flag to toggle the parameter to the opposite on/off.
Example: screen_assembly3.py -q samples.fa -i contigs_folder WILL make alignments and
screen_assembly3.py -q samples.fa -i contigs_folder -a WON'T make alignments
Output:
Top level:
blast_out.txt - raw blast output binary_hits.csv, total_hits.csv and hits_as_percentage.csv are the same thing except one is binary (presence absence), one is a count of hits and one reports the actual percentage of sequence identity of hits (biggest if more than one). Negative results are reported as 0, Ns (excluded due to high level of Ns), stops (excluded due to frame shift) and contigs (excluded due to hitting a contig break). The non zero negative hits are excluded from all plots but this DOES NOT mean they are not there. *_itol.txt - heatmap presence absence to drop into itol on tree (tree needs same samlpe names as used in contigs) sequence_variants.csv - the amount of different sequences for each gene screened. box_and_carriage_plot1.svg - high level summary of results. Shows carriage of genes and sequence variation between hits.
Per gene folder:
*fa - various fasta files of the gene snipped from contigs *aln - various alignments *svg - various plots to show the point of variation
Gotchas:
Its often useful to re-run this script a few times to tinker with parameters (like -p, or -i). To save time the script checks if required files already exist and won't remake them if they do. If you want to start from scratch (recommended if theres any errors in the first run) its recommended to manually delete all existing files. You might notice that some of the contig names I have used are off by a count of one if you compare them to the original assemblies. Don’t panic! This is simply because the naming convention of the assemblies from different sources is different (some start counting at zero and some at one). I have enforced a bioinformatic approach and started counting from zero.
Known issues:
Fixed an issue where if a query had multiple hits and one failed QC then all were reported as a fail in the csvs. This had created a discrepancy between the sequences in the alignments and the csv. However, I wouldn't be surprised if - if there are many multiple hits - that some small discrepancy will occur pertaining to exactly which seq is reported where. Please advise if you find an example and be sure investigate multiple hits manually in the various fastas provided.