-
Notifications
You must be signed in to change notification settings - Fork 1
fixing syndna filtering #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e0af07b
b0227a8
194e3b5
bf34aab
4875c0f
61e53ea
a42f272
5b3973d
bf0dcda
3425b33
ef5828b
2a26557
10ad268
b08ca44
3b19742
99d2a1d
b16d54c
99bac8f
3671354
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,15 +5,15 @@ | |
| #SBATCH -n {{nprocs}} | ||
| #SBATCH --time {{wall_time_limit}} | ||
| #SBATCH --mem {{mem_in_gb}}G | ||
| #SBATCH -o {{output}}/minimap2/logs/%x-%A_%a.out | ||
| #SBATCH -e {{output}}/minimap2/logs/%x-%A_%a.err | ||
| #SBATCH -o {{output}}/syndna/logs/%x-%A_%a.out | ||
| #SBATCH -e {{output}}/syndna/logs/%x-%A_%a.err | ||
| #SBATCH --array {{array_params}} | ||
|
|
||
| source ~/.bashrc | ||
| set -e | ||
| {{conda_environment}} | ||
| out_folder={{output}}/syndna | ||
| mkdir -p | ||
| mkdir -p ${out_folder} | ||
| cd ${out_folder} | ||
| db_folder=/scratch/qp-pacbio/minimap2/syndna/ | ||
|
|
||
|
|
@@ -28,23 +28,49 @@ mkdir -p ${out_folder}/filtered/ | |
| sn_folder=${out_folder}/bioms/${sample_name} | ||
| mkdir -p ${sn_folder} | ||
|
|
||
| txt=${sn_folder}/${sample_name}.txt | ||
| tsv=${txt/.txt/.tsv} | ||
| coverm contig --single $filename --reference ${db_folder}/All_synDNA_inserts.fasta --mapper minimap2-hifi \ | ||
| --min-read-percent-identity 0.95 --min-read-aligned-percent 0.0 -m mean count --threads {{nprocs}} \ | ||
| --output-file ${sn_folder}/${sample_name}.txt | ||
| cat ${sn_folder}/${sample_name}_insert_counts.txt | sed 's/Contig/\#OTU ID/' | \ | ||
| sed 's/ Read Count//' > ${sn_folder}/${sample_name}.tsv | ||
| biom convert -i ${sn_folder}/${sample_name}.txt -o ${sn_folder}/${sample_name}.biom --to-hdf5 | ||
| --output-file ${txt} | ||
|
|
||
| awk 'BEGIN {FS=OFS="\t"}; {print $1,$3}' ${txt} | \ | ||
| sed 's/Contig/\#OTU ID/' | sed 's/All_synDNA_inserts.fasta\///' | \ | ||
| sed 's/ Read Count//' | sed "s/${fn}/${sample_name}/" > ${tsv} | ||
|
|
||
| # if counts is zero mark it as missing and stop | ||
| counts=`tail -n +2 ${tsv} | awk '{sum += $NF} END {print sum}'` | ||
| if [[ "$counts" == "0" ]]; then | ||
| echo ${sample_name} > {{output}}/failed_${SLURM_ARRAY_TASK_ID}.log | ||
| exit 0 | ||
| fi | ||
|
|
||
| biom convert -i ${tsv} -o ${sn_folder}/syndna.biom --to-hdf5 | ||
|
|
||
| # removing AllsynDNA_plasmids_FASTA_ReIndexed_FINAL.fasta not coverm | ||
| # ---- original commands ---- | ||
| # minimap2 -x map-hifi -t {{nprocs}} -a --MD --eqx -o ${out_folder}/${sample_name}_plasmid.sam ${db_folder}/AllsynDNA_plasmids_FASTA_ReIndexed_FINAL.fasta $filename | ||
| # samtools view -F 4 -@ {{nprocs}} ${out_folder}/${sample_name}_plasmid.sam | awk '{print $1}' | sort -u > ${out_folder}/${sample_name}_plasmid_mapped.txt | ||
| # seqkit grep -v -f ${out_folder}/${sample_name}_plasmid_mapped.txt $filename > ${out_folder}/${sample_name}_no_plasmid.fastq | ||
| # ---- original commands ---- | ||
| minimap2 -x map-hifi -t {{nprocs}} -a --MD --eqx -o ${out_folder}/${sample_name}_plasmid.sam ${db_folder}/AllsynDNA_plasmids_FASTA_ReIndexed_FINAL.fasta $filename | ||
| samtools view -F 4 -@ {{nprocs}} ${out_folder}/${sample_name}_plasmid.sam | awk '{print $1}' | sort -u > ${out_folder}/${sample_name}_plasmid_mapped.txt | ||
| seqkit grep -v -f ${out_folder}/${sample_name}_plasmid_mapped.txt $filename > ${out_folder}/${sample_name}_no_plasmid.fastq | ||
|
|
||
| # removing GCF_000184185.1_ASM18418v1_genomic_chroso.fna use coverm | ||
| minimap2 -x map-hifi -t {{nprocs}} -a --MD --eqx -o ${out_folder}/${sample_name}_GCF_000184185.sam ${db_folder}/GCF_000184185.1_ASM18418v1_genomic_chroso.fna ${out_folder}/${sample_name}_no_plasmid_no_inserts.fastq | ||
| samtools view -bS -@ {{ nprocs/2 | int }} ${out_folder}/${sample_name}_no_plasmid_no_inserts.fastq | samtools sort -@ {{ nprocs/2 | int }} -O bam -o ${out_folder}/${sample_name}_GCF_000184185_sorted.sam | ||
| coverm filter --bam-files ${out_folder}/${sample_name}_GCF_000184185_sorted.sam --min-read-percent-identity 99.9 --min-read-aligned-percent 95 --threads {{nprocs}} -o ${out_folder}/${sample_name}_GCF_000184185.bam | ||
| samtools view -O SAM -o ${out_folder}/${sample_name}_no_GCF_000184185_sorted.sam ${out_folder}/${sample_name}_no_inserts.bam | ||
| # ---- original commands ---- | ||
| # minimap2 -x map-hifi -t 8 -a --MD --eqx -o reads.sam ecoli_genome.fna reads.fastq | ||
| # samtools view -bS -@ 8 reads.fastq | samtools sort -@ 24 -O bam -o reads.sorted.bam | ||
| # coverm filter --bam-files reads.sorted.bam --min-read-percent-identity 99.9 --min-read-aligned-percent 95 --threads 8 -o reads_filtered.sorted.bam | ||
| # samtools view -O SAM -o reads_filtered.sam ./reads_filtered.sorted.bam | ||
| # awk '{print $1}' reads_filtered.sam > reads_filtered.txt | ||
| # seqkit grep -v -f reads_filtered.txt reads.fastq > reads_no_ecoli.fastq | ||
| # ---- original commands ---- | ||
| minimap2 -x map-hifi -t {{nprocs}} -a --MD --eqx -o ${out_folder}/${sample_name}_GCF_000184185.sam ${db_folder}/GCF_000184185.1_ASM18418v1_genomic_chroso.fna ${out_folder}/${sample_name}_no_plasmid.fastq | ||
| samtools view -bS -@ {{ nprocs/2 | int }} ${out_folder}/${sample_name}_no_plasmid.fastq | samtools sort -@ {{ nprocs/2 | int }} -O bam -o ${out_folder}/${sample_name}_GCF_000184185_sorted.bam | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems wrong. You're running
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jianshu93, can you comment? I'm not actually sure.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be the SAM file. I think I have the SAM as input in the example commands?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what @jianshu93 sent: @lucaspatel, could you also check?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still think is OK based on the commands sent by @jianshu93. Additionally, I have added as comments the original commands to make sure they are the same and remove checking them in the tests to avoid confusion or extra lines.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, just tested it on barnacle2 and this really does seem to work: Good with me then |
||
| coverm filter --bam-files ${out_folder}/${sample_name}_GCF_000184185_sorted.bam --min-read-percent-identity 99.9 --min-read-aligned-percent 95 --threads {{nprocs}} -o ${out_folder}/${sample_name}_GCF_000184185.bam | ||
| samtools view -O SAM -o ${out_folder}/${sample_name}_no_GCF_000184185_sorted.sam ${out_folder}/${sample_name}_GCF_000184185.bam | ||
| awk '{print $1}' ${out_folder}/${sample_name}_no_GCF_000184185_sorted.sam > ${out_folder}/${sample_name}_GCF_000184185_reads_filtered.txt | ||
| seqkit grep -v -f ${out_folder}/${sample_name}_GCF_000184185_reads_filtered.txt ${out_folder}/${sample_name}_GCF_000184185.fastq | gz > ${out_folder}/filtered/${fn} | ||
| awk 'BEGIN {FS=OFS="\t"}; {print $1,$3}' | ||
| seqkit grep -v -f ${out_folder}/${sample_name}_GCF_000184185_reads_filtered.txt ${out_folder}/${sample_name}_no_plasmid.fastq | gzip > ${out_folder}/filtered/${fn} | ||
|
|
||
| touch {{output}}/completed_${SLURM_ARRAY_TASK_ID}.log | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this file is fine, but I think there's considerable room for optimization via piping. There are several SAM and BAM files that appear to be intermediates and do not really need to be written to disk unless they are needed for a downstream process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right; however, trying to simulate the current available commands and processes. Now, if you see some easy, low hanging fruits, please let me know so I can add them.