# Assembly quality with QUAST

Samuel Barnett

### Introduction

Here I'll use QUAST to assess the quality of the co-assemblies compared to the reference genomes.

## 1) Initialization

First I need to import the python modules I'll use, set some variables, initiate R magic, and create/get into the working directory.

In [1]:
import os
mainDir = '/home/sam/data/SIPSim2_data/RealWorld_study3/'
qualDir = os.path.join(mainDir, 'assembly_quality')
contigDir = os.path.join(mainDir, 'coassembly')
binDir = os.path.join(mainDir, 'binning')
genomeDir = '/home/sam/databases/ncbi_genomes/ncbi-genomes-2019-01-25/'
nprocs = 15

In [2]:
import sys
import pandas as pd

In [3]:
# making directories
## working directory
if not os.path.isdir(qualDir):
    print("Working directory does not exist! Making it now")
    os.makedirs(qualDir)
%cd $qualDir

## genome directory
if not os.path.isdir(genomeDir):
    print("Genome directory does not exist!!!")

Working directory does not exist! Making it now
/home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality


## 2) Assembly quality

Now I'll run QUAST for each simulation group.

In [4]:
## Contig directory
if not os.path.isdir(contigDir):
    print("Contig directory does not exist!!!")
%cd $contigDir

/home/sam/data/SIPSim2_data/RealWorld_study3/coassembly


In [None]:
genset_dict = {'low_GC_skew': 'lowGC', 
               'medium_GC': 'medGC', 
               'high_GC_skew': 'highGC'}
depth_dict = {'depth5MM': '5MM', 
              'depth10MM': '10MM'}
exp_dict = {'SIP': 'window', 'nonSIP': 'nonSIP'}


for genome_set in ['low_GC_skew', 'medium_GC', 'high_GC_skew']:
    index_file = '_'.join([genome_set, 'genome_index.txt'])
    index_file = os.path.join(mainDir, index_file)
    ref_list = ','.join([os.path.join(genomeDir, x) for x in pd.read_table(index_file, names = ['genome', 'file'])['file']])
    
    for depth in ['depth5MM', 'depth10MM']:
        for exp_type in ['SIP', 'nonSIP']:
            print(' '.join(['Running metaquast to get assembly quality for', 
                            genome_set, 'genomes at', depth, 
                            'from the', exp_type, 'experiment\n']))
            subcontigfasta = '_'.join([genset_dict[genome_set], depth_dict[depth], exp_type])
            subcontigfasta = os.path.join(contigDir, subcontigfasta, 'final.contigs.fa')
            outputDir = '_'.join([genset_dict[genome_set], depth_dict[depth], exp_type])
            outputDir = os.path.join(qualDir, outputDir)
            cmd = ' '.join(['metaquast.py -r', ref_list,
                            '-o', outputDir,
                            '-t 10', subcontigfasta])
            os.system(cmd)


Running metaquast to get assembly quality for low_GC_skew genomes at depth5MM from the SIP experiment



In [6]:
print("Done")

Done


## 3) Getting contig-genome assignments

Many genomes are made up of multiple scaffolds (chromosomes and plasmids) and I need to match these scaffolds to their reference genome ID.

In [7]:
j = 1
align_df = pd.DataFrame()
tempDir = os.path.join(qualDir, 'tmp_align')
if not os.path.isdir(tempDir):
    os.makedirs(tempDir)
    
for genome_set in ['lowGC', 'medGC', 'highGC']:
    for depth in ['5MM', '10MM']:
        for exp_type in ['SIP', 'nonSIP']:
            runsDir = '_'.join([genome_set, depth, exp_type])
            runsDir = os.path.join(qualDir, runsDir, 'runs_per_reference')
            subDir_list = os.listdir(runsDir)
            i = 1
            for subDir in subDir_list:
                alignFile = os.path.join(runsDir, subDir, 'contigs_reports', 'all_alignments_final-contigs.tsv')
                subalign_df = pd.read_csv(alignFile, sep='\t')
                subalign_df['genome_set'] = genome_set
                subalign_df['depth'] = depth
                subalign_df['exp_type'] = exp_type
                if j == 1:
                    header = subalign_df.columns.values.tolist()
                    with open(os.path.join(qualDir, 'header.txt'), 'w') as headfile:
                        headfile.write('\t'.join(header))
                        headfile.write('\n')
                subalignFile = '_'.join([genome_set, depth, exp_type, 'genome', str(i), 'aligned.txt'])
                subalign_df.to_csv(os.path.join(tempDir, subalignFile), header=False, index=False, sep='\t')
                i = i+1
            print(' '.join(['There were', str(i-1), 'genomes aligned from', genome_set, depth, exp_type]))
    j = j+1

There were 318 genomes aligned from lowGC 5MM SIP
There were 500 genomes aligned from lowGC 5MM nonSIP
There were 339 genomes aligned from lowGC 10MM SIP
There were 500 genomes aligned from lowGC 10MM nonSIP
There were 354 genomes aligned from medGC 5MM SIP
There were 500 genomes aligned from medGC 5MM nonSIP
There were 353 genomes aligned from medGC 10MM SIP
There were 500 genomes aligned from medGC 10MM nonSIP
There were 396 genomes aligned from highGC 5MM SIP
There were 500 genomes aligned from highGC 5MM nonSIP
There were 398 genomes aligned from highGC 10MM SIP
There were 500 genomes aligned from highGC 10MM nonSIP


In [8]:
cmd = ''.join(['cat ', os.path.join(qualDir, 'header.txt'), ' ', tempDir, '/* > ', 
               os.path.join(qualDir, 'all_contig_alignments.txt')])
print(cmd)
os.system(cmd)

cmd = ' '.join(['rm -r', tempDir])
print(cmd)
os.system(cmd)

cmd = ' '.join(['rm', os.path.join(qualDir, 'header.txt')])
print(cmd)
os.system(cmd)

cat /home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality/header.txt /home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality/tmp_align/* > /home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality/all_contig_alignments.txt
rm -r /home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality/tmp_align
rm /home/sam/data/SIPSim2_data/RealWorld_study3/assembly_quality/header.txt


0

### Get scaffold lengths
Get the lengths for all the scaffolds

In [9]:
finalstatsFile = os.path.join(qualDir, 'scaffold_stats.txt')
with open(finalstatsFile, 'w') as finalstats:
    finalstats.write('Reference\tReference_length\tReference_length_noN\ttotal_coverage\tgenome_set\tdepth\texp_type\n')
    for genome_set in ['lowGC', 'medGC', 'highGC']:
        for depth in ['5MM', '10MM']:
            for exp_type in ['SIP', 'nonSIP']:
                statsFile = '_'.join([genome_set, depth, exp_type])
                statsFile = os.path.join(qualDir, statsFile, 'combined_reference/genome_stats/genome_info.txt')
                with open(statsFile, 'r') as stats:
                    for line in stats:
                        if line.startswith('\tGCF_'):
                            line = line.strip('\t').strip(' bp)\n')
                            line = line.replace(' (total length: ', '\t')
                            line = line.replace(" bp, total length without N's: ", '\t')
                            line = line.replace(' bp, maximal covered length: ', '\t')
                            line = '\t'.join([line, genome_set, depth, exp_type])
                            finalstats.write(line)
                            finalstats.write('\n')