In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import os
import pandas as pd
import seaborn as sb
import statistics as st

from Bio import SeqIO

sb.set()
pd.set_option("display.max_rows", None)

# Evaluation of USEARCH and VSEARCH on Callahan data

The following notebook describes the steps and results of the evaluation.

In [None]:
# Initial files and directories:
#
# uvsearch_callahan
# |- data
# |  |- balanced
# |  |  \- BalancedRefSeqs.fasta  # provided reference sequences of 'balanced' data set
# |  |
# |  \- hmp
# |     \- HMP_MOCK.fasta  # provided reference sequences of 'hmp' data set
# |
# |- evaluation # will contain the evaluation plots and tables
# |
# |- outputs # will contain the cluster and metric outputs
# |
# \- tasks  # task files for the different runs of U/VSEARCH

The provided reference sequences are part of the [Supplementary Software](https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3869/MediaObjects/41592_2016_BFnmeth3869_MOESM270_ESM.zip) of the DADA2 paper.   

## Analysis workflow

The data sets are preprocessed as described in Callahan et al., *DADA2: High-resolution sample inference from Illumina amplicon data* (https://doi.org/10.1038/nmeth.3869),
except that the minimum sequence abundance is set to 1 (not 2) for the sake of a fair comparison with other data sets and tools.

The taxonomic assignment is obtained by merging (using `USEARCH -fastq_mergepairs`) and dereplicating (using `USEARCH -derep_fulllength`) the reads, 
and then matching them with the respective reference sequences (using `VSEARCH --usearch_global`).

Both tools are run with several clustering options:
 - USEARCH: `-cluster_fast`, `-cluster_smallmem`, `-cluster_otus`
 - VSEARCH: `--cluster_fast`, `--cluster_size`, `--cluster_smallmem`

## Commands

The following commands prepare and cluster the data sets. The results are evaluated below.

In order to execute the workflow as provided here, the `tools` subdirectory of the overall repository should contain a VSEARCH binary and the USEARCH binaries `usearch8.0.1623_i86linux32`, `usearch10.0.240_i86linux32` (both for data preparation) and `usearch11.0.667_i86linux32` (for clustering), but the paths can be adjusted.   

IMPORTANT: The commands are not intended to be executed from this notebook. They should be executed from the root directory of the overall repository.

In [None]:
%%bash

TOOLS_DIR=tools
ANALYSIS_DIR=analyses/uvsearch_callahan
DATA_DIR=${ANALYSIS_DIR}/data
OUTPUT_DIR=${ANALYSIS_DIR}/outputs

USEARCH8_PATH=${TOOLS_DIR}/usearch8.0.1623_i86linux32 # adjust to your system
USEARCH10_PATH=${TOOLS_DIR}/usearch10.0.240_i86linux32 # adjust to your system

USEARCH=${TOOLS_DIR}/usearch11.0.667_i86linux32 # adjust to your system
VSEARCH=${TOOLS_DIR}/vsearch-2.14.2-linux-x86_64/bin/vsearch # adjust to your system

RUNS=( uparse_otus usearch_fast_length  usearch_fast_size  usearch_smallmem_length  usearch_smallmem_size  vsearch_fast  vsearch_size  vsearch_smallmem_length  vsearch_smallmem_size )

## balanced
python -m scripts.analyses.analysis_callahan reference balanced ${DATA_DIR}/balanced/BalancedRefSeqs.fasta ${DATA_DIR}/balanced/callahan.fasta 

# paired
python -m scripts.analyses.analysis_callahan prepare balanced ${DATA_DIR}/balanced --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/balanced/ERR777695_pfmd.fastq ${DATA_DIR}/balanced/callahan.fasta ${DATA_DIR}/balanced/bp_callahan_0.97.tax 0.97
READS=bp:${DATA_DIR}/balanced/ERR777695_pfmd.fastq
TAX=callahan:${DATA_DIR}/balanced/bp_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run_uvsearch ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/balanced_paired/${R} --tax_files ${TAX} --usearch ${USEARCH} --vsearch ${VSEARCH}
    for F in ${OUTPUT_DIR}/balanced_paired/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/balanced_paired/${R}/${R}_${F##*/}; done
done

# single
python -m scripts.analyses.analysis_callahan prepare balanced ${DATA_DIR}/balanced --single --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/balanced/ERR777695_1_sfd.fastq ${DATA_DIR}/balanced/callahan.fasta ${DATA_DIR}/balanced/bs_callahan_0.97.tax 0.97
READS=bs:${DATA_DIR}/balanced/ERR777695_1_sfd.fastq
TAX=callahan:${DATA_DIR}/balanced/bs_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run_uvsearch ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/balanced_single/${R} --tax_files ${TAX} --usearch ${USEARCH} --vsearch ${VSEARCH}
    for F in ${OUTPUT_DIR}/balanced_single/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/balanced_single/${R}/${R}_${F##*/}; done
done

## hmp
python -m scripts.analyses.analysis_callahan reference hmp ${DATA_DIR}/hmp/HMP_MOCK.fasta ${DATA_DIR}/hmp/callahan.fasta 

# paired
python -m scripts.analyses.analysis_callahan prepare hmp ${DATA_DIR}/hmp --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/hmp/Mock1_S1_L001_pfmd.fastq ${DATA_DIR}/hmp/callahan.fasta ${DATA_DIR}/hmp/hp_callahan_0.97.tax 0.97
READS=hp:${DATA_DIR}/hmp/Mock1_S1_L001_pfmd.fastq
TAX=callahan:${DATA_DIR}/hmp/hp_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run_uvsearch ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/hmp_paired/${R} --tax_files ${TAX} --usearch ${USEARCH} --vsearch ${VSEARCH}
    for F in ${OUTPUT_DIR}/hmp_paired/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/hmp_paired/${R}/${R}_${F##*/}; done
done

# single
python -m scripts.analyses.analysis_callahan prepare hmp ${DATA_DIR}/hmp --single --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/hmp/Mock1_S1_L001_R1_001_sfd.fastq ${DATA_DIR}/hmp/callahan.fasta ${DATA_DIR}/hmp/hs_callahan_0.97.tax 0.97
READS=hs:${DATA_DIR}/hmp/Mock1_S1_L001_R1_001_sfd.fastq
TAX=callahan:${DATA_DIR}/hmp/hs_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run_uvsearch ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/hmp_single/${R} --tax_files ${TAX} --usearch ${USEARCH} --vsearch ${VSEARCH}
    for F in ${OUTPUT_DIR}/hmp_single/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/hmp_single/${R}/${R}_${F##*/}; done
done

## Evaluation

**Configuration**

In [2]:
data_sets = ['balanced', 'hmp']
read_types = ['single', 'paired']
ground_truths = ['callahan']

opts = ['uparse_otus', 'usearch_fast_length', 'usearch_fast_size', 'usearch_smallmem_length', 'usearch_smallmem_size', 'vsearch_fast', 'vsearch_size', 'vsearch_smallmem_length', 'vsearch_smallmem_size']

read_files = {'balanced_single': 'ERR777695_1_sfd.fastq', 'balanced_paired': 'ERR777695_pfmd.fastq',
              'hmp_single': 'Mock1_S1_L001_R1_001_sfd.fastq', 'hmp_paired': 'Mock1_S1_L001_pfmd.fastq'}

data_dir = 'data'
results_dir = 'outputs'
eval_dir = 'evaluation'

### Number of clusters and amplicons

Reads the input files and the cluster outputs for all data sets and compares the number of clusters and amplicons.

In [3]:
# Requires the input and OTU files. Alternatively, the evaluation can use the stored information (see below).
df_columns = ['data_set', 'tool', 'mode', 'refinement', 'threshold', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

rows = []

for rt in read_types:
    for ds in data_sets:
        run_name = '%s_%s' % (ds, rt)
        
        seq_file = '%s/%s/%s' % (data_dir, ds, read_files[run_name]) # the input sequences
        num_input_amplicons = 0
        input_mass = 0
        with open(seq_file, 'r') as in_file:
            for record in SeqIO.parse(in_file, 'fastq'):
                num_input_amplicons += 1
                input_mass += int(record.id.split('_')[-1])
        
        for opt in opts:
            otu_files = [f for f in os.listdir('%s/%s/%s/' % (results_dir, run_name, opt)) if f.endswith('_otus.txt')]

            for f in otu_files:
                otu_file = '%s/%s/%s/%s' % (results_dir, run_name, opt, f)
                        
                num_output_amplicons = 0
                num_clusters = 0
                output_mass = 0
                with open(otu_file, 'r') as in_file:
                    for line in in_file:
                        num_output_amplicons += len(line.strip().split(' '))
                        num_clusters += 1
                        output_mass += sum([int(m.split('_')[-1]) for m in line.strip().split(' ')])
                        
                tool, mode = f.split('__')
                mode = mode.split('_0')[0]
                refinement = 'nf'
                threshold = float(f.split('_')[-2])
                        
                rows.append([run_name, tool, mode, refinement, threshold, num_input_amplicons, input_mass, num_clusters, num_output_amplicons, output_mass, ds, rt])
            
df_counts = pd.DataFrame(rows, columns = df_columns)

*Column descriptions:*   
`num_input_amplicons`: The number of entries in the corresponding input file.   
`input_mass`: The sum of the abundances of all entries in the input file.   
`num_clusters`: The number of computed clusters.   
`num_output_amplicons`: The number of amplicons contained in the clusters.   
`output_mass`: The sum of the abundances of all amplicons contained in the clusters.   

In [4]:
df_counts[['data_set', 'tool', 'mode', 'threshold', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass']]

Unnamed: 0,data_set,tool,mode,threshold,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass
0,balanced_single,uparse,otus,0.97,33523,558059,591,30452,553807
1,balanced_single,usearch,fast_length,0.9,33523,558059,421,33523,558059
2,balanced_single,usearch,fast_length,0.91,33523,558059,521,33523,558059
3,balanced_single,usearch,fast_length,0.92,33523,558059,640,33523,558059
4,balanced_single,usearch,fast_length,0.93,33523,558059,811,33523,558059
5,balanced_single,usearch,fast_length,0.94,33523,558059,1044,33523,558059
6,balanced_single,usearch,fast_length,0.95,33523,558059,1351,33523,558059
7,balanced_single,usearch,fast_length,0.96,33523,558059,2170,33523,558059
8,balanced_single,usearch,fast_length,0.97,33523,558059,3095,33523,558059
9,balanced_single,usearch,fast_length,0.98,33523,558059,5600,33523,558059


In [5]:
df_counts.to_csv('%s/df_counts.csv' % eval_dir, sep = ';', index = False)
#df_counts = pd.read_csv('%s/df_counts.csv' % eval_dir, sep = ';')

### Clustering quality

In [6]:
# Requires the metrics files. Alternatively, the evaluation can use the stored information (see below).
dfs = []
for ds in data_sets:
    for rt in read_types:
        run_name = '%s_%s' % (ds, rt)

        for opt in opts:
            for gt in ground_truths:
                df = pd.read_csv('%s/%s/%s/%s_%s_%s__metrics.csv' % (results_dir, run_name, opt, opt, ds[0] + rt[0], gt), sep = ';')

                df['reads'] = '%s_%s' % (ds, rt)
                df['gt'] = gt
                df['mode'] = [m.split('__')[-1] for m in df['task']]
                df['refinement'] = 'nf'
                df['ds'] = ds
                df['rt'] = rt

                dfs.append(df)
                    
df_quality = pd.concat(dfs, ignore_index = True)
df_quality.rename(columns = {'task': 'run', 'reads': 'data_set'}, inplace = True)

*Column descriptions:*   
`precision`: Quantifies the extent to which amplicons in a cluster are also from the same species.   
`recall`: Measures the proportion of amplicons from the same species that are grouped in the same cluster.   
`adjrandindex`: Measures the agreement between the clusters and the taxonomic assignment and corrects for chance.   

In [7]:
df_quality[['data_set', 'gt', 'tool', 'mode', 'threshold', 'precision', 'recall', 'adjrandindex']]

Unnamed: 0,data_set,gt,tool,mode,threshold,precision,recall,adjrandindex
0,balanced_single,callahan,uparse,otus,0.97,0.961808,0.990367,0.96679
1,balanced_single,callahan,usearch,fast_length,0.9,0.925922,0.959764,0.906474
2,balanced_single,callahan,usearch,fast_length,0.91,0.942952,0.956019,0.899262
3,balanced_single,callahan,usearch,fast_length,0.92,0.928246,0.962289,0.911734
4,balanced_single,callahan,usearch,fast_length,0.93,0.956786,0.959676,0.929876
5,balanced_single,callahan,usearch,fast_length,0.94,0.956562,0.932833,0.868032
6,balanced_single,callahan,usearch,fast_length,0.95,0.988856,0.954912,0.957855
7,balanced_single,callahan,usearch,fast_length,0.96,0.990825,0.929531,0.916237
8,balanced_single,callahan,usearch,fast_length,0.97,0.980088,0.908106,0.886726
9,balanced_single,callahan,usearch,fast_length,0.98,0.998018,0.876117,0.846011


In [8]:
df_quality.to_csv('%s/df_quality.csv' % eval_dir, sep = ';', index = False)
#df_quality = pd.read_csv('%s/df_quality.csv' % eval_dir, sep = ';')

Combine counting and quality information:

In [9]:
df_c, df_q = df_counts.copy(), df_quality.copy()
drop_cols = ['join_col'] + ['%s_counts' % s for s in set(df_q.columns) & set(df_c.columns)]
df_c['join_col'] = df_c['data_set'] + df_c['tool'] + df_c['mode'] + df_c['refinement'] + df_c['threshold'].apply(str)
df_q['join_col'] = df_q['data_set'] + df_q['tool'] + df_q['mode'] + df_q['refinement'] + df_q['threshold'].apply(str)
df_joined = df_q.join(df_c.set_index('join_col'), on = 'join_col', rsuffix = '_counts').drop(drop_cols, axis = 1)

In [10]:
df_joined.to_csv('%s/df_joined.csv' % eval_dir, sep = ';', index = False)
#df_joined = pd.read_csv('%s/df_joined.csv' % eval_dir, sep = ';')

Determine the maximum, average and N-best average clustering quality (for N = 5).

In [11]:
df_columns = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

max_rows = []
mean_rows = []
nbest_rows = []
n = 5

for (d, g, t, m, f, ds, rt), grp in df_joined.groupby(by = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'ds', 'rt']):
    best = grp.nlargest(1, 'adjrandindex')
    max_rows.append([d, g, t, m, f, best['precision'].values[0], best['recall'].values[0], best['adjrandindex'].values[0], best['num_input_amplicons'].values[0], best['input_mass'].values[0], best['num_clusters'].values[0], best['num_output_amplicons'].values[0], best['output_mass'].values[0], ds, rt])
    mean_rows.append([d, g, t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean(), grp['num_input_amplicons'].mean(), grp['input_mass'].mean(), grp['num_clusters'].mean(), grp['num_output_amplicons'].mean(), grp['output_mass'].mean(), ds, rt])
    nbest = grp.nlargest(n, 'adjrandindex')
    nbest_rows.append([d, g, t, m, f, nbest['precision'].mean(), nbest['recall'].mean(), nbest['adjrandindex'].mean(), nbest['num_input_amplicons'].mean(), nbest['input_mass'].mean(), nbest['num_clusters'].mean(), nbest['num_output_amplicons'].mean(), nbest['output_mass'].mean(), ds, rt])
    
df_joined_max = pd.DataFrame(max_rows, columns = df_columns)
df_joined_mean = pd.DataFrame(mean_rows, columns = df_columns)
df_joined_nbest = pd.DataFrame(nbest_rows, columns = df_columns)

In [12]:
df_joined_max.to_csv('%s/df_joined_max.csv' % eval_dir, sep = ';', index = False)
df_joined_mean.to_csv('%s/df_joined_mean.csv' % eval_dir, sep = ';', index = False)
df_joined_nbest.to_csv('%s/df_joined_nbest.csv' % eval_dir, sep = ';', index = False)
#df_joined_max = pd.read_csv('%s/df_joined_max.csv' % eval_dir, sep = ';')
#df_joined_mean = pd.read_csv('%s/df_joined_mean.csv' % eval_dir, sep = ';')
#df_joined_nbest = pd.read_csv('%s/df_joined_nbest.csv' % eval_dir, sep = ';')

In [13]:
df_max = df_joined_max.loc[df_joined_max['gt'] == 'callahan']
df_mean = df_joined_mean.loc[df_joined_mean['gt'] == 'callahan']
df_nbest = df_joined_nbest.loc[df_joined_nbest['gt'] == 'callahan']

For the chosen ground truth, average the maximum, average and N-best average values per data set (e.g. balanced) and read type (e.g. paired).   
Has no effect in this case because there is only one data set per combination of data set and read type .

In [14]:
df_columns = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

def average_complexity(df):
    rows = []
    for (gt, ds, rt, tool, mode, f), grp in df.groupby(by = ['gt', 'ds', 'rt', 'tool', 'mode', 'refinement']):
        rows.append(['%s_%s' % (ds, rt), gt, tool, mode, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean(), grp['num_input_amplicons'].mean(), grp['input_mass'].mean(), grp['num_clusters'].mean(), grp['num_output_amplicons'].mean(), grp['output_mass'].mean(), ds, rt])
    return pd.DataFrame(rows, columns = df_columns)

In [15]:
df_joined_max_avg = average_complexity(df_max)
df_joined_mean_avg = average_complexity(df_mean)
df_joined_nbest_avg = average_complexity(df_nbest)

In [16]:
df_joined_max_avg.to_csv('%s/df_joined_max_avg.csv' % eval_dir, sep = ';', index = False)
df_joined_mean_avg.to_csv('%s/df_joined_mean_avg.csv' % eval_dir, sep = ';', index = False)
df_joined_nbest_avg.to_csv('%s/df_joined_nbest_avg.csv' % eval_dir, sep = ';', index = False)
#df_joined_max_avg = pd.read_csv('%s/df_joined_max_avg.csv' % eval_dir, sep = ';')
#df_joined_mean_avg = pd.read_csv('%s/df_joined_mean_avg.csv' % eval_dir, sep = ';')
#df_joined_nbest_avg = pd.read_csv('%s/df_joined_nbest_avg.csv' % eval_dir, sep = ';')

**Maximum clustering quality**

Rank by adjusted Rand index (per data set):

In [17]:
for (d, t), grp in df_joined_max_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
0,balanced_paired,callahan,uparse,otus,nf,0.948569,0.991342,0.957946,21808.0,467652.0,99.0,18891.0,463596.0,balanced,paired


Data set: balanced_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
1,balanced_paired,callahan,usearch,fast_length,nf,0.97505,0.974797,0.962173,21808.0,467652.0,738.0,21808.0,467652.0,balanced,paired
4,balanced_paired,callahan,usearch,smallmem_size,nf,0.979604,0.965789,0.956898,21808.0,467652.0,2965.0,21808.0,467652.0,balanced,paired
2,balanced_paired,callahan,usearch,fast_size,nf,0.979669,0.965754,0.956857,21808.0,467652.0,3170.0,21808.0,467652.0,balanced,paired
3,balanced_paired,callahan,usearch,smallmem_length,nf,0.963368,0.971434,0.954777,21808.0,467652.0,1646.0,21808.0,467652.0,balanced,paired


Data set: balanced_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
7,balanced_paired,callahan,vsearch,smallmem_length,nf,0.996442,0.960766,0.966456,21808.0,467652.0,2973.0,21808.0,467652.0,balanced,paired
5,balanced_paired,callahan,vsearch,fast,nf,0.996442,0.960633,0.966443,21808.0,467652.0,2977.0,21808.0,467652.0,balanced,paired
6,balanced_paired,callahan,vsearch,size,nf,0.979998,0.965057,0.956424,21808.0,467652.0,2941.0,21808.0,467652.0,balanced,paired
8,balanced_paired,callahan,vsearch,smallmem_size,nf,0.979998,0.965057,0.956424,21808.0,467652.0,2941.0,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
9,balanced_single,callahan,uparse,otus,nf,0.961808,0.990367,0.96679,33523.0,558059.0,591.0,30452.0,553807.0,balanced,single


Data set: balanced_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
12,balanced_single,callahan,usearch,smallmem_length,nf,0.962336,0.982932,0.960173,33523.0,558059.0,2383.0,33523.0,558059.0,balanced,single
13,balanced_single,callahan,usearch,smallmem_size,nf,0.962336,0.982932,0.960173,33523.0,558059.0,2383.0,33523.0,558059.0,balanced,single
11,balanced_single,callahan,usearch,fast_size,nf,0.962171,0.983018,0.960132,33523.0,558059.0,2432.0,33523.0,558059.0,balanced,single
10,balanced_single,callahan,usearch,fast_length,nf,0.988856,0.954912,0.957855,33523.0,558059.0,1351.0,33523.0,558059.0,balanced,single


Data set: balanced_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
15,balanced_single,callahan,vsearch,size,nf,0.962533,0.982747,0.960249,33523.0,558059.0,2380.0,33523.0,558059.0,balanced,single
16,balanced_single,callahan,vsearch,smallmem_length,nf,0.962533,0.982747,0.960249,33523.0,558059.0,2380.0,33523.0,558059.0,balanced,single
17,balanced_single,callahan,vsearch,smallmem_size,nf,0.962533,0.982747,0.960249,33523.0,558059.0,2380.0,33523.0,558059.0,balanced,single
14,balanced_single,callahan,vsearch,fast,nf,0.962533,0.982747,0.960249,33523.0,558059.0,2383.0,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
18,hmp_paired,callahan,uparse,otus,nf,0.999385,0.756362,0.772837,19882.0,208157.0,271.0,15710.0,201779.0,hmp,paired


Data set: hmp_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
22,hmp_paired,callahan,usearch,smallmem_size,nf,0.98431,0.946166,0.948972,19882.0,208157.0,1027.0,19882.0,208157.0,hmp,paired
20,hmp_paired,callahan,usearch,fast_size,nf,0.980798,0.932878,0.931659,19882.0,208157.0,722.0,19882.0,208157.0,hmp,paired
21,hmp_paired,callahan,usearch,smallmem_length,nf,0.979962,0.902208,0.919493,19882.0,208157.0,630.0,19882.0,208157.0,hmp,paired
19,hmp_paired,callahan,usearch,fast_length,nf,0.952065,0.839299,0.857109,19882.0,208157.0,201.0,19882.0,208157.0,hmp,paired


Data set: hmp_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
24,hmp_paired,callahan,vsearch,size,nf,0.982547,0.951767,0.950444,19882.0,208157.0,547.0,19882.0,208157.0,hmp,paired
26,hmp_paired,callahan,vsearch,smallmem_size,nf,0.982547,0.951767,0.950444,19882.0,208157.0,547.0,19882.0,208157.0,hmp,paired
25,hmp_paired,callahan,vsearch,smallmem_length,nf,0.983229,0.885548,0.90209,19882.0,208157.0,619.0,19882.0,208157.0,hmp,paired
23,hmp_paired,callahan,vsearch,fast,nf,0.940564,0.905514,0.865029,19882.0,208157.0,198.0,19882.0,208157.0,hmp,paired


Data set: hmp_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
27,hmp_single,callahan,uparse,otus,nf,0.993017,0.978726,0.984052,73071.0,449409.0,1884.0,66339.0,437522.0,hmp,single


Data set: hmp_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
29,hmp_single,callahan,usearch,fast_size,nf,0.986843,0.955684,0.964832,73071.0,449409.0,3670.0,73071.0,449409.0,hmp,single
30,hmp_single,callahan,usearch,smallmem_length,nf,0.986511,0.956049,0.964586,73071.0,449409.0,3535.0,73071.0,449409.0,hmp,single
31,hmp_single,callahan,usearch,smallmem_size,nf,0.986511,0.956049,0.964586,73071.0,449409.0,3535.0,73071.0,449409.0,hmp,single
28,hmp_single,callahan,usearch,fast_length,nf,0.978071,0.945484,0.943576,73071.0,449409.0,1057.0,73071.0,449409.0,hmp,single


Data set: hmp_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
33,hmp_single,callahan,vsearch,size,nf,0.992599,0.953846,0.97361,73071.0,449409.0,5714.0,73071.0,449409.0,hmp,single
34,hmp_single,callahan,vsearch,smallmem_length,nf,0.992599,0.953846,0.97361,73071.0,449409.0,5714.0,73071.0,449409.0,hmp,single
35,hmp_single,callahan,vsearch,smallmem_size,nf,0.992599,0.953846,0.97361,73071.0,449409.0,5714.0,73071.0,449409.0,hmp,single
32,hmp_single,callahan,vsearch,fast,nf,0.992828,0.953775,0.973584,73071.0,449409.0,5765.0,73071.0,449409.0,hmp,single


Best option of VSEARCH better than best option of USEARCH on each data set, but difference is rather small.   
UPARSE usually similar to both (sometimes a bit better, sometimes a bit worse), but notably lower on hmp_paired.   
However, the results have to be compared with caution because for UPARSE we only consider those sequences not labelled as chimeric.

Average maximum values over all data sets and sort by adjusted Rand index:

In [18]:
rows = []
for (t, m, f), grp in df_joined_max_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
6,vsearch,size,nf,0.979419,0.963354,0.960181
8,vsearch,smallmem_size,nf,0.979419,0.963354,0.960181
4,usearch,smallmem_size,nf,0.97819,0.962734,0.957657
2,usearch,fast_size,nf,0.97737,0.959333,0.95337
7,vsearch,smallmem_length,nf,0.983701,0.945727,0.950601
3,usearch,smallmem_length,nf,0.973044,0.953156,0.949757
5,vsearch,fast,nf,0.973092,0.950667,0.941326
1,usearch,fast_length,nf,0.973511,0.928623,0.930178
0,uparse,otus,nf,0.975695,0.929199,0.920406


Options using the abundance (size) as the sorting criterion performed better. Minor differences between normal / fast and smallmem versions.

`USEARCH` pick: `fast_size` = `-cluster_fast -sort size
`

`VSEARCH` pick: `size` = `--cluster_size`

**Average clustering quality**

Rank by adjusted Rand index (per data set):

In [19]:
for (d, t), grp in df_joined_mean_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
0,balanced_paired,callahan,uparse,otus,nf,0.948569,0.991342,0.957946,21808.0,467652.0,99.0,18891.0,463596.0,balanced,paired


Data set: balanced_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
1,balanced_paired,callahan,usearch,fast_length,nf,0.949012,0.973574,0.937067,21808.0,467652.0,998.5,21808.0,467652.0,balanced,paired
3,balanced_paired,callahan,usearch,smallmem_length,nf,0.917537,0.977878,0.910356,21808.0,467652.0,906.8,21808.0,467652.0,balanced,paired
4,balanced_paired,callahan,usearch,smallmem_size,nf,0.899123,0.979941,0.887607,21808.0,467652.0,897.5,21808.0,467652.0,balanced,paired
2,balanced_paired,callahan,usearch,fast_size,nf,0.899157,0.979903,0.887595,21808.0,467652.0,920.3,21808.0,467652.0,balanced,paired


Data set: balanced_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
7,balanced_paired,callahan,vsearch,smallmem_length,nf,0.925757,0.977098,0.918353,21808.0,467652.0,903.7,21808.0,467652.0,balanced,paired
5,balanced_paired,callahan,vsearch,fast,nf,0.925757,0.977085,0.918352,21808.0,467652.0,904.2,21808.0,467652.0,balanced,paired
6,balanced_paired,callahan,vsearch,size,nf,0.899669,0.979354,0.887956,21808.0,467652.0,895.1,21808.0,467652.0,balanced,paired
8,balanced_paired,callahan,vsearch,smallmem_size,nf,0.899669,0.979354,0.887956,21808.0,467652.0,895.1,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
9,balanced_single,callahan,uparse,otus,nf,0.961808,0.990367,0.96679,33523.0,558059.0,591.0,30452.0,553807.0,balanced,single


Data set: balanced_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
10,balanced_single,callahan,usearch,fast_length,nf,0.96678,0.930208,0.896599,33523.0,558059.0,3090.7,33523.0,558059.0,balanced,single
12,balanced_single,callahan,usearch,smallmem_length,nf,0.911093,0.97744,0.892855,33523.0,558059.0,1838.2,33523.0,558059.0,balanced,single
13,balanced_single,callahan,usearch,smallmem_size,nf,0.911093,0.97744,0.892855,33523.0,558059.0,1838.2,33523.0,558059.0,balanced,single
11,balanced_single,callahan,usearch,fast_size,nf,0.911057,0.977443,0.892699,33523.0,558059.0,1908.1,33523.0,558059.0,balanced,single


Data set: balanced_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
15,balanced_single,callahan,vsearch,size,nf,0.909752,0.977831,0.891719,33523.0,558059.0,1815.9,33523.0,558059.0,balanced,single
16,balanced_single,callahan,vsearch,smallmem_length,nf,0.909752,0.977831,0.891719,33523.0,558059.0,1815.9,33523.0,558059.0,balanced,single
17,balanced_single,callahan,vsearch,smallmem_size,nf,0.909752,0.977831,0.891719,33523.0,558059.0,1815.9,33523.0,558059.0,balanced,single
14,balanced_single,callahan,vsearch,fast,nf,0.909753,0.97783,0.891718,33523.0,558059.0,1818.4,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
18,hmp_paired,callahan,uparse,otus,nf,0.999385,0.756362,0.772837,19882.0,208157.0,271.0,15710.0,201779.0,hmp,paired


Data set: hmp_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
21,hmp_paired,callahan,usearch,smallmem_length,nf,0.94325,0.875654,0.821855,19882.0,208157.0,723.2,19882.0,208157.0,hmp,paired
22,hmp_paired,callahan,usearch,smallmem_size,nf,0.881686,0.941995,0.778172,19882.0,208157.0,667.3,19882.0,208157.0,hmp,paired
19,hmp_paired,callahan,usearch,fast_length,nf,0.978759,0.763948,0.774877,19882.0,208157.0,997.2,19882.0,208157.0,hmp,paired
20,hmp_paired,callahan,usearch,fast_size,nf,0.88323,0.931885,0.770827,19882.0,208157.0,930.4,19882.0,208157.0,hmp,paired


Data set: hmp_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
25,hmp_paired,callahan,vsearch,smallmem_length,nf,0.94954,0.863909,0.824888,19882.0,208157.0,722.2,19882.0,208157.0,hmp,paired
23,hmp_paired,callahan,vsearch,fast,nf,0.960892,0.809043,0.787176,19882.0,208157.0,787.8,19882.0,208157.0,hmp,paired
24,hmp_paired,callahan,vsearch,size,nf,0.882213,0.939136,0.778996,19882.0,208157.0,676.8,19882.0,208157.0,hmp,paired
26,hmp_paired,callahan,vsearch,smallmem_size,nf,0.882213,0.939136,0.778996,19882.0,208157.0,676.8,19882.0,208157.0,hmp,paired


Data set: hmp_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
27,hmp_single,callahan,uparse,otus,nf,0.993017,0.978726,0.984052,73071.0,449409.0,1884.0,66339.0,437522.0,hmp,single


Data set: hmp_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
28,hmp_single,callahan,usearch,fast_length,nf,0.983296,0.855502,0.85272,73071.0,449409.0,9532.3,73071.0,449409.0,hmp,single
30,hmp_single,callahan,usearch,smallmem_length,nf,0.893841,0.927139,0.77922,73071.0,449409.0,5912.9,73071.0,449409.0,hmp,single
31,hmp_single,callahan,usearch,smallmem_size,nf,0.893841,0.927139,0.77922,73071.0,449409.0,5912.9,73071.0,449409.0,hmp,single
29,hmp_single,callahan,usearch,fast_size,nf,0.894349,0.925273,0.777587,73071.0,449409.0,6649.3,73071.0,449409.0,hmp,single


Data set: hmp_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
32,hmp_single,callahan,vsearch,fast,nf,0.894363,0.927513,0.781736,73071.0,449409.0,5541.3,73071.0,449409.0,hmp,single
33,hmp_single,callahan,vsearch,size,nf,0.893389,0.928299,0.781593,73071.0,449409.0,5504.0,73071.0,449409.0,hmp,single
34,hmp_single,callahan,vsearch,smallmem_length,nf,0.893389,0.928299,0.781593,73071.0,449409.0,5504.0,73071.0,449409.0,hmp,single
35,hmp_single,callahan,vsearch,smallmem_size,nf,0.893389,0.928299,0.781593,73071.0,449409.0,5504.0,73071.0,449409.0,hmp,single


Best option of USEARCH typically (slightly) better than best option of VSEARCH. Larger difference on hmp_single.

Average the mean values over all data sets and sort by adjusted Rand index.

In [20]:
rows = []
for (t, m, f), grp in df_joined_mean_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
0,uparse,otus,nf,0.975695,0.929199,0.920406
1,usearch,fast_length,nf,0.969462,0.880808,0.865316
7,vsearch,smallmem_length,nf,0.919609,0.936784,0.854138
3,usearch,smallmem_length,nf,0.91643,0.939528,0.851071
5,vsearch,fast,nf,0.922691,0.922868,0.844745
6,vsearch,size,nf,0.896256,0.956155,0.835066
8,vsearch,smallmem_size,nf,0.896256,0.956155,0.835066
4,usearch,smallmem_size,nf,0.896436,0.956629,0.834463
2,usearch,fast_size,nf,0.896948,0.953626,0.832177


Options using the abundance (size) as the sorting criterion performed worse. Minor differences between normal / fast and smallmem versions.

`USEARCH` pick: `fast_length` = `-cluster_fast -sort length
`

`VSEARCH` pick: `smallmem_length` = `--cluster_smallmem --minsize 1` on length-ordered input FASTA file

Differences between abundance- and length-sorted options relatively small. Use of options with best maximum average probably okay.

**N-best average clustering quality**

Rank by adjusted Rand index (per data set):

In [21]:
for (d, t), grp in df_joined_nbest_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
0,balanced_paired,callahan,uparse,otus,nf,0.948569,0.991342,0.957946,21808.0,467652.0,99.0,18891.0,463596.0,balanced,paired


Data set: balanced_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
1,balanced_paired,callahan,usearch,fast_length,nf,0.966155,0.971866,0.957449,21808.0,467652.0,937.0,21808.0,467652.0,balanced,paired
3,balanced_paired,callahan,usearch,smallmem_length,nf,0.950502,0.974358,0.94243,21808.0,467652.0,1510.8,21808.0,467652.0,balanced,paired
2,balanced_paired,callahan,usearch,fast_size,nf,0.945044,0.977779,0.934843,21808.0,467652.0,1545.6,21808.0,467652.0,balanced,paired
4,balanced_paired,callahan,usearch,smallmem_size,nf,0.944993,0.977819,0.934801,21808.0,467652.0,1498.8,21808.0,467652.0,balanced,paired


Data set: balanced_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
7,balanced_paired,callahan,vsearch,smallmem_length,nf,0.960568,0.973587,0.953716,21808.0,467652.0,1504.6,21808.0,467652.0,balanced,paired
5,balanced_paired,callahan,vsearch,fast,nf,0.960568,0.97356,0.953714,21808.0,467652.0,1505.6,21808.0,467652.0,balanced,paired
6,balanced_paired,callahan,vsearch,size,nf,0.945746,0.976915,0.93521,21808.0,467652.0,1494.0,21808.0,467652.0,balanced,paired
8,balanced_paired,callahan,vsearch,smallmem_size,nf,0.945746,0.976915,0.93521,21808.0,467652.0,1494.0,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
9,balanced_single,callahan,uparse,otus,nf,0.961808,0.990367,0.96679,33523.0,558059.0,591.0,30452.0,553807.0,balanced,single


Data set: balanced_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
12,balanced_single,callahan,usearch,smallmem_length,nf,0.952507,0.975946,0.939728,33523.0,558059.0,3146.2,33523.0,558059.0,balanced,single
13,balanced_single,callahan,usearch,smallmem_size,nf,0.952507,0.975946,0.939728,33523.0,558059.0,3146.2,33523.0,558059.0,balanced,single
11,balanced_single,callahan,usearch,fast_size,nf,0.952502,0.975877,0.939591,33523.0,558059.0,3286.0,33523.0,558059.0,balanced,single
10,balanced_single,callahan,usearch,fast_length,nf,0.958127,0.953234,0.924435,33523.0,558059.0,1078.6,33523.0,558059.0,balanced,single


Data set: balanced_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
15,balanced_single,callahan,vsearch,size,nf,0.952583,0.975808,0.939582,33523.0,558059.0,3104.0,33523.0,558059.0,balanced,single
16,balanced_single,callahan,vsearch,smallmem_length,nf,0.952583,0.975808,0.939582,33523.0,558059.0,3104.0,33523.0,558059.0,balanced,single
17,balanced_single,callahan,vsearch,smallmem_size,nf,0.952583,0.975808,0.939582,33523.0,558059.0,3104.0,33523.0,558059.0,balanced,single
14,balanced_single,callahan,vsearch,fast,nf,0.952584,0.975807,0.93958,33523.0,558059.0,3108.0,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
18,hmp_paired,callahan,uparse,otus,nf,0.999385,0.756362,0.772837,19882.0,208157.0,271.0,15710.0,201779.0,hmp,paired


Data set: hmp_paired / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
22,hmp_paired,callahan,usearch,smallmem_size,nf,0.956596,0.94232,0.905991,19882.0,208157.0,1081.4,19882.0,208157.0,hmp,paired
20,hmp_paired,callahan,usearch,fast_size,nf,0.956888,0.927343,0.89212,19882.0,208157.0,1546.0,19882.0,208157.0,hmp,paired
21,hmp_paired,callahan,usearch,smallmem_length,nf,0.963259,0.884334,0.865878,19882.0,208157.0,540.8,19882.0,208157.0,hmp,paired
19,hmp_paired,callahan,usearch,fast_length,nf,0.971699,0.789343,0.803311,19882.0,208157.0,410.0,19882.0,208157.0,hmp,paired


Data set: hmp_paired / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
24,hmp_paired,callahan,vsearch,size,nf,0.957977,0.940242,0.904843,19882.0,208157.0,1100.8,19882.0,208157.0,hmp,paired
26,hmp_paired,callahan,vsearch,smallmem_size,nf,0.957977,0.940242,0.904843,19882.0,208157.0,1100.8,19882.0,208157.0,hmp,paired
25,hmp_paired,callahan,vsearch,smallmem_length,nf,0.962135,0.877359,0.863458,19882.0,208157.0,538.8,19882.0,208157.0,hmp,paired
23,hmp_paired,callahan,vsearch,fast,nf,0.95063,0.847301,0.813824,19882.0,208157.0,398.6,19882.0,208157.0,hmp,paired


Data set: hmp_single / Tool: uparse


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
27,hmp_single,callahan,uparse,otus,nf,0.993017,0.978726,0.984052,73071.0,449409.0,1884.0,66339.0,437522.0,hmp,single


Data set: hmp_single / Tool: usearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
28,hmp_single,callahan,usearch,fast_length,nf,0.977345,0.921184,0.92523,73071.0,449409.0,1533.6,73071.0,449409.0,hmp,single
30,hmp_single,callahan,usearch,smallmem_length,nf,0.966593,0.924156,0.901021,73071.0,449409.0,10843.2,73071.0,449409.0,hmp,single
31,hmp_single,callahan,usearch,smallmem_size,nf,0.966593,0.924156,0.901021,73071.0,449409.0,10843.2,73071.0,449409.0,hmp,single
29,hmp_single,callahan,usearch,fast_size,nf,0.966987,0.921309,0.897838,73071.0,449409.0,12275.2,73071.0,449409.0,hmp,single


Data set: hmp_single / Tool: vsearch


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
33,hmp_single,callahan,vsearch,size,nf,0.968443,0.925015,0.906592,73071.0,449409.0,10045.8,73071.0,449409.0,hmp,single
34,hmp_single,callahan,vsearch,smallmem_length,nf,0.968443,0.925015,0.906592,73071.0,449409.0,10045.8,73071.0,449409.0,hmp,single
35,hmp_single,callahan,vsearch,smallmem_size,nf,0.968443,0.925015,0.906592,73071.0,449409.0,10045.8,73071.0,449409.0,hmp,single
32,hmp_single,callahan,vsearch,fast,nf,0.96869,0.924847,0.906461,73071.0,449409.0,10094.0,73071.0,449409.0,hmp,single


Mixed tendencies. Usually no large differences.

Average the N-best values over all data sets and sort by adjusted Rand index:

In [22]:
rows = []
for (t, m, f), grp in df_joined_nbest_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
6,vsearch,size,nf,0.956187,0.954495,0.921557
8,vsearch,smallmem_size,nf,0.956187,0.954495,0.921557
0,uparse,otus,nf,0.975695,0.929199,0.920406
4,usearch,smallmem_size,nf,0.955172,0.95506,0.920385
2,usearch,fast_size,nf,0.955355,0.950577,0.916098
7,vsearch,smallmem_length,nf,0.960932,0.937942,0.915837
3,usearch,smallmem_length,nf,0.958215,0.939699,0.912264
5,vsearch,fast,nf,0.958118,0.930379,0.903395
1,usearch,fast_length,nf,0.968332,0.908907,0.902606


Similar to maximum.