In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import os
import pandas as pd
import seaborn as sb
import statistics as st

from Bio import SeqIO

sb.set()
pd.set_option("display.max_rows", None)

# Evaluation of model-supported clustering methods implemented in GeFaST on Callahan data

The following notebook describes the steps and results of the evaluation.

In [None]:
# Initial files and directories:
#
# model_supported_callahan
# |- data
# |  |- balanced
# |  |  \- BalancedRefSeqs.fasta  # provided reference sequences of 'balanced' data set
# |  |
# |  \- hmp
# |     \- HMP_MOCK.fasta  # provided reference sequences of 'hmp' data set
# |
# |- evaluation # will contain the evaluation plots and tables
# |
# |- outputs # will contain the cluster and metric outputs
# |
# |- tasks  # task files for the different runs of GeFaST
# |
# |- base_as.conf  # common base configuration of runs in 'as' mode
# |- base_cons.conf  # common base configuration of runs in 'cons' mode
# \- base_lev.conf  # common base configuration of runs in 'lev' mode

The provided reference sequences are part of the [Supplementary Software](https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.3869/MediaObjects/41592_2016_BFnmeth3869_MOESM270_ESM.zip) of the DADA2 paper.   

## Analysis workflow

The data sets are preprocessed as described in Callahan et al., *DADA2: High-resolution sample inference from Illumina amplicon data* (https://doi.org/10.1038/nmeth.3869),
except that the minimum sequence abundance is set to 1 (not 2) for the sake of a fair comparison with other data sets and tools.

The taxonomic assignment is obtained by merging (using `USEARCH -fastq_mergepairs`) and dereplicating (using `USEARCH -derep_fulllength`) the reads, 
and matching them with the respective reference sequences (using `VSEARCH --usearch_global`).

GeFaST is executed in different modes (`as`, `lev`, `cons`), with different clusterers (`classic`, `cons-classic`, `cons-swarmer`) and with different cluster refiners (`classic`, `cons-lsa`, `cons-lsr`, `cons-lss`).

The different runs are usually referred to by an abbreviation hinting at the mode, clusterer and refiner: `<mode/clusterer>__<refinement>`, e.g. `as__nf` and `lev_dada2-c__lsa1`.

Values of `<mode/clusterer>`:

| Abbreviation | Description |
| --: | :-- |
| as          | alignment-score mode with `classic` clusterer |
| as_dada2-c  | alignment-score mode with `cons-classic` clusterer |
| dada2-s     | consistency mode with `cons-swarmer` clusterer |
| lev         | Levenshtein mode with `classic` clusterer |
| lev_dada2-c | Levenshtein mode with `cons-classic` clusterer |

Values of `<refinement>`:

| Abbreviation | Description |
| --:  | :-- |
| nf   | no refinement |
| f1   | fastidious refinement with `classic` refiner, incremented refinement threshold (+1 for `lev` / `lev_cons`, +20 for `as` / `as_cons`) |
| 2f   | fastidious refinement with `classic` refiner, doubled refinement threshold doubled |
| lsa1 | consistency-based refinement with `cons-lsa` refiner (option 1) |
| lsa2 | consistency-based refinement with `cons-lsa` refiner (option 2) |
| lsa3 | consistency-based refinement with `cons-lsa` refiner (option 3) |
| lsa4 | consistency-based refinement with `cons-lsa` refiner (option 4) |
| lsr1 | consistency-based refinement with `cons-lsr` refiner (option 1) |
| lsr2 | consistency-based refinement with `cons-lsr` refiner (option 2) |
| lsr3 | consistency-based refinement with `cons-lsr` refiner (option 3) |
| lsr4 | consistency-based refinement with `cons-lsr` refiner (option 4) |
| lss  | consistency-based refinement with `cons-lss` refiner |

## Commands

The following commands prepare and cluster the data sets. The results are evaluated below.

In order to execute the workflow as provided here, the `tools` subdirectory of the overall repository should contain the USEARCH binaries `usearch8.0.1623_i86linux32` and `usearch10.0.240_i86linux32`, but the paths can be adjusted.
In addition, VSEARCH is expected to be accessible through the `vsearch` command.  

IMPORTANT: The commands are not intended to be executed from this notebook. They should be executed from the root directory of the overall repository.

In [None]:
%%bash

TOOLS_DIR=tools
ANALYSIS_DIR=analyses/model_supported_callahan
DATA_DIR=${ANALYSIS_DIR}/data
OUTPUT_DIR=${ANALYSIS_DIR}/outputs

USEARCH8_PATH=${TOOLS_DIR}/usearch8.0.1623_i86linux32 # adjust to your system
USEARCH10_PATH=${TOOLS_DIR}/usearch10.0.240_i86linux32 # adjust to your system

GEFAST=${TOOLS_DIR}/GeFaST/build/GeFaST # adjust to your system

RUNS=( as__nf as__2f as__f1 as__lsa1 as__lsa2 as__lsa3 as__lsa4 as__lsr1 as__lsr2 as__lsr3 as__lsr4 as__lss as_dada2-c__nf as_dada2-c__2f as_dada2-c__f1 as_dada2-c__lsa1 as_dada2-c__lsa2 as_dada2-c__lsa3 as_dada2-c__lsa4 as_dada2-c__lsr1 as_dada2-c__lsr2 as_dada2-c__lsr3 as_dada2-c__lsr4 as_dada2-c__lss dada2-s__nf dada2-s__lsa1 dada2-s__lsa2 dada2-s__lsa3 dada2-s__lsa4 dada2-s__lsr1 dada2-s__lsr2 dada2-s__lsr3 dada2-s__lsr4 dada2-s__lss lev__nf lev__2f lev__f1 lev__lsa1 lev__lsa2 lev__lsa3 lev__lsa4 lev__lsr1 lev__lsr2 lev__lsr3 lev__lsr4 lev__lss lev_dada2-c__nf lev_dada2-c__2f lev_dada2-c__f1 lev_dada2-c__lsa1 lev_dada2-c__lsa2 lev_dada2-c__lsa3 lev_dada2-c__lsa4 lev_dada2-c__lsr1 lev_dada2-c__lsr2 lev_dada2-c__lsr3 lev_dada2-c__lsr4 lev_dada2-c__lss )

## balanced
python -m scripts.analyses.analysis_callahan reference balanced ${DATA_DIR}/balanced/BalancedRefSeqs.fasta ${DATA_DIR}/balanced/callahan.fasta 

# paired
python -m scripts.analyses.analysis_callahan prepare balanced ${DATA_DIR}/balanced --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/balanced/ERR777695_pfmd.fastq ${DATA_DIR}/balanced/callahan.fasta ${DATA_DIR}/balanced/bp_callahan_0.97.tax 0.97

READS=bp:${DATA_DIR}/balanced/ERR777695_pfmd.fastq
TAX=callahan:${DATA_DIR}/balanced/bp_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/balanced_paired/${R} --tax_files ${TAX} --gefast ${GEFAST}
    for F in ${OUTPUT_DIR}/balanced_paired/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/balanced_paired/${R}/${R}_${F##*/}; done
done


# single
python -m scripts.analyses.analysis_callahan prepare balanced ${DATA_DIR}/balanced --single --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/balanced/ERR777695_1_sfd.fastq ${DATA_DIR}/balanced/callahan.fasta ${DATA_DIR}/balanced/bs_callahan_0.97.tax 0.97

READS=bs:${DATA_DIR}/balanced/ERR777695_1_sfd.fastq
TAX=callahan:${DATA_DIR}/balanced/bs_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/balanced_single/${R} --tax_files ${TAX} --gefast ${GEFAST}
    for F in ${OUTPUT_DIR}/balanced_single/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/balanced_single/${R}/${R}_${F##*/}; done
done


## hmp
python -m scripts.analyses.analysis_callahan reference hmp ${DATA_DIR}/hmp/HMP_MOCK.fasta ${DATA_DIR}/hmp/callahan.fasta 

# paired
python -m scripts.analyses.analysis_callahan prepare hmp ${DATA_DIR}/hmp --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/hmp/Mock1_S1_L001_pfmd.fastq ${DATA_DIR}/hmp/callahan.fasta ${DATA_DIR}/hmp/hp_callahan_0.97.tax 0.97

READS=hp:${DATA_DIR}/hmp/Mock1_S1_L001_pfmd.fastq
TAX=callahan:${DATA_DIR}/hmp/hp_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/hmp_paired/${R} --tax_files ${TAX} --gefast ${GEFAST}
    for F in ${OUTPUT_DIR}/hmp_paired/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/hmp_paired/${R}/${R}_${F##*/}; done
done


# single
python -m scripts.analyses.analysis_callahan prepare hmp ${DATA_DIR}/hmp --single --min_size 1 --usearch8 ${USEARCH8_PATH} --usearch10 ${USEARCH10_PATH}
python -m scripts.analyses.analysis_callahan taxonomy ${DATA_DIR}/hmp/Mock1_S1_L001_R1_001_sfd.fastq ${DATA_DIR}/hmp/callahan.fasta ${DATA_DIR}/hmp/hs_callahan_0.97.tax 0.97

READS=hs:${DATA_DIR}/hmp/Mock1_S1_L001_R1_001_sfd.fastq
TAX=callahan:${DATA_DIR}/hmp/hs_callahan_0.97.tax
for R in "${RUNS[@]}"; do
    python -m scripts.analyses.analysis_callahan run ${R} ${READS} ${ANALYSIS_DIR}/tasks/${R}.txt ${OUTPUT_DIR}/hmp_single/${R} --tax_files ${TAX} --gefast ${GEFAST}
    for F in ${OUTPUT_DIR}/hmp_single/${R}/*__metrics.csv; do mv ${F} ${OUTPUT_DIR}/hmp_single/${R}/${R}_${F##*/}; done
done

## Evaluation

**Configuration**

In [2]:
data_sets = ['balanced', 'hmp']
read_types = ['single', 'paired']
ground_truths = ['callahan']

opts = ['as__nf', 'as__f1', 'as__2f', 'as__lsa1', 'as__lsa2', 'as__lsa3', 'as__lsa4', 'as__lsr1', 'as__lsr2', 'as__lsr3', 'as__lsr4', 'as__lss', 
        'as_dada2-c__nf', 'as_dada2-c__f1', 'as_dada2-c__2f', 'as_dada2-c__lsa1', 'as_dada2-c__lsa2', 'as_dada2-c__lsa3', 'as_dada2-c__lsa4', 'as_dada2-c__lsr1', 'as_dada2-c__lsr2', 'as_dada2-c__lsr3', 'as_dada2-c__lsr4', 'as_dada2-c__lss', 
        'dada2-s__nf', 'dada2-s__lsa1', 'dada2-s__lsa2', 'dada2-s__lsa3', 'dada2-s__lsa4', 'dada2-s__lsr1', 'dada2-s__lsr2', 'dada2-s__lsr3', 'dada2-s__lsr4', 'dada2-s__lss', 
        'lev__nf', 'lev__f1', 'lev__2f', 'lev__lsa1', 'lev__lsa2', 'lev__lsa3', 'lev__lsa4', 'lev__lsr1', 'lev__lsr2', 'lev__lsr3', 'lev__lsr4', 'lev__lss', 
        'lev_dada2-c__nf', 'lev_dada2-c__f1', 'lev_dada2-c__2f', 'lev_dada2-c__lsa1', 'lev_dada2-c__lsa2', 'lev_dada2-c__lsa3', 'lev_dada2-c__lsa4', 'lev_dada2-c__lsr1', 'lev_dada2-c__lsr2', 'lev_dada2-c__lsr3', 'lev_dada2-c__lsr4', 'lev_dada2-c__lss']

read_files = {'balanced_single': 'ERR777695_1_sfd.fastq', 'balanced_paired': 'ERR777695_pfmd.fastq',
              'hmp_single': 'Mock1_S1_L001_R1_001_sfd.fastq', 'hmp_paired': 'Mock1_S1_L001_pfmd.fastq'}

data_dir = 'data'
results_dir = 'outputs'
eval_dir = 'evaluation'

### Number of clusters and amplicons

Reads the input files and the cluster outputs for all data sets and compares the number of clusters and amplicons.

In [3]:
# Requires the input and OTU files. Alternatively, the evaluation can use the stored information (see below).
df_columns = ['data_set', 'tool', 'mode', 'refinement', 'threshold', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

rows = []

for rt in read_types:
    for ds in data_sets:
        run_name = '%s_%s' % (ds, rt)
        
        seq_file = '%s/%s/%s' % (data_dir, ds, read_files[run_name]) # the input sequences
        num_input_amplicons = 0
        input_mass = 0
        with open(seq_file, 'r') as in_file:
            for record in SeqIO.parse(in_file, 'fastq'):
                num_input_amplicons += 1
                input_mass += int(record.id.split('_')[-1])
        
        for opt in opts:
            otu_files = [f for f in os.listdir('%s/%s/%s/' % (results_dir, run_name, opt)) if f.endswith('_otus.txt')]

            for f in otu_files:
                otu_file = '%s/%s/%s/%s' % (results_dir, run_name, opt, f)
                                        
                num_output_amplicons = 0
                num_clusters = 0
                output_mass = 0
                with open(otu_file, 'r') as in_file:
                    for line in in_file:
                        num_output_amplicons += len(line.strip().split(' '))
                        num_clusters += 1
                        output_mass += sum([int(m.split('_')[-1]) for m in line.strip().split(' ')])
                        
                tool = 'gefast'
                mode = '_'.join(f.split('__')[0].split('_')[1:])
                refinement = f.split('__')[1].split('_')[0]
                threshold = float(f.split('_')[-2])

                rows.append([run_name, tool, mode, refinement, threshold, num_input_amplicons, input_mass, num_clusters, num_output_amplicons, output_mass, ds, rt])
            
df_counts = pd.DataFrame(rows, columns = df_columns)
df_counts.sort_values(by = ['data_set', 'tool', 'mode', 'refinement', 'threshold'], inplace = True)

*Column descriptions:*   
`num_input_amplicons`: The number of entries in the corresponding input file.   
`input_mass`: The sum of the abundances of all entries in the input file.   
`num_clusters`: The number of computed clusters.   
`num_output_amplicons`: The number of amplicons contained in the clusters.   
`output_mass`: The sum of the abundances of all amplicons contained in the clusters.   

In [4]:
df_counts[['data_set', 'mode', 'refinement', 'threshold', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass']]

Unnamed: 0,data_set,mode,refinement,threshold,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass
1005,balanced_paired,as,2f,20.0,21808,467652,3419,21808,467652
1007,balanced_paired,as,2f,40.0,21808,467652,1853,21808,467652
1008,balanced_paired,as,2f,60.0,21808,467652,1368,21808,467652
1009,balanced_paired,as,2f,80.0,21808,467652,1034,21808,467652
1000,balanced_paired,as,2f,100.0,21808,467652,777,21808,467652
1001,balanced_paired,as,2f,120.0,21808,467652,591,21808,467652
1002,balanced_paired,as,2f,140.0,21808,467652,440,21808,467652
1003,balanced_paired,as,2f,160.0,21808,467652,331,21808,467652
1004,balanced_paired,as,2f,180.0,21808,467652,207,21808,467652
1006,balanced_paired,as,2f,200.0,21808,467652,165,21808,467652


In [5]:
df_counts.to_csv('%s/df_counts.csv' % eval_dir, sep = ';', index = False)
#df_counts = pd.read_csv('%s/df_counts.csv' % eval_dir, sep = ';')

### Clustering quality

In [6]:
# Requires the metrics files. Alternatively, the evaluation can use the stored information (see below).
dfs = []
for ds in data_sets:
    for rt in read_types:
        run_name = '%s_%s' % (ds, rt)

        for opt in opts:
            for gt in ground_truths:
                df = pd.read_csv('%s/%s/%s/%s_%s_%s__metrics.csv' % (results_dir, run_name, opt, opt, ds[0] + rt[0], gt), sep = ';')

                df['reads'] = '%s_%s' % (ds, rt)
                df['gt'] = gt
                df['mode'] = ['_'.join(m.split('__')[0].split('_')[1:]) for m in df['task']]
                df['refinement'] = [m.split('__')[1] for m in df['task']]
                df['ds'] = ds
                df['rt'] = rt

                dfs.append(df)
                    
df_quality = pd.concat(dfs, ignore_index = True)
df_quality.rename(columns = {'task': 'run', 'reads': 'data_set'}, inplace = True)
df_quality.sort_values(by = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'threshold'], inplace = True)

*Column descriptions:*   
`precision`: Quantifies the extent to which amplicons in a cluster are also from the same species.   
`recall`: Measures the proportion of amplicons from the same species that are grouped in the same cluster.   
`adjrandindex`: Measures the agreement between the clusters and the taxonomic assignment and corrects for chance.   

In [7]:
df_quality[['data_set', 'gt', 'mode', 'refinement', 'threshold', 'precision', 'recall', 'adjrandindex']]

Unnamed: 0,data_set,gt,mode,refinement,threshold,precision,recall,adjrandindex
510,balanced_paired,callahan,as,2f,20.0,0.996647,0.934571,0.933914
511,balanced_paired,callahan,as,2f,40.0,0.975262,0.954036,0.942283
512,balanced_paired,callahan,as,2f,60.0,0.972507,0.95588,0.944137
513,balanced_paired,callahan,as,2f,80.0,0.958458,0.971524,0.950691
514,balanced_paired,callahan,as,2f,100.0,0.956968,0.97125,0.949326
515,balanced_paired,callahan,as,2f,120.0,0.95308,0.977716,0.950336
516,balanced_paired,callahan,as,2f,140.0,0.934381,0.977282,0.927715
517,balanced_paired,callahan,as,2f,160.0,0.931156,0.975001,0.922972
518,balanced_paired,callahan,as,2f,180.0,0.901399,0.97663,0.878877
519,balanced_paired,callahan,as,2f,200.0,0.897903,0.973463,0.868034


In [8]:
df_quality.to_csv('%s/df_quality.csv' % eval_dir, sep = ';', index = False)
#df_quality = pd.read_csv('%s/df_quality.csv' % eval_dir, sep = ';')

Combine counting and quality information:

In [9]:
df_c, df_q = df_counts.copy(), df_quality.copy()
drop_cols = ['join_col'] + ['%s_counts' % s for s in set(df_q.columns) & set(df_c.columns)]
df_c['join_col'] = df_c['data_set'] + df_c['tool'] + df_c['mode'] + df_c['refinement'] + df_c['threshold'].apply(str)
df_q['join_col'] = df_q['data_set'] + df_q['tool'] + df_q['mode'] + df_q['refinement'] + df_q['threshold'].apply(str)
df_joined = df_q.join(df_c.set_index('join_col'), on = 'join_col', rsuffix = '_counts').drop(drop_cols, axis = 1)

In [10]:
df_joined.to_csv('%s/df_joined.csv' % eval_dir, sep = ';', index = False)
#df_joined = pd.read_csv('%s/df_joined.csv' % eval_dir, sep = ';')

Determine the maximum, average and N-best average clustering quality (for N = 5).

In [11]:
df_columns = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

max_rows = []
mean_rows = []
nbest_rows = []
n = 5

for (d, g, t, m, f, ds, rt), grp in df_joined.groupby(by = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'ds', 'rt']):
    best = grp.nlargest(1, 'adjrandindex')
    max_rows.append([d, g, t, m, f, best['precision'].values[0], best['recall'].values[0], best['adjrandindex'].values[0], best['num_input_amplicons'].values[0], best['input_mass'].values[0], best['num_clusters'].values[0], best['num_output_amplicons'].values[0], best['output_mass'].values[0], ds, rt])
    mean_rows.append([d, g, t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean(), grp['num_input_amplicons'].mean(), grp['input_mass'].mean(), grp['num_clusters'].mean(), grp['num_output_amplicons'].mean(), grp['output_mass'].mean(), ds, rt])
    nbest = grp.nlargest(n, 'adjrandindex')
    nbest_rows.append([d, g, t, m, f, nbest['precision'].mean(), nbest['recall'].mean(), nbest['adjrandindex'].mean(), nbest['num_input_amplicons'].mean(), nbest['input_mass'].mean(), nbest['num_clusters'].mean(), nbest['num_output_amplicons'].mean(), nbest['output_mass'].mean(), ds, rt])
    
df_joined_max = pd.DataFrame(max_rows, columns = df_columns)
df_joined_mean = pd.DataFrame(mean_rows, columns = df_columns)
df_joined_nbest = pd.DataFrame(nbest_rows, columns = df_columns)

In [12]:
df_joined_max.to_csv('%s/df_joined_max.csv' % eval_dir, sep = ';', index = False)
df_joined_mean.to_csv('%s/df_joined_mean.csv' % eval_dir, sep = ';', index = False)
df_joined_nbest.to_csv('%s/df_joined_nbest.csv' % eval_dir, sep = ';', index = False)
#df_joined_max = pd.read_csv('%s/df_joined_max.csv' % eval_dir, sep = ';')
#df_joined_mean = pd.read_csv('%s/df_joined_mean.csv' % eval_dir, sep = ';')
#df_joined_nbest = pd.read_csv('%s/df_joined_nbest.csv' % eval_dir, sep = ';')

In [13]:
df_max = df_joined_max.loc[df_joined_max['gt'] == 'callahan']
df_mean = df_joined_mean.loc[df_joined_mean['gt'] == 'callahan']
df_nbest = df_joined_nbest.loc[df_joined_nbest['gt'] == 'callahan']

For the chosen ground truth, average the maximum, average and N-best average values per data set (e.g. balanced) and read type (e.g. paired).   
Has no effect in this case because there is only one data set per combination of data set and read type .

In [14]:
df_columns = ['data_set', 'gt', 'tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex', 'num_input_amplicons', 'input_mass', 'num_clusters', 'num_output_amplicons', 'output_mass', 'ds', 'rt']

def average_complexity(df):
    rows = []
    for (gt, ds, rt, tool, mode, f), grp in df.groupby(by = ['gt', 'ds', 'rt', 'tool', 'mode', 'refinement']):
        rows.append(['%s_%s' % (ds, rt), gt, tool, mode, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean(), grp['num_input_amplicons'].mean(), grp['input_mass'].mean(), grp['num_clusters'].mean(), grp['num_output_amplicons'].mean(), grp['output_mass'].mean(), ds, rt])
    return pd.DataFrame(rows, columns = df_columns)

In [15]:
df_joined_max_avg = average_complexity(df_max)
df_joined_mean_avg = average_complexity(df_mean)
df_joined_nbest_avg = average_complexity(df_nbest)

In [16]:
df_joined_max_avg.to_csv('%s/df_joined_max_avg.csv' % eval_dir, sep = ';', index = False)
df_joined_mean_avg.to_csv('%s/df_joined_mean_avg.csv' % eval_dir, sep = ';', index = False)
df_joined_nbest_avg.to_csv('%s/df_joined_nbest_avg.csv' % eval_dir, sep = ';', index = False)
#df_joined_max_avg = pd.read_csv('%s/df_joined_max_avg.csv' % eval_dir, sep = ';')
#df_joined_mean_avg = pd.read_csv('%s/df_joined_mean_avg.csv' % eval_dir, sep = ';')
#df_joined_nbest_avg = pd.read_csv('%s/df_joined_nbest_avg.csv' % eval_dir, sep = ';')

**Maximum clustering quality**

Rank by adjusted Rand index (per data set):

In [17]:
for (d, t), grp in df_joined_max_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
37,balanced_paired,callahan,gefast,lev,lsa2,0.972045,0.971574,0.960314,21808.0,467652.0,450.0,21264.0,466966.0,balanced,paired
41,balanced_paired,callahan,gefast,lev,lsr2,0.971893,0.971161,0.959816,21808.0,467652.0,450.0,21483.0,467185.0,balanced,paired
35,balanced_paired,callahan,gefast,lev,f1,0.972586,0.970153,0.959028,21808.0,467652.0,1504.0,21808.0,467652.0,balanced,paired
34,balanced_paired,callahan,gefast,lev,2f,0.972405,0.970288,0.959009,21808.0,467652.0,1303.0,21808.0,467652.0,balanced,paired
45,balanced_paired,callahan,gefast,lev,nf,0.972753,0.969873,0.958953,21808.0,467652.0,1735.0,21808.0,467652.0,balanced,paired
38,balanced_paired,callahan,gefast,lev,lsa3,0.971936,0.970149,0.958867,21808.0,467652.0,455.0,21808.0,467652.0,balanced,paired
39,balanced_paired,callahan,gefast,lev,lsa4,0.971992,0.970149,0.958847,21808.0,467652.0,596.0,21808.0,467652.0,balanced,paired
36,balanced_paired,callahan,gefast,lev,lsa1,0.972067,0.970149,0.958847,21808.0,467652.0,864.0,21808.0,467652.0,balanced,paired
42,balanced_paired,callahan,gefast,lev,lsr3,0.971817,0.970192,0.958822,21808.0,467652.0,455.0,21808.0,467652.0,balanced,paired
43,balanced_paired,callahan,gefast,lev,lsr4,0.971889,0.970192,0.958813,21808.0,467652.0,596.0,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
61,balanced_single,callahan,gefast,as,lsa2,0.990498,0.952485,0.948608,33523.0,558059.0,920.0,32565.0,556838.0,balanced,single
95,balanced_single,callahan,gefast,lev,lsa2,0.990498,0.952485,0.948608,33523.0,558059.0,920.0,32565.0,556838.0,balanced,single
99,balanced_single,callahan,gefast,lev,lsr2,0.990332,0.952161,0.948217,33523.0,558059.0,920.0,32802.0,557075.0,balanced,single
65,balanced_single,callahan,gefast,as,lsr2,0.990332,0.952161,0.948217,33523.0,558059.0,920.0,32802.0,557075.0,balanced,single
100,balanced_single,callahan,gefast,lev,lsr3,0.989985,0.950482,0.946461,33523.0,558059.0,921.0,33523.0,558059.0,balanced,single
66,balanced_single,callahan,gefast,as,lsr3,0.989985,0.950482,0.946461,33523.0,558059.0,921.0,33523.0,558059.0,balanced,single
67,balanced_single,callahan,gefast,as,lsr4,0.990223,0.950482,0.946446,33523.0,558059.0,1183.0,33523.0,558059.0,balanced,single
101,balanced_single,callahan,gefast,lev,lsr4,0.990223,0.950482,0.946446,33523.0,558059.0,1183.0,33523.0,558059.0,balanced,single
102,balanced_single,callahan,gefast,lev,lss,0.990347,0.950482,0.946445,33523.0,558059.0,1619.0,33523.0,558059.0,balanced,single
68,balanced_single,callahan,gefast,as,lss,0.990347,0.950482,0.946445,33523.0,558059.0,1619.0,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
119,hmp_paired,callahan,gefast,as,lsa2,0.994527,0.732175,0.751635,19882.0,208157.0,657.0,19217.0,207363.0,hmp,paired
153,hmp_paired,callahan,gefast,lev,lsa2,0.994521,0.732192,0.751625,19882.0,208157.0,656.0,19206.0,207354.0,hmp,paired
123,hmp_paired,callahan,gefast,as,lsr2,0.994323,0.731033,0.750888,19882.0,208157.0,657.0,19545.0,207691.0,hmp,paired
157,hmp_paired,callahan,gefast,lev,lsr2,0.994314,0.731013,0.750853,19882.0,208157.0,656.0,19546.0,207694.0,hmp,paired
120,hmp_paired,callahan,gefast,as,lsa3,0.994235,0.729382,0.74993,19882.0,208157.0,659.0,19882.0,208157.0,hmp,paired
116,hmp_paired,callahan,gefast,as,2f,0.995354,0.729353,0.749905,19882.0,208157.0,1856.0,19882.0,208157.0,hmp,paired
154,hmp_paired,callahan,gefast,lev,lsa3,0.994226,0.729368,0.749895,19882.0,208157.0,663.0,19882.0,208157.0,hmp,paired
150,hmp_paired,callahan,gefast,lev,2f,0.995321,0.729358,0.749878,19882.0,208157.0,1836.0,19882.0,208157.0,hmp,paired
121,hmp_paired,callahan,gefast,as,lsa4,0.994259,0.729382,0.749865,19882.0,208157.0,786.0,19882.0,208157.0,hmp,paired
124,hmp_paired,callahan,gefast,as,lsr3,0.994201,0.729397,0.749863,19882.0,208157.0,659.0,19882.0,208157.0,hmp,paired


Data set: hmp_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
211,hmp_single,callahan,gefast,lev,lsa2,0.987062,0.934598,0.941511,73071.0,449409.0,919.0,72074.0,448214.0,hmp,single
177,hmp_single,callahan,gefast,as,lsa2,0.987062,0.934598,0.941511,73071.0,449409.0,919.0,72074.0,448214.0,hmp,single
215,hmp_single,callahan,gefast,lev,lsr2,0.986967,0.934139,0.941209,73071.0,449409.0,919.0,72347.0,448488.0,hmp,single
181,hmp_single,callahan,gefast,as,lsr2,0.986967,0.934139,0.941209,73071.0,449409.0,919.0,72347.0,448488.0,hmp,single
216,hmp_single,callahan,gefast,lev,lsr3,0.986696,0.932224,0.939958,73071.0,449409.0,920.0,73071.0,449409.0,hmp,single
182,hmp_single,callahan,gefast,as,lsr3,0.986696,0.932224,0.939958,73071.0,449409.0,920.0,73071.0,449409.0,hmp,single
183,hmp_single,callahan,gefast,as,lsr4,0.9868,0.932224,0.939941,73071.0,449409.0,1113.0,73071.0,449409.0,hmp,single
217,hmp_single,callahan,gefast,lev,lsr4,0.9868,0.932224,0.939941,73071.0,449409.0,1113.0,73071.0,449409.0,hmp,single
214,hmp_single,callahan,gefast,lev,lsr1,0.986994,0.932224,0.939941,73071.0,449409.0,1643.0,73071.0,449409.0,hmp,single
180,hmp_single,callahan,gefast,as,lsr1,0.986994,0.932224,0.939941,73071.0,449409.0,1643.0,73071.0,449409.0,hmp,single


Average the maximum values over all data sets and sort by adjusted Rand index:

In [18]:
rows = []
for (t, m, f), grp in df_joined_max_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
37,gefast,lev,lsa2,0.986032,0.897712,0.900514
41,gefast,lev,lsr2,0.985877,0.897118,0.900024
42,gefast,lev,lsr3,0.985675,0.895571,0.898768
38,gefast,lev,lsa3,0.985706,0.895508,0.898766
43,gefast,lev,lsr4,0.985791,0.895571,0.898752
40,gefast,lev,lsr1,0.985898,0.895571,0.898752
44,gefast,lev,lss,0.985898,0.895571,0.898752
39,gefast,lev,lsa4,0.985824,0.895508,0.898734
36,gefast,lev,lsa1,0.98604,0.895508,0.898734
3,gefast,as,lsa2,0.982598,0.897969,0.8984


**Average clustering quality**

Rank by adjusted Rand index (per data set):

In [19]:
for (d, t), grp in df_joined_mean_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
37,balanced_paired,callahan,gefast,lev,lsa2,0.946551,0.972325,0.931248,21808.0,467652.0,371.5,21387.8,467115.2,balanced,paired
41,balanced_paired,callahan,gefast,lev,lsr2,0.946432,0.972021,0.930879,21808.0,467652.0,371.5,21548.7,467276.1,balanced,paired
45,balanced_paired,callahan,gefast,lev,nf,0.947145,0.970955,0.930199,21808.0,467652.0,1402.8,21808.0,467652.0,balanced,paired
35,balanced_paired,callahan,gefast,lev,f1,0.94694,0.971167,0.930178,21808.0,467652.0,1185.8,21808.0,467652.0,balanced,paired
38,balanced_paired,callahan,gefast,lev,lsa3,0.946446,0.971216,0.930148,21808.0,467652.0,376.7,21808.0,467652.0,balanced,paired
39,balanced_paired,callahan,gefast,lev,lsa4,0.946519,0.971216,0.930138,21808.0,467652.0,492.3,21808.0,467652.0,balanced,paired
36,balanced_paired,callahan,gefast,lev,lsa1,0.946585,0.971215,0.930138,21808.0,467652.0,697.7,21808.0,467652.0,balanced,paired
42,balanced_paired,callahan,gefast,lev,lsr3,0.946352,0.971245,0.930101,21808.0,467652.0,376.7,21808.0,467652.0,balanced,paired
44,balanced_paired,callahan,gefast,lev,lss,0.946461,0.971245,0.930098,21808.0,467652.0,619.2,21808.0,467652.0,balanced,paired
43,balanced_paired,callahan,gefast,lev,lsr4,0.946424,0.971245,0.930097,21808.0,467652.0,492.3,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
95,balanced_single,callahan,gefast,lev,lsa2,0.940833,0.965155,0.916605,33523.0,558059.0,377.6,33089.5,557527.5,balanced,single
93,balanced_single,callahan,gefast,lev,f1,0.942218,0.963842,0.916308,33523.0,558059.0,1978.7,33523.0,558059.0,balanced,single
99,balanced_single,callahan,gefast,lev,lsr2,0.940704,0.964907,0.916258,33523.0,558059.0,377.6,33244.4,557682.4,balanced,single
103,balanced_single,callahan,gefast,lev,nf,0.942559,0.963149,0.915925,33523.0,558059.0,2586.9,33523.0,558059.0,balanced,single
92,balanced_single,callahan,gefast,lev,2f,0.941508,0.963984,0.915827,33523.0,558059.0,1481.0,33523.0,558059.0,balanced,single
96,balanced_single,callahan,gefast,lev,lsa3,0.940789,0.96424,0.915656,33523.0,558059.0,378.6,33523.0,558059.0,balanced,single
97,balanced_single,callahan,gefast,lev,lsa4,0.940828,0.96424,0.915645,33523.0,558059.0,475.6,33523.0,558059.0,balanced,single
94,balanced_single,callahan,gefast,lev,lsa1,0.940867,0.96424,0.915644,33523.0,558059.0,721.4,33523.0,558059.0,balanced,single
100,balanced_single,callahan,gefast,lev,lsr3,0.940668,0.964258,0.915584,33523.0,558059.0,378.6,33523.0,558059.0,balanced,single
102,balanced_single,callahan,gefast,lev,lss,0.940733,0.964258,0.915579,33523.0,558059.0,646.0,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
145,hmp_paired,callahan,gefast,dada2-s,lsr2,0.986338,0.754205,0.741601,19882.0,208157.0,913.0,19703.0,207803.0,hmp,paired
141,hmp_paired,callahan,gefast,dada2-s,lsa2,0.986338,0.754205,0.741601,19882.0,208157.0,913.0,19703.0,207803.0,hmp,paired
142,hmp_paired,callahan,gefast,dada2-s,lsa3,0.986159,0.752922,0.740803,19882.0,208157.0,914.0,19882.0,208157.0,hmp,paired
146,hmp_paired,callahan,gefast,dada2-s,lsr3,0.986159,0.752922,0.740803,19882.0,208157.0,914.0,19882.0,208157.0,hmp,paired
143,hmp_paired,callahan,gefast,dada2-s,lsa4,0.986352,0.752922,0.740782,19882.0,208157.0,1079.0,19882.0,208157.0,hmp,paired
147,hmp_paired,callahan,gefast,dada2-s,lsr4,0.986352,0.752922,0.740782,19882.0,208157.0,1079.0,19882.0,208157.0,hmp,paired
140,hmp_paired,callahan,gefast,dada2-s,lsa1,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
144,hmp_paired,callahan,gefast,dada2-s,lsr1,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
148,hmp_paired,callahan,gefast,dada2-s,lss,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
153,hmp_paired,callahan,gefast,lev,lsa2,0.951734,0.727037,0.69977,19882.0,208157.0,491.9,19491.6,207689.7,hmp,paired


Data set: hmp_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
209,hmp_single,callahan,gefast,lev,f1,0.900957,0.899345,0.750519,73071.0,449409.0,4298.5,73071.0,449409.0,hmp,single
211,hmp_single,callahan,gefast,lev,lsa2,0.896731,0.902435,0.74995,73071.0,449409.0,459.3,72562.9,448831.2,hmp,single
215,hmp_single,callahan,gefast,lev,lsr2,0.896677,0.902064,0.749741,73071.0,449409.0,459.3,72758.8,449027.2,hmp,single
208,hmp_single,callahan,gefast,lev,2f,0.899183,0.900119,0.749594,73071.0,449409.0,3116.4,73071.0,449409.0,hmp,single
212,hmp_single,callahan,gefast,lev,lsa3,0.89676,0.901267,0.749398,73071.0,449409.0,460.3,73071.0,449409.0,hmp,single
213,hmp_single,callahan,gefast,lev,lsa4,0.896775,0.901267,0.749387,73071.0,449409.0,528.8,73071.0,449409.0,hmp,single
210,hmp_single,callahan,gefast,lev,lsa1,0.89683,0.901267,0.749387,73071.0,449409.0,848.6,73071.0,449409.0,hmp,single
216,hmp_single,callahan,gefast,lev,lsr3,0.896697,0.901291,0.749368,73071.0,449409.0,460.3,73071.0,449409.0,hmp,single
218,hmp_single,callahan,gefast,lev,lss,0.896744,0.901291,0.749363,73071.0,449409.0,752.6,73071.0,449409.0,hmp,single
217,hmp_single,callahan,gefast,lev,lsr4,0.896711,0.901291,0.749362,73071.0,449409.0,528.6,73071.0,449409.0,hmp,single


Average the mean values over all data sets and sort by adjusted Rand index:

In [20]:
rows = []
for (t, m, f), grp in df_joined_mean_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
37,gefast,lev,lsa2,0.933962,0.891738,0.824393
41,gefast,lev,lsr2,0.933855,0.891349,0.824064
35,gefast,lev,f1,0.935601,0.889922,0.823938
34,gefast,lev,2f,0.934835,0.890172,0.823531
38,gefast,lev,lsa3,0.933918,0.890531,0.823508
39,gefast,lev,lsa4,0.933957,0.890531,0.823493
36,gefast,lev,lsa1,0.934019,0.890531,0.823492
42,gefast,lev,lsr3,0.933825,0.890553,0.823462
44,gefast,lev,lss,0.933897,0.890553,0.823456
43,gefast,lev,lsr4,0.933864,0.890553,0.823455


**N-best average clustering quality**

Rank by adjusted Rand index (per data set):

In [21]:
for (d, t), grp in df_joined_nbest_avg.groupby(by = ['data_set', 'tool']):
    print('Data set: %s / Tool: %s' % (d, t))
    display(grp.sort_values(by = 'adjrandindex', ascending = False))

Data set: balanced_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
37,balanced_paired,callahan,gefast,lev,lsa2,0.966947,0.971666,0.954973,21808.0,467652.0,462.0,21276.8,466976.4,balanced,paired
41,balanced_paired,callahan,gefast,lev,lsr2,0.966812,0.971294,0.954524,21808.0,467652.0,462.0,21473.8,467173.4,balanced,paired
35,balanced_paired,callahan,gefast,lev,f1,0.967464,0.970192,0.953643,21808.0,467652.0,1520.4,21808.0,467652.0,balanced,paired
38,balanced_paired,callahan,gefast,lev,lsa3,0.966776,0.970271,0.953564,21808.0,467652.0,467.4,21808.0,467652.0,balanced,paired
45,balanced_paired,callahan,gefast,lev,nf,0.967655,0.969868,0.953554,21808.0,467652.0,1803.8,21808.0,467652.0,balanced,paired
39,balanced_paired,callahan,gefast,lev,lsa4,0.96688,0.970271,0.953552,21808.0,467652.0,610.8,21808.0,467652.0,balanced,paired
36,balanced_paired,callahan,gefast,lev,lsa1,0.966974,0.970269,0.953551,21808.0,467652.0,876.6,21808.0,467652.0,balanced,paired
42,balanced_paired,callahan,gefast,lev,lsr3,0.966678,0.970308,0.953517,21808.0,467652.0,467.4,21808.0,467652.0,balanced,paired
44,balanced_paired,callahan,gefast,lev,lss,0.966837,0.970306,0.953512,21808.0,467652.0,780.6,21808.0,467652.0,balanced,paired
43,balanced_paired,callahan,gefast,lev,lsr4,0.966783,0.970308,0.953511,21808.0,467652.0,610.8,21808.0,467652.0,balanced,paired


Data set: balanced_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
95,balanced_single,callahan,gefast,lev,lsa2,0.96454,0.961127,0.940269,33523.0,558059.0,546.2,32924.8,557315.0,balanced,single
99,balanced_single,callahan,gefast,lev,lsr2,0.964385,0.960815,0.939844,33523.0,558059.0,546.2,33125.6,557515.8,balanced,single
93,balanced_single,callahan,gefast,lev,f1,0.966394,0.959105,0.939624,33523.0,558059.0,2881.6,33523.0,558059.0,balanced,single
92,balanced_single,callahan,gefast,lev,2f,0.965798,0.959347,0.939299,33523.0,558059.0,2371.4,33523.0,558059.0,balanced,single
96,balanced_single,callahan,gefast,lev,lsa3,0.96442,0.959849,0.938924,33523.0,558059.0,547.2,33523.0,558059.0,balanced,single
97,balanced_single,callahan,gefast,lev,lsa4,0.964494,0.959849,0.938908,33523.0,558059.0,692.0,33523.0,558059.0,balanced,single
94,balanced_single,callahan,gefast,lev,lsa1,0.964567,0.959849,0.938907,33523.0,558059.0,1027.2,33523.0,558059.0,balanced,single
100,balanced_single,callahan,gefast,lev,lsr3,0.964293,0.959883,0.938861,33523.0,558059.0,547.2,33523.0,558059.0,balanced,single
102,balanced_single,callahan,gefast,lev,lss,0.964413,0.959883,0.938852,33523.0,558059.0,929.4,33523.0,558059.0,balanced,single
101,balanced_single,callahan,gefast,lev,lsr4,0.964366,0.959883,0.938852,33523.0,558059.0,692.0,33523.0,558059.0,balanced,single


Data set: hmp_paired / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
145,hmp_paired,callahan,gefast,dada2-s,lsr2,0.986338,0.754205,0.741601,19882.0,208157.0,913.0,19703.0,207803.0,hmp,paired
141,hmp_paired,callahan,gefast,dada2-s,lsa2,0.986338,0.754205,0.741601,19882.0,208157.0,913.0,19703.0,207803.0,hmp,paired
146,hmp_paired,callahan,gefast,dada2-s,lsr3,0.986159,0.752922,0.740803,19882.0,208157.0,914.0,19882.0,208157.0,hmp,paired
142,hmp_paired,callahan,gefast,dada2-s,lsa3,0.986159,0.752922,0.740803,19882.0,208157.0,914.0,19882.0,208157.0,hmp,paired
143,hmp_paired,callahan,gefast,dada2-s,lsa4,0.986352,0.752922,0.740782,19882.0,208157.0,1079.0,19882.0,208157.0,hmp,paired
147,hmp_paired,callahan,gefast,dada2-s,lsr4,0.986352,0.752922,0.740782,19882.0,208157.0,1079.0,19882.0,208157.0,hmp,paired
140,hmp_paired,callahan,gefast,dada2-s,lsa1,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
144,hmp_paired,callahan,gefast,dada2-s,lsr1,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
148,hmp_paired,callahan,gefast,dada2-s,lss,0.986361,0.752922,0.740782,19882.0,208157.0,1092.0,19882.0,208157.0,hmp,paired
153,hmp_paired,callahan,gefast,lev,lsa2,0.981075,0.728161,0.731974,19882.0,208157.0,610.0,19328.0,207494.6,hmp,paired


Data set: hmp_single / Tool: gefast


Unnamed: 0,data_set,gt,tool,mode,refinement,precision,recall,adjrandindex,num_input_amplicons,input_mass,num_clusters,num_output_amplicons,output_mass,ds,rt
209,hmp_single,callahan,gefast,lev,f1,0.949753,0.908735,0.837357,73071.0,449409.0,6898.8,73071.0,449409.0,hmp,single
175,hmp_single,callahan,gefast,as,f1,0.950014,0.908091,0.83705,73071.0,449409.0,7081.4,73071.0,449409.0,hmp,single
177,hmp_single,callahan,gefast,as,lsa2,0.943352,0.913725,0.83695,73071.0,449409.0,639.4,72384.6,448614.6,hmp,single
211,hmp_single,callahan,gefast,lev,lsa2,0.943099,0.914073,0.836905,73071.0,449409.0,602.4,72405.4,448636.0,hmp,single
208,hmp_single,callahan,gefast,lev,2f,0.947777,0.91001,0.836687,73071.0,449409.0,5417.8,73071.0,449409.0,hmp,single
181,hmp_single,callahan,gefast,as,lsr2,0.943276,0.913243,0.836659,73071.0,449409.0,639.4,72654.4,448884.6,hmp,single
215,hmp_single,callahan,gefast,lev,lsr2,0.943025,0.913607,0.836616,73071.0,449409.0,602.4,72654.0,448884.8,hmp,single
174,hmp_single,callahan,gefast,as,2f,0.947978,0.909492,0.83645,73071.0,449409.0,5515.4,73071.0,449409.0,hmp,single
182,hmp_single,callahan,gefast,as,lsr3,0.943234,0.912172,0.836073,73071.0,449409.0,640.4,73071.0,449409.0,hmp,single
184,hmp_single,callahan,gefast,as,lss,0.943332,0.912172,0.836065,73071.0,449409.0,1036.2,73071.0,449409.0,hmp,single


Average the N-best values over all data sets and sort by adjusted Rand index:

In [22]:
rows = []
for (t, m, f), grp in df_joined_nbest_avg.groupby(by = ['tool', 'mode', 'refinement']):
    rows.append([t, m, f, grp['precision'].mean(), grp['recall'].mean(), grp['adjrandindex'].mean()])
pd.DataFrame(rows, columns = ['tool', 'mode', 'refinement', 'precision', 'recall', 'adjrandindex']).sort_values(by = 'adjrandindex', ascending = False)

Unnamed: 0,tool,mode,refinement,precision,recall,adjrandindex
37,gefast,lev,lsa2,0.963915,0.893757,0.86603
41,gefast,lev,lsr2,0.963781,0.893245,0.865595
35,gefast,lev,f1,0.966365,0.890936,0.865258
34,gefast,lev,2f,0.965608,0.891339,0.864978
38,gefast,lev,lsa3,0.963777,0.892114,0.86478
39,gefast,lev,lsa4,0.963844,0.892114,0.864756
36,gefast,lev,lsa1,0.963955,0.892113,0.864755
42,gefast,lev,lsr3,0.963691,0.892148,0.864735
44,gefast,lev,lss,0.963815,0.892147,0.864724
43,gefast,lev,lsr4,0.963757,0.892148,0.864724
