### Notebook to analyzse the efficiency of minimap mapping against a mock community

Starting points from Tavish  
reference_dataframe at '/media/MassStorage/tmp/TE/honours/analysis/Stats/reference_dataframe.csv'  
custom_database at '/media/MassStorage/tmp/TE/honours/database/custom_database_labelled.fasta'  
taxonomy_file at '/media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file.csv'

#### workflow

* Two databases
* subsample 15000 reads per each mock community species. Save those out.
* map reads against both databases with minimap safe out data in paf format.
* get best hit per species (see what this means while looking at the data).
* add the full taxonomy to each best match using the taxonomy file.
* summarize data at different taxonomic ranks for each species.
* pull this all together somehow (summary across all the samples? focus on species of interest e.g. deleted from analyis?)

#### requirment

* Bbmap (conda install bbmap https://anaconda.org/bioconda/bbmap)  
* minimap2 (conda install minimap2 https://anaconda.org/bioconda/minimap2)

#### fixes on the command line

* fixed the taxonomy file to fit with the quiime format

cat /media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file.csv | sed 's/,/\t/' > /media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file_v2.csv

#### fix the taxonomy_file_v2.csv more to reflect Qiime style

* This requires to make the species name genus_species and not _species.... hope this makes sense

In [None]:
import re
old_taxonomy_file_fn = '/media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file_v2.csv'
new_taxonomy_file_fn = '/media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file_v3.csv'
with open(new_taxonomy_file_fn, 'w') as out_fh:
    with open(old_taxonomy_file_fn, 'r') as in_fh:
        for line in in_fh:
            line = line.rstrip()
            #print(line)
            first_half = line.split('s__')[0]
            second_half = line.split('s__')[1]
            pattern = re.compile(r'g__\w+;')
            genus = re.findall(pattern, first_half)[0].replace('g__','').replace(';','')
            new_line = F"{first_half}s__{genus}_{second_half}"
            print(new_line,file=out_fh)

In [None]:
!head -2 {old_taxonomy_file_fn}

In [None]:
!head -2 {new_taxonomy_file_fn}

In [None]:
from Bio import SeqIO
import os
import random
import subprocess
import pandas as pd

#### Initial data

In [None]:
reference_dataframe_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Stats/reference_dataframe.csv')
max_custom_database_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/database/custom_database_labelled.fasta')
taxonomy_file_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file_qiime.csv')

In [None]:
#threads to use
threads = 6

In [None]:
INPUT_BASEDIR = os.path.abspath('/media/MassStorage/tmp/TE/honours')

In [None]:
OUT_DIR = os.path.abspath('../../analysis/Mapping_mock_gsref')
if not os.path.exists(OUT_DIR):
    os.mkdir(OUT_DIR)

In [None]:
### list of species in the max database
max_species = ['Puccinia_striiformis-tritici',
             'Zymoseptoria_tritici',
             'Pyrenophora_tritici-repentis',
             'Fusarium_oxysporum',
             'Tuber_brumale',
             'Cortinarius_globuliformis',
             'Aspergillus_niger',
             'Clavispora_lusitaniae',
             'Kluyveromyces_unidentified',
             'Penicillium_chrysogenum',
             'Rhodotorula_mucilaginosa',
             'Scedosporium_boydii',
             'Blastobotrys_proliferans',
             'Debaryomyces_unidentified',
             'Galactomyces_geotrichum',
             'Kodamaea_ohmeri',
             'Meyerozyma_guilliermondii',
             'Wickerhamomyces_anomalus',
             'Yamadazyma_mexicana',
             'Yamadazyma_scolyti',
             'Yarrowia_lipolytica',
             'Zygoascus_hellenicus',
             'Aspergillus_flavus',
             'Cryptococcus_zero',
             'Aspergillus_unidentified',
             'Diaporthe_CCL067',
             'Diaporthe_unidentified',
             'Oculimacula_yallundae-CCL031',
             'Oculimacula_yallundae-CCL029',
             'Dothiorella_vidmadera',
             'Quambalaria_cyanescens',
             'Entoleuca_unidentified',
             'Asteroma_CCL060',
             'Asteroma_CCL068',
             'Saccharomyces_cerevisiae',
             'Cladophialophora_unidentified',
             'Candida_albicans',
             'Candida_metapsilosis',
             'Candida_orthopsilosis',
             'Candida_parapsilosis',
             'Candida_unidentified',
             'Kluyveromyces_marxianus',
             'Pichia_kudriavzevii',
             'Pichia_membranifaciens']

In [None]:
# ###Removed from second test databes
species_delete = [
# 'Candida_orthopsilosis',
#                  'Candida_metapsilosis',
#                  'Aspergillus_niger'
]

In [None]:
###species to be searched against both databases
# mock_community = ['Penicillium_chrysogenum',
#  'Aspergillus_flavus',
#  'Aspergillus_niger',
#  'Pichia_kudriavzevii',
#  'Pichia_membranifaciens',
#  'Candida_albicans',
#  'Candida_parapsilosis',
#  'Candida_orthopsilosis',
#  'Candida_metapsilosis']

mock_community = ['Puccinia_striiformis-tritici',
             'Zymoseptoria_tritici',
             'Pyrenophora_tritici-repentis',
             'Fusarium_oxysporum',
             'Tuber_brumale',
             'Cortinarius_globuliformis',
             'Aspergillus_niger',
             'Clavispora_lusitaniae',
             'Kluyveromyces_unidentified',
             'Penicillium_chrysogenum',
             'Rhodotorula_mucilaginosa',
             'Scedosporium_boydii',
             'Blastobotrys_proliferans',
             'Debaryomyces_unidentified',
             'Galactomyces_geotrichum',
             'Kodamaea_ohmeri',
             'Meyerozyma_guilliermondii',
             'Wickerhamomyces_anomalus',
             'Yamadazyma_mexicana',
             'Yamadazyma_scolyti',
             'Yarrowia_lipolytica',
             'Zygoascus_hellenicus',
             'Aspergillus_flavus',
             'Cryptococcus_zero',
             'Aspergillus_unidentified',
             'Diaporthe_CCL067',
             'Diaporthe_unidentified',
             'Oculimacula_yallundae-CCL031',
             'Oculimacula_yallundae-CCL029',
             'Dothiorella_vidmadera',
             'Quambalaria_cyanescens',
             'Entoleuca_unidentified',
             'Asteroma_CCL060',
             'Asteroma_CCL068',
             'Saccharomyces_cerevisiae',
             'Cladophialophora_unidentified',
             'Candida_albicans',
             'Candida_metapsilosis',
             'Candida_orthopsilosis',
             'Candida_parapsilosis',
             'Candida_unidentified',
             'Kluyveromyces_marxianus',
             'Pichia_kudriavzevii',
             'Pichia_membranifaciens']

In [None]:
# fixed_old_names = ['Kluyveromyces_lactis',
#                    'Candida_zeylanoides',
#                    'Cladophialophora_sp.',
#                    'Diaporthe_sp.',
#                    'CCL060',
#                    'CCL068',
#                    'CCL067',
#                    'Aspergillus_sp.',
#                    'Entoleuca_sp.',
#                    'Tapesia_yallundae_CCL029',
#                    'Tapesia_yallundae_CCL031',
#                    'Cryptococcus_neoformans']

In [None]:
# fixed_new_names = ['candida_unidentified',
#                    'debaryomyces_unidentified',
#                    'cladophialophora_unidentified',
#                    'diaporthe_unidentified',
#                    'asteroma_ccl060',
#                    'asteroma_ccl068',
#                    'diaporthe_ccl067',
#                    'aspergillus_unidentified',
#                    'entoleuca_unidentified',
#                    'oculimacula_yallundae-ccl029',
#                    'oculimacula_yallundae-ccl031',
#                    'kluyveromyces_unidentified']

In [None]:
# old_to_new_names = dict(zip(fixed_old_names, fixed_new_names))

In [None]:
# old_to_new_names

### Fix databases and names

In [None]:
ref_df = pd.read_csv(reference_dataframe_fn)
ref_df['name_species'] = ref_df['genus'] +"_"+ ref_df['species']

In [None]:
ref_df.name_species.tolist()

In [None]:
new_db_fn = os.path.join(OUT_DIR, 'gsref.db.fasta')

In [None]:
new_db_list = []
old_db_list = []
for seq in SeqIO.parse(max_custom_database_fn, 'fasta'):
    old_db_list.append(seq.id)
    if seq.id.lower() in ref_df.name_species.tolist():
        #print(seq.id)
        seq.id = seq.name = seq.description = seq.id.lower()
        new_db_list.append(seq)
    else:
        print(seq.id)

In [None]:
if len(new_db_list) == len(old_db_list):
    SeqIO.write(new_db_list, new_db_fn, 'fasta')
else:
    print("please check!")

In [None]:
sub_db_fn = os.path.join(OUT_DIR, 'gsref.subdb.fasta')
sub_db_list = []
for seq in new_db_list:
    if seq.id not in [x.lower() for x in species_delete]:
        sub_db_list.append(seq)

In [None]:
if len(sub_db_list) + len(species_delete) == len(new_db_list):
    SeqIO.write(sub_db_list, sub_db_fn, 'fasta' )
else:
    print("please check!")

In [None]:
[x.id for x in sub_db_list]

In [None]:
mock_community = [x.lower() for x in mock_community]

In [None]:
mock_community

### Subsample reads

In [None]:
def subsamplereads(in_fn, out_fn, n_reads):
    command = F'reformat.sh samplereadstarget={n_reads} in={in_fn} out={out_fn}'
    out = subprocess.getstatusoutput(command)
    if out[0] == 0:
        print(F":)Completed {command}\n")
    else:
        print(F":(check one {command}!!\n")

In [None]:
n_reads = 2000

In [None]:
MC_READ_DIR = os.path.join(OUT_DIR, 'MC_READS')
if not os.path.exists(MC_READ_DIR):
    os.mkdir(MC_READ_DIR)

In [None]:
ref_df.columns

In [None]:
fn_subsampling = {}
for x in mock_community:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])
fn_subsampling

In [None]:
sub_reads_fn = {}
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(MC_READ_DIR, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

### Map with minimap against both databases

In [None]:
def minimapmapping(fasta_fn, ref_fn, out_fn, threads):
    command = F"minimap2 -x map-ont -t {threads} {ref_fn} {fasta_fn} -o {out_fn}"
    out = subprocess.getstatusoutput(command)
    if out[0] == 0:
        print(F":)Completed {command}\n")
    else:
        print(F":(check one {command}!!\n")

In [None]:
dbases_fn = {}
for x in [sub_db_fn, new_db_fn]:
    dbases_fn[x] = os.path.join(OUT_DIR, os.path.basename(x).replace('.fasta', '').replace('.','_'))
    if not os.path.exists(dbases_fn[x]):
        os.mkdir(dbases_fn[x])
dbases_fn

In [None]:
db_fn = sub_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    tmp_out = dbases_fn[db_fn]
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(tmp_out, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

In [None]:
db_fn = new_db_fn
new_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    tmp_out = dbases_fn[db_fn]
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(tmp_out, F"{db_name}.{species}.minimap2.paf")
    new_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

### Look at mapping results

In [None]:
def mapping_results(fn, species):
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    sub_df = tmp_df[tmp_df['mquality'] == tmp_df.groupby('qseqid')['mquality'].transform(max)].reset_index(drop=True)
    sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_series = pd.Series(sub_df.groupby('tname')['mquality'].count().tolist()/sub_df.groupby('tname')['mquality'].count().sum(),
                      sub_df.groupby('tname')['mquality'].count().index)
    hit_series.sort_values(ascending=False, inplace=True)
    print(sub_df.qseqid.unique().shape == tmp_df.qseqid.unique().shape)
    print('##########\n')
    print(F"This was the query species: {species}\n")
    print(F"These are the results:")
    print(hit_series,'\n')
    hit_series.to_json('/media/MassStorage/tmp/TE/honours/analysis/Mapping/custom_results/%s.json' % species)

In [None]:
###this is running the reads against the full database
for species, hit_fn in new_db_mapping_fn.items():
    mapping_results(hit_fn, species)

In [None]:
###this is running against a database that have ['Candida_orthopsilosis', 'Candida_metapsilosis', 'Aspergillus_niger'] deleted
for species, hit_fn in sub_db_mapping_fn.items():
    mapping_results(hit_fn, species)

### Pull in mapping results and analyse them at all available levels

##### idea

* pull in query taxid as a dictionary
* assign taxid for each tname species from minimap2
* generate a summary dictionary that checks concordance at each taxonmic rank

In [None]:
def pull_mapping_results(fn):
    """
    Takes a minimap2 paf and reads it in with the first 12 columns. Ignores the rest.
    Filters for each read the best hit on mquality first taking the highest value.
    Filters for each read by the number of nmatches in the second step.
    Returns a dataframe that has the tnames as index and the counts of hits as column 'count'.
    The dataframe has also the taxrank columns ['k', 'p', 'c', 'o', 'f', 'g', 's'] that are all False to start with.
    """
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    sub_df = tmp_df[tmp_df['mquality'] == tmp_df.groupby('qseqid')['mquality'].transform(max)].reset_index(drop=True)
    sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_df = pd.DataFrame(sub_df.groupby('tname')['mquality'].count().tolist(), sub_df.groupby('tname')['mquality'].count().index, columns=['count'])
    hit_df.sort_values(by='count', ascending=False, inplace=True)
    for key in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_df[key] = False
    return hit_df

In [None]:
def getquery_taxfileid(refdf_fn, species):
    """
    Takes the reference dataframe filename and the species name.
    Returns the taxfileid, which is the date/flowcellid (column 0 value) of the ref_df.
    """
    ref_df = pd.read_csv(refdf_fn)
    ref_df['name_species'] = ref_df['genus'] +"_"+ ref_df['species']
    return ref_df[ref_df.name_species == species].iloc[:,0].values[0]

In [None]:
def get_taxid_dict(taxid_fn, taxfileid):
    """
    Takes a taxonomy assignment file filename in the Qiime format and a taxonomic identifier.
    Returns the a dictionary with the taxonomic assignment at each rank.
    """
    tax_dict = {}
    with open(taxid_fn, 'r') as fh:
        for line in fh:
            if line.startswith(taxfileid):
                taxrankids = line.rstrip().split('\t')[1].split(';')
                for taxrank in taxrankids:
                    tax_dict[taxrank.split('__')[0]] = taxrank.split('__')[1]
    return tax_dict

In [None]:
def assign_taxranks_results(mapping_df, tax_fn, ref_df_fn = False):
    """
    This function assigns the taxonomic ranks for each hit in the mapping results dataframe.
    It takes a mapping_df, taxonomy assignment file, and if required a reference dataframe filename.
    Returns the mapping dataframe with assignment. 
    """
    for tname in mapping_df.index:
        if ref_df_fn:
            tmp_taxfileid = getquery_taxfileid(ref_df_fn, tname)
        else:
            tmp_taxfileid = tname
        tmp_tax_dict = get_taxid_dict(tax_fn, tmp_taxfileid)
        for key, value in tmp_tax_dict.items():
            mapping_df.loc[tname, key] = value
    return mapping_df

In [None]:
def get_accuracy_dict(mapping_df, query_tax_dict):
    """
    Summarieses the mapping accuracy of the mapping results at all taxonomic ranks.
    Takes the mapping_df with taxnomonic assignments and a taxnomic dictionary of the known query.
    Returns an accuracy dictionary for each taxnomic rank ['k', 'p', 'c', 'o', 'f', 'g', 's']. 
    Right now this function takes a qiime tax 
    """
    accuracy_dict = {}
    total_count = mapping_df['count'].sum()
    for tax_rank in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_count = mapping_df[mapping_df[tax_rank] == query_tax_dict[tax_rank]]['count'].sum()
        accuracy_dict[tax_rank] = hit_count/total_count
    return accuracy_dict

In [None]:
###Test out the summary results statistic for a single mapping result
species = 'penicillium_chrysogenum'
mapping_results = pull_mapping_results(sub_db_mapping_fn[species])

In [None]:
###Assign the data taxonomics ranks for all the results
mapping_results = assign_taxranks_results(mapping_results, taxonomy_file_fn, ref_df_fn=reference_dataframe_fn)

taxfileid = getquery_taxfileid(reference_dataframe_fn, species)

query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

In [None]:
sensitivity_dict

In [None]:
###Test out the summary results statistic for a single mapping result
species = 'candida_albicans'
mapping_results = pull_mapping_results(sub_db_mapping_fn[species])

In [None]:
###Assign the data taxonomics ranks for all the results
mapping_results = assign_taxranks_results(mapping_results, taxonomy_file_fn, ref_df_fn=reference_dataframe_fn)

taxfileid = getquery_taxfileid(reference_dataframe_fn, species)

query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

In [None]:
sensitivity_dict

In [None]:
###Test out the summary results statistic for a single mapping result
species = 'aspergillus_niger'
mapping_results = pull_mapping_results(sub_db_mapping_fn[species])

In [None]:
###Assign the data taxonomics ranks for all the results
mapping_results = assign_taxranks_results(mapping_results, taxonomy_file_fn, ref_df_fn=reference_dataframe_fn)

taxfileid = getquery_taxfileid(reference_dataframe_fn, species)

query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

In [None]:
sensitivity_dict

### Test run on the qiime2 Database

##### Prep on the command line

cp sh_refs_qiime_ver8_dynamic_02.02.2019.fasta /media/WorkingStorage/ben.working/students/tavish/analysis/qiime2/db/.  
cp sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt /media/WorkingStorage/ben.working/students/tavish/analysis/qiime2/db/.


In [None]:
def pull_mapping_results_v2(fn):
    """
    Takes a minimap2 paf and reads it in with the first 12 columns. Ignores the rest.
    Filters for each read the best hit on mquality first taking the highest value.
    Filters for each read by the number of nmatches in the second step.
    Returns a dataframe that has the tnames as index and the counts of hits as column 'count'.
    The dataframe has also the taxrank columns ['k', 'p', 'c', 'o', 'f', 'g', 's'] that are all False to start with.
    """
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    sub_df = tmp_df[tmp_df['mquality'] == tmp_df.groupby('qseqid')['mquality'].transform(max)].reset_index(drop=True)
    #sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_df = pd.DataFrame(sub_df.groupby('tname')['mquality'].count().tolist(), sub_df.groupby('tname')['mquality'].count().index, columns=['count'])
    hit_df.sort_values(by='count', ascending=False, inplace=True)
    for key in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_df[key] = False
        tmp_df[key] = False
    return hit_df, tmp_df

In [None]:
os.path.abspath(os.curdir)

In [None]:
qiime_db_fn = os.path.abspath('../../analysis/qiime2/db/sh_refs_qiime_ver8_dynamic_02.02.2019.fasta')
qiime_tax_fn = os.path.abspath('../../analysis/qiime2/db/sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt')
threads = 10
QIIME_DIR = os.path.abspath('../../analysis/qiime2/')

In [None]:
##mapping folder
mapping_dir = os.path.join(QIIME_DIR, os.path.basename(qiime_db_fn).replace('.fasta', '').replace('.','_'))
if not os.path.exists(mapping_dir):
    os.mkdir(mapping_dir)
subsampling_dir = os.path.join(QIIME_DIR, 'subsamplereads')
if not os.path.exists(subsampling_dir):
    os.mkdir(subsampling_dir)

#### Run on test species 'penicillium_chrysogenum'

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['penicillium_chrysogenum']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

###Test out the summary results statistic for a single mapping result
species = 'penicillium_chrysogenum'
mapping_results , full_results_df = pull_mapping_results_v2(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
###fix family level for 'penicillium_chrysogenum'
sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)


full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict 

In [None]:
full_results_df

In [None]:
###looking at unfiltered results
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

##### These are wired results that might be linked to

* database issues as you can see

sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt: k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Penicillium;s__Penicillium_chrysogenum

In [None]:
mapping_results[mapping_results['s'] == 'Penicillium_chrysogenum']

In [None]:
query_tax_dict = {}
taxrank = 'k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Penicillium;s__Penicillium_chrysogenum'
for rank_id in taxrank.split(';'):
    query_tax_dict[rank_id.split('__')[0]] = rank_id.split('__')[1]
query_tax_dict

In [None]:
get_accuracy_dict(mapping_results, query_tax_dict)

In [None]:
###There must be more or equal number of mapping results compared to number of mapped reads
mapping_results['count'].sum() >= full_results_df.qseqid.unique().shape[0]

#### Testing on Candida albicans

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['candida_albicans']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

In [None]:
###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

In [None]:
###Test out the summary results statistic for a single mapping result
species = test_species[0]
print(sub_db_mapping_fn[species])
mapping_results , full_results_df = pull_mapping_results_v2(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
print(taxfileid)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
print(query_tax_dict)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

In [None]:
def pull_mapping_results_v3(fn):
    """
    Takes a minimap2 paf and reads it in with the first 12 columns. Ignores the rest.
    Filters for each read the best hit on mquality first taking the highest value.
    Filters for each read by the number of nmatches in the second step.
    Returns a dataframe that has the tnames as index and the counts of hits as column 'count'.
    The dataframe has also the taxrank columns ['k', 'p', 'c', 'o', 'f', 'g', 's'] that are all False to start with.
    """
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    tmp_df['cscore'] = tmp_df['alen']/(tmp_df['alen']-tmp_df['nmatch'])
    sub_df = tmp_df[tmp_df['cscore'] == tmp_df.groupby('qseqid')['cscore'].transform(max)].reset_index(drop=True)
#     sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_df = pd.DataFrame(sub_df.groupby('tname')['cscore'].count().tolist(), sub_df.groupby('tname')['cscore'].count().index, columns=['count'])
    hit_df.sort_values(by='count', ascending=False, inplace=True)
    for key in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_df[key] = False
        tmp_df[key] = False
    return hit_df, tmp_df

In [None]:
os.path.abspath(os.curdir)

In [None]:
qiime_db_fn = os.path.abspath('../../analysis/qiime2/db/sh_refs_qiime_ver8_dynamic_02.02.2019.fasta')
qiime_tax_fn = os.path.abspath('../../analysis/qiime2/db/sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt')
threads = 10
QIIME_DIR = os.path.abspath('../../analysis/qiime2/')

In [None]:
##mapping folder
mapping_dir = os.path.join(QIIME_DIR, os.path.basename(qiime_db_fn).replace('.fasta', '').replace('.','_'))
if not os.path.exists(mapping_dir):
    os.mkdir(mapping_dir)
subsampling_dir = os.path.join(QIIME_DIR, 'subsamplereads')
if not os.path.exists(subsampling_dir):
    os.mkdir(subsampling_dir)

#### Run on test species 'penicillium_chrysogenum'

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['penicillium_chrysogenum']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

###Test out the summary results statistic for a single mapping result
species = 'penicillium_chrysogenum'
mapping_results , full_results_df = pull_mapping_results_v3(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
###fix family level for 'penicillium_chrysogenum'
sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)


full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict 

In [None]:
full_results_df

In [None]:
###looking at unfiltered results
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

##### These are wired results that might be linked to

* database issues as you can see

sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt: k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Penicillium;s__Penicillium_chrysogenum

In [None]:
mapping_results[mapping_results['s'] == 'Penicillium_chrysogenum']

In [None]:
query_tax_dict = {}
taxrank = 'k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Penicillium;s__Penicillium_chrysogenum'
for rank_id in taxrank.split(';'):
    query_tax_dict[rank_id.split('__')[0]] = rank_id.split('__')[1]
query_tax_dict

In [None]:
get_accuracy_dict(mapping_results, query_tax_dict)

In [None]:
###There must be more or equal number of mapping results compared to number of mapped reads
mapping_results['count'].sum() >= full_results_df.qseqid.unique().shape[0]

#### Testing on Candida albicans

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['candida_albicans']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

In [None]:
###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

In [None]:
###Test out the summary results statistic for a single mapping result
species = test_species[0]
print(sub_db_mapping_fn[species])
mapping_results , full_results_df = pull_mapping_results_v3(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
print(taxfileid)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
print(query_tax_dict)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

#### Might want to double check how your families are matched in your taxonomic input file for your known test species

In [None]:
# mapping_results[mapping_results['s'] == 'Candida_albicans']

In [None]:
# query_tax_dict['f'] = 'Saccharomycetales_fam_Incertae_sedis'

In [None]:
# get_accuracy_dict(mapping_results, query_tax_dict)

### It appears that the filtering of results by mapping quality works well for the long ITS database but not for the qiime

In [None]:
full_results_df.columns

In [None]:
full_results_df.tname.shape

In [None]:
full_results_df.tname.unique().shape

In [None]:
full_results_df.index = full_results_df.tname

In [None]:
###Asign taxonomic ranks to the full_results_df
for tname in full_results_df.tname.unique():
    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
full_results_df[full_results_df['g'] == 'Candida']['qseqid'].shape

In [None]:
full_results_df[full_results_df['g'] == 'Candida']['qseqid'].unique().shape

In [None]:
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

Looking at the results unfiltered doesn't really work very well either. Might need to look into different filtering of the alignments or the qiime2 database might be just not really useful for the noise reads. Simulated reads with higher accuracy should get better here.

In [None]:
full_results_df.groupby('g').count()['k'].sort_values(ascending=False)

In [None]:
###There must be more or equal number of mapping results compared to number of mapped reads
mapping_results['count'].sum() >= full_results_df.qseqid.unique().shape[0]

#### Testing on other species

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['cortinarius_globuliformis']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

In [None]:
###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

In [None]:
###Test out the summary results statistic for a single mapping result
species = test_species[0]
print(sub_db_mapping_fn[species])
mapping_results , full_results_df = pull_mapping_results_v2(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
print(taxfileid)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
print(query_tax_dict)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():
    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

In [None]:
###Test out the summary results statistic for a single mapping result
species = test_species[0]
print(sub_db_mapping_fn[species])
mapping_results , full_results_df = pull_mapping_results_v3(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
print(taxfileid)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
print(query_tax_dict)

sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)

full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():
    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

In [None]:
###looking at unfiltered results
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

#### Testing on all species using v3

In [None]:
import json
from collections import OrderedDict

def get_accuracy_dict(mapping_df, query_tax_dict):
    """
    Summarises the mapping accuracy of the mapping results at all taxonomic ranks.
    Takes the mapping_df with taxnomonic assignments and a taxnomic dictionary of the known query.
    Returns an accuracy dictionary for each taxnomic rank ['k', 'p', 'c', 'o', 'f', 'g', 's']. 
    Right now this function takes a qiime tax 
    """
    accuracy_dict = OrderedDict()
    total_count = mapping_df['count'].sum()
    for tax_rank in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        tmps_df = pd.DataFrame(data=None)
        if tax_rank == 's':
            for index, row in mapping_df[mapping_df[tax_rank] == query_tax_dict[tax_rank]].iterrows():
                if row['s'] == 'unidentified' and row['g'] != query_tax_dict['g']:
                    mapping_df.drop(index, axis=0, inplace=True)
                else:
                    continue
            hit_count = mapping_df[mapping_df[tax_rank] == query_tax_dict[tax_rank]]['count'].sum()
        else:
            hit_count = mapping_df[mapping_df[tax_rank] == query_tax_dict[tax_rank]]['count'].sum()
        accuracy_dict[tax_rank] = hit_count/total_count
    return accuracy_dict

def minimapmapping(fasta_fn, ref_fn, out_fn, threads):
    command = F"minimap2 -x map-ont -t {threads} {ref_fn} {fasta_fn} -o {out_fn}"
    out = subprocess.getstatusoutput(command)

def pull_mapping_results_v3(fn):
    """
    Takes a minimap2 paf and reads it in with the first 12 columns. Ignores the rest.
    Filters for each read the best hit on mquality first taking the highest value.
    Filters for each read by the number of nmatches in the second step.
    Returns a dataframe that has the tnames as index and the counts of hits as column 'count'.
    The dataframe has also the taxrank columns ['k', 'p', 'c', 'o', 'f', 'g', 's'] that are all False to start with.
    """
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    tmp_df['cscore'] = tmp_df['alen']/(tmp_df['alen']-tmp_df['nmatch'])
    sub_df = tmp_df[tmp_df['cscore'] == tmp_df.groupby('qseqid')['cscore'].transform(max)].reset_index(drop=True)
#     sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_df = pd.DataFrame(sub_df.groupby('tname')['cscore'].count().tolist(), sub_df.groupby('tname')['cscore'].count().index, columns=['count'])
    hit_df.sort_values(by='count', ascending=False, inplace=True)
    for key in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_df[key] = False
        tmp_df[key] = False
    return hit_df, tmp_df
    
def subsamplereads(in_fn, out_fn, n_reads):
    command = F'reformat.sh samplereadstarget={n_reads} in={in_fn} out={out_fn}'
    out = subprocess.getstatusoutput(command)

test_species_list = []
for entry in ref_df.name_species.tolist():
#     if entry[-7:] != '-ccl031' and entry[-7:] != '-ccl029':
#         test_species_list.append(entry)
#     else:
#         test_species_list.append(entry[:-7])
#         print(entry[:-7])
    test_species_list.append(entry)
    
for test_species in test_species_list:
    
    print(test_species)
    
    #subsample tests species
    fn_subsampling = {}
    test_species = [test_species]
    for x in test_species:
        fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
        fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

    sub_reads_fn = {}
    n_reads = 20000
    for key, value in fn_subsampling.items():
        species = key
        in_fn = value
        out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
        subsamplereads(in_fn, out_fn, n_reads)
        sub_reads_fn[species] = out_fn
        
    ###Map the reads
    db_fn = qiime_db_fn
    sub_db_mapping_fn = {}
    for species, fasta_fn in sub_reads_fn.items():
        db_name = os.path.basename(db_fn).replace('.fasta', '')
        out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
        sub_db_mapping_fn[species] = out_fn
        minimapmapping(fasta_fn, db_fn, out_fn, threads)
        
    ###Test out the summary results statistic for a single mapping result
    species = test_species[0]
    mapping_results , full_results_df = pull_mapping_results_v3(sub_db_mapping_fn[species])
    mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
    taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
    
    query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)

    print(type(query_tax_dict))
    print(query_tax_dict)
    
#     sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)
        
#     print(json.dumps(sensitivity_dict, indent=1))
#     print('\n')
#     with open('/media/MassStorage/tmp/TE/honours/analysis/Mapping/qiime_results/%s.json' % species, 'w+') as fp:
#         json.dump(sensitivity_dict, fp)

### Test run on the qiime2 Database 99

##### Prep on the command line

cp sh_refs_qiime_ver8_99_02.02.2019.fasta /media/WorkingStorage/ben.working/students/tavish/analysis/qiime2/db/. 
cp sh_taxonomy_qiime_ver8_99_02.02.2019.txt /media/WorkingStorage/ben.working/students/tavish/analysis/qiime2/db/.

In [None]:
qiime_db_fn = os.path.abspath('../../analysis/qiime2/db/sh_refs_qiime_ver8_99_02.02.2019.fasta')
qiime_tax_fn = os.path.abspath('../../analysis/qiime2/db/sh_taxonomy_qiime_ver8_99_02.02.2019.txt')
threads = 10
QIIME_DIR = os.path.abspath('../../analysis/qiime2/')

In [None]:
##mapping folder
mapping_dir = os.path.join(QIIME_DIR, os.path.basename(qiime_db_fn).replace('.fasta', '').replace('.','_'))
print(mapping_dir)
if not os.path.exists(mapping_dir):
    os.mkdir(mapping_dir)
subsampling_dir = os.path.join(QIIME_DIR, 'subsamplereads')
if not os.path.exists(subsampling_dir):
    os.mkdir(subsampling_dir)

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['penicillium_chrysogenum']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

###Test out the summary results statistic for a single mapping result
species = 'penicillium_chrysogenum'
mapping_results , full_results_df = pull_mapping_results_v2(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
###fix family level for 'penicillium_chrysogenum'
sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)


full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

In [None]:
###looking at unfiltered results
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

In [None]:
###fix the query_tax_dict with the data found in the qiime database
query_tax_dict = {}
taxrank = 'k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Penicillium;s__Penicillium_chrysogenum'
for rank_id in taxrank.split(';'):
    query_tax_dict[rank_id.split('__')[0]] = rank_id.split('__')[1]

In [None]:
get_accuracy_dict(mapping_results, query_tax_dict)

Use with Candida albicans

In [None]:
#subsample tests species
fn_subsampling = {}
test_species = ['candida_albicans']
for x in test_species:
    fn_subsampling[x] = (ref_df[(ref_df['species'] == x.split('_')[1]) & (ref_df['genus'] == x.split('_')[0])]['path for use'].tolist()[0])
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

sub_reads_fn = {}
n_reads = 20000
for key, value in fn_subsampling.items():
    species = key
    in_fn = value
    out_fn = os.path.join(subsampling_dir, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

###Map the reads
db_fn = qiime_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(mapping_dir, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

###Test out the summary results statistic for a single mapping result
species = 'candida_albicans'
mapping_results , full_results_df = pull_mapping_results_v2(sub_db_mapping_fn[species])
mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
taxfileid = getquery_taxfileid(reference_dataframe_fn, species)
query_tax_dict = get_taxid_dict(taxonomy_file_fn, taxfileid)
###fix family level for 'penicillium_chrysogenum'
sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)


full_results_df.index = full_results_df.tname
###Also look at the full results dataframe to explore results a bit more
for tname in full_results_df.tname.unique():

    tmp_tax_dict = get_taxid_dict(qiime_tax_fn, tname)
    for key, value in tmp_tax_dict.items():
        full_results_df.loc[tname, key] = value

In [None]:
sensitivity_dict

In [None]:
###looking at unfiltered results
###look at the results unfiltered
full_results_df['count'] = 1

get_accuracy_dict(full_results_df, query_tax_dict)

## Apply to wheat species for qiime and consensus respectively

In [1]:
from Bio import SeqIO
import os
import random
import subprocess
import pandas as pd

INPUT_BASEDIR = os.path.abspath('/media/MassStorage/tmp/TE/honours')
subsampling_dir = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/subsample_reads')
mapping_dir = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/qiime_results')

wheat_reference_dataframe_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Stats/wheat_reference_dataframe.csv')
wheat_max_custom_database_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/database/wheat_database_labelled.fasta')
# taxonomy_file_fn = os.path.abspath('/media/MassStorage/tmp/TE/honours/analysis/Stats/taxonomy_file_qiime.csv')

qiime_db_fn = os.path.abspath('../../analysis/qiime2/db/sh_refs_qiime_ver8_dynamic_02.02.2019.fasta')
qiime_tax_fn = os.path.abspath('../../analysis/qiime2/db/sh_taxonomy_qiime_ver8_dynamic_02.02.2019.txt')
threads = 10

wheat_ref_df = pd.read_csv(wheat_reference_dataframe_fn)
wheat_ref_df

Unnamed: 0.1,Unnamed: 0,species1,species2,# raw reads,# reads after homology filtering,# reads after length filtering,# for use,path to raw reads,path to homology filtering,path to length filtering,path for use
0,barcode01,Puccinia_striiformis,,115052,23477,21590,21590.0,analysis/Concatenated/wheat/barcode01/merged.f...,analysis/Python_Processing/wheat/barcode01/com...,analysis/Length_Filtered/wheat/barcode01/lengt...,analysis/Length_Filtered/wheat/barcode01/lengt...
1,barcode02,Zymoseptoria_tritici,,151005,20416,18100,18100.0,analysis/Concatenated/wheat/barcode02/merged.f...,analysis/Python_Processing/wheat/barcode02/com...,analysis/Length_Filtered/wheat/barcode02/lengt...,analysis/Length_Filtered/wheat/barcode02/lengt...
2,barcode03,unknown,,92927,23616,21919,21919.0,analysis/Concatenated/wheat/barcode03/merged.f...,analysis/Python_Processing/wheat/barcode03/com...,analysis/Length_Filtered/wheat/barcode03/lengt...,analysis/Length_Filtered/wheat/barcode03/lengt...
3,barcode04,Puccinia_striiformis,Zymoseptoria_tritici,115407,22716,20568,20568.0,analysis/Concatenated/wheat/barcode04/merged.f...,analysis/Python_Processing/wheat/barcode04/com...,analysis/Length_Filtered/wheat/barcode04/lengt...,analysis/Length_Filtered/wheat/barcode04/lengt...
4,barcode05,Pyrenophora_tritici-repentis,,134094,24785,22433,22433.0,analysis/Concatenated/wheat/barcode05/merged.f...,analysis/Python_Processing/wheat/barcode05/com...,analysis/Length_Filtered/wheat/barcode05/lengt...,analysis/Length_Filtered/wheat/barcode05/lengt...
5,barcode06,unknown,,150739,25462,23769,23769.0,analysis/Concatenated/wheat/barcode06/merged.f...,analysis/Python_Processing/wheat/barcode06/com...,analysis/Length_Filtered/wheat/barcode06/lengt...,analysis/Length_Filtered/wheat/barcode06/lengt...
6,barcode07,unknown,,109201,24688,22931,22931.0,analysis/Concatenated/wheat/barcode07/merged.f...,analysis/Python_Processing/wheat/barcode07/com...,analysis/Length_Filtered/wheat/barcode07/lengt...,analysis/Length_Filtered/wheat/barcode07/lengt...
7,barcode08,Puccinia_striiformis,,114215,21725,20105,20105.0,analysis/Concatenated/wheat/barcode08/merged.f...,analysis/Python_Processing/wheat/barcode08/com...,analysis/Length_Filtered/wheat/barcode08/lengt...,analysis/Length_Filtered/wheat/barcode08/lengt...
8,barcode09,Puccinia_striiformis,Pyrenophora_tritici-repentis,106976,22012,20451,20451.0,analysis/Concatenated/wheat/barcode09/merged.f...,analysis/Python_Processing/wheat/barcode09/com...,analysis/Length_Filtered/wheat/barcode09/lengt...,analysis/Length_Filtered/wheat/barcode09/lengt...
9,barcode10,Puccinia_striiformis,Zymoseptoria_tritici,108583,21514,19810,19810.0,analysis/Concatenated/wheat/barcode10/merged.f...,analysis/Python_Processing/wheat/barcode10/com...,analysis/Length_Filtered/wheat/barcode10/lengt...,analysis/Length_Filtered/wheat/barcode10/lengt...


In [2]:
import json
from collections import OrderedDict
pd.set_option('display.max_rows', None)

def assign_taxranks_results(mapping_df, tax_fn, ref_df_fn = False):
    """
    This function assigns the taxonomic ranks for each hit in the mapping results dataframe.
    It takes a mapping_df, taxonomy assignment file, and if required a reference dataframe filename.
    Returns the mapping dataframe with assignment. 
    """
    for tname in mapping_df.index:
        if ref_df_fn:
            tmp_taxfileid = getquery_taxfileid(ref_df_fn, tname)
        else:
            tmp_taxfileid = tname
        tmp_tax_dict = get_taxid_dict(tax_fn, tmp_taxfileid)
        for key, value in tmp_tax_dict.items():
            mapping_df.loc[tname, key] = value
    return mapping_df

def get_accuracy_dict(mapping_df, query_tax_dict):
    """
    Summarises the mapping accuracy of the mapping results at all taxonomic ranks.
    Takes the mapping_df with taxnomonic assignments and a taxnomic dictionary of the known query.
    Returns an accuracy dictionary for each taxnomic rank ['k', 'p', 'c', 'o', 'f', 'g', 's']. 
    Right now this function takes a qiime tax 
    """
    accuracy_dict = OrderedDict()
    total_count = mapping_df['count'].sum()
    for tax_rank in ['s']:
        tmps_df = pd.DataFrame(data=None)
        hit_count = mapping_df[mapping_df[tax_rank] == query_tax_dict[tax_rank]]['count'].sum()
        accuracy_dict[tax_rank] = hit_count/total_count
    return accuracy_dict

def get_taxid_dict(taxid_fn, taxfileid):
    """
    Takes a taxonomy assignment file filename in the Qiime format and a taxonomic identifier.
    Returns the a dictionary with the taxonomic assignment at each rank.
    """
    tax_dict = {}
    with open(taxid_fn, 'r') as fh:
        for line in fh:
            if line.startswith(taxfileid):
                taxrankids = line.rstrip().split('\t')[1].split(';')
                for taxrank in taxrankids:
                    tax_dict[taxrank.split('__')[0]] = taxrank.split('__')[1]
    return tax_dict

def mapping_results(fn, species):
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    sub_df = tmp_df[tmp_df['mquality'] == tmp_df.groupby('qseqid')['mquality'].transform(max)].reset_index(drop=True)
    sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_series = pd.Series(sub_df.groupby('tname')['mquality'].count().tolist()/sub_df.groupby('tname')['mquality'].count().sum(),
                      sub_df.groupby('tname')['mquality'].count().index)
    hit_series.sort_values(ascending=False, inplace=True)
    print(sub_df.qseqid.unique().shape == tmp_df.qseqid.unique().shape)
    print('##########\n')
    print(F"This was the query species: {species}\n")
    print(F"These are the results:")
    print(hit_series,'\n')
    hit_series.to_json('/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/custom_results/%s.json' % species)

def minimapmapping(fasta_fn, ref_fn, out_fn, threads):
    command = F"minimap2 -x map-ont -t {threads} {ref_fn} {fasta_fn} -o {out_fn}"
    out = subprocess.getstatusoutput(command)

def pull_mapping_results_v3(fn):
    """
    Takes a minimap2 paf and reads it in with the first 12 columns. Ignores the rest.
    Filters for each read the best hit on mquality first taking the highest value.
    Filters for each read by the number of nmatches in the second step.
    Returns a dataframe that has the tnames as index and the counts of hits as column 'count'.
    The dataframe has also the taxrank columns ['k', 'p', 'c', 'o', 'f', 'g', 's'] that are all False to start with.
    """
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    tmp_df['cscore'] = tmp_df['alen']/(tmp_df['alen']-tmp_df['nmatch'])
    sub_df = tmp_df[tmp_df['cscore'] == tmp_df.groupby('qseqid')['cscore'].transform(max)].reset_index(drop=True)
#     sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_df = pd.DataFrame(sub_df.groupby('tname')['cscore'].count().tolist(), sub_df.groupby('tname')['cscore'].count().index, columns=['count'])
    hit_df.sort_values(by='count', ascending=False, inplace=True)
    for key in ['k', 'p', 'c', 'o', 'f', 'g', 's']:
        hit_df[key] = False
        tmp_df[key] = False
    return hit_df, tmp_df
    
def subsamplereads(in_fn, out_fn, n_reads):
    command = F'reformat.sh samplereadstarget={n_reads} in={in_fn} out={out_fn}'
    out = subprocess.getstatusoutput(command)

barcode_list = []
for entry in wheat_ref_df['Unnamed: 0']:
#     if entry[-7:] != '-ccl031' and entry[-7:] != '-ccl029':
#         test_species_list.append(entry)
#     else:
#         test_species_list.append(entry[:-7])
#         print(entry[:-7])
    barcode_list.append(entry)
    
counter = 0
for barcode in barcode_list:
    
    print(barcode)
    
    #subsample tests species
    fn_subsampling = {}
    barcode = [barcode]
    for x in barcode:
        fn_subsampling[x] = wheat_ref_df[wheat_ref_df['Unnamed: 0'] == x]['path for use'].tolist()[0]
        fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])

    sub_reads_fn = {}
    n_reads = 2000
    for key, value in fn_subsampling.items():
        barcode = key
        in_fn = value
        out_fn = os.path.join(subsampling_dir, F'{barcode}.{n_reads}.fasta')
        subsamplereads(in_fn, out_fn, n_reads)
        sub_reads_fn[barcode] = out_fn
        
    ###Map the reads
    db_fn = qiime_db_fn
    sub_db_mapping_fn = {}
    for barcode, fasta_fn in sub_reads_fn.items():
        db_name = os.path.basename(db_fn).replace('.fasta', '')
        out_fn = os.path.join(mapping_dir, F"{db_name}.{barcode}.minimap2.paf")
        sub_db_mapping_fn[barcode] = out_fn
        minimapmapping(fasta_fn, db_fn, out_fn, threads)
        
    ###Test out the summary results statistic for a single mapping result
    mapping_results , full_results_df = pull_mapping_results_v3(sub_db_mapping_fn[barcode])
    mapping_results = assign_taxranks_results(mapping_results, qiime_tax_fn)
    print(mapping_results.loc[:,['count', 's']])
#     print("\n\n\n")
    
    specs = []
    if wheat_ref_df[wheat_ref_df['Unnamed: 0'] == barcode]['species2'].any():
        specs = [wheat_ref_df[wheat_ref_df['Unnamed: 0'] == barcode].loc[counter,'species1'],wheat_ref_df[wheat_ref_df['Unnamed: 0'] == barcode].loc[counter,'species2']]
    else:
        specs = [wheat_ref_df[wheat_ref_df['Unnamed: 0'] == barcode].loc[counter,'species1']]

    counter += 1
    
    for element in specs:
        query_tax_dict = {'s': element}
        print(query_tax_dict)
        sensitivity_dict = get_accuracy_dict(mapping_results, query_tax_dict)
        
        print(json.dumps(sensitivity_dict, indent=1))
#     print('\n')
#     with open('/media/MassStorage/tmp/TE/honours/analysis/Mapping/qiime_results/%s.json' % species, 'w+') as fp:
#         json.dump(sensitivity_dict, fp)

barcode01
                                        count                               s
tname                                                                        
SH1732842.08FU_FJ430779_refs_singleton    514                  Acidea_extrema
SH1649638.08FU_UDB016769_refs             364                    unidentified
SH1538821.08FU_AB035719_reps              278            Bannoa_ogasawarensis
SH1654757.08FU_UDB014954_reps             206                    unidentified
SH1563344.08FU_AM267268_reps              159          Knoxdaviesia_cecropiae
SH1558908.08FU_KF800655_reps              144                    unidentified
SH1563343.08FU_FJ430715_reps              123                    unidentified
SH1560162.08FU_AY225488_reps              119               Lignincola_laevis
SH1680040.08FU_KC992945_refs_singleton    110  Cystoagaricus_hirtosquamulosus
SH1563342.08FU_UDB028691_reps              88                    unidentified
SH1566136.08FU_KC992951_refs               75        L

                                        count                               s
tname                                                                        
SH1732842.08FU_FJ430779_refs_singleton    513                  Acidea_extrema
SH1649638.08FU_UDB016769_refs             386                    unidentified
SH1538821.08FU_AB035719_reps              319            Bannoa_ogasawarensis
SH1654757.08FU_UDB014954_reps             198                    unidentified
SH1558908.08FU_KF800655_reps              156                    unidentified
SH1563344.08FU_AM267268_reps              156          Knoxdaviesia_cecropiae
SH1563343.08FU_FJ430715_reps              137                    unidentified
SH1680040.08FU_KC992945_refs_singleton    132  Cystoagaricus_hirtosquamulosus
SH1566136.08FU_KC992951_refs               92        Lacrymaria_subcinnamomea
SH1560162.08FU_AY225488_reps               88               Lignincola_laevis
SH1563342.08FU_UDB028691_reps              75                   

                                        count                               s
tname                                                                        
SH1732842.08FU_FJ430779_refs_singleton    510                  Acidea_extrema
SH1649638.08FU_UDB016769_refs             393                    unidentified
SH1538821.08FU_AB035719_reps              295            Bannoa_ogasawarensis
SH1654757.08FU_UDB014954_reps             208                    unidentified
SH1563344.08FU_AM267268_reps              158          Knoxdaviesia_cecropiae
SH1558908.08FU_KF800655_reps              142                    unidentified
SH1563343.08FU_FJ430715_reps              126                    unidentified
SH1680040.08FU_KC992945_refs_singleton    114  Cystoagaricus_hirtosquamulosus
SH1560162.08FU_AY225488_reps              103               Lignincola_laevis
SH1563342.08FU_UDB028691_reps              91                    unidentified
SH1566136.08FU_KC992951_refs               83        Lacrymaria_

                                        count                               s
tname                                                                        
SH1732842.08FU_FJ430779_refs_singleton    506                  Acidea_extrema
SH1649638.08FU_UDB016769_refs             381                    unidentified
SH1538821.08FU_AB035719_reps              274            Bannoa_ogasawarensis
SH1654757.08FU_UDB014954_reps             189                    unidentified
SH1563344.08FU_AM267268_reps              140          Knoxdaviesia_cecropiae
SH1563343.08FU_FJ430715_reps              138                    unidentified
SH1558908.08FU_KF800655_reps              136                    unidentified
SH1680040.08FU_KC992945_refs_singleton    133  Cystoagaricus_hirtosquamulosus
SH1563342.08FU_UDB028691_reps              96                    unidentified
SH1560162.08FU_AY225488_reps               94               Lignincola_laevis
SH1566136.08FU_KC992951_refs               87        Lacrymaria_

In [3]:
### list of species in the max database
max_species = ['Puccinia_striiformis',
             'Zymoseptoria_tritici',
             'Pyrenophora_tritici-repentis',
             'Fusarium_oxysporum',
             'Tuber_brumale',
             'Cortinarius_globuliformis',
             'Aspergillus_niger',
             'Clavispora_lusitaniae',
             'Kluyveromyces_unidentified',
             'Penicillium_chrysogenum',
             'Rhodotorula_mucilaginosa',
             'Scedosporium_boydii',
             'Blastobotrys_proliferans',
             'Debaryomyces_unidentified',
             'Galactomyces_geotrichum',
             'Kodamaea_ohmeri',
             'Meyerozyma_guilliermondii',
             'Wickerhamomyces_anomalus',
             'Yamadazyma_mexicana',
             'Yamadazyma_scolyti',
             'Yarrowia_lipolytica',
             'Zygoascus_hellenicus',
             'Aspergillus_flavus',
             'Cryptococcus_zero',
             'Aspergillus_unidentified',
             'Diaporthe_CCL067',
             'Diaporthe_unidentified',
             'Oculimacula_yallundae-CCL031',
             'Oculimacula_yallundae-CCL029',
             'Dothiorella_vidmadera',
             'Quambalaria_cyanescens',
             'Entoleuca_unidentified',
             'Asteroma_CCL060',
             'Asteroma_CCL068',
             'Saccharomyces_cerevisiae',
             'Cladophialophora_unidentified',
             'Candida_albicans',
             'Candida_metapsilosis',
             'Candida_orthopsilosis',
             'Candida_parapsilosis',
             'Candida_unidentified',
             'Kluyveromyces_marxianus',
             'Pichia_kudriavzevii',
             'Pichia_membranifaciens']

In [4]:
mock_community = ['barcode01','barcode02','barcode03','barcode04',
                 'barcode05','barcode06','barcode07','barcode08',
                 'barcode09','barcode10','barcode11']

In [5]:
def subsamplereads(in_fn, out_fn, n_reads):
    command = F'reformat.sh samplereadstarget={n_reads} in={in_fn} out={out_fn}'
    out = subprocess.getstatusoutput(command)
    if out[0] == 0:
        print(F":)Completed {command}\n")
    else:
        print(F":(check one {command}!!\n")

In [6]:
n_reads = 2000

In [7]:
OUT_DIR = os.path.abspath('../../analysis/Mapping/wheat')
if not os.path.exists(OUT_DIR):
    os.mkdir(OUT_DIR)
MC_READ_DIR = os.path.join(OUT_DIR, 'MC_READS')
if not os.path.exists(MC_READ_DIR):
    os.mkdir(MC_READ_DIR)
sub_db_fn = os.path.join('../../analysis/Mapping/gsref.subdb.fasta')
new_db_fn = os.path.join('../../analysis/Mapping/gsref.db.fasta')

In [8]:
wheat_ref_df.columns

Index(['Unnamed: 0', 'species1', 'species2', '# raw reads',
       '# reads after homology filtering', '# reads after length filtering',
       '# for use', 'path to raw reads', 'path to homology filtering',
       'path to length filtering', 'path for use'],
      dtype='object')

In [9]:
fn_subsampling = {}
for x in mock_community:
    print(x)
    fn_subsampling[x] = wheat_ref_df[wheat_ref_df['Unnamed: 0'] == x]['path for use'].tolist()[0]
    fn_subsampling[x] = os.path.join(INPUT_BASEDIR, fn_subsampling[x])
fn_subsampling

barcode01
barcode02
barcode03
barcode04
barcode05
barcode06
barcode07
barcode08
barcode09
barcode10
barcode11


{'barcode01': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode01/length_restricted_reads.fasta',
 'barcode02': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode02/length_restricted_reads.fasta',
 'barcode03': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode03/length_restricted_reads.fasta',
 'barcode04': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode04/length_restricted_reads.fasta',
 'barcode05': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode05/length_restricted_reads.fasta',
 'barcode06': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode06/length_restricted_reads.fasta',
 'barcode07': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode07/length_restricted_reads.fasta',
 'barcode08': '/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode08/length_restricted_reads.fasta',
 'barcode09': '/media/Ma

In [10]:
sub_reads_fn = {}
for key, value in fn_subsampling.items():
    print(key)
    print(value)
    species = key
    in_fn = value
    out_fn = os.path.join(MC_READ_DIR, F'{species}.{n_reads}.fasta')
    subsamplereads(in_fn, out_fn, n_reads)
    sub_reads_fn[species] = out_fn

barcode01
/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode01/length_restricted_reads.fasta
:)Completed reformat.sh samplereadstarget=2000 in=/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode01/length_restricted_reads.fasta out=/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode01.2000.fasta

barcode02
/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode02/length_restricted_reads.fasta
:)Completed reformat.sh samplereadstarget=2000 in=/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode02/length_restricted_reads.fasta out=/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode02.2000.fasta

barcode03
/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode03/length_restricted_reads.fasta
:)Completed reformat.sh samplereadstarget=2000 in=/media/MassStorage/tmp/TE/honours/analysis/Length_Filtered/wheat/barcode03/length_restricted_reads.fasta out=/m

### Map with minimap against both databases

In [11]:
def minimapmapping(fasta_fn, ref_fn, out_fn, threads):
    command = F"minimap2 -x map-ont -t {threads} {ref_fn} {fasta_fn} -o {out_fn}"
    out = subprocess.getstatusoutput(command)
    if out[0] == 0:
        print(F":)Completed {command}\n")
    else:
        print(F":(check one {command}!!\n")

In [12]:
dbases_fn = {}
for x in [sub_db_fn, new_db_fn]:
    print(x)
    dbases_fn[x] = os.path.join(OUT_DIR, os.path.basename(x).replace('.fasta', '').replace('.','_'))
    if not os.path.exists(dbases_fn[x]):
        os.mkdir(dbases_fn[x])
dbases_fn

../../analysis/Mapping/gsref.subdb.fasta
../../analysis/Mapping/gsref.db.fasta


{'../../analysis/Mapping/gsref.subdb.fasta': '/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_subdb',
 '../../analysis/Mapping/gsref.db.fasta': '/media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_db'}

In [13]:
db_fn = sub_db_fn
sub_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    tmp_out = dbases_fn[db_fn]
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(tmp_out, F"{db_name}.{species}.minimap2.paf")
    sub_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.subdb.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode01.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_subdb/gsref.subdb.barcode01.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.subdb.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode02.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_subdb/gsref.subdb.barcode02.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.subdb.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode03.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_subdb/gsref.subdb.barcode03.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.subdb.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode04.2000.fasta -o /media/Ma

In [14]:
db_fn = new_db_fn
new_db_mapping_fn = {}
for species, fasta_fn in sub_reads_fn.items():
    tmp_out = dbases_fn[db_fn]
    db_name = os.path.basename(db_fn).replace('.fasta', '')
    out_fn = os.path.join(tmp_out, F"{db_name}.{species}.minimap2.paf")
    new_db_mapping_fn[species] = out_fn
    minimapmapping(fasta_fn, db_fn, out_fn, threads)

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.db.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode01.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_db/gsref.db.barcode01.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.db.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode02.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_db/gsref.db.barcode02.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.db.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode03.2000.fasta -o /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/gsref_db/gsref.db.barcode03.minimap2.paf

:)Completed minimap2 -x map-ont -t 10 ../../analysis/Mapping/gsref.db.fasta /media/MassStorage/tmp/TE/honours/analysis/Mapping/wheat/MC_READS/barcode04.2000.fasta -o /media/MassStorage/tmp/TE/honours/analy

### Look at mapping results

In [15]:
def mapping_results(fn, species,expected_species):
    min_header = ['qseqid', 'qlen', 'qstart', 'qstop', 'strand', 'tname', 'tlen', 'tstart', 'tend', 'nmatch', 'alen', 'mquality']
    tmp_df = pd.read_csv(fn, sep='\t', header = None, usecols=[x for x in range(0,12)], names=min_header)
    sub_df = tmp_df[tmp_df['mquality'] == tmp_df.groupby('qseqid')['mquality'].transform(max)].reset_index(drop=True)
    sub_df = sub_df[sub_df['nmatch'] == sub_df.groupby('qseqid')['nmatch'].transform(max)].reset_index(drop=True)
    hit_series = pd.Series(sub_df.groupby('tname')['mquality'].count().tolist()/sub_df.groupby('tname')['mquality'].count().sum(),
                      sub_df.groupby('tname')['mquality'].count().index)
    hit_series.sort_values(ascending=False, inplace=True)
    print(sub_df.qseqid.unique().shape == tmp_df.qseqid.unique().shape)
    print('##########\n')
    print(F"This was the sample type expected: {expected_species[species]}\n")
    print(F"These are the results:")
    print(hit_series,'\n')
    hit_series.to_json('/media/MassStorage/tmp/TE/honours/analysis/Mapping/custom_results/%s.json' % species)

In [16]:
expected_species = {'barcode01': "Puccinia striiformis-tritici",
                   'barcode02': 'Zymoseptoria tritici',
                   'barcode03': 'Healthy wheat sample',
                   'barcode04': 'Puccinia striiformis-tritici AND Zymoseptoria tritici',
                   'barcode05': 'Pyrenophora tritici-repentis',
                   'barcode06': 'Healthy resistant wheat sample',
                   'barcode07': 'Healthy susceptible wheat sample',
                   'barcode08': 'Puccinia striiformis-tritici',
                   'barcode09': 'Puccinia striiformis-tritici AND Pyrenophora tritici-repentis',
                   'barcode10': 'Puccinia striiformis-tritici AND Zymoseptoria tritici',
                   'barcode11': 'Zymoseptoria tritici AND Puccinia striiformis-tritici'}

In [17]:
###this is running the reads against the full database
for species, hit_fn in new_db_mapping_fn.items():
    mapping_results(hit_fn, species, expected_species)

True
##########

This was the sample type expected: Puccinia striiformis-tritici

These are the results:
tname
cryptococcus_zero                0.641251
blastobotrys_proliferans         0.160313
saccharomyces_cerevisiae         0.073314
cortinarius_globuliformis        0.040567
rhodotorula_mucilaginosa         0.008309
aspergillus_flavus               0.007820
meyerozyma_guillermondii         0.006354
puccinia_striiformis-tritici     0.005376
aspergillus_unidentified         0.003910
candida_albicans                 0.003910
candida_unidentified             0.003910
kluyveromyces_unidentified       0.003910
zymoseptoria_tritici             0.003910
wickerhamomyces_anomalus         0.003910
entoleuca_unidentified           0.003421
candida_metapsilosis             0.003421
quambalaria_cyanescens           0.003421
debaryomyces_unidentified        0.002933
pyrenophora_tritici-repentis     0.002933
dothiorella_vidmadera            0.002444
oculimacula_yallundae-ccl031     0.002444
kluyver

True
##########

This was the sample type expected: Healthy resistant wheat sample

These are the results:
tname
cortinarius_globuliformis        0.550556
cryptococcus_zero                0.172714
wickerhamomyces_anomalus         0.106434
candida_parapsilosis             0.046928
blastobotrys_proliferans         0.023706
saccharomyces_cerevisiae         0.013546
aspergillus_flavus               0.007741
quambalaria_cyanescens           0.007257
oculimacula_yallundae-ccl029     0.006773
meyerozyma_guillermondii         0.005806
debaryomyces_unidentified        0.005806
candida_unidentified             0.004838
dothiorella_vidmadera            0.004838
entoleuca_unidentified           0.004354
rhodotorula_mucilaginosa         0.004354
candida_albicans                 0.003870
zymoseptoria_tritici             0.003870
kluyveromyces_unidentified       0.003387
aspergillus_unidentified         0.003387
cladophialophora_unidentified    0.002903
kluyveromyces_marxianus          0.002419
yamad

In [18]:
###this is running against a database that have ['Candida_orthopsilosis', 'Candida_metapsilosis', 'Aspergillus_niger'] deleted
for species, hit_fn in sub_db_mapping_fn.items():
    mapping_results(hit_fn, species, expected_species)

True
##########

This was the sample type expected: Puccinia striiformis-tritici

These are the results:
tname
cryptococcus_zero                0.640039
blastobotrys_proliferans         0.159766
saccharomyces_cerevisiae         0.073551
cortinarius_globuliformis        0.041403
rhodotorula_mucilaginosa         0.008281
aspergillus_flavus               0.007793
meyerozyma_guillermondii         0.006819
puccinia_striiformis-tritici     0.005358
kluyveromyces_unidentified       0.005358
candida_albicans                 0.004384
candida_unidentified             0.004384
wickerhamomyces_anomalus         0.004384
zymoseptoria_tritici             0.003897
aspergillus_unidentified         0.003897
quambalaria_cyanescens           0.003897
entoleuca_unidentified           0.003897
kluyveromyces_marxianus          0.003410
pyrenophora_tritici-repentis     0.002923
debaryomyces_unidentified        0.002923
dothiorella_vidmadera            0.002435
oculimacula_yallundae-ccl031     0.002435
penicil

True
##########

This was the sample type expected: Puccinia striiformis-tritici AND Zymoseptoria tritici

These are the results:
tname
cryptococcus_zero                0.369928
cortinarius_globuliformis        0.182816
kluyveromyces_marxianus          0.136993
blastobotrys_proliferans         0.084010
penicillium_chrysogenum          0.038663
saccharomyces_cerevisiae         0.030072
rhodotorula_mucilaginosa         0.015274
zymoseptoria_tritici             0.014797
wickerhamomyces_anomalus         0.014320
meyerozyma_guillermondii         0.012888
entoleuca_unidentified           0.011933
aspergillus_flavus               0.011933
quambalaria_cyanescens           0.010501
candida_albicans                 0.010501
candida_unidentified             0.007160
debaryomyces_unidentified        0.006205
candida_parapsilosis             0.006205
dothiorella_vidmadera            0.004773
kluyveromyces_unidentified       0.004773
oculimacula_yallundae-ccl031     0.003819
cladophialophora_unident