# Prepare Sample Files for BGCflow
Use: `workflow/envs/bgc_analytics.yaml`

This notebook will prepare a sample file (csv) for BGCflow (https://github.com/NBChub/bgcflow) to comparative BGC analysis in:
- [o__Myxococcales](https://gtdb.ecogenomic.org/tree?r=o__Myxococcales)
- [f__Nitrospiraceae](https://gtdb.ecogenomic.org/tree?r=f__Nitrospiraceae)

We will compare all public genome assembly available in GTDB and MAGs from [Singleton et al., 2021](https://www.nature.com/articles/s41467-021-22203-2).

## Table of Contents
* [Library and paths setting](#load-library)
* [Getting MAGs information from Singleton 2021](#second-bullet)
    * [Data cleaning](#data-cleaning-mags)
        * [Map to BioProject](#map-to-bioproject)
    * [Filtering for Myxococcales](#myxo-filter-mags)
    * [Filtering for Nitrospira](#nitro-filter-mags)
* [Second Bullet Header](#second-bullet)

### Load Library <a class="anchor" id="load-library"></a>


In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
## Useful scripts
def get_gtdb_tax(df):
    """
    Clean bac120_taxonomy_<release>.tsv into pandas dataframe.
    Get the file from https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv
    """
    # cleaning
    df_tax = pd.DataFrame(df.apply(lambda x: x.split(";")).to_dict()).T
    df_tax = df_tax.rename(columns={0:"domain",
                          1:"phylum",
                          2:"class",
                          3:"order",
                          4:"family",
                          5:"genus",
                          6:"species"})
    return df_tax

def format_bgcflow(df, outfile):
    """
    Input a filtered Pandas DataFrame from bac120_taxonomy_<release>.tsv and format it to bgcflow samples.csv.
    """
    filtered_df = df.copy()
    for c in filtered_df.columns:
        filtered_df.loc[:, c] = filtered_df.loc[:, c].apply(lambda x: x.split("__")[-1])
    filtered_df.insert(0, "genome_id", [g for g in filtered_df.index]) 
    filtered_df.insert(1, "source", ["ncbi" for g in filtered_df.index]) 
    filtered_df.insert(2, "organism", filtered_df.loc[:, "species"])
    filtered_df.loc[:, "species"] = filtered_df.loc[:, "species"].apply(lambda x: x.split(" ")[-1])
    filtered_df.loc[:, "strain"] = np.nan
    filtered_df.loc[:, "closest_placement_reference"] = np.nan
    filtered_df.to_csv(outfile, index=False)
    return filtered_df

In [3]:
# Set paths
tables = "../tables"
data = "../data"

# Getting MAGs information from Singleton 2021 <a class="anchor" id="load-library">
## Data Cleaning <a class="anchor" id="data-cleaning-mags">
First we will get the metadata from the paper and map them to NCBI assembly ids. Grab the metadata with:

In [4]:
! wget -P {data} https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-021-22203-2/MediaObjects/41467_2021_22203_MOESM5_ESM.xlsx -nc

File ‘../data/41467_2021_22203_MOESM5_ESM.xlsx’ already there; not retrieving.



In [5]:
# Load supplementary materials
## ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.
## Install with: mamba install openpyxl
df_MAGs = pd.read_excel(os.path.join(data, "41467_2021_22203_MOESM5_ESM.xlsx"), skiprows=1)
df_MAGs.head(1)

Unnamed: 0,MAG,NCBI_accession_number,NumContigs,TotBP,MaxContigBP,AvContigBP,HQMAG,HQdRep,HQdRep99ANI,HQSpRep,...,SILVA138Tax,SIdentity,SAlnLgth,ilmcov,npcov,max_abund,mm27f,mm534r,ttl_polymorphic_rate,polymut_rate
0,AalE_18-Q3-R2-46_BAT3C.1_cln.fa,CP064957,1,1451127,1451127,1451127.0,1083HQ,183_0,183_0,SPREPHQ,...,JX983997.1.1289 Bacteria;Dependentiae;Babeliae...,89.9,1293.0,10.876333,10.614671,0.17912,1.0,0.0,0.00164,0.00052


In [6]:
# What columns are available?
df_MAGs.columns

Index(['MAG', 'NCBI_accession_number', 'NumContigs', 'TotBP', 'MaxContigBP',
       'AvContigBP', 'HQMAG', 'HQdRep', 'HQdRep99ANI', 'HQSpRep', 'Comp',
       'Cont', 'StrHet', 'Circ', 'FLSSU', 'FLLSU', 'GTDBTax', 'MiDAS3_7Tax',
       'MIdentity', 'MAlnLgth', 'SILVA138Tax', 'SIdentity', 'SAlnLgth',
       'ilmcov', 'npcov', 'max_abund', 'mm27f', 'mm534r',
       'ttl_polymorphic_rate', 'polymut_rate'],
      dtype='object')

## Map to BioProject</a><a class="anchor" id="map-to-bioproject">
From the table above, we can see that it contains NCBI accession number, but we want the assembly ids (refseq or genbank). We can grab the information from bioproject. The easiest way is to go to https://www.ncbi.nlm.nih.gov/bioproject/prjna629478 and download the assembly details.

In [7]:
df_bioproject = pd.read_csv(os.path.join(data, "PRJNA629478_AssemblyDetails.txt"), skiprows=1, sep="\t", index_col=False)
df_bioproject.head(1)

Unnamed: 0,# Assembly,Level,WGS,Chrs,BioSample,Isolate,Taxonomy
0,GCA_016722025.1,Contig,JADKGZ000000000,undefined,SAMN16426458,Skiv_18-Q3-R9-52_BAT3C.106,Acidimicrobiaceae bacterium


In [8]:
df_bioproject.shape

(1083, 7)

Looking at the structure, we can map the two tables using Isolate name in df_bioproject and MAG in df_MAGs:

In [9]:
df_MAGs.loc[:, "Isolate"] = [b.strip(".fa") for b in df_MAGs.MAG]
df_MAGs.head(1)

Unnamed: 0,MAG,NCBI_accession_number,NumContigs,TotBP,MaxContigBP,AvContigBP,HQMAG,HQdRep,HQdRep99ANI,HQSpRep,...,SIdentity,SAlnLgth,ilmcov,npcov,max_abund,mm27f,mm534r,ttl_polymorphic_rate,polymut_rate,Isolate
0,AalE_18-Q3-R2-46_BAT3C.1_cln.fa,CP064957,1,1451127,1451127,1451127.0,1083HQ,183_0,183_0,SPREPHQ,...,89.9,1293.0,10.876333,10.614671,0.17912,1.0,0.0,0.00164,0.00052,AalE_18-Q3-R2-46_BAT3C.1_cln


In [10]:
df_MAGs_bioproject = pd.merge(df_MAGs, df_bioproject, on='Isolate')

In [11]:
## Format to BGCflow input
df_MAGs_bioproject.columns

Index(['MAG', 'NCBI_accession_number', 'NumContigs', 'TotBP', 'MaxContigBP',
       'AvContigBP', 'HQMAG', 'HQdRep', 'HQdRep99ANI', 'HQSpRep', 'Comp',
       'Cont', 'StrHet', 'Circ', 'FLSSU', 'FLLSU', 'GTDBTax', 'MiDAS3_7Tax',
       'MIdentity', 'MAlnLgth', 'SILVA138Tax', 'SIdentity', 'SAlnLgth',
       'ilmcov', 'npcov', 'max_abund', 'mm27f', 'mm534r',
       'ttl_polymorphic_rate', 'polymut_rate', 'Isolate', '# Assembly',
       'Level', 'WGS', 'Chrs', 'BioSample', 'Taxonomy'],
      dtype='object')

In [12]:
# Clean 
df_MAGs_bioproject = df_MAGs_bioproject.rename(columns={'# Assembly' : 'genome_id'})
df_MAGs_bioproject = df_MAGs_bioproject.set_index("genome_id", drop=False)
df_MAGs_bioproject.to_csv("../tables/df_MAGs_bioproject.csv", index=False)
df_MAGs_bioproject.head(1)

Unnamed: 0_level_0,MAG,NCBI_accession_number,NumContigs,TotBP,MaxContigBP,AvContigBP,HQMAG,HQdRep,HQdRep99ANI,HQSpRep,...,mm534r,ttl_polymorphic_rate,polymut_rate,Isolate,genome_id,Level,WGS,Chrs,BioSample,Taxonomy
genome_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GCA_016699045.1,AalE_18-Q3-R2-46_BAT3C.1_cln.fa,CP064957,1,1451127,1451127,1451127.0,1083HQ,183_0,183_0,SPREPHQ,...,0.0,0.00164,0.00052,AalE_18-Q3-R2-46_BAT3C.1_cln,GCA_016699045.1,Complete genome,undefined,1,SAMN16425488,bacterium


In [13]:
# Format into bgcflow input table
df_MAGs_bgcflow = format_bgcflow(get_gtdb_tax(df_MAGs_bioproject["GTDBTax"]), os.path.join(tables, "df_MAGs_bgcflow.csv"))
df_MAGs_bgcflow.head(1)

Unnamed: 0,genome_id,source,organism,domain,phylum,class,order,family,genus,species,strain,closest_placement_reference
GCA_016699045.1,GCA_016699045.1,ncbi,,Bacteria,Dependentiae,Babeliae,Babeliales,RVW-14,,,,


## Filtering for Myxococcales </a><a class="anchor" id="myxo-filter-mags"></a>

In [14]:
df_p__Myxococcota_MAGs = df_MAGs_bioproject[df_MAGs_bioproject.loc[:, "GTDBTax"].str.contains("p__Myxococcota")]
df_p__Myxococcota_MAGs.shape

(45, 37)

In [15]:
df_p__Myxococcota_MAGs.loc[:, "GTDBTax"].unique()

array(['d__Bacteria;p__Myxococcota;c__Polyangia;o__Polyangiales;f__Polyangiaceae;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Polyangiales;f__Ga0077539;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Polyangiales;f__SG8-38;g__;s__',
       'd__Bacteria;p__Myxococcota;c__UBA727;o__;f__;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__GCA-2747355;f__GCA-2747355;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Nannocystales;f__Nannocystaceae;g__Ga0077550;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Haliangiales;f__Haliangiaceae;g__UBA2376;s__',
       'd__Bacteria;p__Myxococcota;c__UBA796;o__UBA796;f__;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Haliangiales;f__Haliangiaceae;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Polyangiales;f__;g__;s__',
       'd__Bacteria;p__Myxococcota;c__Polyangia;o__Nannocystales;f__Nannocystaceae;g__Nannocystis;s__',
       'd__Bacteria;p__Myxococcota;c__Myxococ

In [16]:
format_bgcflow(get_gtdb_tax(df_p__Myxococcota_MAGs["GTDBTax"]), os.path.join(tables, "p__Myxococcota_MAGs.csv")).head(1)

Unnamed: 0,genome_id,source,organism,domain,phylum,class,order,family,genus,species,strain,closest_placement_reference
GCA_016704525.1,GCA_016704525.1,ncbi,,Bacteria,Myxococcota,Polyangia,Polyangiales,Polyangiaceae,,,,


## Filtering for Nitrospira </a><a class="anchor" id="nitro-filter-mags"></a>

In [17]:
df_f__Nitrospiraceae_MAGs = df_MAGs_bioproject[df_MAGs_bioproject.loc[:, "GTDBTax"].str.contains("f__Nitrospiraceae")]
df_f__Nitrospiraceae_MAGs.shape

(8, 37)

In [18]:
df_f__Nitrospiraceae_MAGs.loc[:, "GTDBTax"].unique()

array(['d__Bacteria;p__Nitrospirota;c__Nitrospiria;o__Nitrospirales;f__Nitrospiraceae;g__Nitrospira_A;s__Nitrospira_A sp900170025',
       'd__Bacteria;p__Nitrospirota;c__Nitrospiria;o__Nitrospirales;f__Nitrospiraceae;g__Nitrospira;s__Nitrospira sp002254365',
       'd__Bacteria;p__Nitrospirota;c__Nitrospiria;o__Nitrospirales;f__Nitrospiraceae;g__Nitrospira;s__'],
      dtype=object)

In [19]:
df_f__Nitrospiraceae_MAGs = format_bgcflow(get_gtdb_tax(df_f__Nitrospiraceae_MAGs["GTDBTax"]), os.path.join(tables, "f__Nitrospiraceae_MAGs.csv"))

# Getting genome assembly accession id from GTDB
If you hadn't done so, download the taxonomy mapping from the latest GTDB release:

In [20]:
# Download bacteria taxonomy from release 202
! wget -P {data} https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv -nc

File ‘../data/bac120_taxonomy_r202.tsv’ already there; not retrieving.



In [21]:
df = pd.read_csv(os.path.join(data, "bac120_taxonomy_r202.tsv"), sep="\t", header=None, index_col=0)
df.index = [i.split("_",1)[-1] for i in df.index]
# Clean the file
df_tax = get_gtdb_tax(df[1])
df_tax.head(1)

Unnamed: 0,domain,phylum,class,order,family,genus,species
GCF_014075335.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia flexneri


### Filtering for Myxococcales

In [22]:
# filter for Myxococcales
df_o__Myxococcales = df_tax[df_tax.loc[:, "order"] == "o__Myxococcales"]
df_o__Myxococcales.shape

(150, 7)

In [23]:
# Generate sample file for BGCflow
df_o__Myxococcales_gtdb = format_bgcflow(df_o__Myxococcales, os.path.join(tables, "o__Myxococcales_gtdb.csv"))
df_o__Myxococcales_gtdb

Unnamed: 0,genome_id,source,organism,domain,phylum,class,order,family,genus,species,strain,closest_placement_reference
GCF_012985225.1,GCF_012985225.1,ncbi,Corallococcus exiguus,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,exiguus,,
GCF_013155175.1,GCF_013155175.1,ncbi,Corallococcus exiguus,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,exiguus,,
GCF_003668935.1,GCF_003668935.1,ncbi,Corallococcus exiguus,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,exiguus,,
GCF_003668965.1,GCF_003668965.1,ncbi,Corallococcus exiguus,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,exiguus,,
GCA_003612255.1,GCA_003612255.1,ncbi,Corallococcus exiguus,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,exiguus,,
...,...,...,...,...,...,...,...,...,...,...,...,...
GCF_012933655.1,GCF_012933655.1,ncbi,Myxococcus fallax,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Myxococcus,fallax,,
GCA_009781495.1,GCA_009781495.1,ncbi,WRKR01 sp009781495,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,WRKR01,sp009781495,,
GCF_003611695.1,GCF_003611695.1,ncbi,Corallococcus carmarthensis,Bacteria,Myxococcota,Myxococcia,Myxococcales,Myxococcaceae,Corallococcus,carmarthensis,,
GCA_001768075.1,GCA_001768075.1,ncbi,R267 sp001768075,Bacteria,Myxococcota,Myxococcia,Myxococcales,Anaeromyxobacteraceae,R267,sp001768075,,


### Filtering for Nitrospira

In [24]:
g = df_tax.loc[:, "genus"].unique()
[q for q in g if q.startswith("g__Nitrospira")]

['g__Nitrospira_A',
 'g__Nitrospira_F',
 'g__Nitrospira_D',
 'g__Nitrospira_C',
 'g__Nitrospira_E']

In [25]:
f = df_tax.loc[:, "family"].unique()
[q for q in f if q.startswith("f__Nitrospira")]

['f__Nitrospiraceae']

In [26]:
df_f__Nitrospiraceae = df_tax[df_tax.loc[:, "family"] == "f__Nitrospiraceae"]
df_f__Nitrospiraceae.shape

(129, 7)

In [27]:
df_f__Nitrospiraceae_gtdb = format_bgcflow(df_f__Nitrospiraceae, os.path.join(tables, "f__Nitrospiraceae_gtdb.csv"))

## Merge Tables

In [28]:
df_f__Nitrospiraceae_all = pd.concat([df_f__Nitrospiraceae_gtdb, df_f__Nitrospiraceae_MAGs])
df_f__Nitrospiraceae_all.to_csv("../tables/f__Nitrospiraceae_all.csv")

In [29]:
df_f__Nitrospiraceae_all

Unnamed: 0,genome_id,source,organism,domain,phylum,class,order,family,genus,species,strain,closest_placement_reference
GCA_001918505.1,GCA_001918505.1,ncbi,40CM-3-62-11 sp001914955,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,40CM-3-62-11,sp001914955,,
GCA_001920515.1,GCA_001920515.1,ncbi,40CM-3-62-11 sp001914955,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,40CM-3-62-11,sp001914955,,
GCA_001914545.1,GCA_001914545.1,ncbi,40CM-3-62-11 sp001914955,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,40CM-3-62-11,sp001914955,,
GCA_005877575.1,GCA_005877575.1,ncbi,40CM-3-62-11 sp001914955,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,40CM-3-62-11,sp001914955,,
GCA_005877605.1,GCA_005877605.1,ncbi,40CM-3-62-11 sp001914955,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,40CM-3-62-11,sp001914955,,
...,...,...,...,...,...,...,...,...,...,...,...,...
GCA_016711755.1,GCA_016711755.1,ncbi,Nitrospira_A sp900170025,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,Nitrospira_A,sp900170025,,
GCA_016715825.1,GCA_016715825.1,ncbi,,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,Nitrospira,,,
GCA_016719745.1,GCA_016719745.1,ncbi,Nitrospira_A sp900170025,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,Nitrospira_A,sp900170025,,
GCA_016722315.1,GCA_016722315.1,ncbi,Nitrospira_A sp900170025,Bacteria,Nitrospirota,Nitrospiria,Nitrospirales,Nitrospiraceae,Nitrospira_A,sp900170025,,


## New Project - Nitrospiraceae

In [30]:
p__Nitrospirota_MAGs = df_MAGs_bioproject[df_MAGs_bioproject.loc[:, "GTDBTax"].str.contains("p__Nitrospirota")]
p__Nitrospirota_MAGs.shape

(8, 37)

In [31]:
p__Nitrospirota_GTDB = df_tax[df_tax.loc[:, "phylum"] == "p__Nitrospirota"]
p__Nitrospirota_GTDB.shape

(320, 7)

In [32]:
df_p__Nitrospirota_gtdb = format_bgcflow(p__Nitrospirota_GTDB, os.path.join(tables, "p__Nitrospirota_gtdb.csv"))

In [33]:
df_p__Nitrospirota_MAGs = format_bgcflow(get_gtdb_tax(p__Nitrospirota_MAGs["GTDBTax"]), os.path.join(tables, "p__Nitrospirota_MAGs.csv"))

In [34]:
df_p__Nitrospirota_all = pd.concat([df_p__Nitrospirota_gtdb, df_p__Nitrospirota_MAGs])
df_p__Nitrospirota_all.to_csv("../tables/p__Nitrospirota_all.csv")

## New Project - HQ Nitrospiraceae

In [68]:
df_raw_HQ_Nitrospiraceae = pd.read_csv("../data/assembly_result_Nitrospiraceae_hq.txt", sep="\t")
df_raw_HQ_Nitrospiraceae = df_raw_HQ_Nitrospiraceae.fillna("")

In [72]:
# Check if genbank available in GTDB

ncbi_in_GTDB = []
ncbi_not_in_GTDB = []

for i in df_raw_HQ_Nitrospiraceae.index:
    refseq = df_raw_HQ_Nitrospiraceae.loc[i, "RefSeq Assembly ID (Accession.version)"]
    genbank = df_raw_HQ_Nitrospiraceae.loc[i, "GenBank Assembly ID (Accession.version)"]
    
    if refseq in df_p__Nitrospirota_all.genome_id:
        print(refseq, "refseq in dataframe")
    elif genbank in df_p__Nitrospirota_all.genome_id:
        print(genbank, "genbank in dataframe")
    
    elif refseq in df_tax.index:
        print(refseq, "refseq in GTDB:", f"adding {refseq} to list...")
        ncbi_in_GTDB.append(refseq)
    elif genbank in df_tax.index:
        print(genbank, "genbank in GTDB:", f"adding {genbank} to list...")
        ncbi_in_GTDB.append(genbank)
    
    else:
        if refseq != "":
            print(refseq, "not found:", f"adding {refseq} to list...")
            ncbi_not_in_GTDB.append(refseq)
        else:
            print(genbank, "not found:", f"adding {genbank} to list...")
            ncbi_not_in_GTDB.append(genbank)
        
ncbi_not_in_GTDB

GCF_011405515.1 refseq in dataframe
GCF_000695975.1 refseq in GTDB: adding GCF_000695975.1 to list...
GCF_001458695.1 refseq in dataframe
GCF_000196815.1 refseq in dataframe
GCF_000284315.1 refseq in GTDB: adding GCF_000284315.1 to list...
GCA_014058405.1 genbank in dataframe
GCF_000299235.1 refseq in GTDB: adding GCF_000299235.1 to list...
GCA_015709715.1 not found: adding GCA_015709715.1 to list...
GCA_020347665.1 not found: adding GCA_020347665.1 to list...
GCA_020356065.1 not found: adding GCA_020356065.1 to list...
GCA_020346765.1 not found: adding GCA_020346765.1 to list...
GCA_020354155.1 not found: adding GCA_020354155.1 to list...
GCA_020347785.1 not found: adding GCA_020347785.1 to list...
GCA_020349825.1 not found: adding GCA_020349825.1 to list...
GCF_001186405.1 refseq in GTDB: adding GCF_001186405.1 to list...
GCF_001273775.1 refseq in dataframe
GCF_900169565.1 refseq in dataframe


['GCA_015709715.1',
 'GCA_020347665.1',
 'GCA_020356065.1',
 'GCA_020346765.1',
 'GCA_020354155.1',
 'GCA_020347785.1',
 'GCA_020349825.1']

In [82]:
p__Nitrospiraceae_NCBI_HQ = df_tax.loc[ncbi_in_GTDB, :]
df_p__Nitrospiraceae_NCBI_HQ = format_bgcflow(p__Nitrospiraceae_NCBI_HQ, os.path.join(tables, "f__Nitrospiraceae_NCBI_HQ.csv"))
for i in ncbi_not_in_GTDB:
    df_p__Nitrospiraceae_NCBI_HQ.loc[i, ["genome_id", "source"]] = [i, "ncbi"]
df_p__Nitrospiraceae_NCBI_HQ.to_csv(os.path.join(tables, "f__Nitrospiraceae_NCBI_HQ.csv"), index=False)

In [83]:
df_p__Nitrospiraceae_HQ_all = pd.concat([df_p__Nitrospiraceae_NCBI_HQ, df_p__Nitrospirota_MAGs])
df_p__Nitrospiraceae_HQ_all.to_csv("../tables/f__Nitrospiraceae_HQ_all.csv")

## New Project - Myxococcocta

In [35]:
p__Myxococcota_MAGs = df_MAGs_bioproject[df_MAGs_bioproject.loc[:, "GTDBTax"].str.contains("p__Myxococcota")]
p__Myxococcota_MAGs.shape

(45, 37)

In [36]:
p__Myxococcota_GTDB = df_tax[df_tax.loc[:, "phylum"] == "p__Myxococcota"]
p__Myxococcota_GTDB.shape

(538, 7)

In [37]:
df_p__Myxococcota_gtdb = format_bgcflow(p__Myxococcota_GTDB, os.path.join(tables, "p__Myxococcota_gtdb.csv"))

In [38]:
df_p__Myxococcota_MAGs = format_bgcflow(get_gtdb_tax(p__Myxococcota_MAGs["GTDBTax"]), os.path.join(tables, "p__Myxococcota_MAGs.csv"))

In [39]:
df_p__Myxococcota_all = pd.concat([df_p__Myxococcota_gtdb, df_p__Myxococcota_MAGs])
df_p__Myxococcota_all.to_csv("../tables/p__Myxococcota_all.csv")
df_p__Myxococcota_all.shape

(583, 12)