# Get KO counts and md5 IDs from MG RAST's API 


###  Obtains KO gene abundance tables, KO-md5 counts for later DNA sequence queries / taxonomic assignments

#### API calls and JSON parser produce tables for each sample, then tables merged using reduce
This script produces a KO abundance table across a group of MG RAST metagenomes, after downloading and parsing individual table.  This works well IF asynchronous get request has returned this table to the server, which takes time -- a separate script requests annotations and returns status, but needs work. Function below could be easily modified if other annotations, cutoffs needed. Just edit text in reqMG below to yield needed parameters from MG RAST, for available features see http://api.metagenomics.anl.gov/api.html
Note that MG RAST produces tables of KO counts per cluster, which need to be aggregated by KO (using groupby). API get KO function includes this step. API_get_KO_tables function returns a single KO table per sample, to avoid crashes. Then tables are merged using reduce. KO to md5 count maps are also provided for each sample in this version, and then coallated. 

### Revisions:
1) Now accounts for multiple KOs per MD5, here double counting them.  Previous DB was misaligned due to unlisting of multiple KOs then merging by index with counts. 

2) Now outputs separate file to track MD5 IDs from MG RASTs RefSeq DB.  Makes file with KO, count, MD5, sample ID (mgm.).  These files will be used for later taxonomic assignment of reads, using the MD5 info. 

## Note on usage

While the functions in this script can be used by anyone, I use my own sample lists, and remap sample from MG-RAST names using another file.

Importantly, using this script on my data (130 samples) created an intermediate 4.5 GB files, so others are cautioned to consider smaller test runs.

## Function: API_get_KO tables
Get KO count and KO - MD5 count tables  (to apply to a list of samples)

In [16]:
def API_get_KO_tables(file):
    import requests  # to make the GET request 
    import json  # to parse the JSON response to a Python dictionary
    import csv  # to write our data to a CSV
    import pandas as pd # to see our CSV
    import numpy as np
    import itertools
    import re
    import string
    
    ###############################
    ### api CALL 
    # Get url for API from variable                                                      # NO AUTH, for GIT -- 
    reqMG = 'http://api.metagenomics.anl.gov/profile/' + file + '?source=KO&format=biom' #&auth=''

    # API get KO data in JSON fmt, extract data
    response = requests.get(reqMG)
    response_text = response.text
    data = json.loads(response_text)
    
    # Get data from 3rd nested JSON layer
    data['data']                                        # 1st layer
    dat2=data['data']                                   # 2nd layer
    dat3=dat2['data']                                   # 3rd layer

    ###############################
    ### Parse JSON
    # get MD5s  - unique IDs
    MD5list=[x[0] for x in dat3]                        # extract list
    MD5 = pd.DataFrame(MD5list)                         # make DF
    MD5.columns= ['MD5']                                # rename col                     
    
    # get counts
    counts_List=[x[1] for x in dat3]                    # extract list
    counts = pd.DataFrame(counts_List)                  # make DF 
    counts.columns = ['count']                          # rename col                    
    #MGID=[file]                                        # get file name for col
    #counts.columns = MGID                              # rename col                    
    
    ### get KO table, format given multiple KO IDs for some MD5s -- slicing later requires columns renamed as non-numeric 
    KOlist=[x[6] for x in dat3]                         # extract list of KOs
    Ko = pd.DataFrame(KOlist)                           # Make list a DF, fixes needed below, but unlisting BAD
    n = Ko.shape[1]                                     # Get KO shape
    newcols = list(string.ascii_lowercase[0:n])         # Get letter list of dim n 
    Ko.columns = newcols                                # Rename columns by letters
    KO = Ko.replace([None], ['no'], regex=True)         # Replace [None] generated by KO List -> DF
    
    ### Combine cleaned KO, Counts, MD5s (per cluster)
    KO_countMD5=pd.concat([KO, counts, MD5], axis=1, join='inner')   # Join elements together by INDEX   # KO_Counts.head(5)

    ################################
    ### Collect and Coallate multiple KO hits per MD5 (unique Refseq gene, linked to taxon)
    # Get 1st level KOs
    L1 = ['a','count','MD5']                            # Get columns to keep
    KO1 = KO_countMD5[L1]                               # Keep only columns a = KO first call
    KO1.columns = ['KO','count','MD5']                  # Rename columns                  # KO1.head() #KO1.shape
    
    ## Append (A) counts from KO levels 2-n to KO level 1 
    KOA = KO1                                           # make DF to append to, from KO1
    col2= newcols[1:]                                   # make list of KO columns, excluding KO1
    for i in col2:                                      # For loop, i in KO columns (-KO1): 
        names = [(i),'count','MD5']                     # Get columns to keep, incl. i (KO level), counts, MD5
        ko_only = KO_countMD5[KO_countMD5[(i)] != 'no']    # Keep only rows of KO count tab where i != none (no)
        ko_Lcut = ko_only[names]                        # Keep columns i, count, md5 -- this is new table    
        ko_Lcut.columns = ['KO', 'count', 'MD5']        # rename col. i as 'KO'
        KOA = KOA.append(ko_Lcut, ignore_index=True)    # Append new table to older table(s)
                                                        # Check results, sequentially: # print(KOA.shape) # print(names) # print(koA)   
                                                        # KOA.head()  # KOA.tail()  # KOA.shape
    ####################################
    # Aggregate KO Counts and export to file
    koCounts = KOA.drop('MD5', axis=1)
    koCounts.columns = ['KO', file]                     # koCounts.head()

    # Sum over KOs  (counts were given for each read cluster)
    KOcounts = koCounts.groupby(['KO']).sum()

    # Print filename as string from file 
    filetr=str(file)
    filestr=file[:-2]  # Won't tolerate ".3" in filename, truncate and addback
    filenam = '\'' + str(filestr)+ '.3_KO.txt''\''
    filename = filenam[1:-1]   # strip double quot. by remove first, last char

    # write csv (\t) to filename, no row ind.
    KOcounts.to_csv((str(filename)), sep="\t", index=True, quoting=csv.QUOTE_NONE)

    ####################################
    ### Export KO to MD5 mapping to file
    KOmd5 = KOA                                          # Don't drop, keep counts for Taxonomy # .drop('count', axis=1) # get mapping, tested and is unique
    KOmd5["MGID"] = file

    # Print filename as string from file 
    # filetr=str(file)
    # filestr=file[:-2]                                  # Won't tolerate ".3" in filename, truncate and addback
    filenam = '\'' + str(filestr)+ '.3_KOmd5.txt''\''    # print filename string 
    filename = filenam[1:-1]                             # strip double quot. by remove first, last char

    # write csv (\t) to filename, no row ind.
    KOmd5.to_csv((str(filename)), sep="\t", index=True, quoting=csv.QUOTE_NONE)
    
    print(file + " KO tables completed")

### Example usage
When samples "get" requests completed, about 35s per sample, else crash if non-completed are on list.  Note this could take longer under higher server load.  

In [2]:
# One sample:
API_get_KO_tables("mgm4755136.3")

mgm4755136.3 KO tables completed


In [25]:
API_get_KO_tables("mgm4755203.3")

mgm4755203.3 KO tables completed


In [26]:
# Multiple samples --note here first and most pmo rich
filelist = ("mgm4755136.3", "mgm4755203.3") #, "mgm4755144.3")

for i in filelist:
    API_get_KO_tables(i)

mgm4755136.3 KO tables completed
mgm4755203.3 KO tables completed


## Import MGIDs lists and sample mapping to get  API call queue lists

In [104]:
import pandas as pd
import numpy as np

# read list of file IDs, mgID and SPID (use later)
samples = pd.read_csv("Salinity_MGrast_OID_mgmIDs133.txt", sep ='\t' )

# also get list of SPID to sample names (for later)
sampnamesSPID = pd.read_csv("Sample_name_to_SPID_mapping2.txt", sep ='\t' )
sampnamesSPID.head()

# Make list of MGRAST file IDs from column
samplist = samples["mgID"].tolist()

# Break list into chunks of 4 before for loop,  -- 4 chunks used as ca. 10 samples ea.
# else crash computer if too many parallel (does work though disasterous;) )

# TODO, make for loop which appends to dict, then call items.
metalist= np.array_split(np.array(samplist), 4)
l_0=metalist[0].tolist()
l_1=metalist[1].tolist()
l_2=metalist[2].tolist()
l_3=metalist[3].tolist()

#sampnamesSPID.head()

In [106]:
# Check lists, length: # l_3  # len(l_0)

## Run batched lists of API calls for KO tables 
Batches run separately so as to monitoring chunks for completion time.  Otherwise if run all together crash.  Estimate 900 seconds for each batch, see time.sleep command commented out below, and use at own risk in loop of list-batches if you feel bold.

In [34]:
## Here running each list separately...DON'T run them all at once (in parallel, CRASH computer

# Could TODO, but more dangerous?: Make for loop of chunks with timed pause between sets.
# import time       # to pause after each API call
# time.sleep(900)   # Delay for 15 minutes (15x60 seconds).
# Wall time for below is 

for i in l_0:
    API_get_KO_tables(i)
    
# 11 sampples: started at 4:24 pm, finish 4:30 -- 6 min wall time
# 34 samples: start 4:50, finish 5:12 -- 22 m wall time

mgm4755136.3 KO tables completed
mgm4755135.3 KO tables completed
mgm4755144.3 KO tables completed
mgm4755149.3 KO tables completed
mgm4755150.3 KO tables completed
mgm4755157.3 KO tables completed
mgm4755145.3 KO tables completed
mgm4755151.3 KO tables completed
mgm4755154.3 KO tables completed
mgm4755155.3 KO tables completed
mgm4755146.3 KO tables completed
mgm4755200.3 KO tables completed
mgm4755189.3 KO tables completed
mgm4755195.3 KO tables completed
mgm4755205.3 KO tables completed
mgm4755197.3 KO tables completed
mgm4755209.3 KO tables completed
mgm4755199.3 KO tables completed
mgm4755210.3 KO tables completed
mgm4755207.3 KO tables completed
mgm4755188.3 KO tables completed
mgm4755202.3 KO tables completed
mgm4755215.3 KO tables completed
mgm4755190.3 KO tables completed
mgm4754590.3 KO tables completed
mgm4754591.3 KO tables completed
mgm4754600.3 KO tables completed
mgm4754585.3 KO tables completed
mgm4754597.3 KO tables completed
mgm4754606.3 KO tables completed
mgm4754598

In [37]:
for i in l_1:            
    API_get_KO_tables(i)

mgm4754581.3 KO tables completed
mgm4754594.3 KO tables completed
mgm4754605.3 KO tables completed
mgm4754583.3 KO tables completed
mgm4754593.3 KO tables completed
mgm4754586.3 KO tables completed
mgm4754602.3 KO tables completed
mgm4754603.3 KO tables completed
mgm4754587.3 KO tables completed
mgm4754580.3 KO tables completed
mgm4754604.3 KO tables completed
mgm4754592.3 KO tables completed
mgm4754582.3 KO tables completed
mgm4754601.3 KO tables completed
mgm4754584.3 KO tables completed
mgm4754579.3 KO tables completed
mgm4754599.3 KO tables completed
mgm4754596.3 KO tables completed
mgm4755213.3 KO tables completed
mgm4755198.3 KO tables completed
mgm4755204.3 KO tables completed
mgm4755186.3 KO tables completed
mgm4755217.3 KO tables completed
mgm4755211.3 KO tables completed
mgm4755208.3 KO tables completed
mgm4755181.3 KO tables completed
mgm4755196.3 KO tables completed
mgm4755206.3 KO tables completed
mgm4755214.3 KO tables completed
mgm4755180.3 KO tables completed
mgm4755216

In [39]:
for i in l_2:
    API_get_KO_tables(i)

mgm4755191.3 KO tables completed
mgm4755184.3 KO tables completed
mgm4755201.3 KO tables completed
mgm4755185.3 KO tables completed
mgm4755193.3 KO tables completed
mgm4755194.3 KO tables completed
mgm4755192.3 KO tables completed
mgm4755218.3 KO tables completed
mgm4755203.3 KO tables completed
mgm4755212.3 KO tables completed
mgm4755147.3 KO tables completed
mgm4755139.3 KO tables completed
mgm4755143.3 KO tables completed
mgm4755137.3 KO tables completed
mgm4755131.3 KO tables completed
mgm4755140.3 KO tables completed
mgm4755133.3 KO tables completed
mgm4755141.3 KO tables completed
mgm4755127.3 KO tables completed
mgm4755138.3 KO tables completed
mgm4755134.3 KO tables completed
mgm4755129.3 KO tables completed
mgm4755132.3 KO tables completed
mgm4755158.3 KO tables completed
mgm4755142.3 KO tables completed
mgm4755152.3 KO tables completed
mgm4755153.3 KO tables completed
mgm4755126.3 KO tables completed
mgm4755148.3 KO tables completed
mgm4755156.3 KO tables completed
mgm4755128

KeyError: 'data'

In [19]:
l_2[30:34]
# only last one of l_2 dropped, re-run...

['mgm4755128.3', 'mgm4755130.3', 'mgm4758871.3']

In [20]:
API_get_KO_tables("mgm4758871.3")  # -- PROBLEM was AUTH, need to release samples, but first fix metadata

mgm4758871.3 KO tables completed


In [21]:
for i in l_3:
    API_get_KO_tables(i)

mgm4758870.3 KO tables completed
mgm4758844.3 KO tables completed
mgm4758847.3 KO tables completed
mgm4758852.3 KO tables completed
mgm4758863.3 KO tables completed
mgm4758856.3 KO tables completed
mgm4758842.3 KO tables completed
mgm4758862.3 KO tables completed
mgm4758840.3 KO tables completed
mgm4758865.3 KO tables completed
mgm4758838.3 KO tables completed
mgm4758846.3 KO tables completed
mgm4758845.3 KO tables completed
mgm4758866.3 KO tables completed
mgm4758849.3 KO tables completed
mgm4758841.3 KO tables completed
mgm4758859.3 KO tables completed
mgm4758843.3 KO tables completed
mgm4758867.3 KO tables completed
mgm4758861.3 KO tables completed
mgm4758860.3 KO tables completed
mgm4758868.3 KO tables completed
mgm4758848.3 KO tables completed
mgm4758850.3 KO tables completed
mgm4758853.3 KO tables completed
mgm4758864.3 KO tables completed
mgm4758858.3 KO tables completed
mgm4758869.3 KO tables completed
mgm4758857.3 KO tables completed
mgm4758839.3 KO tables completed
mgm4758837

##  Combine KO count tables using reduce (merge all by KO)

In [22]:
# Use separate KO table API outputs in current directory
import glob
import pandas as pd
from functools import reduce  # Python 3 only
import csv  # to write our data to a CSV


#### 1) Make list of data
# Get all KO files, add path if needed
allFiles = glob.glob("*KO.txt")  # allFiles

# Read allFiles as csv and make list of dataframes:
frame = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0, sep='\t')
    list_.append(df) 
list_


#### 2) Reduce data in list
# Reduce data (merge all), make DF and fill NAs
def big_merge(df1, df2):
    return pd.merge(df1, df2, how='outer')

merged_dfs = reduce(big_merge, list_)  # REDUCE 
out=pd.DataFrame(merged_dfs)           # As DF  
out.fillna(value=0, inplace = True)    # NA = 0 

(6277, 133)

In [26]:
# out.head() # out.shape

In [99]:
### 3) Rename samples, format DF and sort for export
# Transpose and merge with SPIDs from ID mapping file
outT=out.T
outT.columns=outT.iloc[0]  # rename cols
outT=outT[1:]              # drop first row
outT['mgID']=outT.index    # MG ID as column

# Merge this with sample list to get OIDs 
outMt = pd.merge(samples, outT, on='mgID', how='outer')

# Merge with OID to sample name map, sort Site EW, hydrology (low to high)
OutMt = pd.merge(sampnamesSPID, outMt, on='SPID', how='right')
OutMt.drop(['SPID','Alpha_Index','Index_EW_ d1_d2', 'Ind_site_EW','mgID'], axis=1, inplace=True)         # drop extra columns        # OutMt.head(10)    # OutMt.shape
OutMt_sort = OutMt.sort_values(['Indx_Site_hyd_EW'])#, inplace=True)              # SORT samples by EW index  
# OutMt_sort.head() # OutMt_sort.shape

# Reformat before transpose to KO as row
OutMt_sort.set_index(['Sample'], drop = True, inplace = True)
OutMt_sort.drop(['Indx_Site_hyd_EW'], axis=1, inplace=True)         # drop extra columns # OutMt_sort=OutMt_sort.iloc[:,2:]      # drop KO and mgID columns
OutMt_sort.index.names = ['KO']  # rename index
OutMt_sort.head(10)

## retranspose 
OutM_sort = OutMt_sort.T
#OutM_sort.head(10) #OutM_sort.columns #OutM_sort.tail(50) #OutM.shape #OutMt.head()

In [100]:
# Filter table to minimum number of reads, here 200
OutM_sort['KOsum'] = OutM_sort.sum(axis=1)                # Get sum of KOs
OutM_sort_f200 = OutM_sort[OutM_sort.KOsum > 200]         # Filter KO table by min number            # OutM_sort_f200.shape
OutM_sort_F200 = OutM_sort_f200.drop(['KOsum'],axis=1)    # Drop KOsum col, inplace throws py error  # , inplace=True)  # OutM_sort_f200.head()

### Export combined KO abundance table(s) for all samples

In [102]:
# Write table, incl. rownames
OutM_sort.to_csv("MG_RAST_KO_counts133_revised.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)
OutMt_sort.to_csv("MG_RAST_KO_counts133_revisedTRSNP.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)
OutM_sort_F200.to_csv("MG_RAST_KO_counts133_revisedF200.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)

### Cleanup files 

In [103]:
! mkdir KO_counts_sample_txt           # Makes new directory
! mv *KO.txt KO_counts_sample_txt/     # Moves all KO.txt to new directory 

## Combine KO - count MD5 tables 

In [116]:
# Assumes API separate KO table outputs in current directory
import glob
import pandas as pd
from functools import reduce  # Python 3 only
import csv  # to write our data to a CSV


#### 1) Make list of data
# Get all KO files, add path if needed
allFiles_md5 = glob.glob("*KOmd5.txt")  # allFiles

# Read allFiles as csv and make list of dataframes:
frame = pd.DataFrame()
list_ = []

for file_ in allFiles_md5:
    df = pd.read_csv(file_,index_col=None, header=0, sep='\t')
    list_.append(df) 
#list_

In [None]:
### 2) Concatenate tables in list
KOmd5_cts_all = pd.concat(list_, axis=0, ignore_index = True)  # could reduce, or append?

In [124]:
KOmd5_cts_all.shape #KOmd5_cts_all.head() # KOmd5_cts_all.tail()

(72150157, 5)

In [131]:
### Remap sample IDs from MGID to sample name 
# Join KO-md5 counts/Sample with Sample name mapping, using MGID and OIDs (here 'samples', 'samplenameSPID') 
samples.columns = ['SPID','MGID']                                                       # Rename mgID col to MGID to match md5 file... # samples.head()
KOmd5_ctS_all = pd.merge(samples, KOmd5_cts_all, on='MGID', how='outer')                # Merge w/ sample list -> OIDs 
KOmd5_cts_all_samps = pd.merge(sampnamesSPID, KOmd5_ctS_all, on='SPID', how='right')    # Merge w/ OID to sample name map, sort Site EW, hydrology (low to high)      # KOmd5_cts_all_samps.head()

# Drop extra columns 
KOmd5_cts_all_samps.drop(['SPID','Alpha_Index','Index_EW_ d1_d2','Indx_Site_hyd_EW','Ind_site_EW','MGID','Unnamed: 0'], axis=1, inplace=True)         # drop extra columns        # OutMt.head(10)    # OutMt.shape
#KOmd5_cts_all_samps.head()

In [137]:
KOmd5_cts_all_samps.head() # KOmd5_cts_all_samps.tail() # KOmd5_cts_all_samps.shape

Unnamed: 0,Sample,KO,count,MD5
0,Sandmound_TuleA_D1,K00566,2,00001508eba3f78863a4f9cb2463810d
1,Sandmound_TuleA_D1,K01687,2,00001a757949ba4df5f1a9f8f6ba6c09
2,Sandmound_TuleA_D1,K03688,37,00001aba8aee0c90a80969ea8da059f8
3,Sandmound_TuleA_D1,K02013,4,00002ee0efb6f4ef77f1a53bbeb207d0
4,Sandmound_TuleA_D1,K00066,43,00003a8575ab2461c908a808ffe2002a


### Export KO-md5 count table combined for all samples

In [138]:
KOmd5_cts_all_samps.to_csv("MG_RAST_KO_md5_counts_133.txt", sep="\t", index=False, quoting=csv.QUOTE_NONE)

## Cleanup individual files

In [None]:
! mkdir KOmd5_counts_sample_txt           # Makes new directory
! mv *KOmd5.txt KOmd5_counts_sample_txt/     # Moves all KO.txt to new directory 

## Reformat / reduce file dimensions 
4.5 GB might be a bit big for this file, can rearrange or reformat while in memory?


In [144]:
# Unique KO-md5 combinations:
keep = ['KO','MD5']
KO_md5 = KOmd5_cts_all_samps[keep]          # Keep only KO, MD5            # KO_md5.head() # KO_md5.shape
KO_md5_unique_133sal = KO_md5.drop_duplicates() # get unique mappings of KO, MD5
KO_md5_unique_133sal.head() #KO_md5_unique_133sal.shape

# Export unique combinations
KO_md5_unique_133sal.to_csv("MG_RAST_KO_x_md5_unique_133.txt", sep="\t", index=False, quoting=csv.QUOTE_NONE)

Unnamed: 0,KO,MD5
0,K00566,00001508eba3f78863a4f9cb2463810d
1,K01687,00001a757949ba4df5f1a9f8f6ba6c09
2,K03688,00001aba8aee0c90a80969ea8da059f8
3,K02013,00002ee0efb6f4ef77f1a53bbeb207d0
4,K00066,00003a8575ab2461c908a808ffe2002a


## REFORMAT 4.5 GB table
SLOW steps below on 4.5 GB file, separate

In [159]:
# Make WIDE table of md5-KO counts per sample                                                                 # Pivot long to wide data on KO-md5 to get counts / sample
md5_sample_count_DB = KOmd5_cts_all_samps.pivot_table(index=['MD5','KO'], columns='Sample', values='count')  
# md5_sample_count_DB.head()

In [161]:
# Fill NAs with 0 s
md5_sample_count_DB.fillna(0, inplace=True)
md5_sample_count_DB.head()

Unnamed: 0_level_0,Sample,Browns_ThreeSqA_D1,Browns_ThreeSqA_D2,Browns_ThreeSqB_D1,Browns_ThreeSqB_D2,Browns_ThreeSqC_D1,Browns_ThreeSqC_D2,Browns_TuleA_D1,Browns_TuleA_D2,Browns_TuleB_D1,Browns_TuleB_D2,...,WestPond_TuleC_D1,WestPond_TuleC_D2,White_CordA_D2,White_CordB_D2,White_ThreeSqA_D1,White_ThreeSqA_D2,White_ThreeSqB_D1,White_ThreeSqB_D2,White_ThreeSqC_D1,White_ThreeSqC_D2
MD5,KO,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
00001508eba3f78863a4f9cb2463810d,K00566,4.0,1.0,3.0,2.0,3.0,1.0,0.0,0.0,8.0,1.0,...,3.0,2.0,0.0,1.0,1.0,1.0,2.0,1.0,0.0,1.0
00001a757949ba4df5f1a9f8f6ba6c09,K01687,2.0,4.0,0.0,3.0,3.0,2.0,0.0,1.0,0.0,1.0,...,3.0,5.0,10.0,9.0,1.0,1.0,3.0,6.0,1.0,1.0
00001aba8aee0c90a80969ea8da059f8,K03688,28.0,13.0,16.0,6.0,19.0,7.0,11.0,19.0,9.0,20.0,...,14.0,4.0,21.0,34.0,12.0,19.0,5.0,36.0,54.0,22.0
00002ee0efb6f4ef77f1a53bbeb207d0,K02013,2.0,1.0,8.0,3.0,4.0,2.0,3.0,4.0,0.0,4.0,...,6.0,3.0,1.0,3.0,1.0,2.0,2.0,3.0,1.0,2.0
00003a8575ab2461c908a808ffe2002a,K00066,33.0,31.0,39.0,22.0,43.0,28.0,30.0,29.0,54.0,42.0,...,28.0,15.0,19.0,15.0,27.0,33.0,24.0,18.0,14.0,33.0


In [176]:
md5_sample_count_DB.shape

(1014593, 133)

In [181]:
# Write the wide DB
md5_sample_count_DB.to_csv("MG_RAST_md5_WIDE_sample_count_KO_DB_133.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)

In [179]:
md5DB_head= md5_sample_count_DB.head()
md5DB_head
md5DB_head.to_csv("md5DB_head.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)

In [171]:
### Get Separate md5 total counts / sample for later normalization...
md5_totalcts_samp = md5_sample_count_DB.sum(axis=0)
md5_Sample_total = pd.DataFrame(md5_totalcts_samp)
md5_Sample_total.columns = ['md5_total_count'] 
# Export unique combinations
md5_Sample_total.to_csv("MG_RAST_md5_Sample_total_133.txt", sep="\t", index=True, quoting=csv.QUOTE_NONE)