Date Created: 09/10/20
## Goal of Notebook: Collect and Clean Phages DB Data

To gather phagesDB data we are going to use the swagger API.

The API has the followings options...
- "phages": "https://phagesdb.org/api/phages/",
- "clusters": "https://phagesdb.org/api/clusters/",
- "subclusters": "https://phagesdb.org/api/subclusters/",
- "institutions": "https://phagesdb.org/api/institutions/",
- "host_strains": "https://phagesdb.org/api/host_strains/",
- "host_species": "https://phagesdb.org/api/host_species/",
- "host_genera": "https://phagesdb.org/api/host_genera/",
- "publications": "https://phagesdb.org/api/publications/",
- "genes": "https://phagesdb.org/api/genes/",
- "pham_phages": "https://phagesdb.org/api/pham_phages/"

For our use case we want to gather sequenced phage metadata and their respective genes. Therefore we will utilize "phages" and "genes".

In [1]:
# import libraries
import requests 
import pandas as pd
from random import sample 
import numpy as np
import matplotlib.pyplot as plt
import sys 
from multiprocessing.pool import Pool
import multiprocessing 

## 1. Downloading Genes

Download all the genes from a specific page number. This helps because then I can thread by page

In [2]:
def download_page(page_number):
    ''' 
    Download page number from https://phagesdb.org/api/genes/?page= given global param
    '''
    query_url = "https://phagesdb.org/api/genes/?page=" + str(page_number) + "&page_size=" + str(page_size)

    response = requests.get(url = query_url).json()
    list_of_genes = []
    for gene in response["results"]:
        info = gene["GeneID"].split("_")
        list_of_genes.append([gene["GeneID"],
                              gene["phams"][0],
                              gene["Notes"].lower(),
                              gene["translation"],
                              gene["Orientation"],
                              gene["PhageID"]["Name"] if "PhageID" in gene.keys() and "Name" in gene["PhageID"].keys() else gene["GeneID"].split("_")[0],
                              gene["GeneID"].split("_")[-1],
                              gene["Start"],
                              gene["Stop"]
                             ])
    return list_of_genes

In [3]:
# define size of pages globally so it can be used in above function
global page_size
page_size = 1000

# if page size is 1 then the number of pages is equal to total number of genes
query_url = "https://phagesdb.org/api/genes/?page=1&page_size=" + str(page_size)
total_num_genes = int(requests.get(url = query_url).json()["count"])

# determine number of pages based on page size
pages = int(np.ceil(total_num_genes/page_size))

# use multiproccesing to query and proccess each page in paralel
with Pool(multiprocessing.cpu_count()) as p:
    genes = p.map(download_page, list(range(1,pages+1)))

In [4]:
combined_genes = [gene for genes_groups in genes for gene in genes_groups]

print("Finished download... Found ", len(combined_genes),"genes")

Finished download... Found  348734 genes


SANITY CHECK

In [5]:
combined_genes[0]

['20ES_CDS_1',
 '36676',
 '',
 'MYGTRSSAFWASQPGKFDVLNLRMTFPSTSAHEIPDLTATDFVPENLAAWNMPRHREYAAHTGGALHFFLDDYRFETVWSSPERLLDRVKAVGAALTPDFSLWKDMPRAAQVWNTYRSRWCGAYWQSEGIEVIPTVGWGTPDTYDFCFDGLPTGGNVAISCLTLRAKQEDRELFTRGVQELVWRTQPKTLLVYGRLRFCEDIDLPEVREYPTYWDRRRKRLEEQWESAGAAVEAVEPPAPRPETKEPQLQAVDLD',
 'F',
 '20ES',
 '1',
 568,
 1336]

In [6]:
len(combined_genes) == total_num_genes

True

### Clean Gene Functions

In [7]:
# ADDED 09/28 TO DISAMBIGUATE FUNCTIONS
# Open conversion dict
import pickle
a_file = open("data/conversion_table.pkl", "rb")
conversion_table = pickle.load(a_file)

# load in approved functions list (taken from https://seaphages.org/blog/2017/10/30/official-function-list/ version 5/2020)
df_approved_functions = pd.read_csv("data/Approved_Functions.csv")
df_approved_functions = df_approved_functions.dropna(subset=["Approved Function"])
df_approved_functions.head()

# clean approved functions to lower case
approved_functions = list(df_approved_functions["Approved Function"])
approved_functions = [i.lower() for i in approved_functions]

# for each gene check if it's function is valid, if not use conversion list to correct or NKF
for i in combined_genes:
    function = i[2] # uncleaned function
    i.append(function) # save uncleaned function
    if function in approved_functions:
        continue
    elif function in conversion_table.keys() and conversion_table[function] != -1:
        i[2] = conversion_table[function]
    else: 
        i[2] = "NKF"

In [8]:
combined_genes[0]

['20ES_CDS_1',
 '36676',
 'NKF',
 'MYGTRSSAFWASQPGKFDVLNLRMTFPSTSAHEIPDLTATDFVPENLAAWNMPRHREYAAHTGGALHFFLDDYRFETVWSSPERLLDRVKAVGAALTPDFSLWKDMPRAAQVWNTYRSRWCGAYWQSEGIEVIPTVGWGTPDTYDFCFDGLPTGGNVAISCLTLRAKQEDRELFTRGVQELVWRTQPKTLLVYGRLRFCEDIDLPEVREYPTYWDRRRKRLEEQWESAGAAVEAVEPPAPRPETKEPQLQAVDLD',
 'F',
 '20ES',
 '1',
 568,
 1336,
 '']

### Create Genes DF

In [9]:
df_genes = pd.DataFrame( combined_genes, columns = ['gene ID',
                                                    'pham',
                                                    'function',
                                                    'translation',
                                                    'orientation',
                                                    'phage',
                                                    'gene number',
                                                    'start',
                                                    'stop',
                                                    'uncleaned function'
                                                   ]) 

df_genes.head()

Unnamed: 0,gene ID,pham,function,translation,orientation,phage,gene number,start,stop,uncleaned function
0,20ES_CDS_1,36676,NKF,MYGTRSSAFWASQPGKFDVLNLRMTFPSTSAHEIPDLTATDFVPEN...,F,20ES,1,568,1336,
1,20ES_CDS_10,39578,lysin b,MSLQVGSSGELVNRWIRVMKARFASYAGKLKEDGYFGLDDKAVQQE...,F,20ES,10,6442,7420,lysin b
2,20ES_CDS_11,34196,terminase,MSLENHHPELAPSPPHIIGPSWQRTVDGSWHLPDPKMTLGWGVLKW...,F,20ES,11,7442,9233,terminase
3,20ES_CDS_12,39511,portal protein,MTAPLPGQEEIPDPAIARDEMISAFDDAVKNLKINTSYYEAERRPE...,F,20ES,12,9229,10690,portal protein
4,20ES_CDS_13,21454,capsid maturation protease,MITAAVAAYVQRFASMFTGPALSLGEWARFLQTLFPEVQRRYAQAA...,F,20ES,13,10719,11583,capsid maturation protease


## 2. Phage Metadata

In [10]:
def collect_phage_metadata(phage_info):
    (phage, phage_from_geneid) = phage_info
    query_url = "https://phagesdb.org/api/phages/"+ str(phage)
    response = requests.get(url = query_url).json()
    
    if len(response.keys()) < 5:
        query_url = "https://phagesdb.org/api/phages/"+ str(phage.split("_")[0]) #sometimes they are drafts
        response = requests.get(url = query_url).json()
        
    if len(response.keys()) < 5:
        query_url = "https://phagesdb.org/api/phages/"+ str(phage_from_geneid) #sometimes they are drafts
        response = requests.get(url = query_url).json()
        
    if len(response.keys())>5:
        return [
                response['phage_name'],
                response["pcluster"]["temperate"]  if "pcluster" in response.keys() and response["pcluster"]  != None else "",
                response["pcluster"]["cluster"] if "pcluster" in response.keys() and response["pcluster"] != None else response["pcluster"],
                response["psubcluster"]["subcluster"] if "psubcluster" in response.keys() and response["psubcluster"] != None else response["psubcluster"],
                response["morphotype"],
                response["isolation_host"]["genus"],
                response["isolation_host"]["species"],
                response["genome_length"],
                response['is_annotated'],
                response['is_phamerated'],
                response["gcpercent"]
               ]
    else:
        return [phage, "-1"]
    

In [11]:
phages_from_genes = []
for phage in df_genes['phage'].unique():
    phages_from_genes.append((phage, list(df_genes[df_genes['phage']==phage]["gene ID"].values)[0].split("_")[0]))

In [12]:
len(phages_from_genes)

3513

In [13]:
# use multiproccesing to query and proccess each page in paralell
with Pool(multiprocessing.cpu_count()-1) as p:
    phage_metadata = p.map(collect_phage_metadata, phages_from_genes)

In [14]:
phage_metadata = [i for i in phage_metadata]

In [15]:
temp = phage_metadata
for i in temp:
    if i[1] == "-1":
        print(i)
        phage_metadata.remove(i)
        df_genes = df_genes[df_genes["phage"]!=i[0]]

['B5', '-1']
['BFK20', '-1']
['CMP1', '-1']
['ISF9', '-1']
['P1.1', '-1']
['P1201', '-1']
['P9.1', '-1']
['PHL010M04', '-1']
['PHL060L00', '-1']
['PHL071N05', '-1']
['PHL112N00', '-1']
['PHL114L00', '-1']


In [16]:
print("Unique # Sequenced Phages:", len(phage_metadata))

Unique # Sequenced Phages: 3501


SANITY CHECK

In [17]:
len(phage_metadata) == len(df_genes['phage'].unique())

False

In [18]:
len(phage_metadata)==len(df_genes['phage'].unique())

False

Some phages are not present in metadata that are found in gene data

### Parse Response Data

In [19]:
df_phage = pd.DataFrame(phage_metadata, columns =['phage',
                                                  'temperate',
                                                  'cluster',
                                                  'subcluster',
                                                  'morphotype',
                                                  'host genus',
                                                  'host species',
                                                  'genome length',
                                                  'is annotated',
                                                  'is phamerated', 
                                                  'gcpercent'
                                               ]) 

df_phage.to_csv("data/phage_metadata.csv",index=False)
df_phage.head()

Unnamed: 0,phage,temperate,cluster,subcluster,morphotype,host genus,host species,genome length,is annotated,is phamerated,gcpercent
0,20ES,True,A,A2,SIPHO,Mycobacterium,smegmatis,53124.0,False,True,63.4
1,244,True,E,,SIPHO,Mycobacterium,smegmatis,74483.0,True,True,63.4
2,32HC,True,Z,,SIPHO,Mycobacterium,smegmatis,50781.0,False,True,65.7
3,39HC,False,B,B6,SIPHO,Mycobacterium,smegmatis,71565.0,False,True,70.0
4,40AC,True,A,A17,SIPHO,Mycobacterium,smegmatis,53396.0,False,True,63.3


Some phages from the genes API have no metadata associated with them, therefore we must drop these phages from our list

### Save df_genes to .CSV

In [20]:
df_genes.to_csv("data/cleaned_gene_list.csv",index=False)