# PHISTO Data Preprocessing and Organization

Preprocess the search results of host-pathogen protein-protein interactions from PHISTO (http://www.phisto.org) to use as training dataset.

Search by taxonomy ID

Full keyword: 'TAXONOMY ID'  = '1392'  OR  'TAXONOMY ID'  = '632'  OR  'TAXONOMY ID'  = '177416'
- **632**: *Yersinia pestis*
- **1392**: *Bacillus anthracis*
- **166416**: *Francisella tularensis*

In [1]:
import os
import pandas as pd

# Printing status of datasets
def print_status(df):
    i = len(df)
    p = len(set(df.Pathogen_UniprotID))
    h = len(set(df.Human_UniprotID))
    print('Total:\n%i interactions involving %i pathogen proteins and %i human proteins\n' % (i, p, h))

In [2]:
# Set up in and out directories
parent_dir = os.path.dirname(os.getcwd())

dir_in = dir_out = os.path.join(parent_dir, 'data')

## Step 1: Obtain Uniprot accessions of proteins

Extract list of pathogen and host proteins included in `PHISTO_data.csv` for ID mapping into the Uniprot database

In [3]:
# Read PHISTO file as DataFrame
# Select only relevant columns
f_in = os.path.join(dir_in, 'PHISTO_data.csv')

columns = ['Taxonomy ID', 'Uniprot ID', 'Uniprot ID.1']
df = pd.read_csv(f_in)[columns]

# Remove duplicate interactions
df.drop_duplicates(inplace=True)

# Rename columns
df.columns = ['Pathogen_TaxID',
              'Pathogen_UniprotID',
              'Human_UniprotID']

print_status(df)
df.head()

Total:
8541 interactions involving 2503 pathogen proteins and 3530 human proteins



Unnamed: 0,Pathogen_TaxID,Pathogen_UniprotID,Human_UniprotID
0,632,Q9RI12,Q96FW1
2,632,Q7ARN6,P63000
3,632,Q74YG7,Q9HD26
4,632,Q8D0Q9,O43491
5,632,Q0WAP0,Q9P0K7


In [4]:
# Save protein lists for ID mapping in Uniprot
for organism in ['Pathogen', 'Human']:
    name = '%s_UniprotID' % organism
    uniprot_ids = set(df[name])
    
    f_out = os.path.join(dir_out, name)
    with open(f_out, 'w') as f:
        _ = f.write('\n'.join(uniprot_ids))
        print('%s: Saved %i Uniprot accessions' % (organism,
                                                   len(uniprot_ids)))

Pathogen: Saved 2503 Uniprot accessions
Human: Saved 3530 Uniprot accessions
