# 1. PHISTO Data Preprocessing and Organization

Preprocess the search results of host-pathogen protein-protein interactions from PHISTO (http://www.phisto.org) to use as training dataset.

Accessed: 14 June 2019

Search by taxonomy ID

>Full keyword: 'TAXONOMY ID'  = '1392'  OR  'TAXONOMY ID'  = '632'  OR  'TAXONOMY ID'  = '177416'
- **632**: *Yersinia pestis*
- **1392**: *Bacillus anthracis*
- **177416**: *Francisella tularensis*

In [1]:
import os
import joblib
import pandas as pd

from Bio import SearchIO

# Print status of datasets
def print_status(df):
    
    # For each pathogen
    for pathogen in sorted(set(df.Pathogen)):
        df_patho = df[df.Pathogen == pathogen]
        i = len(df_patho)
        p = len(set(df_patho.Pathogen_Uniprot_ID))
        h = len(set(df_patho.Human_Uniprot_ID))
        print('%s:\n%i interactions involving %i pathogen proteins and %i human proteins\n' % (pathogen, i, p, h))
    
    # Total
    i = len(df)
    p = len(set(df.Pathogen_Uniprot_ID))
    h = len(set(df.Human_Uniprot_ID))
    print('TOTAL:\n%i interactions involving %i pathogen proteins and %i human proteins\n' % (i, p, h))

In [2]:
# Set up in and out directories
parent_dir = os.path.dirname(os.getcwd())

dir_in = os.path.join(parent_dir, 'raw_data', 'training')
dir_out = os.path.join(parent_dir, 'processed_data')

## Step 1: Mapping Uniprot accessions of PHISTO proteins

Extract list of pathogen and host proteins included in `PHISTO_data.csv` for ID mapping into the Uniprot database

In [3]:
# Read PHISTO file as DataFrame
# Select only relevant columns
f_in = os.path.join(dir_in, 'PHISTO_data.csv')

columns = ['Pathogen', 'Uniprot ID', 'Uniprot ID.1']
df = pd.read_csv(f_in)[columns]

# Replace obsolete Uniprot IDs with the active ones
replacements = {'A0A1A9IFF4': 'A0A2P0HB98',
                'A0A1A9IJH2': 'A0A2P0HHP2'}
df.replace(replacements, inplace=True)

# Remove duplicate interactions
df.drop_duplicates(inplace=True)

# Rename columns
df.columns = ['Pathogen',
              'Pathogen_Uniprot_ID',
              'Human_Uniprot_ID']

print_status(df)
df.head()

Bacillus anthracis:
3088 interactions involving 938 pathogen proteins and 1710 human proteins

Francisella tularensis SUBSPECIES TULARENSIS SCHU S4:
1352 interactions involving 342 pathogen proteins and 998 human proteins

Yersinia pestis:
4101 interactions involving 1223 pathogen proteins and 2151 human proteins

TOTAL:
8541 interactions involving 2503 pathogen proteins and 3530 human proteins



Unnamed: 0,Pathogen,Pathogen_Uniprot_ID,Human_Uniprot_ID
0,Yersinia pestis,Q9RI12,Q96FW1
2,Yersinia pestis,Q7ARN6,P63000
3,Yersinia pestis,Q74YG7,Q9HD26
4,Yersinia pestis,Q8D0Q9,O43491
5,Yersinia pestis,Q0WAP0,Q9P0K7


In [4]:
# Obtain protein accessions for ID mapping in Uniprot
uniprot_ids = []

for organism in ['Pathogen', 'Human']:
    protein_set = set(df['%s_Uniprot_ID' % organism])
    uniprot_ids = uniprot_ids + list(protein_set)

# Save Uniprot accessions into a file
f_out = os.path.join(dir_in, 'PHISTO_proteins_list')

with open(f_out, 'w') as f:
    _ = f.write('\n'.join(uniprot_ids))
    print('Written %i Uniprot accessions' % len(uniprot_ids))

Written 6033 Uniprot accessions


## Uniprot ID mapping and sequence search

Source: https://www.uniprot.org

>Filters:
- active accessions
- sequence lengths >= 50

>Important fields:
- `Your list` column: query Uniprot accessions (renamed into `Query` after download)
- `Entry`: primary Uniprot accession of the query protein
- `Length`: sequence length of the protein

>Results files:
- Uniprot ID mapping: `uniprot_mapping.tab`
- Sequences: `uniprot_sequences.fasta`
<hr></hr>

## Curation of Uniprot ID mapping result

Modify `uniprot_mapping.tab` manually to ensure one-to-one mapping between `Query` and `Entry`

**A single entry for multiple queries**: map only the matching IDs

>`O95766,P86790 -> P86790`
- `O95766` is already mapped to `P86791`

>`Q68DN6,P0DJD1 -> P0DJD1`
- `Q68DN6` is already mapped to `P0DJD0`

>`P0CL84,Q6NXR2 -> P0CL84`
- `Q6NXR2` is already mapped to `P0CL83`

**Multiple entries for one query**: map based on sequence identity (manual web search)

>`P62158 -> P0DP23,P0DP24,P0DP25`
- map only to **`P0DP23`**

>`P30042 -> P0DPI2,A0A0B4J2D5`
- map only to **`P0DPI2`**

>`P08107 -> P0DMV8,P0DMV9`
- map only to **`P0DMV8`**

>`Q6NXR2 -> P0CL83,P0CL85`
- map only to **`P0CL83`**

## Step 2: Filter interactions

Filter PHISTO interactions by sequence length and domain availability

### Replacing IDs of interactors with primary Uniprot IDs

In [5]:
# Load mapping results
# Select only relevant columns: Query and Entry
f_in = os.path.join(dir_in, 'uniprot_mapping.tab')
df_map = pd.read_csv(f_in, sep='\t')[['Query', 'Entry']]

print('Obtained %i proteins from Uniprot ID mapping\n' % len(df_map))
df_map.head()

Obtained 6008 proteins from Uniprot ID mapping



Unnamed: 0,Query,Entry
0,O75575,O75575
1,Q96AT9,Q96AT9
2,P52434,P52434
3,Q6ZNA4,Q6ZNA4
4,P61587,P61587


In [6]:
#Replace Uniprot IDs in PHISTO DataFrame with the ones in Entry
replace_dict = {query: entry for query, entry in df_map.values}
df.replace(replace_dict, inplace=True)

print_status(df)

Bacillus anthracis:
3088 interactions involving 938 pathogen proteins and 1710 human proteins

Francisella tularensis SUBSPECIES TULARENSIS SCHU S4:
1352 interactions involving 342 pathogen proteins and 998 human proteins

Yersinia pestis:
4101 interactions involving 1223 pathogen proteins and 2151 human proteins

TOTAL:
8541 interactions involving 2503 pathogen proteins and 3530 human proteins



### Parse `hmmscan` result

Domain scan via `hmmscan` (HMMER 3.2.1) against Pfam-A database (`Pfam-A.hmm`; downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/

Full terminal command:

`hmmscan --tblout pfam_hits --acc --noali -E 0.00001 --domE 0.00001 --cpu 7 ~/hmmer-3.2.1/pfam/Pfam-A.hmm uniprot_sequences.fasta`

In [7]:
# Parse hmmscan output
f_in = os.path.join(dir_in, 'pfam_hits')

pfam_dict = {}
pfam_set = set() # store unique Pfam accessions
domains = [] # store domain descriptions

for query in SearchIO.parse(f_in, 'hmmer3-tab'):
    uniprot_id = query.id.split('|')[1]
    domain_counts = {} # store domain counts
    
    # Read each domain hits in query
    for hit in query.hits:
        pfam_acc = hit.accession.split('.')[0]
        
        # Add Pfam accession along with its number of occurences
        domain_counts[pfam_acc] = hit.domain_reported_num
        pfam_set.add(pfam_acc)
        domains.append((pfam_acc, hit.description))
        
    # Map Uniprot ID to a dict of its domain counts
    pfam_dict[uniprot_id] = domain_counts

# Print statistics
print('Obtained %i domains from %i proteins' % (len(pfam_set), len(pfam_dict)))

Obtained 4456 domains from 5695 proteins


In [8]:
# Save complete domain descriptions into a DataFrame
f_out = os.path.join(dir_out, 'pfam_domain_descriptions.tsv')

df_pfam = pd.DataFrame(domains, columns=['Pfam_ID', 'Description'])

df_pfam.drop_duplicates(inplace=True)
df_pfam.sort_values(by='Pfam_ID', inplace=True)
df_pfam.to_csv(f_out, sep='\t', index=False)

# Dump domains data as pickled files for further uses
_ = joblib.dump((pfam_dict, sorted(pfam_set)), 'pfam.pkl')

In [9]:
# Filter interactions by availability of domains
p_filter = df.Pathogen_Uniprot_ID.isin(pfam_dict.keys())
h_filter = df.Human_Uniprot_ID.isin(pfam_dict.keys())
df = df[p_filter & h_filter]

print_status(df)

Bacillus anthracis:
2764 interactions involving 857 pathogen proteins and 1565 human proteins

Francisella tularensis SUBSPECIES TULARENSIS SCHU S4:
1187 interactions involving 307 pathogen proteins and 884 human proteins

Yersinia pestis:
3590 interactions involving 1120 pathogen proteins and 1917 human proteins

TOTAL:
7541 interactions involving 2284 pathogen proteins and 3188 human proteins



In [10]:
# Save final interaction DataFrame
f_out = os.path.join(dir_out, 'positive_pairs.tsv')
df.to_csv(f_out, sep='\t', index=False)

<hr></hr>