### Week 5 - Biological Databases - Protein-Protein Interactions & Pathways
- October 2023
- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)
- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)

In [None]:
import pandas as pd
import urllib as ul
import numpy as np

In [None]:
# Fetching KEGG pathway data

human_pathways = pd.read_csv(ul.request.urlopen('http://rest.kegg.jp/list/pathway/hsa'),sep='\t',header=0,names=['kegg_id','pathway_name'])

human_pathways.head()

In [None]:
# Extracting the KEGG ID for the pathway we're interested in
# we're looking for "Cell adhesion moelcules"

pathway_id = human_pathways[human_pathways['pathway_name'].str.match('Cell adhesion molecules')]['kegg_id']
print(pathway_id.values)

In [None]:
# we can use the KEGG API to fetch the pathway data
# pull the pathway entry from KEGG

ul.request.urlretrieve('http://rest.kegg.jp/get/'+pathway_id.to_numpy()[0],'../data/pathways/cams.txt')

# why not open this file and look at the contents. You will see the full pathway details including the gene names

# open the file
dop_file = open('../data/pathways/cams.txt','r')

# I wanted to show you some basic python parsing and a simple for loop with a conditional in to demonstrate how you can quickly build simple parsers.
# There are quicker ways to do this, but this is a good learning example.

# set a flag for our parser, we will use this to know when we are in the gene section of the file
flag=0

# create a list to hold the genes
genes = []

# work through the text file one line at a time
# notice we're pulling the line information so not just the gene id but also the description
for line in dop_file:
    # find the start of the gene entries
    if 'GENE' in line:
        # add the first gene to the list
        genes.append(pd.Series(line.strip('GENE').strip().split('  ')))
        # set the flag to 1, we are in the gene section of the file
        flag = 1
    # stop when we reach the end of the section and escape the file
    elif 'REFERENCE' in line:
        break
    # continue adding the genes to the list
    elif flag == 1:
        genes.append(pd.Series(line.strip().split('  ',2)))

# close the file
dop_file.close()

# convert the list to a dataframe
dop_df = pd.DataFrame(genes)

# name the columns
dop_df.columns = ['gene_id','description']

# you now have the gene_ids (NCBI Entrez GeneIDs for the genes in the pathway)
print('The Cell Adhesion Moelcule Map has '+str(dop_df.shape[0])+' genes in it.\n')

# show the gene_ids
print(dop_df['gene_id'].to_numpy())

In [None]:
# write out a simple gene_id file

f = open('../data/pathways/cams_geneids.txt','w')

for i in dop_df['gene_id']:
     f.write(i+'\n')

f.close()

In [None]:
# print this out in a nice table

import prettytable as pt

# create a pretty table object
table = pt.PrettyTable()

# add the columns
table.add_column('Gene ID',dop_df['gene_id'].to_numpy())
table.add_column('Description',dop_df['description'].to_numpy())

# print out the table
print(table)

In [None]:
# we could go on to use this basic file of gene ids to retreive protein interaction data from BioGrid or Intact. We will do this in the week6 lab using APIs.

# you could have a look at one of the most commonly used protein-protein interaction databases in advance if you like:
#   STRING - https://string-db.org/ or BIOGRID - https://thebiogrid.org/

#### Automating Retrieval of Protein-Protein Interactions from STRING

The details of the String-DB API can be found here - [https://string-db.org/help/api/](https://string-db.org/help/api/)

APIs have specific formats required for their query URLs and it getting these correct in your code can take a little time until you get used to them. In this case we need to concatenate (stitch together) our gene IDs using a '%0D' string. This is actually the encoding for a line-return which is in effect mimicking the one gene per line entry that you would paste into the web page.

In [None]:
# create a concatenated list of entrezIDs as strings
# note we are taking integer gene_ids from the 'gene_id' column of the dataframe we generated above then using
# the map function to convert each one into a string. The join function then concatenates them using the '%0D' string
# to stitch them all together. This string will be used to help us build the API query URL.
entrezIDs = '%0D'.join(map(str,dop_df['gene_id']))

# pass the list of EntrezIDs to the String-DB API return the String-IDs
# we first form the query url using the 'get_string_ids' API function which takes a list of identifiers and
# converts them into the internal String-DB accession IDs. This massively speeds up the search and allows us to
# search for more than 10 at once which is an API restriction for other API functions if String-DB internal accessions 
# aren't used.
query_url = 'https://string-db.org/api/tsv-no-header/get_string_ids?identifiers='+entrezIDs+'&species=9606&format=only-ids'

# use the urllib library to retrieve the String-DB internal IDs
result = ul.request.urlopen(query_url).read().decode('utf-8')

# now we want to query String-DB to retrieve interactions from this list of String-DB IDs
# we create a concatenated list of stringdbIDs in much the same way as above for the Entrez Gene IDs
stringdbIDs = '%0D'.join(result.splitlines())

# again we build the query for interactions using the String-DB IDs
query_url = 'https://string-db.org/api/tsv/network?identifiers='+stringdbIDs+'&species=9606'

# again using urllib to retrieve the interactions these are returned in a standard tab delimied text format
interactions = ul.request.urlopen(query_url).read().decode('utf-8').splitlines()

# we need to split the result by these 'tabs' (\t - is used to identfy them)
int_test = [interaction.split('\t') for interaction in interactions]

# we extract the field names from the first row
column_names = int_test[:1][0]

# create a Pandas dataframe of the interaction data we have just retrieved from String-DB
interactions_df = pd.DataFrame(int_test,columns=column_names)

# delete the first row that held the fieldnames but we no longer need
interactions_df = interactions_df.drop(labels=0,axis=0)

# remove any duplicate rows
final_interactions = interactions_df.drop_duplicates()

# show the top of the protein-protein interaction table
final_interactions.head()

In [None]:
# create a simple network view of the interactions using the NetworkX library
# https://networkx.org/documentation/stable/index.html

import networkx as nx
import matplotlib.pyplot as plt

# check the column names of the dataframe
print(final_interactions.columns)

 #Create an empty graph
G = nx.Graph()

# add all nodes
G.add_nodes_from(set(final_interactions['preferredName_A']) | set(final_interactions['preferredName_B'])) 

# add the edges (connections) to the network
edges = []
for edge1 , edge2  in zip(final_interactions['preferredName_A'] , final_interactions['preferredName_B']) : #add all edge to the network
    edges.append((edge1 , edge2 ))
G.add_edges_from(edges)

# draw the network with a force directed layout specify plot size and node size and thin light gray edges
plt.figure(figsize=(15,15))
nx.draw(G, pos=nx.spring_layout(G,k=2), with_labels=True,node_size=100,edge_color='gray',node_color='lightblue',width=0.5)