### Week 5 - Biological Databases - Protein-Protein Interactions & Pathways
- October 2023
- [https://https://github.com/tisimpson/bioinformatics1](https://github.com/tisimpson/bioinformatics1)
- [ian.simpson@ed.ac.uk](mailto:ian.simpson@ed.ac.uk)

In [None]:
import pandas as pd
import urllib as ul
import numpy as np

In [None]:
# Fetching KEGG pathway data

human_pathways = pd.read_csv(ul.request.urlopen('http://rest.kegg.jp/list/pathway/hsa'),sep='\t',header=0,names=['kegg_id','pathway_name'])

# we're looking for "Dopaminergic Synapse"

human_pathways.head()

In [None]:
# Extracting the KEGG ID for the pathway we're interested in
pathway_id = human_pathways[human_pathways['pathway_name'].str.match('Dopaminergic synapse')]['kegg_id']
print(pathway_id.values)

In [None]:
# we can use the KEGG API to fetch the pathway data
# pull the pathway entry from KEGG

ul.request.urlretrieve('http://rest.kegg.jp/get/'+pathway_id.to_numpy()[0],'../data/pathways/dop_synapse.txt')

# why not open this file and look at the contents. You will see the full pathway details including the gene names

# open the file
dop_file = open('../data/pathways/dop_synapse.txt','r')

# I wanted to show you some basic python parsing and a simple for loop with a conditional in to demonstrate how you can quickly build simple parsers.
# There are quicker ways to do this, but this is a good learning example.

# set a flag for our parser, we will use this to know when we are in the gene section of the file
flag=0

# create a list to hold the genes
genes = []

# work through the text file one line at a time
# notice we're pulling the line information so not just the gene id but also the description
for line in dop_file:
    # find the start of the gene entries
    if 'GENE' in line:
        # add the first gene to the list
        genes.append(pd.Series(line.strip('GENE').strip().split('  ')))
        # set the flag to 1, we are in the gene section of the file
        flag = 1
    # stop when we reach the end of the section and escape the file
    elif 'COMPOUND' in line:
        break
    # continue adding the genes to the list
    elif flag == 1:
        genes.append(pd.Series(line.strip().split('  ',2)))

# close the file
dop_file.close()

# convert the list to a dataframe
dop_df = pd.DataFrame(genes)

# name the columns
dop_df.columns = ['gene_id','description']

# you now have the gene_ids (NCBI Entrez GeneIDs for the genes in the pathway)
print('The Dopaminergic Synapse pathway has '+str(dop_df.shape[0])+' genes in it.\n')

# show the gene_ids
print(dop_df['gene_id'].to_numpy())

In [None]:
# print this out in a nice table

import prettytable as pt

# create a pretty table object
table = pt.PrettyTable()

# add the columns
table.add_column('Gene ID',dop_df['gene_id'].to_numpy())
table.add_column('Description',dop_df['description'].to_numpy())

# print out the table
print(table)

In [None]:
# lets practice writing out a simple gene_id file

f = open('../data/pathways/dop_geneids.txt','w')

for i in dop_df['gene_id']:
     f.write(i+'\n')

f.close()

# we could go on to use this basic file of gene ids to retreive protein interaction data from BioGrid or Intact. We will do this in the week6 lab using APIs.

# you could have a look at one of the most commonly used protein-protein interaction databases in advance if you like:
#   STRING - https://string-db.org/ or BIOGRID - https://thebiogrid.org/