### Programming for Biomedical Informatics
#### Week 3 Assignment - Gene ID Conversion

In this weekly mini assignment you will practice converting between different accession types using NCBI eUtils and BioMart

- make sure that you've installed Biopython, signed up for a free NCBI account so that you can get an API key
- if you deciede that you would like to try with BioMart too remember that you can (should!) practice using the BioMart web interface first so that you understand the correct parameters to use

We've included the basic code below based on the weekly snippets from the GitHub ```./notebooks/week3``` feel free to explore and try different things.

In [32]:
# using NCBI-NLM eUtils to get gene IDs from gene symbols
from Bio import Entrez

# load your API key from the file
with open('../api_keys/ncbi.txt', 'r') as file:
    api_key = file.read().strip()

# load your email from the file
with open('../api_keys/ncbi_email.txt', 'r') as file:
    email = file.read().strip()

Entrez.api_key = api_key
Entrez.email = email

## NB I modified this to do a faster search using the OR string
def get_gene_ids(gene_symbols, organism="Homo sapiens"):
    """
    Convert a list of gene symbols into NCBI Gene IDs.

    Parameters:
    gene_symbols (list): List of gene symbols to search for.
    organism (str): Organism name to restrict search (default is "Homo sapiens").

    Returns:
    dict: A dictionary mapping gene symbols to NCBI Gene IDs.
    """
    gene_ids = {}
    search_string = " OR ".join([f"{symbol}[Gene]" for symbol in gene_symbols])
    search_string += f" AND {organism}[Organism]"

    handle = Entrez.esearch(db="gene", term=search_string, retmax=len(gene_symbols))
    record = Entrez.read(handle)
    handle.close()

    for symbol, gene_id in zip(gene_symbols, record["IdList"]):
        gene_ids[symbol] = gene_id if gene_id else None

    return gene_ids

In [11]:
'''In last week's assignment you looked up the genes associated with Alzheimer's disease using the Reactome API.
Now, let's parse the list you recovered to extract the gene symbol, this is the first element after 
the hsa:12344 number from last week's mapping:

e.g.

hsa:10000	AKT3, MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-gamma, RAC-gamma, STK-2; AKT serine/threonine kinase 3
hsa:100137049	PLA2G4B, HsT16992, cPLA2-beta; phospholipase A2 group IVB
hsa:10125	RASGRP1, CALDAG-GEFI, CALDAG-GEFII, IMD64, RASGRP; RAS guanyl releasing protein 1
....

This can be done using the python first splitting on ```tab``` and then using split on ```,``` to get the gene symbol.
Save the gene symbols to a file called alzheimers_genes.txt.
'''

# pseudocode1
# Use code from last week (solution for that posted on GitHub) to get the gene symbols data as above
'''###YOUR CODE HERE###'''
# Importing the required libraries
import requests

# Function to download KEGG pathway data
def download_kegg_pathway_genes(pathway_id):
    # URL for the pathway data
    data_url = f"http://rest.kegg.jp/link/hsa/{pathway_id}"
    
    # Make the HTTP request for the pathway data
    response = requests.get(data_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Write the data content to a file
        with open(f"{pathway_id}.txt", 'w') as file:
            file.write(response.text)
        print(f"Pathway data saved as {pathway_id}.txt")
    else:
        print("Failed to retrieve pathway data. Status code:", response.status_code)

# Example pathway ID for Alzheimer's disease
pathway_id = "hsa05010"

# Call the functions to download the pathway image and data
download_kegg_pathway_genes(pathway_id)

Pathway data saved as hsa05010.txt


In [12]:
# pseudocode2
# Parse the data to extract the gene symbol
'''###YOUR CODE HERE###'''
import pandas as pd
df = pd.read_csv('hsa05010.txt', sep='\t', header=None)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:, 1]

# Function to fetch gene details
def fetch_gene_details(gene_ids):
    gene_details = dict()
    for gene_id in gene_ids:
        # URL for the pathway data
        data_url = f"http://rest.kegg.jp/list/{gene_id}"
        
        # Make the HTTP request for the pathway data
        response = requests.get(data_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            #strip the newline character from the response text
            response = response.text.strip()
            #ad the gene details to the dictionary
            gene_details[gene_id] = response
            print(response)
        else:
            print("Failed to retrieve gene data. Status code:", response.status_code)
    return gene_details

# Call the function to fetch gene details
# this will take a while to run (c.15 minutes)
gene_details = fetch_gene_details(gene_ids)

hsa:10000	AKT3, MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-gamma, RAC-gamma, STK-2; AKT serine/threonine kinase 3
hsa:10023	FRAT1; FRAT regulator of WNT signaling pathway 1
hsa:100506742	CASP12, CASP-12, CASP12P1; caspase 12 (gene/pseudogene)
hsa:100532726	NDUFC2-KCTD14; NDUFC2-KCTD14 readthrough
hsa:10105	PPIF, CYP3, CyP-M, Cyp-D, CypD; peptidylprolyl isomerase F
hsa:102	ADAM10, AD10, AD18, CD156c, CDw156, HsT18717, MADM, RAK, kuz; ADAM metallopeptidase domain 10
hsa:1020	CDK5, LIS7, PSSALRE; cyclin dependent kinase 5
hsa:10213	PSMD14, PAD1, POH1, RPN11; proteasome 26S subunit, non-ATPase 14
hsa:102800317	TPTEP2-CSNK1E, LOC400927-CSNK1E; TPTEP2-CSNK1E readthrough
hsa:10297	APC2, APCL, MRT74; APC regulator of WNT signaling pathway 2
hsa:10313	RTN3, ASYIP, HAP, NSPL2, NSPLII, RTN3-A1; reticulon 3
hsa:10376	TUBA1B, K-ALPHA-1; tubulin alpha 1b
hsa:10381	TUBB3, CDCBM, CDCBM1, CFEOM3, CFEOM3A, FEOM3, TUBB4, beta-4; tubulin beta 3 class III
hsa:10382	TUBB4A, DYT4, TUBB4, beta-5; tubulin bet

In [13]:
# save gene_details as a pickle file
import pickle
with open('gene_details.pkl', 'wb') as file:
    pickle.dump(gene_details, file)

In [20]:
# extract the first element after the hsa:12344 number
gene_symbols = []
for gene_id in gene_details:
    gene_symbol_synonym = gene_details[gene_id].split('\t')[1]
    # split on either a comma or a semi-colon
    gene_symbol = gene_symbol_synonym.split(',')[0].split(';')[0]
    gene_symbols.append(gene_symbol)

print(gene_symbols)

['AKT3', 'FRAT1', 'CASP12', 'NDUFC2-KCTD14', 'PPIF', 'ADAM10', 'CDK5', 'PSMD14', 'TPTEP2-CSNK1E', 'APC2', 'RTN3', 'TUBA1B', 'TUBB3', 'TUBB4A', 'TUBB4B', 'ATP5PD', 'cytochrome c oxidase subunit NDUFA4-like', 'UQCR11', 'P3R3URF-PIK3R3', 'ADRM1', 'FZD10', 'TUBA3E', 'CHRM1', 'CHRM3', 'CHRM5', 'TUBA3D', 'CHRNA7', 'CHUK', 'CSNK1A1L', 'tubulin beta 8B', 'tubulin beta 8B', 'tubulin beta 8B', 'COX6B2', 'NDUFA11', 'COX4I1', 'COX5B', 'COX6A1', 'COX6A2', 'COX6B1', 'COX6C', 'COX7A1', 'COX7A2', 'COX7B', 'COX7C', 'COX8A', 'PSMA8', 'CSF1', 'CSNK1A1', 'CSNK1E', 'CSNK2A1', 'CSNK2A2', 'CSNK2B', 'KLC3', 'CTNNB1', 'CYBB', 'CYC1', 'CALML6', 'DDIT3', 'COX7B2', 'AGER', 'DVL1', 'DVL2', 'DVL3', 'EIF2S1', 'SLC39A11', 'TUBB', 'AKT1', 'AKT2', 'ERN1', 'SLC39A12', 'ATG14', 'ATF6', 'DKK1', 'ATG2A', 'PLCB1', 'NCSTN', 'FRAT2', 'SLC39A14', 'BACE1', 'MTOR', 'FZD2', 'SLC39A6', 'BACE2', 'GAPDH', 'TUBB8B', 'WIPI2', 'NOX1', 'UQCRQ', 'DKK4', 'DKK2', 'SLC39A1', 'GNAQ', 'CSNK2A3', 'SLC39A5', 'GRIN1', 'GRIN2A', 'GRIN2B', 'GRIN2C

In [21]:
# pseudocode3
# Save the gene symbols to a file called alzheimers_genes.txt
'''###YOUR CODE HERE###'''
with open('alzheimers_genes.txt', 'w') as file:
    for gene_symbol in gene_symbols:
        file.write(f"{gene_symbol}\n")

In [33]:
'''Now use the function defined above to fetch the NCBI Gene IDs for the gene symbols you extracted above. Use Pretty Table to print out a table with the first column being the gene symbol and the second column being the NCBI Gene ID.'''

# pseudocode4
# Use the function defined above to fetch the NCBI Gene IDs for the gene symbols you extracted above
'''###YOUR CODE HERE###'''
gene_ids = get_gene_ids(gene_symbols)

# save the gene_ids as a pickle file
with open('gene_ids.pkl', 'wb') as file:
    pickle.dump(gene_ids, file)

In [42]:
# pseudocode5
# Use Pretty Table to print out a table with the first column being the gene symbol and the second column being the NCBI Gene ID
'''###YOUR CODE HERE###'''
from prettytable import PrettyTable

table = PrettyTable()
table.field_names = ["Gene Symbol", "NCBI Gene ID"]

for gene_symbol, gene_id in gene_ids.items():
    table.add_row([gene_symbol, str(gene_id)])

print(table)

+------------------------------------------+--------------+
|               Gene Symbol                | NCBI Gene ID |
+------------------------------------------+--------------+
|                   AKT3                   |     348      |
|                  FRAT1                   |     7124     |
|                  CASP12                  |     3569     |
|              NDUFC2-KCTD14               |     351      |
|                   PPIF                   |     3845     |
|                  ADAM10                  |     673      |
|                   CDK5                   |     207      |
|                  PSMD14                  |     4790     |
|              TPTEP2-CSNK1E               |     3553     |
|                   APC2                   |     1499     |
|                   RTN3                   |     6622     |
|                  TUBA1B                  |     5743     |
|                  TUBB3                   |     4137     |
|                  TUBB4A               

In [None]:
'''If you would like you can repeat the above process but for BioMart. You could also try retrieving other information fields, for example the gene name, description, chromosome, start and end position, etc. from BioMart. Use one of the two BioMart snippets in the GitHub to help you.'''