### Programming for Biomedical Informatics
#### Week 2 Assignment Solution - Finding & Retreiving Data

In this weekly mini assignment you will practice retreiving data using FTP download and using an API and do some very basic summarisation.

#### Bulk Download by FTP

In [None]:
'''Many of the data repositories we introduced in this week's lecture make bulk data available for download
the benefit is that you can get a local copy to use for many different purposes. It also provides you with a
time-stamped version of the data that you can refer back to if the data changes on the website. The downside 
is that tyou may end up downloading much more data than you need, need to understand and parse it, it also will
likely fall out of date.

We first present some example code to download a file from an FTP repository at the NCBI NLM'''

# Importing the required libraries
import urllib.request

# Function to download a file from an FTP server
def download_file_from_ftp(url, output_filename):
    # Trying to download the file (the try, excpet block is used to catch exceptions i.e. when the file is not found or the URL is invalid)
    try:
        # Open the URL
        # The with statement here will automatically close the connection when the block is exited
        with urllib.request.urlopen(url) as response:
            # Read the data from the URL
            data = response.read()
        
            # Writing the data to a file
            with open(output_filename, 'wb') as file:
                file.write(data)
                
        print("File downloaded successfully!")
    # Catching exceptions
    except Exception as e:
        print(f"An error occurred: {e}")

'''Now that we have defined a function we can use this to download a file from anny public FTP server.'''

# Let's download the gene information file just for the human genome note from the file extension that this file is gzipped
url = "https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz"

# Local file path where you want to save the downloaded file
output_filename = "human_genes.gz"

# Calling the download function
download_file_from_ftp(url, output_filename)

In [None]:
'''The pandas python library is extremely flexible for reading in and manipulating files, rather nicely it can direclty
handle gzipped files'''

'''Note that you don't know how the file we just downloaded is structured  - I'll help you by telling you that the data is tab-delimited
and has a single line header containing the field names'''

# import pandas
import pandas as pd

'''###YOUR CODE HERE###'''
# pseudocode step 1 - load the gzipped file directly into a pandas DataFrame
df = pd.read_csv('human_genes.gz', compression='gzip',sep='\t')

# pseudocode step2 - display the DataFrame so you can inspect the structure of the data
df.head()

In [None]:
'''You will notice that there is a column named 'type_of_gene'. Create a table to count the number of genes in each category
found in this column. Your output should have two columns - in column 1 the title "Type of Gene" and in column 2 "Number of Genes". 
Sort the table by the second column in descending order and show the top 10 rows only in your output'''

'''###YOUR CODE HERE###'''
# pseudocode step 3 - create a table to count the number of genes in each category
df['type_of_gene'].value_counts()

# pseudocode step 4 - sort the table by the second column in descending order
df['type_of_gene'].value_counts().sort_values(ascending=False)

# pseudocode step 5 - show the top 10 rows only in your output
df['type_of_gene'].value_counts().sort_values(ascending=False).head(10)

QUESTION - "what is the most common type of gene?" (add the exact string for the gene type below this line)

OPTIONAL - by looking at the FTP directory here https://ftp.ncbi.nlm.nih.gov/gene/DATA/ can you use the approach above to find the top3 genes with the most PubMed citations?.

#### Request by API

In [None]:
'''We first present some example code to download a file from the KEGG repository using their API endpoint
The manual documenting the KEGG API is here https://www.kegg.jp/kegg/rest/keggapi.html'''

# Importing the required libraries
import requests

# Function to download KEGG pathway data
def download_kegg_pathway_genes(pathway_id):
    # URL for the pathway data
    data_url = f"http://rest.kegg.jp/link/hsa/{pathway_id}"
    
    # Make the HTTP request for the pathway data
    response = requests.get(data_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Write the data content to a file
        with open(f"{pathway_id}.txt", 'w') as file:
            file.write(response.text)
        print(f"Pathway data saved as {pathway_id}.txt")
    else:
        print("Failed to retrieve pathway data. Status code:", response.status_code)

# Example pathway ID for MAPK signalling
pathway_id = "hsa04010"

# Call the functions to download the pathway image and data
download_kegg_pathway_genes(pathway_id)

#look at the file you just downloaded
with open('hsa04010.txt') as file:
    print(file.read())

'''Ah, this contains internal identifiers for the genes in the pathway. We can use the KEGG API to get the gene names'''

In [8]:
'''Fetch the gene names from the data we found above'''

#read the pathway gene file
df = pd.read_csv('hsa04010.txt', sep='\t', header=None)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:5, 1]

# Function to fetch gene details
def fetch_gene_details(gene_ids):
    for gene_id in gene_ids:
        # URL for the pathway data
        data_url = f"http://rest.kegg.jp/list/{gene_id}"
        
        # Make the HTTP request for the pathway data
        response = requests.get(data_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            #strip the newline character from the response text
            response = response.text.strip()
            print(response)
        else:
            print("Failed to retrieve gene data. Status code:", response.status_code)

# Call the function to fetch gene details
fetch_gene_details(gene_ids)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:5, 1]


hsa:10000	AKT3, MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-gamma, RAC-gamma, STK-2; AKT serine/threonine kinase 3
hsa:100137049	PLA2G4B, HsT16992, cPLA2-beta; phospholipase A2 group IVB
hsa:10125	RASGRP1, CALDAG-GEFI, CALDAG-GEFII, IMD64, RASGRP; RAS guanyl releasing protein 1
hsa:10235	RASGRP2, CALDAG-GEFI, CDC25L; RAS guanyl releasing protein 2
hsa:10368	CACNG3; calcium voltage-gated channel auxiliary subunit gamma 3


In [9]:
'''Now try finding the pathway id (like the hsa04010 number above) for Alzheimer's Disease
You will probably find it easier to find this using the web browser but it is possible to use the API directly. Once you've found this use code
similar to the above to download the gene data for this pathway and then fetch the gene details and list the details for the first 5 genes'''


# pseudocode 1 - find the pathway id for Alzheimer's Disease
'''you could use the search feature of the API to find the pathway id for Alzheimer's Disease, but for this example just quicker to lookup in the browser'''
pathway_id = "hsa05010"

# pseudocode 2 - download the pathway gene data
download_kegg_pathway_genes(pathway_id)

# pseudocode 3 - fetch the gene details themselves
df = pd.read_csv('hsa05010.txt', sep='\t', header=None)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:5, 1]

# Function to fetch gene details
def fetch_gene_details(gene_ids):
    for gene_id in gene_ids:
        # URL for the pathway data
        data_url = f"http://rest.kegg.jp/list/{gene_id}"
        
        # Make the HTTP request for the pathway data
        response = requests.get(data_url)
        
        # Check if the request was successful
        if response.status_code == 200:
            #strip the newline character from the response text
            response = response.text.strip()
            print(response)
        else:
            print("Failed to retrieve gene data. Status code:", response.status_code)

# Call the function to fetch gene details
fetch_gene_details(gene_ids)

# the second column contians the gene ids, use the first 5 (for speed) to fetch the gene details
gene_ids = df.iloc[:5, 1]

Pathway data saved as hsa05010.txt
hsa:10000	AKT3, MPPH, MPPH2, PKB-GAMMA, PKBG, PRKBG, RAC-PK-gamma, RAC-gamma, STK-2; AKT serine/threonine kinase 3
hsa:10023	FRAT1; FRAT regulator of WNT signaling pathway 1
hsa:100506742	CASP12, CASP-12, CASP12P1; caspase 12 (gene/pseudogene)
hsa:100532726	NDUFC2-KCTD14; NDUFC2-KCTD14 readthrough
hsa:10105	PPIF, CYP3, CyP-M, Cyp-D, CypD; peptidylprolyl isomerase F
