8-oxoG arises in both DNA and RNA during normal cellular processes and our research in the Resendiz lab focuses on one of the enzymes (Polynucleotide Phosphorylase) responsible for degrading oxidatively damaged or "oxidized" RNA, and how this degradation has been shown to stall when the PNPase enzyme encounters 8-oxoG. My project has focused on the relationship between the enzyme's stalled degradation and the specific sequence of the RNA being degraded. With this project, I hope to probe the human genome and specifically identify post transcriptional protein coding RNA that could be identified and checked for the specific sequence related to increased 8-oxoG stalling

The code will be broken into 2 main portions, the first portion to pull protein RNA sequences from an API, and the second portion to search RNA sequences for a motif 

The first component of the program was attempted here below, probing the NIH database for protein ID's related to any protein in question, in this example, "Neprilysin" is used

In [None]:
import requests
protein_ids = []
def search_ncbi_rna(query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'nucleotide',
        'term': query,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    return response.text

# Example search for "Neprilysin"
protein_ids = search_ncbi_rna("Neprilysin")
print(protein_ids)


The code block above works to aquire a list of ID's however the ID's aren't really in a workable state, so a converter needs to be built in order to get the ID's in a list that can be plugged back into other functions 

The Block below combines the ideas above, creating the protein id extractor function that will be used to print the list of ID's

In [None]:
import requests
import xml.etree.ElementTree as ET

def search_ncbi_rna(query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"  #base url to probe the nlm api
    params = {
        'db': 'nucleotide',
        'term': query,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    return response.text

def extract_protein_ids(xml_response):
    protein_ids = []
    root = ET.fromstring(xml_response)
    
    # Iterate over the XML structure to find IDs
    for id_elem in root.findall('.//Id'): #the root variable is a way of opening and examining the xml file, and the .findall command helps identify every 'id' 
        protein_ids.append(id_elem.text)  #the id_elem is a new variable basically appending the protein id to the variable from the xml file
    
    return protein_ids

# Example search for "Neprilysin"
xml_results = search_ncbi_rna("Neprilysin")
protein_ids = extract_protein_ids(xml_results)

# Print the list of protein IDs
print(protein_ids)


Now that we have a workable list of protein ID's they can be used in yet another function to find the genome sequences for each of the ID's in the list, which is created and added on in the block below.

In [None]:
import requests
import xml.etree.ElementTree as ET
import time

def search_ncbi_rna(query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'nucleotide',
        'term': query,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    return response.text

def extract_protein_ids(xml_response):
    protein_ids = []
    root = ET.fromstring(xml_response)
    
    for id_elem in root.findall('.//Id'):
        protein_ids.append(id_elem.text)
        time.sleep(1)
    
    return protein_ids

def fetch_rna_sequence(protein_id):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        'db': 'nucleotide',
        'id': protein_id,
        'rettype': 'fasta',
        'retmode': 'text'
    }
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        return response.text
    else:
        print(f"Error fetching RNA for ID {protein_id}: {response.status_code}")
        return None

def main():
    # Search for RNA associated with the protein
    query = "Neprilysin"
    xml_results = search_ncbi_rna(query)
    protein_ids = extract_protein_ids(xml_results)
    
    # Fetch RNA sequences for the protein IDs
    rna_sequences = ['GGGCA']
    for protein_id in protein_ids:
        rna_sequence = fetch_rna_sequence(protein_id)
        if rna_sequence:
            rna_sequences.append(rna_sequence)
        
        # Delay to avoid hitting the API rate limit
        time.sleep(1)  # Sleep for 1 second between requests
    
    # Print all RNA sequences
    for seq in rna_sequences:
        print(seq)

if __name__ == "__main__":
    main()


The output above is closer to what was originally dsired, where the genome/full sequence data for proteins could be opened and interacted with in the python environment, now the only piece that is missing is a function to scan through the full sequences and search for iterations of a specific sequence in question.

# The portion below are cleaned and more succinct versions of the larger code blocks above, creating the final combinations of functions to complete the desired task

The first portion of the code and the first function is the API query to get the xml file of ID's

In [None]:
def search_ncbi_rna(query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'nucleotide',
        'term': query,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    return response.text

The following portion of code is responsible for pulling out the ID's from the xml sheet and transform them into a usable format, ie. a list of individual id's to be plugged back into a second API in order to pull the genome for each ID

In [None]:
def extract_protein_ids(xml_response):
    protein_ids = []
    root = ET.fromstring(xml_response)
    
    for id_elem in root.findall('.//Id'):
        protein_ids.append(id_elem.text)
        time.sleep(1)
    
    return protein_ids


The code below here is responsible for grabbing the full genome of each of the protein ID's this will be used in a for loop in the main portion 

In [None]:
def fetch_rna_sequence(protein_id):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        'db': 'nucleotide',
        'id': protein_id,
        'rettype': 'fasta',
        'retmode': 'text'
    }
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        return response.text
    else:
        print(f"Error fetching RNA for ID {protein_id}: {response.status_code}")
    return None


The short portion below this is responsible for searching through the full sequences and count the number of times the small fragment sequence we are looking for shows up. This will be useful when probing proteins becuase if the specific sequence is much more present in certain proteins than others, this might increase the probablility of a potential site of oxidation and mutation

In [None]:
def count_sequence_occurrences(sequence, rna_sequence):
    return sequence.count(rna_sequence)

Below is the main function, making use of all the previous functions above and combining them in a for loop in order to pull the nucleotide sequence data for each of the protein and RNA fragments identified by the first API interogation, and then counting the number of times the specific RNA sequence arises. This can then be used to identify notable RNA fragments or proteins for further investigation

In [None]:
def main():
    # Search for RNA associated with the protein
    query = "Neprilysin"
    xml_results = search_ncbi_rna(query)
    protein_ids = extract_protein_ids(xml_results)
    
    # Define the RNA sequence you want to count
    specific_rna_sequence = "GGTTA"  # Example RNA sequence

    # Fetch RNA sequences for the protein IDs
    occurrences = {}
    for protein_id in protein_ids:
        rna_sequence = fetch_rna_sequence(protein_id)
        if rna_sequence:
            count = count_sequence_occurrences(rna_sequence, specific_rna_sequence)
            occurrences[protein_id] = count
        
        # Delay to avoid hitting the API rate limit
        time.sleep(1)  # Sleep for 1 second between requests
    
    # Print the results
    for protein_id, count in occurrences.items():
        print(f"Protein ID: {protein_id}, Occurrences of '{specific_rna_sequence}': {count}")

if __name__ == "__main__":
    main()