8-oxoG arises in both DNA and RNA during normal cellular processes and our research in the Resendiz lab focuses on one of the enzymes (Polynucleotide Phosphorylase) responsible for degrading oxidatively damaged or "oxidized" RNA, and how this degradation has been shown to stall when the PNPase enzyme encounters 8-oxoG. My project has focused on the relationship between the enzyme's stalled degradation and the specific sequence of the RNA being degraded. With this project, I hope to probe the human genome and specifically identify post transcriptional protein coding RNA that could be identified and checked for the specific sequence related to increased 8-oxoG stalling

The code will be broken into 2 main portions, the first portion to pull protein RNA sequences from an API, and the second portion to search RNA sequences for a motif 

First comes the imports in order to establish all the working functions

In [None]:
import requests
import xml.etree.ElementTree as ET
import time
import pickle
import os
import matplotlib.pyplot as plt

Caching in this program is implemented to avoid repeated requests to the NCBI API for the same RNA sequence data, improving performance and reducing redundant calls to the external service.

The cached RNA sequences are stored in a file, called rna_sequences_cache.pkl, using Python's pickle module. The cache file contains a dictionary that maps each protein_id to its corresponding RNA sequence. The data is stored in binary format using pickle, which allows efficient reading and writing of the data.

In [None]:
CACHE_FILE = "rna_sequences_cache.pkl"

Each of the individual functions are defined below this cache creator

The code block above works to aquire a list of ID's however the ID's aren't really in a workable state, so a converter needs to be built in order to get the ID's in a list that can be plugged back into other functions 

In [None]:
def search_ncbi_rna(query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'nucleotide',
        'term': query,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    return response.text


The extract_protein_ids function is responsible for parsing the XML response received from the NCBI API (specifically from the esearch.fcgi endpoint) and extracting the protein IDs from it

In [None]:
def extract_protein_ids(xml_response):
    protein_ids = []
    root = ET.fromstring(xml_response)
    
    for id_elem in root.findall('.//Id'):
        protein_ids.append(id_elem.text)
    
    return protein_ids

The fetch_rna_sequence function below here is responsible for retrieving the RNA sequence associated with a specific protein ID. It first checks if the RNA sequence for that protein ID is already cached (to avoid redundant API calls), and if not, it fetches the sequence from the NCBI database, caches it for future use, and then returns the sequence

In [None]:
def fetch_rna_sequence(protein_id):
    # Check if RNA sequence is already cached
    cached_sequence = check_cache(protein_id)
    if cached_sequence:
        print(f"Using cached RNA sequence for {protein_id}")
        return cached_sequence

    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    params = {
        'db': 'nucleotide',
        'id': protein_id,
        'rettype': 'fasta',
        'retmode': 'text'
    }
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        rna_sequence = response.text
        # Cache the fetched RNA sequence
        cache_rna_sequence(protein_id, rna_sequence)
        return rna_sequence
    else:
        print(f"Error fetching RNA for ID {protein_id}: {response.status_code}")
        return None

The check_cache function below designed to check if an RNA sequence for a given protein_id exists in the cache. If it does, the function retrieves and returns the cached sequence. If not, it returns None, indicating that the sequence is not available in the cache.

In [None]:
def check_cache(protein_id):
    # Check if the cache file exists
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE, "rb") as f:
            cached_data = pickle.load(f)
            return cached_data.get(protein_id)  # Return cached sequence if exists
    return None

The cache_rna_sequence function is responsible for storing the RNA sequence associated with a given protein_id into a cache file. If the cache already exists, it loads the existing cache, adds the new RNA sequence to it, and then saves the updated cache back to the file. If the cache doesn't exist, it creates a new cache and saves the RNA sequence

In [None]:
def cache_rna_sequence(protein_id, rna_sequence):
    # Load existing cache if it exists, otherwise create a new one
    cached_data = {}
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE, "rb") as f:
            cached_data = pickle.load(f)
    
    # Add the new RNA sequence to the cache
    cached_data[protein_id] = rna_sequence
    
    # Save the updated cache back to the file
    with open(CACHE_FILE, "wb") as f:
        pickle.dump(cached_data, f)

The count_sequence_occurrences function is responsible for counting the number of times a specific RNA sequence (sequence) appears within a larger RNA sequence (rna_sequence).

In [None]:
def count_sequence_occurrences(sequence, rna_sequence):
    return rna_sequence.count(sequence)

The plot_occurrences function is responsible for visualizing the data of RNA sequence occurrences across different protein IDs using a bar chart. It takes a dictionary occurrences as input, where the keys are protein IDs and the values are the number of times a specific RNA sequence appears in the corresponding protein's RNA sequence

In [None]:
def plot_occurrences(occurrences):
    # Extract protein IDs and their corresponding occurrence counts
    protein_ids = list(occurrences.keys())
    counts = list(occurrences.values())

    # Create a bar chart
    plt.figure(figsize=(10, 6))
    plt.bar(protein_ids, counts, color='blue')

    # Add labels and title
    plt.xlabel('Protein ID')
    plt.ylabel('Occurrences of Specific RNA Sequence')
    plt.title('Occurrences of Specific RNA Sequence in Protein Sequences')
    
    # Rotate the x-axis labels for better readability
    plt.xticks(rotation=90)
    
    # Show the plot
    plt.tight_layout()
    plt.show()


The main function in this program orchestrates the entire process of querying an API, fetching RNA sequences, counting occurrences of a specific RNA sequence, and then visualizing the results with a bar chart

In [None]:
def main():
    # Search for RNA associated with the protein
    query = "PolynucleotidePhosphorylase"
    xml_results = search_ncbi_rna(query)
    protein_ids = extract_protein_ids(xml_results)
    
    # Define the RNA sequence you want to count
    specific_rna_sequence = "GGA"  # Example RNA sequence

    # Fetch RNA sequences for the protein IDs
    occurrences = {}
    for protein_id in protein_ids:
        rna_sequence = fetch_rna_sequence(protein_id)
        if rna_sequence:
            count = count_sequence_occurrences(specific_rna_sequence, rna_sequence)
            occurrences[protein_id] = count
        
        # Delay to avoid hitting the API rate limit
        time.sleep(1)  # Sleep for 1 second between requests
    
    # Print the results
    for protein_id, count in occurrences.items():
        print(f"Protein ID: {protein_id}, Occurrences of '{specific_rna_sequence}': {count}")

    # Visualize the results
    plot_occurrences(occurrences)

if __name__ == "__main__":
    main()