<a href="https://colab.research.google.com/github/roaring60s/CYP2D6_CNV_Analysis/blob/main/CYP2D6_exons_coordinates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Of course. Here is the Python script formatted for a Google Colab notebook. You can copy and paste the code into a cell and run it directly.

### **Text Cell:**

-----

# Retrieve *CYP2D6* Exon Coordinates from Ensembl

This notebook contains a Python script to programmatically query the Ensembl database for the human *CYP2D6* gene (ENSG00000100197). It specifically targets the MANE Select transcript (ENST00000645361) on the GRCh38 assembly.

The script performs the following actions:

1.  Connects to the Ensembl REST API.
2.  Retrieves the full data for the *CYP2D6* gene, including all its transcripts and exons.
3.  Identifies the specific MANE Select transcript.
4.  Extracts the genomic coordinates for each exon within that transcript.
5.  Sorts the exons according to their biological order (Exon 1, 2, 3...), accounting for the gene's position on the reverse strand.
6.  Saves the results into a tab-separated values (TSV) file named `CYP2D6_GRCh38_MANE_exons.tsv`.

After running the code, you can find the output file in the Colab file browser on the left-hand side.

-----

### **Code Cell:**

In [None]:
import requests
import json
import csv
import sys

def get_cyp2d6_mane_exon_coordinates():
    """
    Retrieves the genomic coordinates for all exons of the CYP2D6 MANE Select
    transcript (ENST00000645361) from the Ensembl REST API (GRCh38) and
    saves them to a TSV file.
    """
    # Step 1: Construct the API Request
    # Define the Ensembl server and the specific endpoint for gene lookup by ID.
    # The 'expand=1' parameter is crucial for fetching all related features
    # (transcripts, exons) in a single call.
    server = "https://rest.ensembl.org"
    gene_id = "ENSG00000100197"  # Ensembl Gene ID for CYP2D6
    ext = f"/lookup/id/{gene_id}"
    headers = {"Content-Type": "application/json"}
    params = {"expand": 1}

    # Step 2: Execute the Request and Handle Errors
    try:
        print(f"Querying Ensembl for gene: {gene_id}...")
        r = requests.get(server + ext, headers=headers, params=params)

        # Check if the request was successful. If not, raise an exception.
        r.raise_for_status()
        print("Successfully retrieved data from Ensembl.")

    except requests.exceptions.HTTPError as err:
        print(f"HTTP Error occurred: {err}", file=sys.stderr)
        return
    except requests.exceptions.RequestException as err:
        print(f"An error occurred during the request: {err}", file=sys.stderr)
        return

    # Step 3: Parse the Nested JSON Response
    # The response text is loaded into a Python dictionary for easy navigation.
    decoded_data = r.json()

    # Step 4: Isolate the MANE Select Transcript
    # The MANE Select transcript provides a stable, standard reference.
    # We iterate through all transcripts of the gene to find the correct one.
    mane_transcript_id = "ENST00000645361"
    target_transcript = None

    if 'Transcript' in decoded_data:
        for transcript in decoded_data:
            # Check for the MANE flag and the correct ID.
            if transcript.get('is_mane') and transcript['id'].startswith(mane_transcript_id):
                target_transcript = transcript
                print(f"Found MANE Select transcript: {transcript['id']}")
                break

    if not target_transcript:
        print(f"Error: MANE Select transcript {mane_transcript_id} not found for gene {gene_id}.", file=sys.stderr)
        return

    # Step 5: Extract and Sort Exon Data
    # Exons are not guaranteed to be in order in the API response.
    # They must be sorted by their genomic start position to be numbered correctly.
    if 'Exon' not in target_transcript:
        print(f"Error: No exons found for transcript {target_transcript['id']}.", file=sys.stderr)
        return

    exons = target_transcript['Exon']
    # The gene is on the reverse strand, so for correct 5' to 3' numbering (Exon 1, 2, 3...),
    # we sort in descending order of genomic start position.
    sorted_exons = sorted(exons, key=lambda e: e['start'], reverse=True)

    extracted_data =
    for i, exon in enumerate(sorted_exons):
        exon_number = i + 1
        length = exon['end'] - exon['start'] + 1
        extracted_data.append({
            'Exon_Number': exon_number,
            'Ensembl_Exon_ID': exon['id'],
            'Chromosome': exon['seq_region_name'],
            'Start': exon['start'],
            'End': exon['end'],
            'Strand': exon['strand'],
            'Length_bp': length
        })

    print(f"Extracted {len(extracted_data)} exons.")

    # Step 6: Write to a TSV File
    # The data is saved in a structured, machine-readable format.
    output_filename = "CYP2D6_GRCh38_MANE_exons.tsv"

    try:
        with open(output_filename, 'w', newline='') as tsvfile:
            if not extracted_data:
                print("No data to write to file.", file=sys.stderr)
                return

            fieldnames = extracted_data.keys()
            writer = csv.DictWriter(tsvfile, fieldnames=fieldnames, delimiter='\t')

            writer.writeheader()
            writer.writerows(extracted_data)

        print(f"Successfully wrote exon coordinates to {output_filename}")

    except IOError as err:
        print(f"Error writing to file {output_filename}: {err}", file=sys.stderr)

# Execute the function
get_cyp2d6_mane_exon_coordinates()