
# Canonical Transcript Dataset Creation

This notebook describes a reproducible workflow for creating a curated dataset of canonical human transcripts. The goal is to produce a BED and FASTA file representing one transcript per gene, without redundancy or paralog contamination.

## Workflow

1. **Load GENCODE annotation files**
   - Use `knownGene.bed` and `knownCanonical.txt` from UCSC
   - Filter to retain only canonical transcripts

2. **Process transcript structures**
   - Join canonical list with BED entries
   - Drop extra columns and keep BED6 format
   - remove transcripts associated with paralogous genes
   - remove non-protein coding genes

3. **Generate output files**
   - Canonical BED file
   - Corresponding FASTA file extracted from genome sequence


In [2]:
import pandas as pd

known_gene = pd.read_csv("knownGene.txt", sep="\t", header=None)
known_canonical = pd.read_csv("knownCanonical2.txt", sep="\t", header=None)

canonical_transcripts = set(known_canonical[4])

filtered_known_gene = known_gene[known_gene[0].isin(canonical_transcripts)]

filtered_known_gene.to_csv("filtered_knownGene.txt", sep="\t", index=False, header=False)


In [5]:
unique_transcripts = filtered_known_gene[0].unique()

num_unique_transcripts = len(unique_transcripts)

print("Number of unique transcripts in filtered_knownGene.txt:", num_unique_transcripts)

Number of unique transcripts in filtered_knownGene.txt: 64955


In [7]:
# The chromosome column is the second column (index 1)
distinct_chromosomes = filtered_known_gene[1].unique()

# Print the list of distinct chromosome names
print("Distinct chromosome names:")
print(distinct_chromosomes)

Distinct chromosome names:
['chr1' 'chr10' 'chr11' 'chr12' 'chr13' 'chr14' 'chr15' 'chr16' 'chr17'
 'chr18' 'chr19' 'chr2' 'chr20' 'chr21' 'chr22' 'chr3' 'chr4' 'chr5'
 'chr6' 'chr7' 'chr8' 'chr9' 'chrM' 'chrMT' 'chrX' 'chrY']


Ignore the above, that was just data exploration. 

BELOW:Filter the knownGene dataset so that it only includes transcripts present in the knownCanonical2.txt list (the canonical transcripts).
Further filter out non-coding RNA types like pseudo, lncRNA, and lincRNA based on the gene type.
Exclude rows with irregular chromosomes (e.g., chrM, chrMT).

Below:
Filter the knownGene dataset so that it only includes transcripts present in the knownCanonical
- filter out non-coding RNA types like pseudo, lncRNA, and lincRNA etc
- Exclude rows with irregular chromosomes (chrM, chrMT).
- Number of rows in the final filtered file: 26623
- Distinct gene types remaining in the final filtered file: ['protein_coding']

In [15]:
reference_genome = pysam.FastaFile("hg19.fa")


In [4]:
print(bed_file.head())

chr1	65418	71585	ENST00000641515.2_5	0	+	65564	70008	789624	3	15,54,2549,	0,101,3618,
 chr1	134900	139379	ENST00000423372.3	0	-	138529	139309	789624	2	902,1759,	0,2720,
 chr1	367639	368634	ENST00000426406.1	0	+	367658	368597	789624	1	995,	0,
 chr1	621095	622034	ENST00000332831.5_7	0	-	621095	622034	789624	1	939,	0,
 chr1	859302	879954	ENST00000616016.5_6	0	+	859811	879533	789624	14	1026,92,182,51,125,90,138,163,116,79,500,125,111,667,	0,1999,6232,7116,11849,15117,15352,17221,18213,18487,18636,19330,19775,19985,
 chr1	879582	894636	ENST00000327044.7_6	0	-	880073	894620	789624	19	598,90,136,114,144,102,114,112,140,189,114,111,79,91,121,132,175,153,42,	0,854,1315,1970,2199,3928,4287,6924,7797,8209,8972,9579,9801,11720,11892,12691,12896,14726,15012,
 chr1	895963	901099	ENST00000338591.8_3	0	+	896073	900571	789624	12	217,260,122,222,117,214,145,168,89,74,182,757,	0,709,1045,1242,1771,2120,2525,2753,3336,3523,3765,4379,
 chr1	901861	911245	ENST00000379410.8_9	0	+	901911	909955	789624	16	133,

In [5]:
# Helper function to get the reverse complement of a sequence
def reverse_complement(sequence):
    complement = str.maketrans("ATCGatcg", "TAGCtagc")
    return sequence.translate(complement)[::-1]

exon_sequences = []
intron_sequences = []

# Process each BED record
for bed_record in bed_file:
    chrom = bed_record.chrom
    strand = bed_record.strand
    chrom_start = int(bed_record.start)  # Transcription start
    chrom_end = int(bed_record.end)  # Transcription end
    exon_lengths = list(map(int, bed_record.fields[10].strip(',').split(',')))  # Exon lengths
    exon_offsets = list(map(int, bed_record.fields[11].strip(',').split(',')))  # Exon start offsets

    # Calculate absolute exon starts and ends
    exon_starts = [chrom_start + offset for offset in exon_offsets]
    exon_ends = [start + length for start, length in zip(exon_starts, exon_lengths)]

    previous_end = chrom_start

    for exon_start, exon_end in zip(exon_starts, exon_ends):
        # Get the exon sequence
        exon_seq = reference_genome.fetch(chrom, exon_start, exon_end)
        if strand == '-':
            exon_seq = reverse_complement(exon_seq)
        exon_sequences.append({"transcript_id": bed_record.name, "sequence": exon_seq})

        if previous_end < exon_start:
            intron_seq = reference_genome.fetch(chrom, previous_end, exon_start)
            if strand == '-':
                intron_seq = reverse_complement(intron_seq)
            intron_sequences.append({"transcript_id": bed_record.name, "sequence": intron_seq})

        previous_end = exon_end

    if previous_end < chrom_end:
        intron_seq = reference_genome.fetch(chrom, previous_end, chrom_end)
        if strand == '-':
            intron_seq = reverse_complement(intron_seq)
        intron_sequences.append({"transcript_id": bed_record.name, "sequence": intron_seq})

df_exons = pd.DataFrame(exon_sequences)
df_exons.to_csv("extracted_exon_sequences.txt", sep="\t", index=False)

df_introns = pd.DataFrame(intron_sequences)
df_introns.to_csv("extracted_intron_sequences.txt", sep="\t", index=False)

print("Exon and intron sequence extraction complete.")


Exon and intron sequence extraction complete.


In [17]:
def calculate_intron_frequencies(intron_sequences):
    counts = Counter(''.join(intron_sequences))
    total = sum(counts.values())
    frequencies = {
        'A': counts.get('A', 0) / total,
        'T': counts.get('T', 0) / total,
        'C': counts.get('C', 0) / total,
        'G': counts.get('G', 0) / total
    }
    return frequencies

def generate_random_sequence(length, frequencies):
    nucleotides = ['A', 'T', 'C', 'G']
    probs = [frequencies[nuc] for nuc in nucleotides]
    return ''.join(random.choices(nucleotides, probs, k=length))

In [18]:
# Helper function to get the reverse complement of a sequence
def reverse_complement(sequence):
    complement = str.maketrans("ATCGatcg", "TAGCtagc")
    return sequence.translate(complement)[::-1]

# Extract intron sequences and calculate overall intron nucleotide frequencies
intron_sequences = []

for bed_record in bed_file:
    chrom = bed_record.chrom
    strand = bed_record.strand
    chrom_start = int(bed_record.start)
    chrom_end = int(bed_record.end)
    exon_starts = list(map(int, bed_record.fields[10].strip(',').split(',')))
    exon_ends = list(map(int, bed_record.fields[11].strip(',').split(',')))

    previous_end = chrom_start
    for start_offset, end_offset in zip(exon_starts, exon_ends):
        exon_start = chrom_start + start_offset
        exon_end = chrom_start + end_offset

        # Get intron sequence before the current exon
        if previous_end < exon_start:
            intron_seq = reference_genome.fetch(chrom, previous_end, exon_start)
            if strand == '-':
                intron_seq = reverse_complement(intron_seq)
            intron_sequences.append(intron_seq)

        previous_end = exon_end

    # Add the last intron sequence if it exists
    if previous_end < chrom_end:
        intron_seq = reference_genome.fetch(chrom, previous_end, chrom_end)
        if strand == '-':
            intron_seq = reverse_complement(intron_seq)
        intron_sequences.append(intron_seq)


In [20]:
# Function to process each transcript
def process_transcript(transcript_id, full_sequence, bed_record):
    chrom = bed_record.chrom
    strand = bed_record.strand
    chrom_start = int(bed_record.start)
    exon_starts = list(map(int, bed_record.fields[10].strip(',').split(',')))
    exon_ends = list(map(int, bed_record.fields[11].strip(',').split(',')))

    processed_sequence = []
    previous_end = 0

    for start_offset, end_offset in zip(exon_starts, exon_ends):
        exon_start = chrom_start + start_offset
        exon_end = chrom_start + end_offset

        # Add the intron sequence before the exon
        if previous_end < start_offset:
            intron_seq = full_sequence[previous_end:start_offset]
            processed_sequence.append(intron_seq)

        exon_length = exon_end - exon_start
        random_exon = generate_random_sequence(exon_length, intron_frequencies)
        processed_sequence.append(random_exon)

        previous_end = end_offset

    if previous_end < len(full_sequence):
        processed_sequence.append(full_sequence[previous_end:])

    final_sequence = ''.join(processed_sequence)
    padded_sequence = f"{padding_sequence}{final_sequence}{padding_sequence}"

    fasta_header = f">{transcript_id}|{chrom}|{strand}|{chrom_start}|{previous_end}"
    return fasta_header, padded_sequence

with open("canonical_dataset_spliceai.fasta", "w") as output_file:
    for idx, row in df_sequences.iterrows():
        transcript_id = row["transcript_id"]
        full_sequence = row["sequence"]
        bed_record = bed_file[idx]
        header, padded_sequence = process_transcript(transcript_id, full_sequence, bed_record)
        output_file.write(f"{header}\n{padded_sequence}\n")

print("Canonical dataset preparation complete. Saved to 'canonical_dataset_spliceai.fasta'.")


Canonical dataset preparation complete. Saved to 'canonical_dataset_spliceai.fasta'.



example row:
>chr1:65418-71585(+)|chr1|+|65418|3618
chr1:65418-71585(+):

The transcript is located on chromosome chr1.
The genomic range spans from 65418 to 71585.
The transcript is on the + strand (forward strand).
|chr1|:

Confirms the chromosome (chr1).
+:

Strand orientation is positive (+).
65418:

Start position of the transcript on the chromosome.
3618:

End position of the last exon, relative to the transcript start (chromStart). This value might indicate the last position of the exon offset or an internal boundary within the transcript.

In [4]:
#how to make canonical sequences file


import pysam
import pandas as pd

# Paths
reference_genome_path = "/mnt/lareaulab/sdahiyat/hg19.fa"  # Reference genome
canonical_dataset_path = "/mnt/lareaulab/sdahiyat/illumina/canonical_dataset_created.txt"
output_file_path = "/mnt/lareaulab/sdahiyat/illumina/canonical_sequence_with_flanked_Ns.txt"

reference_genome = pysam.FastaFile(reference_genome_path)

canonical_df = pd.read_csv(
    canonical_dataset_path,
    sep="\t",

    header=None,
    names=["geneName", "chrom", "strand", "chromStart", "chromEnd", "exonEnds", "exonStarts"]
)

# Create the canonical sequence file with flanked 'N's
with open(output_file_path, "w") as output_file:
    for _, row in canonical_df.iterrows():
        chrom = row["chrom"]
        chrom_start = row["chromStart"]
        chrom_end = row["chromEnd"]
        strand = row["strand"]

        adjusted_start = max(0, chrom_start-5001)  #  or chrom_start-5001
        adjusted_end = chrom_end +5000 #or add 5000

        core_sequence = reference_genome.fetch(chrom, chrom_start-1, chrom_end)  # Use adjusted coordinates
        # if strand == "-":  # Reverse complement for negative strand
        #      core_sequence = core_sequence.translate(str.maketrans("ATCG", "TAGC"))[::-1]

        padded_sequence = f"{'N' * 5000}{core_sequence}{'N' * 5000}"

        output_file.write(f"{chrom}:{adjusted_start}-{adjusted_end}\t{padded_sequence}\n")

print(f"Canonical sequence file with flanked 'N's created at {output_file_path}")


Canonical sequence file with flanked 'N's created at /mnt/lareaulab/sdahiyat/illumina/canonical_sequence_with_flanked_Ns.txt


In [14]:
filtered_known_gene[filtered_known_gene.iloc[:, 18] == "A0A2U3U0J3"]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,chr1,65418,71585,ENST00000641515.2_5,0,+,65564,70008,789624,3,...,OR4F5,A0A2U3U0J3,protein_coding,coding,havana_homo_sapiens,protein_coding,"CCDS,Ensembl_canonical,MANE_Select,RNA_Seq_sup...",2,"canonical,basic,all",0
