# Introduction

`geneanot` provides gene and transcript annotation of vertebrates species based on [Ensembl](https://www.ensembl.org/index.html).

## Species
`geneanot` supports all vertebrates species. The default species is Homo sapiens.

## Requirements
`geneanot` requires an Ensembl annotation GFF3 file. `geneanot` provides functions to download the Ensembl annotation file to a local folder, which will be used by the package. If a local annotation file exists, `geneanot` can also check if there is a new version in Ensembl and download it.

The user must designate a folder that holds the Ensembl annotation file (or that will be used by `geneanot` to download the Ensembl annotation file to). This is defined below by `Annotation_folder`.

## Chromosome Data
Some of the methods require access to the chromosome data. The two supported **access modes** are:
1. `local` - using a local chromosome Fasta file provided by the user
2. `remote` - using the remote Ensembl chromosome files via the Ensembl REST API

The `local` access mode requires the user to provide the corresponding chromosome Fasta file (this can be done at instantiation, and/or during the lifetime of the annotation object). The Fasta file **must** contain equal number of bps per row for all sequence rows (except possibly the last row), since this is assumed by `geneanot` to support fast random access. The chromosome Fasta files can be downloaded from Ensembl. For example, for Homo spaiens, these are located in
`https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/` (the required files are `Homo_sapiens.GRCh38.dna_sm.chromosome.<chromosome number>.fa.gz`, e.g. `Homo_sapiens.GRCh38.dna_sm.chromosome.1.fa.gz` for chromosome 1). The Ensembl chromosome Fasta files contain equal number of bps per row for all sequence rows.
Since in `local` mode `geneanot` uses a local chromosome Fasta file, the sequence retrival is fast.

In the `remote` access mode, `geneanot` uses the Ensembl REST API to extract sequences from the corresponding chromosome. No local chromosome Fasta file is required, however this implies a slower sequence retrival, and requires a network connection.

The access mode is determined by `chrm_fasta_file` below. Setting it to `None` implies `remote` mode, otherwise, holding a valid chromosome Fasta file name implies `local` mode. Note that this may be set upon annotation class instantiation, and/or during the lifetime of the annotation object.


The following sections provide useful usage examples.



In [1]:
from pathlib import Path
import geneanot as u

# folder that holds (or is designated to hold) the Ensembl annotation file.
Annotation_folder: Path = Path('./../AnnotationDB')

# Human Gene Annotation

We start with examples of annotating Homo sapiens genes and transcripts.

## Informative
This following is not required for annotation. It just provides information about the annotation file in the `Annotation_folder` folder and the latest release in Ensembl.

In [2]:
# Check for Ensembl annotation files in the Annotation_folder
if (local_files := u.get_annotation_files_info_in_folder(Annotation_folder)):
    releases = list(local_files.keys())
    print(f"Found {len(local_files)} annotation files in {Annotation_folder}. {releases=}, latest release = {max(releases)}.")
else:
    print(f"No annotation file found in {Annotation_folder=}.")

# check the latest Ensembl release
ensembl_release = u.get_ensembl_release()
print(f"Ensembl latest release is {ensembl_release}.")

Found 1 annotation files in ../AnnotationDB. releases=[113], latest release = 113.
Ensembl latest release is 113.


## Annotate

### Initialize the Annotation Folder
This is required if:
1. `Annotation_folder` does not contain an annotation file, as it will be downloaded from the Ensembl site. Or,
1. we want to update the annotation file in `Annotation_folder` to the lastest Ensembl release

Otherwise, this can be ignored, and `annotation_full_file` must be set by the user, and must point to a valid annotation file name. 

In [3]:
enable_download: bool = True  # enables/disables downloading the annotation file from Ensembl 
                              # only when local release < ensembl latest release.
                              # However, if Annotation_folder does not contain an anotation file, 
                              # the download is done regardless of this setting.
# -----------------------------------------------------------------------------------------------
download_done, ensembl_file, local_file = u.update_local_release_to_latest(Annotation_folder, enable_download=enable_download)
annotation_full_file = Annotation_folder / (ensembl_file if download_done else local_file)
print(f"Annotation file that will be used: {annotation_full_file}.\n")

Local annotation releases found: 113. Latest is 113 (Homo_sapiens.GRCh38.113.gff3.gz).
Ensembl latest release is 113.
Ensembl and local releases match.
Annotation file that will be used: ../AnnotationDB/Homo_sapiens.GRCh38.113.gff3.gz.



### Instantiate an Annotation Object
This requires the annotation file defined in `annotation_full_file` (see above).

The instantiation is slow, as `geneanot` starts by preparing several data structures from the annotation file. Below we give an example of a faster way to instantiate annotation objects, which is more relevant when annotating multiple genes.

In [4]:
gene: str = 'EGFR'  # set a gene name or a gene ID
verbose: bool = True  # set to True to get informative prints
# -----------------------------------------------------------
# instantiate in remote access mode
gA = u.Gene_cls(gene, annotation_full_file, verbose=verbose)

# Or - instantiate in local access mode
# chrm_fasta_file: str = './../Chromosome/Homo_sapiens.GRCh38.dna_sm.chromosome.7.fa'
# gA = u.Gene_cls(gene, annotation_full_file, chrm_fasta_file=chrm_fasta_file, verbose=verbose)

# regardless, later we can set gA.chrm_fasta_file to either None (remote access)
# or a valid chromosome Fasta file name (local access).

Loading ../AnnotationDB/Homo_sapiens.GRCh38.113.gff3.gz to a dataframe ... Done.


### General Gene Information

In [5]:
gA.info()

print(f"\n{gene} contains {len(gA)} transcripts.")
gA.show_all_transcript_IDs()

print(f"\nGene start = {gA.gene_start:,}, gene end = {gA.gene_end:,}.")

print(f"\n{gene=} encoded in chromosome {gA.chrm} on the {'negative' if gA.rev else 'positive'} DNA strand.")

Gene=EGFR, Gene name=EGFR, Gene ID=ENSG00000146648
Source=../AnnotationDB/Homo_sapiens.GRCh38.113.gff3.gz
Positive strand=True.
Type=protein_coding, version=21.
Description=epidermal growth factor receptor [Source:HGNC Symbol%3BAcc:HGNC:3236]
Gene region=chr7:55,019,017-55,211,628.
13 transcripts.

EGFR contains 13 transcripts.
EGFR transcript IDs:
 1. ENST00000344576
 2. ENST00000275493
 3. ENST00000455089
 4. ENST00000342916
 5. ENST00000420316
 6. ENST00000700144
 7. ENST00000459688
 8. ENST00000463948
 9. ENST00000450046
10. ENST00000700145
11. ENST00000485503
12. ENST00000700146
13. ENST00000700147

Gene start = 55,019,017, gene end = 55,211,628.

gene='EGFR' encoded in chromosome 7 on the positive DNA strand.


In [6]:
# the various gene attributes printed above are also accessible by gA
print(gA.gene_name, gA.gene_ID, gA.gene_desc, gA.gene_type, gA.gene_ver, gA.rev, gA.chrm, gA.gene_start, gA.gene_end, len(gA.transcripts_info), sep='\n')


EGFR
ENSG00000146648
epidermal growth factor receptor [Source:HGNC Symbol%3BAcc:HGNC:3236]
protein_coding
21
False
7
55019017
55211628
13


### Transcript Annotation

Here we annotate a transcript of `gene`.

In [7]:
transcript_id: str = 'ENST00000275493'  # transcript ID without version
# ---------------------------------------------------------------------
# print transcript information
gA.transcript_info(transcript_id, verbose=True)

# the transcript features are accessible via gA.transcripts_info[transcript_id], for example:
t_info = gA.transcripts_info[transcript_id]
print('\n', t_info['transcript_name'], t_info['transcript_v_id'], t_info['transcript_ver'])

# all available transcript features (i.e., keys in t_info)
available_transcript_features = ''.join(f"{i:2}. {feature}\n" for i, feature in zip(range(1, len(t_info)+1), t_info.keys()))
print(f"\nThe transcript accessible features are:\n{available_transcript_features}")


ENST00000275493 information:
Name=EGFR-201, ID=ENST00000275493.7, version=7
Type=mRNA, biotype=protein_coding
Start=55,019,017, end=55,211,628
protein ID = ENSP00000275493
ccdsid=CCDS5514.1
ORF start offset = 261, ORF end offset = 3,893 (0-based offsets from the beginning of the RNA)
ORF start chromosome position = 55,019,278, ORF end chromosome position = 55,205,617
Exon regions:
	 1. 55,019,017 - 55,019,365, ID=ENSE00001841347, phase=-1, end phase=1
	 2. 55,142,286 - 55,142,437, ID=ENSE00003541288, phase=1, end phase=0
	 3. 55,143,305 - 55,143,488, ID=ENSE00001704157, phase=0, end phase=1
	 4. 55,146,606 - 55,146,740, ID=ENSE00001798125, phase=1, end phase=1
	 5. 55,151,294 - 55,151,362, ID=ENSE00001683983, phase=1, end phase=1
	 6. 55,152,546 - 55,152,664, ID=ENSE00001652975, phase=1, end phase=0
	 7. 55,154,011 - 55,154,152, ID=ENSE00001623732, phase=0, end phase=1
	 8. 55,155,830 - 55,155,946, ID=ENSE00001751179, phase=1, end phase=1
	 9. 55,156,533 - 55,156,659, ID=ENSE0000108492

exon phases: 
- exon end phase is the exon phase of the next exon.
- conversion from exon phase to CDS phase: 0-->0, 1-->2, 2-->1.

In [8]:
# Transcript table - exon and intron table.
df = gA.exon_intron_map(transcript_id)
display(df)

# the transcript table can be written to excel, CSV or HTML file
gA.exon_intron_map_to_csv(transcript_id, './../Reports/transcript_table.csv')
gA.exon_intron_map_to_excel(transcript_id, './../Reports/transcript_table.xlsx', usr_desc={"Description": "Transcript table", "Transcript": transcript_id})
gA.exon_intron_map_to_html(transcript_id, './../Reports/transcript_table.html')

Unnamed: 0,name,region,region_size
0,Exon 1,"(55019017, 55019365)",349
1,Intron 1,"(55019366, 55142285)",122920
2,Exon 2,"(55142286, 55142437)",152
3,Intron 2,"(55142438, 55143304)",867
4,Exon 3,"(55143305, 55143488)",184
5,Intron 3,"(55143489, 55146605)",3117
6,Exon 4,"(55146606, 55146740)",135
7,Intron 4,"(55146741, 55151293)",4553
8,Exon 5,"(55151294, 55151362)",69
9,Intron 5,"(55151363, 55152545)",1183


In [9]:
# RNA table - maps the ORF and AAs to the mRNA sequence.
df = gA.exon_map(transcript_id)
display(df)

# similarly to exon_intron_map() above, this table can also we written to excel, CSV or HTML file
gA.exon_map_to_csv(transcript_id, './../Reports/mRNA_table.csv')
gA.exon_map_to_excel(transcript_id, './../Reports/mRNA_table.xlsx', usr_desc={"Description": "mRNA table", "Transcript": transcript_id})
gA.exon_map_to_html(transcript_id, './../Reports/mRNA_table.html')

Unnamed: 0,exon_number,exon_region,exon_size,mRNA_NT_region,ORF_NT_region,ORF_AA_region,next_exon_frame_alignment,exon_ID
0,1,"(55019017, 55019365)",349,"[1, 349]","[1, 88]","[1, 30]",2,ENSE00001841347
1,2,"(55142286, 55142437)",152,"[350, 501]","[89, 240]","[30, 80]",0,ENSE00003541288
2,3,"(55143305, 55143488)",184,"[502, 685]","[241, 424]","[81, 142]",2,ENSE00001704157
3,4,"(55146606, 55146740)",135,"[686, 820]","[425, 559]","[142, 187]",2,ENSE00001798125
4,5,"(55151294, 55151362)",69,"[821, 889]","[560, 628]","[187, 210]",2,ENSE00001683983
5,6,"(55152546, 55152664)",119,"[890, 1008]","[629, 747]","[210, 249]",0,ENSE00001652975
6,7,"(55154011, 55154152)",142,"[1009, 1150]","[748, 889]","[250, 297]",2,ENSE00001623732
7,8,"(55155830, 55155946)",117,"[1151, 1267]","[890, 1006]","[297, 336]",2,ENSE00001751179
8,9,"(55156533, 55156659)",127,"[1268, 1394]","[1007, 1133]","[336, 378]",1,ENSE00001084929
9,10,"(55156759, 55156832)",74,"[1395, 1468]","[1134, 1207]","[378, 403]",2,ENSE00001084931


Here `mRNA_NT_region` gives the mRNA bp count, `ORF_NT_region` gives the ORF bp count, `ORF_AA_region` gives the protein AA count, and `next_exon_frame_alignment` gives the codon phase (i.e., number of bps required in the next exon to complete the last codon in the current exon).

#### Sequences

Retrieve sequences (e.g., pre-mRNA, mRNA, ORF, AA, UTRs).

In [10]:
# The access mode may be set/changed anytime:

# set to use local mode (if instantiation was done with remote mode)
gA.chrm_fasta_file = f'./../Chromosome/Homo_sapiens.GRCh38.dna_sm.chromosome.{gA.chrm}.fa'

# Or, set to use remote mode (if instantiation was with local mode)
# gA.chrm_fasta_file = None

print(f"Access mode is {gA.access_mode()}.")

Access mode is local.


In [11]:
pre_mRNA_seq = gA.seq(transcript_id).upper()
print(f"The pre-mRNA sequence contains {len(pre_mRNA_seq):,} bps.")

rna_seq = gA.rna(transcript_id).upper()
print(f"\nrna:\n{rna_seq}\n{len(rna_seq):,} bps.")

orf_seq = gA.ORF(transcript_id).upper()
print(f"\norf:\n{orf_seq}\n{len(orf_seq):,} bps.")

aa_seq = gA.AA(transcript_id)
print(f"\nprotein:\n{aa_seq}\n{len(aa_seq):,} AAs.")

utr5_seq = gA.UTR5(transcript_id).upper()
print(f"\n5'UTR:\n{utr5_seq}\n{len(utr5_seq):,} bps.")

utr3_seq = gA.UTR3(transcript_id).upper()
print(f"\n3'UTR=\n{utr3_seq}\n{len(utr3_seq):,} bps.")

# An Exon sequence
exon_number: int = 5
exon_seq, seq_info = gA.exon_intron_seq('Exon', exon_number, transcript_id)
print(f"\nExon #{exon_number}:\n{exon_seq}\n{seq_info}")

# An Intron sequence
intron_number: int = 5
intron_seq, seq_info = gA.exon_intron_seq('Intron', intron_number, transcript_id)
print(f"\nIntron #{intron_number} contains {len(intron_seq):,} bps.\n{seq_info.upper()}")

# A modified transcript - generates a sequence based on a list of exons and introns
use_exon_list: list[int] = [1, 3]  # exon numbers to use
use_intron_list: list[int] = [2]  # intron numbers to use
m_seq = gA.modified_transcript(use_exon_list, use_intron_list, transcript_id)
print(f"\nA modified transcript containing exons {', '.join(map(str,use_exon_list))} and introns {', '.join(map(str,use_intron_list))} contains {len(m_seq.upper()):,} bps.")

The pre-mRNA sequence contains 192,612 bps.

rna:
AGACGTCCGGGCAGCCCCCGGCGCAGCGCGGCCGCAGCAGCCTCCGCCCCCCGCACGGTGTGAGCGCCCGACGCGGCCGAGGCGGCCGGAGTCCCGAGCTAGCCCCGGCGGCCGCCGCCGCCCAGACCGGACGACAGGCCACCTCGTCGGCGTCCGCCCGAGTCCCCGCCTCGCCGCCAACGCCACAACCACCGCGCACGGCCCCCTGACTCCGTCCAGTATTGATCGGGAGAGCCGGAGCGAGCTCTTCGGGGAGCAGCGATGCGACCCTCCGGGACGGCCGGGGCAGCGCTCCTGGCGCTGCTGGCTGCGCTCTGCCCGGCGAGTCGGGCTCTGGAGGAAAAGAAAGTTTGCCAAGGCACGAGTAACAAGCTCACGCAGTTGGGCACTTTTGAAGATCATTTTCTCAGCCTCCAGAGGATGTTCAATAACTGTGAGGTGGTCCTTGGGAATTTGGAAATTACCTATGTGCAGAGGAATTATGATCTTTCCTTCTTAAAGACCATCCAGGAGGTGGCTGGTTATGTCCTCATTGCCCTCAACACAGTGGAGCGAATTCCTTTGGAAAACCTGCAGATCATCAGAGGAAATATGTACTACGAAAATTCCTATGCCTTAGCAGTCTTATCTAACTATGATGCAAATAAAACCGGACTGAAGGAGCTGCCCATGAGAAATTTACAGGAAATCCTGCATGGCGCCGTGCGGTTCAGCAACAACCCTGCCCTGTGCAACGTGGAGAGCATCCAGTGGCGGGACATAGTCAGCAGTGACTTTCTCAGCAACATGTCGATGGACTTCCAGAACCACCTGGGCAGCTGCCAAAAGTGTGATCCAAGCTGTCCCAATGGGAGCTGCTGGGGTGCAGGAGAGGAGAACTGCCAGAAACTGACCAAAATCATCTGTGCCCAGCAGTGCTCCGGGCGCTGCCGTGGCAAGTCCCCCAGTGA

#### Queries
All positions below are 1-based positions (unless otherwise indicated).

In [12]:
# query the start and stop codon positions in the chromosome and in the RNA
if (start_end_stop := gA.start_and_stop_codons_pos(transcript_id)) is None:
    raise ValueError(f"{transcript_id=} not recognized !!")
start_codon_info, stop_codon_info = start_end_stop
# each _info contains (chromosome_position_of_first_bp_of_codon, RNA_position_of_first_bp_of_codon)
print(f"{start_codon_info=}\n{stop_codon_info=}")

start_codon_info=(55019278, 262)
stop_codon_info=(55205615, 3892)


In [13]:
# query the start and end ORF positions in the exons
# (i.e., the exon index and the offset within the exon of the 
# first bp of the start codon and the last bp of the stop codon)
orf_start_end_info = gA.orf_start_and_end_exon_info(transcript_id)
# here, 'exon_number' value is 1-based and 'exon_offset' is 0-based.
print(orf_start_end_info)

{'start': {'exon_number': 1, 'exon_offset': 261}, 'end': {'exon_number': 28, 'exon_offset': 361}}


In [14]:
# query a chromosome position
chrm_pos: int = 55_157_663
# ---------------------------
chrm_info = gA.chrm_pos_info(transcript_id, chrm_pos)
print(chrm_info)

{'region': 'Exon_11', 'region_pos': 1, 'dist_from_region_boundary': (0, 90), 'segment': 'ORF', 'pos_in_segment': 1208, 'NT': 'G', 'codon_number': 403, 'nt_in_codon': 2, 'codon': 'GGG', 'aa': 'G'}


In [15]:
# map a chromosome position to the corresponding RNA position
chrm_pos: int = 55_211_628 
# ------------------------
if (rna_pos := gA.chrm_pos2rna_pos(transcript_id, chrm_pos)) is None:
    print(f"{chrm_pos=:,} does not overlap with the RNA of {transcript_id}.")
else:
    print(f"{chrm_pos=:,} --> {rna_pos=:}")


chrm_pos=55,211,628 --> rna_pos=(9905, 'A')


In [16]:
# map a RNA position to the chromosome position
rna_pos: int = 685
# ----------------
chrm_p = gA.rna_pos2chrm_pos(transcript_id, rna_pos)
# chrm_p is a tuple of the chromosome position and the nucleotide value at that position
print(f"{rna_pos=} --> {chrm_p=}")

rna_pos=685 --> chrm_p=(55143488, 'G')


In [17]:
# query a RNA position
rna_pos: int = 685
# ----------------
chrm_info = gA.rna_pos2chrm_info(transcript_id, rna_pos)
print(chrm_info)

{'chrm_pos': 55143488, 'region': 'Exon_3', 'region_pos': 184, 'dist_from_region_boundary': (183, 0), 'segment': 'ORF', 'pos_in_segment': 424, 'NT': 'G', 'codon_number': 142, 'nt_in_codon': 1, 'codon': 'GAA', 'aa': 'E'}


In [18]:
# query an exon position
exon_number: int = 7
bp_index_in_exon: int = 47
# -------------------
info = gA.exon_nt_info(transcript_id, exon_number, bp_index_in_exon)
print(info)

{'segment': 'ORF', 'pos_in_segment': 794, 'NT': 'C', 'codon_number': 265, 'nt_in_codon': 2, 'codon': 'CCC', 'aa': 'P'}


In [19]:
# query the RNA segment (UTR, ORF) of an exon position
exon_number: int = 28
bp_index_in_exon: int = 400
# --------------------------
if (exon_segment_info := gA.exon_nt_segment(transcript_id, exon_number, bp_index_in_exon)) is not None:
    print(f"bp #{bp_index_in_exon} in exon #{exon_number} corresponds to bp #{exon_segment_info[1]:,} in the {exon_segment_info[0]} segment.")

bp #400 in exon #28 corresponds to bp #38 in the 3UTR segment.


In [20]:
# query an amino-acid position
aa_number: int = 163
# ------------------
aa_info = gA.aa_exon_info(transcript_id, aa_number)
print(aa_info)

{'codon': 'CAG', 'AA': 'Q', 'codon_exon_pos': ['Exon_4:63', 'Exon_4:64', 'Exon_4:65'], 'codon_chromosome_pos': [55146668, 55146669, 55146670], 'mrna_pos': [748, 749, 750]}


The `codon_exon_pos` is a list of 3 positions, each in the form of `<exon number>:<pos>`.
`pos` is the 1-based position relative to the start of exon `<exon number`. The corresponding
chromosome position is then `exon_start_coordinate+m*(pos-1)`, where `m`=`1` [`m`=-`1`] for genes encoded on
the positive [negative] DNA strand (recall that in genes encoded on the negative strand, 
`exon_start_coordinate`>`exon_end_coordinate`).

In [21]:
# query an AA variant given a DNA variant defined as chromosome_pos:ref_allele>var_allele
ref_allele, var_allele, chromosome_pos = 'G', 'C', 55_152_609
# ------------------------------------------------------------
if (aa_var := gA.DNA_SNP_mut_to_AA_mut(ref_allele, var_allele, chromosome_pos, transcript_id)) is None:
    raise ValueError("Verify variant is not DEL or INS and located in the ORF.")
print(f"chr{gA.chrm}:{chromosome_pos}:{ref_allele}>{var_allele} --> {aa_var=}")


chr7:55152609:G>C --> aa_var='C231S'


In [22]:
# query (all) DNA variants given an AA variant
aa_var: str = 'C231S'
# -------------------
if (dna_all_muts := gA.AA_mut_to_DNA_SNP_mut(aa_var, transcript_id)) is None:
    raise ValueError("Verify aa_var is a valid AA variant.")
print(f"{aa_var=} corresponds to the following DNA variants:")
for i, (codon, var_info) in enumerate(dna_all_muts.items(), start=1):
    print(f"{i}. {codon=}, {var_info}")


aa_var='C231S' corresponds to the following DNA variants:
1. codon='TCT', {'start_pos': 55152609, 'reference_allele': 'GC', 'alternative_allele': 'CT'}
2. codon='TCC', {'start_pos': 55152609, 'reference_allele': 'G', 'alternative_allele': 'C'}
3. codon='TCA', {'start_pos': 55152609, 'reference_allele': 'GC', 'alternative_allele': 'CA'}
4. codon='TCG', {'start_pos': 55152609, 'reference_allele': 'GC', 'alternative_allele': 'CG'}
5. codon='AGT', {'start_pos': 55152608, 'reference_allele': 'TGC', 'alternative_allele': 'AGT'}
6. codon='AGC', {'start_pos': 55152608, 'reference_allele': 'T', 'alternative_allele': 'A'}


## Annotating Multiple Genes
A faster instantiation of annotation objects, which is relevant when annotating multiple genes, can be done using annotation dataframes that are generated from the annotation file as shown here.

In [23]:
# First, parse the annotation file into annotation dataframes
gff3_dfs = u.ensembl_gff3_df(annotation_full_file)  

In [24]:
# Now instantiate annotation objects (multiple genes) with the annotation dataframes (faster).
# Can instantiate in local mode by providing chrm_fasta_file (see above).
g_a1 = u.Gene_cls('IDH1', gff3_dfs)
g_a1.info()
g_a2 = u.Gene_cls('BRCA1', gff3_dfs)
g_a2.info()
g_a3 = u.Gene_cls('EGFR', gff3_dfs)
g_a3.info()

Gene=IDH1, Gene name=IDH1, Gene ID=ENSG00000138413
Source=Input DataFrame (ID=5434342224, size=(4105687, 9)). (Gene type DataFrame (ID=5434342704, size=(78932, 11)))
Positive strand=False.
Type=protein_coding, version=14.
Description=isocitrate dehydrogenase (NADP(+)) 1 [Source:HGNC Symbol%3BAcc:HGNC:5382]
Gene region=chr2:208,236,229-208,266,074.
9 transcripts.
Gene=BRCA1, Gene name=BRCA1, Gene ID=ENSG00000012048
Source=Input DataFrame (ID=5434342224, size=(4105687, 9)). (Gene type DataFrame (ID=5434342704, size=(78932, 11)))
Positive strand=False.
Type=protein_coding, version=26.
Description=BRCA1 DNA repair associated [Source:HGNC Symbol%3BAcc:HGNC:1100]
Gene region=chr17:43,044,295-43,170,245.
41 transcripts.
Gene=EGFR, Gene name=EGFR, Gene ID=ENSG00000146648
Source=Input DataFrame (ID=5434342224, size=(4105687, 9)). (Gene type DataFrame (ID=5434342704, size=(78932, 11)))
Positive strand=True.
Type=protein_coding, version=21.
Description=epidermal growth factor receptor [Source:HGN

# Annotating Other Vertebrates Species
`geneanot` default species is Homo sapiens (`homo_sapiens`). To annotate other vertebrates species, the variable `species` must be set.

## Annotation File
The **annotation file signature** is required by the function `update_local_release_to_latest` to download the latest annotation file or check the local release against the latest Ensembl release when annotating other species. If a local annotation file exists (and there is no need to check its release against Ensembl), then this can skipped, and the user must set `annotation_full_file` and `species` when instantiating annotation objects (see below).

The **annotation file signature** is identical to the annotation file name, except that the release number is replaced by `XXX`. For example, the signature of the annotation file of Mus musculus is `Mus_musculus.GRCm39.XXX.gff3.gz`. 

In general, the annotation file name can be found:
- by invoking the function `suggested_annotation_file_name(species=species)` as shown below. Or,
- in [Ensembl](http://www.ensembl.org/index.html) -> `--Select a species--` (under `All genomes`) -> in the `Gene annotation` section select `GFF3`

Following is an example of annotating `Mus musculus` genes and transcripts.

In [25]:
species: str = 'mus_musculus'

# The suggest annotation file name
if (anot_info := u.suggested_annotation_file_name(species=species)) is None:
    print(f"Error in retrieving annotation file name. Please check manually for the {species} GFF3 annotation file name in Ensembl.")
else:
    suggest_annotation_file_name, assembly_n, release_n = anot_info
    suggested_annotation_signature: str = suggest_annotation_file_name.replace(release_n, 'XXX')
    print(f"Annotation file name: {suggest_annotation_file_name}.\nChange the release numnber ({release_n}) to 'XXX' to get the signature: {suggested_annotation_signature}")

# Annotation file signature
annotation_file_signature: str = suggested_annotation_signature


Annotation file name: Mus_musculus.GRCm39.113.gff3.gz.
Change the release numnber (113) to 'XXX' to get the signature: Mus_musculus.GRCm39.XXX.gff3.gz


In [26]:
# Informative - this can be skipped.
# Check for local Ensembl annotation files
if (local_files := u.get_annotation_files_info_in_folder(Annotation_folder, gff3_pattern=annotation_file_signature)):
    releases = list(local_files.keys())
    max_ver = max(releases)
    print(f"Found {len(local_files)} annotation files in {Annotation_folder}. {releases=}, maximal version = {max_ver}.")
else:
    print(f"No annotation file in {Annotation_folder}. Need to download from Ensembl.")

# check Ensembl version - Informative
ensembl_version = u.get_ensembl_release()
print(f"{ensembl_version=}.")

Found 1 annotation files in ../AnnotationDB. releases=[113], maximal version = 113.
ensembl_version=113.


In [27]:
# This is required only if want to download annotation file, or check local version against Ensembl latest version.
enable_download: bool = False  # enables/disables annotation file download from Ensembl only when local version < ensembl version.
                               # If Annotation_folder contains no anotation file, download is done regardless of this setting.
# ------------------------------------------------------------------------------------------------------------------------------
download_done, ensembl_file, local_file = u.update_local_release_to_latest(Annotation_folder, 
                                                                           enable_download=enable_download, 
                                                                           gff3_pattern=annotation_file_signature,
                                                                           species=species)
annotation_full_file = Annotation_folder / (ensembl_file if download_done else local_file)
print(f"Annotation file to use: {annotation_full_file}.\n")

Local annotation releases found: 113. Latest is 113 (Mus_musculus.GRCm39.113.gff3.gz).
Ensembl latest release is 113.
Ensembl and local releases match.
Annotation file to use: ../AnnotationDB/Mus_musculus.GRCm39.113.gff3.gz.



In [28]:
# Instantiate (in remote access mode)
# Can instantiate in local mode by providing chrm_fasta_file (see above)
gene: str = 'ENSMUSG00000017167' # set a gene name or a gene ID
verbose: bool = True  # set to True to get infromation prints
# -----------------------------------------------------------------
gAm = u.Gene_cls(gene, annotation_full_file, species=species, verbose=verbose)


Loading ../AnnotationDB/Mus_musculus.GRCm39.113.gff3.gz to a dataframe ... Done.


In [29]:
gAm.info()

print(f"\n{gene} contains {len(gAm)} transcripts.")
gAm.show_all_transcript_IDs()

print(f"\nGene start = {gAm.gene_start:,}, gene end = {gAm.gene_end:,}")

Gene=ENSMUSG00000017167, Gene name=Cntnap1, Gene ID=ENSMUSG00000017167
Source=../AnnotationDB/Mus_musculus.GRCm39.113.gff3.gz
Positive strand=True.
Type=protein_coding, version=7.
Description=contactin associated protein-like 1 [Source:MGI Symbol%3BAcc:MGI:1858201]
Gene region=chr11:101,061,349-101,081,550.
2 transcripts.

ENSMUSG00000017167 contains 2 transcripts.
ENSMUSG00000017167 transcript IDs:
 1. ENSMUST00000138942
 2. ENSMUST00000103109

Gene start = 101,061,349, gene end = 101,081,550


In [30]:
transcript_id: str = 'ENSMUST00000103109'
# --------------------------------------
print(f"\n{transcript_id=} annotation:")

if (ts_te := gAm.transcript_boundaries(transcript_id)) is None:
    raise ValueError(f"{transcript_id} is not a transcript of {gene} !!")
ts, te = ts_te
print(f"start site (TSS) = {ts:,}, end site (TES) = {te:,}.")


transcript_id='ENSMUST00000103109' annotation:
start site (TSS) = 101,066,867, end site (TES) = 101,081,550.


In [31]:
df = gAm.exon_map(transcript_id)
print(df.to_string())

    exon_number             exon_region  exon_size mRNA_NT_region ORF_NT_region ORF_AA_region  next_exon_frame_alignment             exon_ID
0             1  (101066867, 101067153)        287       [1, 287]       [1, 70]       [1, 24]                          2  ENSMUSE00000243064
1             2  (101068060, 101068161)        102     [288, 389]     [71, 172]      [24, 58]                          2  ENSMUSE00000112951
2             3  (101068249, 101068442)        194     [390, 583]    [173, 366]     [58, 122]                          0  ENSMUSE00000243052
3             4  (101068825, 101068972)        148     [584, 731]    [367, 514]    [123, 172]                          2  ENSMUSE00001279109
4             5  (101069062, 101069265)        204     [732, 935]    [515, 718]    [172, 240]                          2  ENSMUSE00000112949
5             6  (101069641, 101069825)        185    [936, 1120]    [719, 903]    [240, 301]                          0  ENSMUSE00000112946
6            

In [32]:
# protein sequence
print(f"Access mode is {gAm.access_mode()}.")

aa_seq = gAm.AA(transcript_id)
print(f"\nprotein:\n{aa_seq}\n{len(aa_seq):,} AAs.")

Access mode is remote.

protein:
MMSLRLFSILLATVVSGAWGWGYYGCNEELVGPLYARSLGASSYYGLFTTARFARLHGISGWSPRIGDPNPWLQIDLMKKHRIRAVATQGAFNSWDWVTRYMLLYGDRVDSWTPFYQKGHNATFFGNVNDSAVVRHDLHYHFTARYIRIVPLAWNPRGKIGLRLGIYGCPYTSSILYFDGDDAISYRFQRGASQSLWDVFAFSFKTEEKDGLLLHTEGSQGDYVTLELQGAHLLLHMSLGSSPIQPRPGHTTVSLGGVLNDLSWHYVRVDRYGRDANFTLDGYAHHFVLNGDFERLNLENEIFIGGLVGAARKNLAYRHNFRGCIENVIYNRINIAEMAVMRHSRITFEGNVAFRCLDPVPHPINFGGPHNFVQVPGFPRRGRLAVSFRFRTWDLTGLLLFSHLGDGLGHVELMLSEGQVNVSIAQTGRKKLQFAAGYRLNDGFWHEVNFVAQENHAVISIDDVEGAEVRVSYPLLIRTGTSYFFGGCPKPASRWGCHSNQTAFHGCMELLKVDGQLVNLTLVEFRKLGYFAEVLFDTCGITDRCSPNMCEHDGRCYQSWDDFICYCELTGYKGVTCHEPLYKESCEAYRLSGKYSGNYTIDPDGSGPLKPFVVYCDIRENRAWTVVRHDRLWTTRVTGSSMDRPFLGAIQYWNASWEEVSALANASQHCEQWIEFSCYNSRLLNTAGGYPYSFWIGRNEEQHFYWGGSQPGIQRCACGLDQSCVDPALHCNCDADQPQWRTDKGLLTFVDHLPVTQVVVGDTNRSNSEAQFFLRPLRCYGDRNSWNTISFHTGAALRFPPIRANHSLDVSFYFRTSAPSGVFLENMGGPFCRWRRPYVRVELNTSRDVVFAFDIGNGDENLTVHSDDFEFNDDEWHLVRAEINVKQARLRVDHRPWVLRPMPLQTYIWLVYDQPLYVGSAELKRRPFVGCLRAMRLNGVTLNLEGRANASEGTFPNCTGHCTHPRF

In [33]:
# Another example (gene encoded on the negative strand)
gene: str = 'Drg1'
gAm = u.Gene_cls(gene, annotation_full_file, species=species, verbose=verbose)

gAm.info()

print(f"\n{gene} contains {len(gAm)} transcripts.")
gAm.show_all_transcript_IDs()

print(f"\nGene start = {gAm.gene_start:,}, gene end = {gAm.gene_end:,}")

transcript_id: str = 'ENSMUST00000020741'
display(gAm.exon_intron_map(transcript_id))

mrna_seq = gAm.rna(transcript_id)
print(f"\nmRNA:\n{mrna_seq}\n{len(mrna_seq):,} bps.")

aa_seq = gAm.AA(transcript_id)
print(f"\nprotein:\n{aa_seq}\n{len(aa_seq):,} AAs.")


Loading ../AnnotationDB/Mus_musculus.GRCm39.113.gff3.gz to a dataframe ... Done.
Gene=Drg1, Gene name=Drg1, Gene ID=ENSMUSG00000020457
Source=../AnnotationDB/Mus_musculus.GRCm39.113.gff3.gz
Positive strand=False.
Type=protein_coding, version=14.
Description=developmentally regulated GTP binding protein 1 [Source:MGI Symbol%3BAcc:MGI:1343297]
Gene region=chr11:3,137,360-3,216,415.
7 transcripts.

Drg1 contains 7 transcripts.
Drg1 transcript IDs:
 1. ENSMUST00000159304
 2. ENSMUST00000020741
 3. ENSMUST00000139430
 4. ENSMUST00000142605
 5. ENSMUST00000166162
 6. ENSMUST00000125424
 7. ENSMUST00000132159

Gene start = 3,137,360, gene end = 3,216,415


Unnamed: 0,name,region,region_size
0,Exon 1,"(3216393, 3216269)",125
1,Intron 1,"(3216268, 3214891)",1378
2,Exon 2,"(3214890, 3214767)",124
3,Intron 2,"(3214766, 3212665)",2102
4,Exon 3,"(3212664, 3212489)",176
5,Intron 3,"(3212488, 3209376)",3113
6,Exon 4,"(3209375, 3209306)",70
7,Intron 4,"(3209305, 3206713)",2593
8,Exon 5,"(3206712, 3206543)",170
9,Intron 5,"(3206542, 3204643)",1900



mRNA:
AGTGGTGGCCAGAGGCGCGGCGGGCGCGAGAAGGGGGACCTAACCAGTGTGGAGGCTGGTGTGGCCTTAAGCCCTCCGCCACCATGAGCGGGACCTTAGCGAAGATCGCGGAGATCGAAGCCGAGATGGCTCGGACTCAGAAGAACAAAGCCACAGCTCATCACCTTGGCCTGCTTAAGGCTCGCCTTGCTAAACTTCGAAGAGAACTCATCACTCCTAAAGGTGGCGGCGGCGGGGGGCCAGGAGAAGGTTTTGATGTGGCCAAGACAGGTGATGCTCGAATTGGGTTTGTGGGTTTTCCATCTGTGGGGAAATCAACTCTGCTAAGTAACTTAGCAGGAGTGTATTCTGAAGTGGCAGCCTATGAATTCACTACATTGACCACAGTGCCTGGTGTCATCAGATATAAAGGTGCCAAGATCCAGCTCCTGGATCTCCCGGGTATTATTGAAGGTGCCAAGGATGGGAAAGGTAGAGGTCGTCAAGTCATTGCAGTGGCCCGAACGTGTAATTTGATTTTGATTGTTCTGGATGTCCTAAAACCCTTGGGACATAAGAAGATAATTGAAAATGAGTTGGAAGGCTTTGGGATTCGCTTGAACAGCAAACCCCCCAACATTGGCTTTAAGAAGAAAGACAAGGGAGGCATTAACCTCACAGCCACTTGTCCTCAGAGTGAACTGGATGCTGAAACAGTGAAGAGCATTCTGGCTGAATATAAGATTCATAATGCTGATGTGACTTTGCGTAGTGATGCCACAGCTGACGACCTCATTGACGTGGTCGAAGGGAACAGAGTTTATATTCCCTGTATCTATGTATTAAATAAGATTGATCAGATCTCCATTGAGGAGTTGGATATTATCTATAAGGTGCCTCACTGTGTACCCATCTCTGCTCATCACCGCTGGAATTTTGATGACCTTTTGGAGAAAATCTGGGACTATCTGAAACTAGTGAGAATTTACACCAAACCGAAAGGCCAGTTACCAG