# TransGenic: Generating annotations over many sequences

In this example, we will generate annotations for all the genic regions of chromosome 4 of the *Arabidopsis* genome. The file ```ATH_Chr4_gene.bed``` provides the location of each gene and will be used to construct a dataset of extracted DNA sequences from ```ATH_Chr4.fas```.

### Constructing an inference dataset
To pre-process the gene regions stored in the BED file into sequences useable by the model, we create a DuckDB database which collates the DNA sequences of each region with gene identidiers and chromosomal coordinates. The function below creates a DuckDB database called ```ath_chr4_predict.db```. Prediction datasets can be constructed from BED or GFF3 files. Note: GFF3 files should be sorted with AGAT or similar tool prior to use.

In [1]:
import ipywidgets
from transgenic.datasets.preprocess import genome2GSFDataset

genome2GSFDataset(
	"ATH_Chr4.fas",
	"ATH_Chr4_gene.bed",
	"ath_chr4_predict.db",
	anoType="bed",
	mode = "predict"
)

[2025-04-21 10:00:13,913] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/home/pgl/scratch1/jlomas/miniforge3/envs/transgenic/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/home/pgl/scratch1/jlomas/miniforge3/envs/transgenic/compiler_compat/ld: cannot find -lcufile: No such file or directory
collect2: error: ld returned 1 exit status



Processing ATH_Chr4_gene.bed...


100%|██████████| 4128/4128 [00:16<00:00, 255.23it/s]


Next, initialize a pytorch dataset and dataloader for use in the prediction loop.

In [2]:
from torch.utils.data import DataLoader
from transgenic.datasets.datasets import isoformDataHyena, hyena_collate_fn

# Initialize a torch Dataset from the preprocessed database
ds = isoformDataHyena(
	"ath_chr4_predict.db",
	mode = "inference"
)

# Create a DataLoader for the dataset
dl = DataLoader(
	ds,
	batch_size=1,
	shuffle=False,
	num_workers=0,
	pin_memory=True,
	collate_fn=hyena_collate_fn
)

### *De novo* prediction and GFF output

Loop through the sequences in the dataset to generate full annotations using a pretrained checkpoint. GSF model outputs are converted to GFF and written to an output file. For this example, we'll limit the output to 10 samples.

In [3]:
import torch
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer
from transgenic.utils.gsf import gffString2GFF3

# Check GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
	device = torch.device("cpu")

# Load the model
model_name = "jlomas/HyenaTransgenic-512L9A4-160M"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()
model.to(device)

# Load the output tokenizer
gffTokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Prediction loop 
for step, batch in enumerate(tqdm(dl)):
	if step == 10:
		break
	
	ii, am = batch[0].to(device), batch[1].to(device)

	# Generate annotation with beam search
	with torch.no_grad():
		outputs = model.generate(
					inputs=ii, 
					attention_mask=am, 
					num_return_sequences=1, 
					max_length=2048, 
					num_beams=2,
					do_sample=True
				)
	# Decode the output to GSF
	pred = gffTokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0].replace("|</s>", "").replace("</s>", "").replace("<s>", "")
	
	# Convert the GSF to GFF3
	gff = gffString2GFF3(pred, batch[4][0], batch[5][0], f"GM={batch[3][0]}")
	
	# Write the GFF3 output
	with open("ath_chr4_predict.gff", "a") as f:
		for line in gff:
			f.write(line + "\n")
		

  0%|          | 10/4127 [00:04<30:55,  2.22it/s]


In [4]:
!head ath_chr4_predict.gff

Chr4	transgenic	gene	2875	3270	.	+	.	ID=efac4490-e16e-4153-ab41-ac535e8a0406;GM=AT4G00005
Chr4	transgenic	mRNA	2875	3270	.	+	.	ID=efac4490-e16e-4153-ab41-ac535e8a0406.t1;Parent=efac4490-e16e-4153-ab41-ac535e8a0406;GM=AT4G00005
Chr4	transgenic	CDS	2875	3270	.	+	0	ID=efac4490-e16e-4153-ab41-ac535e8a0406.t1.CDS1;Parent=efac4490-e16e-4153-ab41-ac535e8a0406.t1;GM=AT4G00005
Chr4	transgenic	gene	2885	10464	.	-	.	ID=b0024f5a-da1d-47d6-9208-554ced090b63;GM=AT4G00020
Chr4	transgenic	mRNA	2885	10464	.	-	.	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1;Parent=b0024f5a-da1d-47d6-9208-554ced090b63;GM=AT4G00020
Chr4	transgenic	CDS	4075	4438	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS1;Parent=b0024f5a-da1d-47d6-9208-554ced090b63.t1;GM=AT4G00020
Chr4	transgenic	CDS	4545	4749	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS2;Parent=b0024f5a-da1d-47d6-9208-554ced090b63.t1;GM=AT4G00020
Chr4	transgenic	CDS	4839	4901	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS3;Parent=b0024f5a-da1d-47d6-92

### Prompt completion, predicting splice variants

TransGenic can also be used to add splice variants to existing annotations. For this use case, we'll construct a dataset with GSF labels using the sorted reference GFF3 annotation. Then the features of the first transcript can be provided as input to the decoder to complete the annotation.

Here, we'll create a GFF3 dataset with GSF labels.

In [5]:
# Create a new database, Dataset, and Dataloader
genome2GSFDataset(
	"ATH_Chr4.fas",
	"ATH_Chr4.sorted.gff3",
	"ath_chr4_train.db",
	anoType="gff",
	mode = "train"
)


Processing ATH_Chr4.sorted.gff3...


 12%|█▏        | 9880/81369 [00:01<00:13, 5234.60it/s]

100%|██████████| 81369/81369 [00:15<00:00, 5117.17it/s]


In [6]:
from torch.utils.data import DataLoader
from transgenic.datasets.datasets import isoformDataHyena, hyena_collate_fn

# Initialize a torch Dataset from the preprocessed database
ds_comp = isoformDataHyena(
	"ath_chr4_train.db",
	mode = "train"
)

# Create a DataLoader for the dataset
dl_comp = DataLoader(
	ds_comp,
	batch_size=1,
	shuffle=False,
	num_workers=0,
	pin_memory=True,
	collate_fn=hyena_collate_fn
)

Now, we can loop through the sequences while providing input to the decoder.

In [None]:
import torch
from tqdm import tqdm
from transformers import AutoModel, AutoTokenizer
from transgenic.utils.gsf import gffString2GFF3

# Prediction loop 
for step, batch in enumerate(tqdm(dl_comp)):
	if step == 10:
		break
	
	ii, am, lab = batch[0].to(device), batch[1].to(device), batch[2].to(device)

	# Get elements of first transcript to use as decoder input ids
	labs = ",".join([str(i) for i in lab.tolist()[0]])
	last_element = labs.split(",17,")[1].split(",21,")[0].split(",")[-1]
	try:
		last_element_index = [f",{last_element}," in i for i in labs.split(",17,")[0].split(f",21,")].index(True)
	except:
		last_element_index = len(labs.split(",17,")[0].split(f",21,")) - 1
	
	dii = torch.tensor(list(map(int, ",21,".join(labs.split(",17,")[0].split(f",21,")[0:last_element_index+1]).split(",")))).unsqueeze(0).to(device)

	# Generate annotation with beam search
	with torch.no_grad():
		outputs = model.generate(
					inputs=ii, 
					attention_mask=am, 
					num_return_sequences=1, 
					max_length=2048, 
					num_beams=2,
					do_sample=True,
					decoder_input_ids = dii
				)

	# Decode the output to GSF
	pred = gffTokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0].replace("|</s>", "").replace("</s>", "").replace("<s>", "")
	
	# Convert the GSF to GFF3
	gff = gffString2GFF3(pred, batch[4][0], batch[5][0], f"GM={batch[3][0]}")
	
	# Write the GFF3 output
	with open("ath_chr4_completion.gff", "a") as f:
		for line in gff:
			f.write(line + "\n")

  0%|          | 10/4127 [00:01<09:12,  7.45it/s]


In [8]:
!head ath_chr4_predict.gff

Chr4	transgenic	gene	2875	3270	.	+	.	ID=efac4490-e16e-4153-ab41-ac535e8a0406;GM=AT4G00005
Chr4	transgenic	mRNA	2875	3270	.	+	.	ID=efac4490-e16e-4153-ab41-ac535e8a0406.t1;Parent=efac4490-e16e-4153-ab41-ac535e8a0406;GM=AT4G00005
Chr4	transgenic	CDS	2875	3270	.	+	0	ID=efac4490-e16e-4153-ab41-ac535e8a0406.t1.CDS1;Parent=efac4490-e16e-4153-ab41-ac535e8a0406.t1;GM=AT4G00005
Chr4	transgenic	gene	2885	10464	.	-	.	ID=b0024f5a-da1d-47d6-9208-554ced090b63;GM=AT4G00020
Chr4	transgenic	mRNA	2885	10464	.	-	.	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1;Parent=b0024f5a-da1d-47d6-9208-554ced090b63;GM=AT4G00020
Chr4	transgenic	CDS	4075	4438	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS1;Parent=b0024f5a-da1d-47d6-9208-554ced090b63.t1;GM=AT4G00020
Chr4	transgenic	CDS	4545	4749	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS2;Parent=b0024f5a-da1d-47d6-9208-554ced090b63.t1;GM=AT4G00020
Chr4	transgenic	CDS	4839	4901	.	-	0	ID=b0024f5a-da1d-47d6-9208-554ced090b63.t1.CDS3;Parent=b0024f5a-da1d-47d6-92