# Abstract

Due to its relatively small genome and cosmopolitan distribution, Arabidopsis thaliana has long served as a model organism in plant genetics. As we move increasingly from the physiology and genetics age to the genomics age, its use has broadened to include large scale sequencing efforts and enabled research into plant evolution and genetic variation. Here, we assemble and annotate the genome of the Arabidopsis Pa-1 accession using multiple assemblers and evaluate the quality of each. Flye was ultimately chosen due to its superior contiguity and accuracy. Many analyses confirmed high heterozygosity in Pa-1, impacting assembly and annotation correctness. We make an effort to understand what qualities are desirable during de novo genome assembly and gene annotation, and what metrics to use in order to evaluate quality, and simultaneously examine the annotated genomes for signatures of transcriptional element dynamics such as TE distribution, age, and phylogeny, revealing recent expansions of Mutator elements and diversity in Copia and Gypsy LTR transposons.

# Introduction

With recent advances in high throughput sequencing, incredible new possibilities are being opened to generate and analyze genomic data. Arabidopsis thaliana is an extremely well characterized model organism in plant research and is an ideal system in which to study genome structure and evolution. Its cosmopolitan distribution has led to the diversification of genomes across different environments, providing a broad dataset for exploring genetic variation (1). One major advance enabling research is long read data, such as from Oxford Nanopore or PacBio HiFi sequencing. These allow us to assemble genetic features that have been traditionally difficult using short read sequencing, such as transposable elements. Since they have repetitive structures at their boundaries, they have historically tended to confound genome assembly (2). However, long read sequencing now enables us to investigate TE evolutionary history with more robustness. TEs are not simply genetic artifacts but indeed drive genome evolution, creating new genes and regulation, and structural variations (3). Understanding their history can provide insights into the evolution of the genome and its adaptation to many different environments. By analyzing TE age and phylogenetic relationships, we can start to infer the history of genome evolution in Arabidopsis itself.  

In this study, we focus on the Arabidopsis Pa-1 accession, which was originally excluded from analysis in the source paper due to heterozygosity. This presents challenges in genome assembly, causing fragmented and duplicated contigs. However, this obstacle can be circumvented to still yield high quality genomes. 

# Materials & Methods

## Sequencing Quality Control

PacBio Hifi reads from the Arabidopsis Pa-1 accession from a dataset provided by Lian et al. (4) and RNA-seq data from Jiao and Scheeberger (5) were preliminarily analyzed with FastQC (6). We then preprocessed the reads with fastp (7), filtering out all reads with low quality scores. Afterwards, FastQC was run again on the processed reads. We also utilized Jellyfish to perform kmer counting (8).  

## Genome and Transcriptome Assembly and Quality Control

To assemble the genome de novo, we compared three different assemblers: LJA (9), Flye (10), and Hifiasm (11), while transcriptome reads were assembled using Trinity (12). The quality of each of the assemblies was first evaluated using BUSCO (13), specifying the brassicales_odb10 lineage. We then continued to check common assembly metrics using Quast (14), and additionally analyze them via kmers with merqury (15). Finally, each assembly was compared against the Arabidopsis reference genome, and to each other in dotplots with Nucmer and Mummer (16).   

## Transposable Element Annotation and Classification

Transposable elements in the assembled genome were first identified using EDTA (17), with option `--anno 1` to perform whole genome annotation. A cds file of canonical Arabidopsis genes from the TAIR database was also provided to perform gene masking during the process. The classification and visualization of LTR-TEs was performed using an R script provided by R.Choudhury and the circlize R package (18). A more refined classification of elements in clades was then performed using TEsorter (19) 

## Transposable Element Dating and Phylogeny

The Perl script parseRmpl(20) was used to process raw alignments and calculate an estimate of percent divergence between TEs. With these data, the ages of all TEs were calculated and plotted in R. To generate phylogenetic trees, data were processed using TEsorter and sequences were aligned with ClustalOmega (21). Maximum likelihood trees for the sequences were inferred with FastTree (22), and results were visualized and annotated with the online iTOL tool (23).

## Functional Annotation and Quality Control

Genes in the assembly were annotated using Maker (24) and an Arabidopsis database from Augustus (25). Functional annotations were then determined using InterProScan (26) and the Pfam database, and then quality filtered with an annotation edit distance (AED) cutoff of 0.5, resulting in transcript and protein fasta files with only high quality annotations. We then validated these models using BUSCO to assess the completeness of the annotated protein sequences compared to many Brassicales taxa. We utilized OMArk (27) to additionally quality control gene annotations based on proteome orthology. We identify Hierarchal Orthologous Groups (HOGs) and utilize those mapping to fragmented and missing gene models to improve our annotations. 

## Comparative Genomics

Finally, genome comparisons between multiple accessions were made using Genespace (28) and OrthoFinder (29). Synteny was analyzed between Lu-1, Pa-1, RRS10, Kar_1, and the canonical TAIR10 genome via pairwise dotplots and riparian plots, also generated with Genespace.

# Results

After preprocessing, reads had a size range of 72-49452bp, with 476430 reads, and an average GC content of 36%. In comparison, the RNA reads had 22620680 reads and average GC content of 46%. The quality of reads also decreased after approximately 60 base pairs (Figure 1, appendix).  

![](fastqc_after.png)  
Figure 1: Fastqc results of the genomic reads after fastp. 

![](jelllyfish.png)  
  Figure 2: Jellyfish plot counting kmer multiplicity.  

Jellyfish was then used to count kmers and estimate the genome size. The calculated genome size is 135,581,494bp, which is in alignment with the canonical Arabidopsis genome size of approxomately 135Mb. Using all this information, we can then estimate coverage over the genome: $\text{Depth} = \frac{\text{Mean Read Length x Num. Reads}}{\text{Genome Length}} = 50.95$x coverage, which should adequately cover the whole genome. Jellyfish also reports $ab$ of 0.111%, which is in agreement with analysis of authors from the data source, that this accession displays high heterozygosity. We can see a shoulder to the left of the main peak, which has half of its coverage - this suggests different alleles splitting the coverage for some kmers.   

## Assembly Quality Control

After assembly, we evaluated the quality of the genomes. BUSCO was first used to check each's alignment to canonical Brassicales gene sets. Given that we expect to see a certain amount of duplicated genes, Flye and Hifiasm perform slightly better than LJA in capturing all BUSCOs and reducing duplications. 

Assembler|Complete|Duplicated|Missing
---|---|---|---
Flye|94%|5.8%|0.1%
Hifiasm|94%|5.9%|0%
LJA|92.2%|7.6%|0%  
  Table 1: BUSCO analysis for the three genome assemblies.

We compared further using Quast metrics. We additionally see here that although LJA produces the largest genome, it also has the smallest NGA50, indicating that there are many short contigs assembled. In comparison to LJA and Hifiasm, Flye generated the largest NGA50 (104592), the smallest number of contigs (133), and the smallest number of misassemblies (4538).

![](quast_report.png)  
  Figure 3: Quast report.  

In a third comparison, we ran merqury to create kmer multiplicity plots. Once again, we see a shoulder in the 1 k-mer peak at half the coverage show up in all three assemblies - one example is given below in figure 4. This feature likely is caused by heterozygote kmers in single copy, and cannot trivially be eliminated with standard assembly parameters.

![](flye.assembly.spectra-cn.fl.png)  
  Figure 4: Merqury report.  

## Dotplots

Finally, dotplots created with Nucmer and Mummer were compared. In the LJA and Flye plots, we see a distinctive duplicate segment remapping to the reference genome (appendix fig. 2). However, this does not appear in the Hifiasm plot, suggesting that this assembler was perhaps able to resolve the heterozygosity. 

## Transposable Element Annotation and Classification

Outputs from EDTA were initially analyzed using R. Counts and percent identities for annotated transposons were compared, totaling 137 detected Copia elements, 156 Gypsy, and 27 uncategorized. When broken down into clades, there were more variable counts. Some such as Athila had the highest count of any, at 76, followed then by Ale at 64, while others such as Alesia, Angela, and TAR all had only one each. Between both superfamilies, all seemed to have high percent identity with canonical sequences at above 90% (Figure 3, appendix), indicating primarily younger insertions. From further R analysis, is was shown that the mutator and helitron classes of transposons had the greatest detected numbers, at 21229 and 14365 respectively. We plotted these on a filtered set of the 20 largest contigs, the sum of which was 112,614,314 bp. This constitutes approximately 77% of the fully assembled genome. Further analysis was performed to identify where on the contigs these elements lie using Circlize:

![](02-TE_density.png)  
  Figure 6: Circlize plot showing Copia (green), Gypsy (red), Mutator TIR (purple), and DNA/DTM (blue) elements.  

Figure 6 shows the results for the largest 20 contigs. Notably, the gypsy and copia LTRs (red and green) seem relatively interspersed, while mutator TIRs (purple) seem to occur exclusively in the opposite chromosomal locations. The mutators also seems to cluster more highly together, while other TE elements are more distributed. Mutator TIRs are the largest class of elements detected, likely since they can reach up to 100% transposition rate (30), and they also seem to cluster more tightly together. When classifying the TEs with TEsorter, we see a greater number of detected Copia elements at 379, and similar numbers of Gypsy elements at 143. We then compared TE abundance with other accessions:

![](TEs.png)  
  Figure 7: Comparing TE frequencies amongst accessions.  

The Pa-1 accession had better classification of TEs, resulting in a lower fraction of unknown LTRs, and additionally had higher than average amounts of Mutator TIRs.

## TE Age Estimation and Phylogenetic Trees

We were able to calculate the estimated sequence divergence $K$ from the consensus motif for all TEs, then convert this to insertion time $T$ for the TEs based on $T = \frac{K}{2r}$ given $r \approxeq 8.22\times10^{-9}$ (31). Figure 4 shows violin plots of the density of all TEs over 30 million years:

![](TE_age_all_violins.png)  
  Figure 8: Violin plots representing TE age.  

As we can see, the majority of TEs are consistently active up to about 20 million years ago. However, we observe a much more recent spike in activity around 5 million years ago for the DNA/DTM Mutator elements, corresponding to their overrepresentation in detected TEs, and their relative exclusiveness in the genome - as the apparent youngest class of elements, not many other TEs have had time to insert into Mutator rich regions.  
Focusing further on Copia and Gypsy elements, we were able to analyze the CDS order of Copia and Gypsy superfamilies to generate phylogenetic trees (32):

![Copia Tree](copia_full_tree.png)  
  Figure 9a: Copia TE phylogenetic tree.  

![Gypsy Tree](gypsy_full_tree.png)  
  Figure 9b: Gypsy TE phylogenetic tree.  

Amongst the tree Copia elements, the Ale element seems to be overrepresented against other clades - this suggests there was more diversity generated in the past, and more recent evolution is restricted mostly to one clade. This is consistent with the observation that Copia elements peak in insertion age around 17.5 million years ago, one of the older elements. In contrast, the Gypsy elements show a few clades that are more divergent from others, and more recent, while older parts of the tree all belong to the same Reina or CRM clades. This also suggests that Gypsy elements have been more active recently, and thus have more extant diversity. 

## Functional Annotation and Quality Control

We Identified 23142 genes in the raw output, before filtering down to 22583 genes according to Annotation Edit Distance, Maker's internal metric for evaluating the quality of an annotation. This is slightly lower than the number of genes predicted in the model, 27416. This tool evaluates an annotation in the context of experimental evidence for a gene, and returns a score between 0 an 1, where more high quality annotations will fall below 0.5, and by constraining to this range we can eliminate spurious annotations.  

Next, we used BUSCO to determine the completeness of the annotated genome based on a reference Arabidopsis database. 

![Busco report](busco_figure.png)  
  Figure 10: BUSCO report  

Although we had already eliminated all but the longest version of each protein and transcript, we still see a relatively lower amount of complete BUSCOs, only around 80%. We also see a large proportion of both missing, and duplicated BUSCOs: 

Type|Duplicated|Missing
---|---|---
Protein|5.5%|12.8%
Transcript|7.2%|10.4%  
  Table 2: BUSCO analysis for protein and transcript annotations.  

Validating the quality of the annotation based on protein orthology using OMArk confirmed similar patterns:

![](assembly.all.maker.proteins.fasta.renamed.fasta.png)  
  Figure 11: OMArk analysis.   

We see that, once again, only about 80% of HOGs are complete and single-copy, while the remaining 10-20% are mostly duplicated and missing HOGs. 

## Comparative Genomics

Finally, we use Genespace to examine synteny between accessions, and the reference sequence. Looking to a riparian plot, we see immediately that the genome is shorter in Pa-1 than others:  

![](rip.png)  
  Figure 12: Riparian plot comparing accessions to reference TAIR10.   

Interestingly, we see two full contigs from Pa-1 both mapping to the same contig in Lu-1 - and this is corroborated with the Pa-1 dotplot:  

![](dotplot.png)  
 Figure 13: Pa-1 self similarity dotplot. Duplication is clearly seen in bottom left.  

Noticeably, Contigs 86 and 87 are inverted relative to each other on Chromosome 2. Additionally, we notice a break in an adjacent section of chr. 2 on the reference genome, where nothing in the Pa-1 accession maps.

# Discussion

## Genome Assembly

In this project we explore two highly relevant technologies in the genomics age: the ability to assemble entire genomes, and to annotate the genes therein in high throughput. We utilize multiple command-line bioinformatics tools to assemble and annotate the genome of a diverse cohort of Arabidopsis samples, and examine ways to orthogonally validate gene annotations, as well as analysing other genomic features of interest such as transposable elements.  

When assembling genomes, there is no perfect tool, or perfect metric. We evaluated LJA, Flye, and Hifiasm assemblers to see which produced the best the most accurate and complete genome. Generally speaking, we would tend to prefer larger and few contigs over many smaller ones, and so metrics like NGA50 are of interest. During assembly, the Pa-1 accession generally seems to be handled the best by the Flye assembler - it has the lowest number of contigs and misassemblies, and largest NGA50. Hifiasm had several times more contigs produced, and many more reported misassemblies, while in BUSCO analysis it was only equivalent to Flye. As was noted by the original authors, Pa-1 seems to have higher heterozygosity which complicates analysis. This was corroborated by Jellyfish and Quast kmer analysis, and in BUSCO duplication rates compared to other accessions. The shoulder peak in k-mer multiplicity plots suggests that alternative alleles exist leading to halved coverage, and this phenomenon is further confirmed by Mummer dotplots, in which LJA and Flye assemblies display duplicate contigs mapping to the same reference sequence. Interestingly, Hifiasm’s assembly did not exhibit these duplications and may have been better at handling heterozygosity, at the cost of contig size and genome completeness.

## Annotation

Analyzing TEs revealed distinct behaviors concerning phylogeny and genomic distribution. The circlize plot showed that TIR Mutator elements tended to cluster together, while LTR retrotransposons were more dispersed across the genome, except for within TIR Mutator regions. This clustering could be due to more recent activity of TIR Mutator elements, as suggested by TE age estimation, which showed a strong density of younger TIR insertions peaking about 5 million years ago. These results are consistent with the fact that TIR elements can have extremely high transposition rates and can preferentially insert into specific sequences (33)(34). 
Analysis of Copia and Gypsy elements displays different behavior. Copia elements displayed greater diversity in older parts of the phylogenetic tree, with most clades forming early, while there are few recent insertions. This suggests that Copia diversity was generated in the past, and recent insertions originate from a limited number of active elements within the Ale clade. This is also corroborated by the TE age violin plot; we see a fairly continuous distribution over the last 30 million years, with a bias towards older elements. Conversely, the pattern of Gypsy clades mapping to the phylogeny suggests that it is more diverse now than it was longer ago, and newer insertions are in different clades. Perhaps this is because more different elements have been active more recently, and longer ago only a small set of elements was actively inserting. this is also corroborated by the violin plot, where the density is biased slightly more recently than Copia elements. This suggests that the diversity generating mechanism for Gypsy elements is more recent, and many more elements are still active. Gene annotation quality was evaluated using BUSCO and OMArk. Although many genes were retained after AED filtering, BUSCO analysis revealed that only 80% of them were complete and single copy, whereas a significant fraction were missing or duplicated. OMArk analysis further confirmed this pattern, with substantial numbers of duplicated or missing HOGs. Riparian plots revealed that some contigs in Pa-1 map redundantly to the same regions in other accessions. These results suggest that while our annotation pipeline captured a majority of genes, heterozygosity did interfere with the gene models. The presence of duplicated gene models aligns with patterns observed earlier and throughout the entire pipeline. 

In summary, even with a certain amount of heterozygosity in the sample data, interesting conclusions about genome dynamics and evolution can still be drawn. In the future, it would be useful to try to continue improving the handling of heterozygosity once it is detected in the sequence data. It may be helpful to develop piplines that incorporate assemblers that are more specialized to handle certain edge cases within genetic data, rather than applying the same algorithm or parameters to everything. Since there are usually tradeoffs in software performance, it may be useful to utilize the best tool for specific genomic features during assembly to yield the best genomes.

# Code Availability

All scripts and output files can be accessed at: [https://github.com/riinajh/genome_assembly_annotation](https://github.com/riinajh/genome_assembly_annotation)

# References 

1. Horton, Matthew W., Angela M. Hancock, Yu S. Huang, Christopher Toomajian, Susanna Atwell, Adam Auton, N. Wayan Muliyati, et al. 2012. “Genome-Wide Patterns of Genetic Variation in Worldwide Arabidopsis Thaliana Accessions from the RegMap Panel.” Nature Genetics 44 (2): 212–16. https://doi.org/10.1038/ng.1042.

2. Peona, Valentina, Mozes P. K. Blom, Luohao Xu, Reto Burri, Shawn Sullivan, Ignas Bunikis, Ivan Liachko, et al. 2021. “Identifying the Causes and Consequences of Assembly Gaps Using a Multiplatform Genome Assembly of a Bird‐of‐paradise.” Molecular Ecology Resources 21 (1): 263–86. https://doi.org/10.1111/1755-0998.13252.

3. He, Xin, Zhengyang Qi, Zhenping Liu, Xing Chang, Xianlong Zhang, Jianying Li, and Maojun Wang. 2024. “Pangenome Analysis Reveals Transposon-Driven Genome Evolution in Cotton.” BMC Biology 22 (1): 92. https://doi.org/10.1186/s12915-024-01893-2.

4. Lian, Qichao, Bruno Huettel, Birgit Walkemeier, Baptiste Mayjonade, Céline Lopez-Roques, Lisa Gil, Fabrice Roux, Korbinian Schneeberger, and Raphael Mercier. 2024. “A Pan-Genome of 69 Arabidopsis Thaliana Accessions Reveals a Conserved Genome Structure throughout the Global Species Range.” Nature Genetics 56 (5): 982–91. https://doi.org/10.1038/s41588-024-01715-9.

5. Jiao, Wen-Biao, and Korbinian Schneeberger. 2020. “Chromosome-Level Assemblies of Multiple Arabidopsis Genomes Reveal Hotspots of Rearrangements with Altered Evolutionary Dynamics.” Nature Communications 11 (1): 989. https://doi.org/10.1038/s41467-020-14779-y.

6. Andrews, Simon. (2017) 2025. “S-Andrews/FastQC.” Java. https://github.com/s-andrews/FastQC.

7. Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.

8. Marçais, Guillaume, and Carl Kingsford. 2011. “A Fast, Lock-Free Approach for Efficient Parallel Counting of Occurrences of k-Mers.” Bioinformatics 27 (6): 764–70. https://doi.org/10.1093/bioinformatics/btr011.

9. Bankevich, Anton, Andrey V. Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A. Pevzner. 2022. “Multiplex de Bruijn Graphs Enable Genome Assembly from Long, High-Fidelity Reads.” Nature Biotechnology 40 (7): 1075–81. https://doi.org/10.1038/s41587-022-01220-6.

10. Kolmogorov, Mikhail, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner. 2019. “Assembly of Long, Error-Prone Reads Using Repeat Graphs.” Nature Biotechnology 37 (5): 540–46. https://doi.org/10.1038/s41587-019-0072-8.

11. Cheng, Haoyu, Mobin Asri, Julian Lucas, Sergey Koren, and Heng Li. 2024. “Scalable Telomere-to-Telomere Assembly for Diploid and Polyploid Genomes with Double Graph.” Nature Methods 21 (6): 967–70. https://doi.org/10.1038/s41592-024-02269-8.

12. Grabherr, Manfred G., Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Trinity: Reconstructing a Full-Length Transcriptome without a Genome from RNA-Seq Data.” Nature Biotechnology 29 (7): 644–52. https://doi.org/10.1038/nbt.1883.

13. Manni, Mosè, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, and Evgeny M Zdobnov. 2021. “BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.” Molecular Biology and Evolution 38 (10): 4647–54. https://doi.org/10.1093/molbev/msab199.

14. Gurevich, Alexey, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. 2013. “QUAST: Quality Assessment Tool for Genome Assemblies.” Bioinformatics 29 (8): 1072–75. https://doi.org/10.1093/bioinformatics/btt086.

15. Rhie, Arang, Brian P. Walenz, Sergey Koren, and Adam M. Phillippy. 2020. “Merqury: Reference-Free Quality, Completeness, and Phasing Assessment for Genome Assemblies.” Genome Biology 21 (1): 245. https://doi.org/10.1186/s13059-020-02134-9.

16. Marçais, Guillaume, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey Zimin. 2018. “MUMmer4: A Fast and Versatile Genome Alignment System.” PLOS Computational Biology 14 (1): e1005944. https://doi.org/10.1371/journal.pcbi.1005944.

17. Ou, Shujun, Weija Su, Yi Liao, Kapeel Chougule, Jireh R. A. Agda, Adam J. Hellinga, Carlos Santiago Blanco Lugo, et al. 2019. “Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline.” Genome Biology 20 (1): 275. https://doi.org/10.1186/s13059-019-1905-y.

18. Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–12. https://doi.org/10.1093/bioinformatics/btu393.

19. Zhang, Ren-Gang, Guang-Yuan Li, Xiao-Ling Wang, Jacques Dainat, Zhao-Xuan Wang, Shujun Ou, and Yongpeng Ma. 2022. “TEsorter: An Accurate and Fast Method to Classify LTR-Retrotransposons in Plant Genomes.” Horticulture Research 9 (January):uhac017. https://doi.org/10.1093/hr/uhac017.

20. Kapusta, Aurelie. (2014) 2024. “4ureliek/Parsing-RepeatMasker-Outputs.” Perl. https://github.com/4ureliek/Parsing-RepeatMasker-Outputs.

21. Madeira, Fábio, Nandana Madhusoodanan, Joonheung Lee, Alberto Eusebi, Ania Niewielska, Adrian R N Tivey, Rodrigo Lopez, and Sarah Butcher. 2024. “The EMBL-EBI Job Dispatcher Sequence Analysis Tools Framework in 2024.” Nucleic Acids Research 52 (W1): W521–25. https://doi.org/10.1093/nar/gkae241.

22. Price, Morgan N., Paramvir S. Dehal, and Adam P. Arkin. 2010. “FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments.” PLoS ONE 5 (3): e9490. https://doi.org/10.1371/journal.pone.0009490.

23. Letunic, Ivica, and Peer Bork. 2024. “Interactive Tree of Life (iTOL) v6: Recent Updates to the Phylogenetic Tree Display and Annotation Tool.” Nucleic Acids Research 52 (W1): W78–82. https://doi.org/10.1093/nar/gkae268.

24. Cantarel, Brandi L., Ian Korf, Sofia M. C. Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro Sánchez Alvarado, and Mark Yandell. 2008. “MAKER: An Easy-to-Use Annotation Pipeline Designed for Emerging Model Organism Genomes.” Genome Research 18 (1): 188–96. https://doi.org/10.1101/gr.6743907.

25. Stanke, Mario, Mark Diekhans, Robert Baertsch, and David Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5): 637–44. https://doi.org/10.1093/bioinformatics/btn013.

26. Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, et al. 2014. “InterProScan 5: Genome-Scale Protein Function Classification.” Bioinformatics 30 (9): 1236–40. https://doi.org/10.1093/bioinformatics/btu031.

27. Nevers, Yannis, Alex Warwick Vesztrocy, Victor Rossier, Clément-Marie Train, Adrian Altenhoff, Christophe Dessimoz, and Natasha M. Glover. 2025. “Quality Assessment of Gene Repertoire Annotations with OMArk.” Nature Biotechnology 43 (1): 124–33. https://doi.org/10.1038/s41587-024-02147-w.

28. Lovell, John T, Avinash Sreedasyam, M Eric Schranz, Melissa Wilson, Joseph W Carlson, Alex Harkess, David Emms, David M Goodstein, and Jeremy Schmutz. 2022. “GENESPACE Tracks Regions of Interest and Gene Copy Number Variation across Multiple Genomes.” Edited by Detlef Weigel. eLife 11 (September):e78526. https://doi.org/10.7554/eLife.78526.

29. Emms, David M., and Steven Kelly. 2019. “OrthoFinder: Phylogenetic Orthology Inference for Comparative Genomics.” Genome Biology 20 (1): 238. https://doi.org/10.1186/s13059-019-1832-y.

30. Alleman, Mary, and Michael Freeling. 1986. “THE Mu TRANSPOSABLE ELEMENTS OF MAIZE: EVIDENCE FOR TRANSPOSITION AND COPY NUMBER REGULATION DURING DEVELOPMENT.” Genetics 112 (1): 107–19. https://doi.org/10.1093/genetics/112.1.107.

31. Kagale, Sateesh, Chushin Koh, John Nixon, Venkatesh Bollina, Wayne E. Clarke, Reetu Tuteja, Charles Spillane, et al. 2014. “The Emerging Biofuel Crop Camelina Sativa Retains a Highly Undifferentiated Hexaploid Genome Structure.” Nature Communications 5 (1): 3706. https://doi.org/10.1038/ncomms4706.

32. Stritt, Christoph, Michele Wyler, Elena L. Gimmi, Martin Pippel, and Anne C. Roulin. 2020. “Diversity, Dynamics and Effects of Long Terminal Repeat Retrotransposons in the Model Grass Brachypodium Distachyon.” New Phytologist 227 (6): 1736–48. https://doi.org/10.1111/nph.16308.

33. Cresse, A D, S H Hulbert, W E Brown, J R Lucas, and J L Bennetzen. 1995. “Mu1-Related Transposable Elements of Maize Preferentially Insert into Low Copy Number DNA.” Genetics 140 (1): 315–24. https://doi.org/10.1093/genetics/140.1.315.

34. Dietrich, Charles R, Feng Cui, Mark L Packila, Jin Li, Daniel A Ashlock, Basil J Nikolau, and Patrick S Schnable. 2002. “Maize Mu Transposons Are Targeted to the 5′ Untranslated Region of the Gl8 Gene and Sequences Flanking Mu Target-Site Duplications Exhibit Nonrandom Nucleotide Composition Throughout the Genome.” Genetics 160 (2): 697–716. https://doi.org/10.1093/genetics/160.2.697.




# Appendix

![](rna_fastqc.png)
Supplemental Figure 1

![](flye_mummerplot_output.png)  
Supplemental Figure 2

![Supplemental figure 01: ](01_full-length-LTR-RT-clades.png)    
Supplemental Figure 3

