07-conclusion.Rmd

# This is the End {#conclusion}

> "... if you swear that there's no truth and who cares//How come you say it like you're right?"
>
> --- Connor Oberst, *We Are Nowhere And It's Now*

<span class="newthought">This dissertation has had two main goals:</span> to provide a framework for creating and working with large, heterogeneous, pangenomes and to provide visualizations and analyses that are well suited for investigating such pangenomes. In chapter&nbsp;1 the <span class="abr">FindMyFriends</span> package for <span class="abr">R</span> was introduced. <span class="abr">FindMyFriends</span> has three main features: a fast, scalable, and accurate gene grouping algorithm, an <span class="abr">API</span> that allows for transparent and efficient access to all levels of raw, intermediary, and high-level data, and an extensible class system that allows users to plug <span class="abr">FindMyFriends</span> into their own sequence storage backend if needed. The quality and speed of the gene grouping algorithm was assessed by comparison to other current algorithms. It was found that <span class="abr">FindMyFriends</span> performs both faster and more accurate than the algorithms it was tested up against. The scalability of the algorithm was shown by using it to calculate the pangenome of 4,770 genomes from all branches of the bacterial domain, which could be done in 99 hours on a high-end workstation machine. Chapter&nbsp;2 introduced <span class="abr">PanViz</span>, an interactive and easily sharable visualization of functionally annotated pangenomes. <span class="abr">PanViz</span> facilitates exploration of pangenome structures by providing natural transitions between genome subsets and visualization types. Furthermore, it provides a visual querying system that allows the user to quickly pinpoint the gene groups of interest. While <span class="abr">PanViz</span> is agnostic to the algorithm used for pangenome creation it can operate seamlessly with the data structures provided by <span class="abr">FindMyFriends</span> using the provided <span class="abr">API</span>. In the last article, presented in chapter&nbsp;3, I described a novel analysis and visualization technique, <span class="abr">Hierarchical Sets</span>, for large collections of sets. The implications for pangenome analysis was illustrated through the parallels between pangenome and set concepts and the technique was applied to a genus level pangenome based on 46 *Streptococcus*. The published implementation of <span class="abr">Hierarchical Sets</span> can, like <span class="abr">PanViz</span>, work directly with the output generated by <span class="abr">FindMyFriends</span>, as well as a range of common set data representation.

The scalability and usability of both the <span class="abr">FindMyFriends</span> framework as well as the <span class="abr">Hierarchical Sets</span> analysis was showcased in chapter&nbsp;4. It was shown how <span class="abr">Hierarchical Sets</span> was able to provide a scaffold useful for both getting an overview of the full pangenome as well as for zooming in on different parts of the pangenome. The clustering as well as the outlying elements provided by <span class="abr">Hierarchical Sets</span> was efficient in detecting and explaining irregularities within the genomes of the pangenome. Furthermore, it was shown how these analyses can be used to investigate the relations between different groupings of genomes. The evolutionary relationship between *Escherichia* and *Shigella* was examined and it was shown that while the two genera share a large amount of genetic material, they each form their own distinct lineage with separate core genes. By combining the results from the <span class="abr">Hierarchical Sets</span> analysis with the panchromosome structure available in <span class="abr">FindMyFriends</span> it was possible to show how most of the outlying elements between *Escherichia* and *Shigella* were interconnected and pinpoint their nature to either transposons or prophages.

\vspace{1em}

While the presented tools and techniques are not exhaustive, they provide a solid backbone for the creation and investigations of large-scale pangenomes. Hopefully, they can play a role in taming the massive influx of sequence data and make pangenome analyses on a larger scale accessible to everyone.