02-chapter1.Rmd

# How to Build a Pangenome {#fmf}

> "Programming interfaces are user interfaces"
>
> --- Mike Bostock, *Creator of D3.js*

<span class="newthought">The foundation for large-scale pangenome analysis</span> will be described in this chapter with the introduction of the <span class="abr">FindMyFriends</span> framework. The algorithms that allow <span class="abr">FindMyFriends</span> to create high-quality pangenomes from thousands of genomes will be described in the following article. As the focus in the article is on the novel algorithms provided by <span class="abr">FindMyFriends</span> less attention is given to the infrastructure that the framework provides as the basis for all algorithms. The motivation for developing <span class="abr">FindMyFriends</span> into a full framework rather than a tool that produces pangenomes is the lack of a coherent solution for working with pangenome data that has existed up until now. The approach employed by current tools is to provide a main command that takes care of reading in, comparing, and clustering genome sequences, writing the results to several text files. Acknowledging that researchers often wants more than producing a pangenome many tools provide several secondary functions that understands the result files and can perform additional post-processing and reporting tasks such as creating generic plots and converting the results to other format. While this works well for handling the tasks that are supported by the tools, it puts a considerable stumbling block in front of researchers that wish to investigate their results in other ways. Further, it makes it difficult for bioinformatic developers to extend the functionality provided by a tool by e.g. adding support for new sequence formats or additional visualizations. This has resulted in several tools that basically repacks existing algorithms and provide an additional layer of post-processing [@ContrerasMoreira:2013ip; @Chaudhari:2015bj]. Unfortunately, these tools have the same type of restrictions as described above, namely that if their functionality does not satisfy the needs of the researcher there is no easy access to the underlying data, due to lack of proper interface.

<span class="abr">FindMyFriends</span> is an attempt to solve the problems stated above. While scientific progress is sure to make the algorithms it currently provides obsolete at some point, it is my hope that the successors will be developed on top of the <span class="abr">FindMyFriends</span> framework. This would ensure that minimum strain is put on the users, making them more prone to adopt new approaches as they are developed. In order to attract both developers and users to the platform, <span class="abr">FindMyFriends</span> offers both a powerful and extensible <span class="abr">API</span>. Developers can add support for new input file types as well as new back-end storage that users can utilize without changing any work-flows. It is furthermore possible to develop new algorithms on top of the framework, with transparent access to all data, without worrying how the data is stored. This means that it is easy to add a new clustering algorithm, post-processing analysis, or visualization tool, without having to spend time on managing data-flows. Users are provided a very powerful access to all parts of their data and can easily extend their analysis with other tools from <span class="abr">Bioconductor</span>, as pangenome data is easily available in standard data formats. 

The following article, intended for submission to [Nature Methods](http://www.nature.com/nmeth), will describe the algorithm used by <span class="abr">FindMyFriends</span> to provide fast and accurate grouping of genes from thousands of genomes in more detail. It includes speed and accuracy benchmarking against a selection of other popular tools, as well as a case study based on 4,770 genomes from the full bacterial domain. Supplementary figures are available in appendix \ref{FMFsupfig} and the bacterial pangenome as well as the raw sequences are available in appendix \ref{supdata}.

**Article 1**

[FindMyFriends: A Framework for Fast and Accurate Pangenome Analysis of Thousands of Diverse Genomes](articles/FMF_nature.pdf)

The benchmarking provided in the article clearly shows the qualities of the implemented algorithms in terms of speed and accuracy. Further, the bacterial pangenome case-study is the largest published pangenome and by far the one with highest taxon coverage. That this pangenome was generated in a matter of days, on a computer accessible to everyone marks a shift in what is possible for researchers without access to high-performance computing hardware. With the exception of annotating the gene groups with Pfam domains, all the downstream analysis of the pangenome was done on an underpowered laptop, underscoring the efficiency of the data handling implementation. For users that simply want to create a pangenome for a few genomes and get the core size, <span class="abr">FindMyFriends</span> might seem like using a sledgehammer to crack a nut, but I expect users with such small requirements are few an far between. Contrary I believe that a powerful common framework for pangenome research is an absolute requirement to advance pangenome analysis to the next level. Whether <span class="abr">FindMyFriends</span> will be that framework is too early to say, but the growth and popularity of both <span class="abr">R</span> and <span class="abr">Bioconductor</span> within biological data analysis puts it in a very favorable spot.