Skip to content
Juan C Rodriguez edited this page Feb 7, 2019 · 3 revisions

R Code Optimizer

Background

A brief search on the web suffices to notice that R is slow compared to other popular programming languages. “The R interpreter is not fast and execution of large amounts of R code can be unacceptably slow” [1]. The main reason for this is because “R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer” [2]. Currently the most widely used R interpreter is GNU-R, although there are several implementations of R interpreters that attempt to improve execution speed [3–8], “switching interpreters is something to consider carefully” [9].

“Beyond performance limitations due to design and implementation, it has to be said that a lot of R code is slow simply because it’s poorly written. Few R users have any formal training in programming or software development … This means that it’s relatively easy to make most R code much faster” [2].

“It is important to pursue efficiency issues, and in particular, speed” [10]. “A good deal of work is going into making R more efficient. Much of this work consists of reimplementing interpreted R code” [1].

The main goal of this project is to provide a GNU-R package with functions that allow users to automatically apply different strategies to optimize their R code. The developed functions will have as input and output R code so that the resulting code will allow the user to understand what modifications in the code cause its optimization.

Related work

To the best of our knowledge the only existing tool to automatically optimize R code is the compiler library. The high impact of such library was demonstrated as it was added to GNU-R since version 2.13.0. Although the compiler library manages, in certain cases, to improve the execution time of the R code, its main objective is to compile expressions into byte code. Since the main goal of the compiler package is not optimization, it is that, as we show in the optimization strategies section, this library leaves aside several optimization strategies commonly known by the community [11]. In addition to this, as the result of applying the functions of the compiler library is byte code, it does not allow the user to easily understand which modifications make their code more efficient.

Other types of related work include blog posts, web pages, and books that provide tips and guides to follow in order to omptimize R code [2,9,12–16]. Although intuitive and easy to apply strategies are found in these texts, none of them provide an automatic way of optimizing the code.

Automatic code optimization strategies were firstly implemented for compiled languages, the best known example being the GNU Compiler Collection (gcc; formerly called GNU C Compiler). This C code compiler was initially developed more than 30 years ago and implements more than 100 different code optimization techniques. While it is known that R is interpreted and therefore certain optimization techniques for compiled code cannot be implemented, many of these ideas can be applied to interpreted languages. As a precedent of interpreted languages that have tools for code optimization are the case of PMD for Java, or Vulture and PyCC for Python.

Optimization strategies

To evaluate the feasibility of this project, a portion of the optimization strategies present in the citations of this document were evaluated. In this sense, for each strategy, was implemented a function f with a non-optimized code chunk, and a f_opt function with the modification that would result after applying the optimization strategy. Additionally, both functions were compiled using the cmpfun function of the compiler package. Evaluation times were obtained, using the microbenchmark R package, by evaluating the resulting 4 functions with the same (as similar as possible) inputs.

Common optimization strategies:

R-specific optimization strategies:

Details of your coding project

The tasks to be carried out during the present summer of code project will be:

  • Study several code optimization strategies. Evaluate the complexity of implementing them in R, and their efficiency gains (mainly speed).
  • Rank the optimization strategies based on efficiency gain against complexity.
  • Analyze methods for R code parsing, e.g., the one used by the compiler package. Select an appropriate parsing method to use.
  • Analyze alternatives of how to model R code (functions, chunks), e.g., the one used by the compiler package, executions trees, etc. Select an appropriate alternative to use.
  • Create a GNU-R package (tests, docs, etc.) that contains the top ranked optimization strategies. The package will be designed in a way that results extensible, so that it serves as the basis to continue collaboratively adding new optimization strategies.

Expected impact

  • Since the output of the package functions will be R code, it is expected to be used to teach/learn efficient coding practices.
  • The most ambitious impact of this project would be to replicate the success generated by the compiler package. Even more, a pipeline of R Code Optimizer %>% compiler would generate great results. While this expectation sounds ambitious, by checking the correctness of the implementation of each optimization strategy then this objective would be a reality.

Mentors

  1. Dr. Nicolás Wolovick - is an expert in high-performance computing, optimizing compilers, low-level programming, etc. Teaches the “operating systems” course since 2002, and “parallel computing” since 2012.
  2. Dr. Yihui Xie - well, every R user knows him, he has authored knitr, bookdown, DT, formatR, highr, servr, testit, and many other high impact R packages. He has been a GSOC mentor three times (2012, 2014 and 2017).

Student

References

[1] R. Ihaka, R: Lessons learned, directions for the future, in: Joint Statistical Meetings, The Authors, 2010. https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf.

[2] H. Wickham, Advanced r, Chapman; Hall/CRC, 2014. http://adv-r.had.co.nz/.

[3] Microsoft r open, 2018. https://mran.microsoft.com/open.

[4] PqR - a pretty quick version of r, 2018. http://www.pqr-project.org/.

[5] Renjin, 2018. http://www.renjin.org/.

[7] Riposte, a fast interpreter and jit for r, 2015. https://github.com/jtalbot/riposte/tree/library.

[9] C. Gillespie, R. Lovelace, Efficient r programming, O’Reilly Media, Incorporated, 2016. https://csgillespie.github.io/efficientR/.

[10] R. Ihaka, R: Past and future history, Computing Science and Statistics. 392396 (1998). https://www.stat.auckland.ac.nz/~ihaka/downloads/Interface98.pdf.

[11] K. Cooper, L. Torczon, Engineering a compiler, Elsevier, 2011. https://www.elsevier.com/books/engineering-a-compiler/cooper/978-0-12-088478-0.

[12] P. Burns, The r inferno, 2011. https://www.burns-stat.com/pages/Tutor/R_inferno.pdf.

[14] Strategies to speedup r code, 2016. https://datascienceplus.com/strategies-to-speedup-r-code/.

[15] FasteR! HigheR! StrongeR! - a guide to speeding up r code for busy people, 2013. http://www.noamross.net/blog/2013/4/25/faster-talk.html.

[16] Making r code faster : A case study, 2017. https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/.

Clone this wiki locally