Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache styling #538

Merged
merged 69 commits into from
Dec 15, 2019
Merged

Cache styling #538

merged 69 commits into from
Dec 15, 2019

Conversation

lorenzwalthert
Copy link
Collaborator

@lorenzwalthert lorenzwalthert commented Aug 15, 2019

Closes #320.

Goal

The goal is to make styler remember what it styled so the pre-commit hook in https://github.com/lorenzwalthert/pre-commit-hooks is much faster and functions like styler::style_pkg() are much faster if run on already styled files.

Requirements

The requirements are:

  • must work on all plattforms and must not be error-prone.

  • the user should be able to manage it (know its size and location, deactivate,
    clear, delete)

  • should be as close to the functionality "once seen, by style_text(), Addin or
    style_file() remembered forever".

  • The cache should be default be enabled, but the required packages should be in
    Suggests to keep the core installation lightweight. If R.cache is not
    installed, we should issue a warning message, deactivate the caching feature
    for the current R session and ask the user to install the dependencies and or
    permanently disable the feature in their .Rprofile with
    usethis::edit_r_profile().

  • For this reason, the styling should also work when R.cache is not installed,
    which requires every R.cache call to be wrapped in a conditional.

  • The advanced user should be able to understand how R.cache was used to
    implement the caching.

Conceputal

In this PR we introduce caching to styler. We follow the approach outlined in
#320 (comment), which basically is:

  • check if text to style is in cache.

  • if not, style it and add styled text to cache.

The approach has the following advantages:

  • Very simple to implement.

  • API agnostic. Works for files, text, Addin because it operates on the text
    level, which is quite low level.

  • Because approach is path and modification time independent, it can cache the
    same content in different locations, including renaming, copies in multiple
    places as well as moving files. Can cache multiple versions of a file, e.g. on
    different branches, when going back and forth in the git history.

  • Cache remains very small because no actual code is cached.

The cache must be styler version specific because if not, updated styling rules
won't be applied.

Implementation

  • We use R.cache to power the caching.

  • We use R options to manage it, with additional functional wrappers to modify
    the options.

  • There must be one cache per styler version and for testing purposes, the user
    must be able to specify a cache maunally (mainly to be able to delete the
    cache of the tests without deleting the cache he uses as a user).

  • To not convolute the R.cache caching directory, we use a dir under the
    caching root that corresponds to /styler/cache_name, where cache_name is
    the use specified cache name, defaulting to the installed version number from
    DESCRIPTION.

  • We modify .travis.yaml to also test behavior if R.cache is not installed.


Todo:

  • Should R.cache be in suggest and certainly fail in a vanilla installation of styler when caching is not explicitly disabled and R.cache is not installed? No, just warning as described above.
  • Think about using the cache mechanism in testing (CI and local) as well as development (local).
  • What is the default if the package is not loaded. Then, the option is not set.
  • Central place to document the options. Not needed since we use wrapper functions to manipulate.
  • What happens if R.cache is not installed?
  • Which cache functions should be exported? Probably the ones we have now.
  • Tests for cache accessor functions.
  • Test for caching in conjunction with options.
  • Tests for caching regarding EOL blank line.
  • Document this PR and design choices in more detail

@lorenzwalthert lorenzwalthert changed the title Support caching of styling Cache styling Aug 15, 2019
@lorenzwalthert lorenzwalthert marked this pull request as ready for review August 22, 2019 17:26
@codecov-io
Copy link

codecov-io commented Aug 29, 2019

Codecov Report

Merging #538 into master will decrease coverage by 0.37%.
The diff coverage is 82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #538      +/-   ##
==========================================
- Coverage   90.83%   90.46%   -0.38%     
==========================================
  Files          43       45       +2     
  Lines        1801     1898      +97     
==========================================
+ Hits         1636     1717      +81     
- Misses        165      181      +16
Impacted Files Coverage Δ
R/io.R 84.21% <ø> (ø) ⬆️
R/ui-styling.R 100% <ø> (ø)
R/zzz.R 0% <0%> (ø) ⬆️
R/addins.R 0% <0%> (ø) ⬆️
R/transform-files.R 100% <100%> (ø) ⬆️
R/utils-cache.R 100% <100%> (ø)
R/ui-caching.R 100% <100%> (ø)
R/communicate.R 42.3% <12.5%> (-47.7%) ⬇️
R/rules-line-break.R 100% <0%> (ø) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d8470b...a00f166. Read the comment docs.

@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Sep 10, 2019

With the latest two commits, we use my fork of R.cache to store empty files for caching, as suggested in HenrikBengtsson/R.cache#37, beacuse otherwise, each cached code expression needs 4KB on disk.

@krlmlr
Copy link
Member

krlmlr commented Sep 22, 2019

Is this creating one file per expression, or one file per file processed?

@lorenzwalthert
Copy link
Collaborator Author

One per file. I thought we could also do one per expression, it would give an additional speed boost in case you have big expressions and modify just one. The implementation would also be more complex though. I think as of now, it's pretty slim (excluding all the cache management tools that most people will never touch). However, there is (obviously) a trade-off with file size as long as caching costs us one block, e.g. 4KB memory on macOS.

@krlmlr
Copy link
Member

krlmlr commented Sep 23, 2019

Can we do one cache object per R project?

@lorenzwalthert
Copy link
Collaborator Author

Do you mean RStudio Projects associated with a working directory?

@krlmlr
Copy link
Member

krlmlr commented Sep 23, 2019

Project = package, analysis project, RStudio project -- in the sense of {here} or usethis::proj_get().

@lorenzwalthert
Copy link
Collaborator Author

I am not sure I understand what you mean. I don't think we should convolute the project directory with a styler cache to make the caching directory specific. What do you think would be the advantages of this approach? Advantages of a central cache:

  • Can be removed very easily, not scattered in many places.
  • Not yet another file to gitignore, .Rbuildignore or whatever in every project repo.
  • The same code in different places can be cached easily.

Also, I think it would not help solving the problem with the block size. Also, if possible, I'd like to use R.cache because:

  • CRAN policy is comparatively string on writing to anywhere on the file system.
  • Offload implementation of cache to another package keeps complexity low, e.g. initialization, error handling.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to make this package easier to use in practice!

DESCRIPTION Outdated
@@ -27,7 +27,8 @@ Imports:
tibble (>= 1.4.2),
tools,
withr (>= 1.0.0),
xfun (>= 0.1)
xfun (>= 0.1),
R.cache (>= 0.13.0.9000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to add {R.cache} to "Suggests"? Leaving it in "Imports" seems easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it was only added in the very last commit cb2dc17363dfaa39f4025a1895ffed86556e5ceb when I added my own fork as a remote dependency. Probably caused by unintentional usethis::use_dep(). Do you think it's unnecessary to keep it in Suggests? It needs some extra handling in other places, I agree.

setdiff(miniCRAN::pkgDep("R.cache"), miniCRAN::pkgDep("styler"))
#> [1] "R.cache"     "R.methodsS3" "R.oo"        "R.utils"

Created on 2019-09-23 by the reprex package (v0.3.0)

NEWS.md Outdated Show resolved Hide resolved
R/ui-caching.R Outdated Show resolved Hide resolved
)
should_use_cache <- cache_is_activated()
use_cache <- is_cached && should_use_cache
if (!use_cache) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an early return like if (use_cache) return(text) ?

R/utils-cache.R Outdated Show resolved Hide resolved
R/utils-cache.R Outdated Show resolved Hide resolved
R.cache::findCache(key = hash_standardize(text), dir = cache_dir)
)
should_use_cache <- cache_is_activated()
use_cache <- is_cached && should_use_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_cache <- is_cached && should_use_cache
use_cache <- should_use_cache && is_cached(text)

with a suitable is_cached() function avoids calling hash_standardize() if not needed.

@lorenzwalthert
Copy link
Collaborator Author

Thanks for reviewing @krlmlr. Feels like the good old times are back 🎉 .

@krlmlr
Copy link
Member

krlmlr commented Oct 1, 2019

Do we need to include the style in the hash key, so that we recompute when the style changes?

@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Oct 13, 2019

I guess we should. The question is how. Because currently, the version of styler determines which cache to use by default. Also, if people have their own style guide they supply, we should probably also hash that function and include it in the hash, leaving the R.cache directory structure as is, i.e.

~.R.cache/styler/:version/:hash(transformer, text)

In addition, if we NULL mapped to the empty file in HenrikBengtsson/R.cache#37, this reduces space consumption drastiaclly so we can also cache on the top-level expression level, further improving speed. By top-level, I mean the number of expressions that have 0 as a parent. Example with two expressions:

x <- 3
gg <- function() {
  if (TRUE) {
    f(x, na = 3)
  }
}

@krlmlr
Copy link
Member

krlmlr commented Dec 1, 2019

I tweaked a bit, works for me with {dm}.

  • How can we communicate to the caller that a cached values is used? I think for this we need to get rid of transform_utf8() and intercept at an earlier stage.
    1. Read file
    2. Check if cached
    3. If not, style
    4. If changed, cache and write
  • Can we map empty files to NULL in {R.cache}?

@lorenzwalthert
Copy link
Collaborator Author

Can we map empty files to NULL in {R.cache}?

Tracked in HenrikBengtsson/R.cache#37.

How can we communicate to the caller that a cached values is used?

I am not sure this is necessary because it seems quite cumbersome to implement. You mean in the console output or also in the data frame returned invisibly by style_pkg() and friends? For consistency, I think we'd need to do that in both or none. The reason we check if text is cached in make_transformer() is because all style API functions (like the Addins, style_text(), etc.) all use this function.

Also, if you want to read in only once (i.e. transform_utf8_one()) and use a functional approach (i.e. not storing this information in an environment), it would require adapting the code in many places and pass back the information about whether or not the cache was used. it would have to be propagated up the chain in this order I think:

transform_file() <- transform_code() <- transform_utf8() <- transform_utf8_one() <- make_transformer()

Creating file manually instead of mapping NULL to empty file
@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Dec 5, 2019

I think we should not communicate to the user if the cache was used, at least not in this PR. It involves refactoring quite a few things and the PR is already getting too large. Please open another issue if you think it's important. If something is wrong, the user can always turn off the cache completely.

@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Dec 5, 2019

Seems that with 36606a2, we are able to consume 0 KB instead of block size per cached value, so that one is resolved.

@krlmlr
Copy link
Member

krlmlr commented Dec 5, 2019

Plus the inode, but these are cheap.

Let's see how it works without notifying the user.

@lorenzwalthert
Copy link
Collaborator Author

lorenzwalthert commented Dec 15, 2019

There are two remaining issues with this PR:

We will merge this and open new issues for these tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memorize which files were not changed since last styling
3 participants