New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_* improvements #432

Merged
merged 5 commits into from Jun 14, 2016

Conversation

Projects
None yet
2 participants
@jimhester
Member

jimhester commented Jun 10, 2016

The previous implementation as we know was very slow because of the character conversion, on par with write.csv().

set.seed(1)
df <- as.data.frame(matrix(runif(256*2^15), nrow = 256))
system.time(write.csv(df, "/tmp/df1.csv"))
#>    user  system elapsed 
#>  26.074   7.622  33.776
system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>  34.012   0.537  34.589

9c28645 just does a little cleanup and turns off converting numeric to character first. This produces valid round trip-able results and is quite a bit faster than converting to character, however all numeric numbers are printed with the maximum amount of precision.

system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>   8.308   0.240   8.588

a67c1d5 uses the grisu3 implementation found at https://github.com/juj/MathGeoLib/blob/master/src/Math/grisu3.c. It is under the Apache 2 license so is safe for us to use. This actually gives us quite a bit better performance than the naive approach.

system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>   3.047   0.203   3.265

However data.table:fwrite() is still faster than any of these methods.

system.time(data.table::fwrite(df, "/tmp/df3.csv"))
#> Your platform/environment has not detected OpenMP support. fwrite() will still work, but slower in single threaded mode.
#>    user  system elapsed 
#>   1.578   0.063   1.651

I did some profile sampling with R -q -d "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes" and the vast majority of our computational time is doing the formatting, so I am not sure how much more room there is to improve.

Fixes #387

Small cleanup for writing
Don't convert numeric to character, use the max amount of precision
nessesary to roundtrip doubles.
@jimhester

This comment has been minimized.

Member

jimhester commented Jun 10, 2016

Sample output from grisu3 is identical to current and write.csv() on mtcars

readr::write_tsv(head(mtcars), "/dev/stdout")
#> mpg  cyl disp    hp  drat    wt  qsec    vs  am  gear    carb
#> 21   6   160 110 3.9 2.62    16.46   0   1   4   4
#> 21   6   160 110 3.9 2.875   17.02   0   1   4   4
#> 22.8 4   108 93  3.85    2.32    18.61   1   1   4   1
#> 21.4 6   258 110 3.08    3.215   19.44   1   0   3   1
#> 18.7 8   360 175 3.15    3.44    17.02   0   0   3   2
#> 18.1 6   225 105 2.76    3.46    20.22   1   0   3   1

vs setting the precision manually

readr::write_tsv(head(mtcars), "/dev/stdout")
#> mpg  cyl disp    hp  drat    wt  qsec    vs  am  gear    carb
#> 21   6   160 110 3.8999999999999999  2.6200000000000001  16.460000000000001  0   1   4   4
#> 21   6   160 110 3.8999999999999999  2.875   17.02   0   1   4   4
#> 22.800000000000001   4   108 93  3.8500000000000001  2.3199999999999998  18.609999999999999  1   1   4   1
#> 21.399999999999999   6   258 110 3.0800000000000001  3.2149999999999999  19.440000000000001  1   0   3   1
#> 18.699999999999999   8   360 175 3.1499999999999999  3.4399999999999999  17.02   0   0   3   2
#> 18.100000000000001   6   225 105 2.7599999999999998  3.46    20.219999999999999  1   0   3   1
@jimhester

This comment has been minimized.

Member

jimhester commented Jun 10, 2016

One possible issue is mentioned in the grisu3 paper, which states the following (emphasis mine)

With just two extra bits it is difficult to do better than in our
example, but often there exists an integer type with more bits. For
IEEE 754 floating-point numbers, which have a significand size of
53, one can use 64 bit integers, providing 11 extra bits. We have
developed an algorithm Grisu2 that uses these extra bits to shorten
the output. However, even 11 extra bits may not be sufficient in
every case. There are still boundary conditions under which Grisu2
will not be able to produce the shortest representation. Since this
property is often a requirement (see [Steele Jr. and White(2004)]
for some examples) we propose a variant, Grisu3, that detects (and
aborts) when its output may not be the shortest. As a consequence
Grisu3 is incomplete and will fail for some percentage of its input.
Given 11 extra bits roughly 99.5% are processed correctly and
are thus guaranteed to be optimal (with respect to shortness and
rounding). The remaining 0.5% are rejected and need to be printed
by another printing algorithm (like Dragon4)
.

I need to look a this implementation and see what happens if the input is rejected, but it is reassuring it did not fail with the test inputs, which are random although only between 0-1.

Edit
Answered by https://github.com/hadley/readr/pull/432/files#diff-d249c6cf5b0b488ebfa485f331caed3bR331, which uses sprintf in this case.

@jimhester

This comment has been minimized.

Member

jimhester commented Jun 10, 2016

We may also want to incorporate the changes from https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/portable/grisu3_print.h#L207-L228 (also under Apache 2.0) which prefer 'unscientific' notation at the same length and always append a 0 on decimals.

@@ -0,0 +1,361 @@
/* This file is part of an implementation of the "grisu3" double to string

This comment has been minimized.

@hadley

hadley Jun 10, 2016

Member

Can you include the license too? Might need to copy and paste from somewhere else

@hadley

This comment has been minimized.

Member

hadley commented Jun 10, 2016

Also need to update Authors@R

@hadley

This comment has been minimized.

Member

hadley commented Jun 10, 2016

You mean C-level formatting or R-level formatting?

But I'm happy with that performance - we don't need to be as fast as fwrite(), we just need not to be embarrassingly slow.

@jimhester

This comment has been minimized.

Member

jimhester commented Jun 13, 2016

C-level formatting is what I meant (after the above changes).

@hadley

This comment has been minimized.

Member

hadley commented Jun 13, 2016

Apart from the authorship/license stuff (and news bullet), LGTM. Feel free to merge when you've done those bits.

@jimhester jimhester added ready and removed in progress labels Jun 14, 2016

@jimhester jimhester changed the title from WIP: write_* improvements to write_* improvements Jun 14, 2016

@jimhester

This comment has been minimized.

Member

jimhester commented Jun 14, 2016

Added the license and authors to the DESCRIPTION. PTAL briefly to make sure it looks OK and then I can merge this.

@jimhester jimhester force-pushed the jimhester:master branch from 94c990e to 52cca9e Jun 14, 2016

@jimhester jimhester force-pushed the jimhester:master branch from 52cca9e to ca075fd Jun 14, 2016

@hadley

This comment has been minimized.

Member

hadley commented Jun 14, 2016

Looks good - I confirmed that Apache license is compatible with GPL3.

@jimhester jimhester merged commit 58d5682 into tidyverse:master Jun 14, 2016

3 checks passed

codecov/patch 94.73% of diff hit (target 70.65%)
Details
codecov/project 71.95% (+1.30%) compared to 424c90d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jimhester jimhester removed the ready label Jun 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment