readr 1.0.0

@hadley hadley released this Aug 3, 2016 · 286 commits to master since this release

readr 1.0.0

Column guessing

The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren't correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

And you can extract those values after the fact with spec():

spec(challenge)
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

This makes it easier to quickly identify parsing problems and fix them (#314). If the column specification is long, the new cols_condense() is used to condense the spec by identifying the most common type and setting it as the default. This is particularly useful when only a handful of columns have a different type (#466).

You can also generating an initial specification without parsing the file using spec_csv(), spec_tsv(), etc.

Once you have figured out the correct column types for a file, it's often useful to make the parsing strict. You can do this either by copying and pasting the printed output, or for very long specs, saving the spec to disk with write_rds(). In production scripts, combine this with stop_for_problems() (#465): if the input data changes form, you'll fail fast with an error.

You can now also adjust the number of rows that readr uses to guess the column types with guess_max:

challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

You can now access the guessing algorithm from R. guess_parser() will tell you which parser readr will select for a character vector (#377). We've made a number of fixes to the guessing algorithm:

  • New example extdata/challenge.csv which is carefully created to cause
    problems with the default column type guessing heuristics.
  • Blank lines and lines with only comments are now skipped automatically
    without warning (#381, #321).
  • Single '-' or '.' are now parsed as characters, not numbers (#297).
  • Numbers followed by a single trailing character are parsed as character,
    not numbers (#316).
  • We now guess at times using the time_format specified in the locale().

We have made a number of improvements to the reification of the col_types, col_names and the actual data:

  • If col_types is too long, it is subsetted correctly (#372, @jennybc).
  • If col_names is too short, the added names are numbered correctly
    (#374, @jennybc).
  • Missing colum name names are now given a default name (X2, X7 etc) (#318).
    Duplicated column names are now deduplicated. Both changes generate a warning;
    to suppress it supply an explicit col_names (setting skip = 1 if there's
    an existing ill-formed header).
  • col_types() accepts a named list as input (#401).

Column parsing

The date time parsers recognise three new format strings:

  • %I for 12 hour time format (#340).
  • %AD and %AT are "automatic" date and time parsers. They are both slightly
    less flexible than previous defaults. The automatic date parser requires a
    four digit year, and only accepts - and / as separators (#442). The
    flexible time parser now requires colons between hours and minutes and
    optional seconds (#424).

%y and %Y are now strict and require 2 or 4 characters respectively.

Date and time parsing functions received a number of small enhancements:

  • parse_time() returns hms objects rather than a custom time class (#409).
    It now correctly parses missing values (#398).
  • parse_date() returns a numeric vector (instead of an integer vector) (#357).
  • parse_date(), parse_time() and parse_datetime() gain an na
    argument to match all other parsers (#413).
  • If the format argument is omitted parse_date() or parse_time(),
    date and time formats specified in the locale will be used. These now
    default to %AD and %AT respectively.
  • You can now parse partial dates with parse_date() and
    parse_datetime(), e.g. parse_date("2001", "%Y") returns 2001-01-01.

parse_number() is slightly more flexible - it now parses numbers up to the first ill-formed character. For example parse_number("-3-") and parse_number("...3...") now return -3 and 3 respectively. We also fixed a major bug where parsing negative numbers yielded positive values (#308).

parse_logical() now accepts 0, 1 as well as lowercase t, f, true, false.

New readers and writers

  • read_file_raw() reads a complete file into a single raw vector (#451).
  • read_*() functions gain a quoted_na argument to control whether missing
    values within quotes are treated as missing values or as strings (#295).
  • write_excel_csv() can be used to write a csv file with a UTF-8 BOM at the
    start, which forces Excel to read it as UTF-8 encoded (#375).
  • write_lines() writes a character vector to a file (#302).
  • write_file() to write a single character or raw vector
    to a file (#474).
  • Experimental support for chunked reading a writing (read_*_chunked())
    functions. The API is unstable and subject to change in the future (#427).

Minor features and bug fixes

  • Printing double values now uses an
    implementation
    of the grisu3 algorithm
    which speeds up writing of large numeric data frames by ~10X. (#432) '.0' is
    appended to whole number doubles, to ensure they will be read as doubles as
    well. (#483)
  • readr imports tibble so that you get consistent tbl_df behaviour
    (#317, #385).
  • New example extdata/challenge.csv which is carefully created to cause
    problems with the default column type guessing heuristics.
  • default_locale() now sets the default locale in readr.default_locale
    rather than regenerating it for each call. (#416).
  • locale() now automatically sets decimal mark if you set the grouping
    mark. It throws an error if you accidentally set decimal and grouping marks
    to the same character (#450).
  • All read_*() can read into long vectors, substantially increasing the
    number of rows you can read (#309).
  • All read_*() functions return empty objects rather than signaling an error
    when run on an empty file (#356, #441).
  • read_delim() gains a trim_ws argument (#312, noamross)
  • read_fwf() received a number of improvements:
    • read_fwf() now can now reliably read only a partial set of columns
      (#322, #353, #469)
    • fwf_widths() accepts negative column widths for compatibility with the
      widths argument in read.fwf() (#380, @leeper).
    • You can now read fixed width files with ragged final columns, by setting
      the final end position in fwf_positions() or final width in fwf_widths()
      to NA (#353, @ghaarsma). fwf_empty() does this automatically.
    • read_fwf() and fwf_empty() can now skip commented lines by setting a
      comment argument (#334).
  • read_lines() ignores embedded null's in strings (#338) and gains a na
    argument (#479).
  • readr_example() makes it easy to access example files bundled with readr.
  • type_convert() now accepts only NULL or a cols specification for
    col_types (#369).
  • write_delim() and write_csv() now invisibly return the input data frame
    (as documented, #363).
  • Doubles are parsed with boost::spirit::qi::long_double to work around a bug
    in the spirit library when parsing large numbers (#412).
  • Fix bug when detecting column types for single row files without headers
    (#333).