Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Second implementation of parsing for space-delimited data #36

Open
wants to merge 14 commits into from

2 participants

@Shimuuar

Previous dicussion is in #30 pul request

Everything does work. Although encoding os space delimited data is a bit fragile and depends on fact that space is always escaped

There are performance regression with respect to current HEAD. Benchmarks shows that streaming have around 1.3 slowdown and decoding of named CSV have similar slowdown.

and others added some commits
@tibbe Check-point commit
Space-delimited parsing is mostly working, except that trailing
whitespace is not dropped correctly.
f2bf695
@Shimuuar Shimuuar Fix build 9eeea09
@Shimuuar Shimuuar Add more failing tests d217e27
@Shimuuar Shimuuar Correctly leading/trailing spaces
It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.
6b8e1dd
@Shimuuar Shimuuar parser for header is same as parser for record 6fbc77f
@Shimuuar Shimuuar Pass DecodeOptions to the field and name parsers
It's necessary for implementing delimiters which are not single
character
0a60b79
@Shimuuar Shimuuar Use predicate on delimiter character.
Now all tests are passing.
24f375e
@Shimuuar Shimuuar Reexport spaceDecodeOptions f7d8d8a
@Shimuuar Shimuuar Inline unescapedField
Now it depends on rather complex data structure and needs to be
specialized by compiler
1c18502
@tibbe
Owner

The change looks good from a brief look. I will have to take a deeper look and run the benchmarks later next week (I'm very busy this week, sorry!)

Data/Csv/Encoding.hs
((6 lines not shown))
. V.toList
{-# INLINE encodeWith #-}
-encodeRecord :: Word8 -> Record -> Builder
-encodeRecord delim = mconcat . intersperse (fromWord8 delim)
- . map fromByteString . map escape . V.toList
+encodeRecord :: EncodeOptions -> Record -> Builder
@tibbe Owner
tibbe added a note

I don't see why passing the whole record here is necessary (since we only use the delimiter). It might be slower or it might not, but I'm in favor of not changing things without reason (unless there is a reason I missed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@tibbe
Owner

I took a deeper look and things look fine except for the small comments above.

Could you please post the criterion benchmarks for before and after your change (you can use cabal bench to run the benchmarks).

Shimuuar added some commits
@Shimuuar Shimuuar Add options for encoding of space-delimited data
Always escape delimiter. Now it's passed to escape function as
parameter and not hardcoded as comma.

Note correct encoding of space-delimited data depends on escaping
of space character.
361703c
@Shimuuar Shimuuar Reexport spaceEncodingOptions from Data.Csv fa7fe10
@Shimuuar Shimuuar Add tests for encoding of space delimited data
All tests are passing
1ca1784
@Shimuuar Shimuuar Document field values in defaultDecodeOptions and spaceDecodeOptions b009085
@Shimuuar

I changed EncodeOptions for a few times and when I selected simple option I didn't roll back all changes. I've pushed amended commits.

Here is table with benchmarks results. Current HEAD, space delimited implementation and ration space/HEAD

                                                       Name       old       new     ratio
1           positional/decode/presidents/without conversion  60.86681  57.76073 0.9489693
2              positional/decode/presidents/with conversion 134.72521 123.18220 0.9143218
3 positional/decode/streaming/presidents/without conversion  98.87020 171.77240 1.7373526
4    positional/decode/streaming/presidents/with conversion 137.70761 191.79104 1.3927410
5              positional/encode/presidents/with conversion 162.29042 180.51010 1.1122659
6                named/decode/presidents/without conversion 282.89621 355.71811 1.2574156
7                   named/decode/presidents/with conversion 213.22024 275.15457 1.2904712
8                   named/encode/presidents/with conversion 264.76437 267.29542 1.0095596
9                                       comparison/lazy-csv 150.31214 147.47115 0.9810994

Streaming suffered badly and named fields see performance impact as well,

@tibbe
Owner

The reason I haven't looked into this yet is I was hoping to find some time to figure out why the streaming decode got 70% slower.

@Shimuuar

I've merged current master and updated benchmarks. Streaming performans still suffer although less

                                                       name       old      new     ratio
1           positional/decode/presidents/without conversion  59.47160  63.4873 1.0675229
2              positional/decode/presidents/with conversion 107.10150 112.4752 1.0501736
3 positional/decode/streaming/presidents/without conversion  89.97198 126.9854 1.4113888
4    positional/decode/streaming/presidents/with conversion 107.13403 149.0713 1.3914473
5              positional/encode/presidents/with conversion 134.93142 128.5330 0.9525799
6                named/decode/presidents/without conversion 247.20380 289.9606 1.1729619
7                   named/decode/presidents/with conversion 197.89479 246.9330 1.2477996
8                   named/encode/presidents/with conversion 195.77406 193.9283 0.9905722
9                                       comparison/lazy-csv 116.55858 117.6775 1.0095993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 28, 2013
  1. @Shimuuar

    Check-point commit

    authored Shimuuar committed
    Space-delimited parsing is mostly working, except that trailing
    whitespace is not dropped correctly.
  2. @Shimuuar

    Fix build

    Shimuuar authored
  3. @Shimuuar

    Add more failing tests

    Shimuuar authored
  4. @Shimuuar

    Correctly leading/trailing spaces

    Shimuuar authored
    It's hard to strip whitespaces correctly because
     a) It's valid part of field for CSV so
        "a,b,c " -> ["a","b","c "]
     b) If we're using spaces as delimiter we get spurious empty field
        at the end fo the line
        "a b c " -> ["a","b","c",""]
    
    Only reliable way to strip them is to read whole line, strip spaces
    and parse stripped line.
  5. @Shimuuar
  6. @Shimuuar

    Pass DecodeOptions to the field and name parsers

    Shimuuar authored
    It's necessary for implementing delimiters which are not single
    character
  7. @Shimuuar

    Use predicate on delimiter character.

    Shimuuar authored
    Now all tests are passing.
  8. @Shimuuar

    Reexport spaceDecodeOptions

    Shimuuar authored
Commits on Mar 8, 2013
  1. @Shimuuar

    Inline unescapedField

    Shimuuar authored
    Now it depends on rather complex data structure and needs to be
    specialized by compiler
Commits on Mar 23, 2013
  1. @Shimuuar

    Add options for encoding of space-delimited data

    Shimuuar authored
    Always escape delimiter. Now it's passed to escape function as
    parameter and not hardcoded as comma.
    
    Note correct encoding of space-delimited data depends on escaping
    of space character.
  2. @Shimuuar
  3. @Shimuuar

    Add tests for encoding of space delimited data

    Shimuuar authored
    All tests are passing
  4. @Shimuuar
Commits on Oct 14, 2013
  1. @Shimuuar

    Merge master into space2

    Shimuuar authored
This page is out of date. Refresh to see the latest.
View
2  Data/Csv.hs
@@ -32,10 +32,12 @@ module Data.Csv
-- $options
, DecodeOptions(..)
, defaultDecodeOptions
+ , spaceDecodeOptions
, decodeWith
, decodeByNameWith
, EncodeOptions(..)
, defaultEncodeOptions
+ , spaceEncodeOptions
, encodeWith
, encodeByNameWith
View
32 Data/Csv/Encoding.hs
@@ -21,10 +21,12 @@ module Data.Csv.Encoding
-- ** Encoding and decoding options
, DecodeOptions(..)
, defaultDecodeOptions
+ , spaceDecodeOptions
, decodeWith
, decodeByNameWith
, EncodeOptions(..)
, defaultEncodeOptions
+ , spaceEncodeOptions
, encodeWith
, encodeByNameWith
) where
@@ -123,7 +125,7 @@ decodeWithC :: (DecodeOptions -> AL.Parser a) -> DecodeOptions -> HasHeader
-> BL8.ByteString -> Either String a
decodeWithC p !opts hasHeader = decodeWithP parser
where parser = case hasHeader of
- HasHeader -> header (decDelimiter opts) *> p opts
+ HasHeader -> header opts *> p opts
NoHeader -> p opts
{-# INLINE decodeWithC #-}
@@ -151,10 +153,16 @@ data EncodeOptions = EncodeOptions
encDelimiter :: {-# UNPACK #-} !Word8
} deriving (Eq, Show)
--- | Encoding options for CSV files.
+-- | Encoding options for CSV files. Comma is used as separator.
defaultEncodeOptions :: EncodeOptions
defaultEncodeOptions = EncodeOptions
- { encDelimiter = 44 -- comma
+ { encDelimiter = 44 -- comma
+ }
+
+-- | Encode options for space-delimited files. Tab is used as separator.
+spaceEncodeOptions :: EncodeOptions
+spaceEncodeOptions = EncodeOptions
+ { encDelimiter = 9 -- tab
}
-- | Like 'encode', but lets you customize how the CSV data is
@@ -167,13 +175,13 @@ encodeWith opts = toLazyByteString
encodeRecord :: Word8 -> Record -> Builder
encodeRecord delim = mconcat . intersperse (fromWord8 delim)
- . map fromByteString . map escape . V.toList
+ . map fromByteString . map (escape delim) . V.toList
{-# INLINE encodeRecord #-}
-- TODO: Optimize
-escape :: B.ByteString -> B.ByteString
-escape s
- | B.any (\ b -> b == dquote || b == comma || b == nl || b == cr || b == sp)
+escape :: Word8 -> B.ByteString -> B.ByteString
+escape delim s
+ | B.any (\ b -> b == dquote || b == delim || b == nl || b == cr || b == sp)
s = toByteString $
fromWord8 dquote
<> B.foldl
@@ -185,11 +193,11 @@ escape s
<> fromWord8 dquote
| otherwise = s
where
+ sp = 32
dquote = 34
- comma = 44
nl = 10
cr = 13
- sp = 32
+
-- | Like 'encodeByName', but lets you customize how the CSV data is
-- encoded.
@@ -262,7 +270,7 @@ csv !opts = do
return $! V.fromList vals
where
records = do
- !r <- record (decDelimiter opts)
+ !r <- record opts
if blankLine r
then (endOfLine *> records) <|> pure []
else case runParser (parseRecord r) of
@@ -276,7 +284,7 @@ csv !opts = do
csvWithHeader :: FromNamedRecord a => DecodeOptions
-> AL.Parser (Header, V.Vector a)
csvWithHeader !opts = do
- !hdr <- header (decDelimiter opts)
+ !hdr <- header opts
vals <- records hdr
_ <- optional endOfLine
endOfInput
@@ -284,7 +292,7 @@ csvWithHeader !opts = do
return (hdr, v)
where
records hdr = do
- !r <- record (decDelimiter opts)
+ !r <- record opts
if blankLine r
then (endOfLine *> records hdr) <|> pure []
else case runParser (convert hdr r) of
View
4 Data/Csv/Incremental.hs
@@ -133,7 +133,7 @@ decodeHeader = decodeHeaderWith defaultDecodeOptions
decodeHeaderWith :: DecodeOptions -> HeaderParser B.ByteString
decodeHeaderWith !opts = PartialH (go . parser)
where
- parser = A.parse (header $ decDelimiter opts)
+ parser = A.parse (header opts)
go (A.Fail rest _ msg) = FailH rest err
where err = "parse error (" ++ msg ++ ")"
@@ -290,7 +290,7 @@ decodeWithP p !opts = go Incomplete [] . parser
acc' | blankLine r = acc
| otherwise = let !r' = convert r in r' : acc
- parser = A.parse (record (decDelimiter opts) <* (endOfLine <|> endOfInput))
+ parser = A.parse (record opts <* (endOfLine <|> endOfInput))
convert = runParser . p
{-# INLINE decodeWithP #-}
View
120 Data/Csv/Parser.hs
@@ -16,6 +16,7 @@
module Data.Csv.Parser
( DecodeOptions(..)
, defaultDecodeOptions
+ , spaceDecodeOptions
, csv
, csvWithHeader
, header
@@ -26,8 +27,8 @@ module Data.Csv.Parser
import Blaze.ByteString.Builder (fromByteString, toByteString)
import Blaze.ByteString.Builder.Char.Utf8 (fromChar)
-import Control.Applicative (Alternative, (*>), (<$>), (<*), (<|>), optional,
- pure)
+import Control.Applicative (Alternative, (*>), (<$), (<$>), (<*), (<|>),
+ optional, pure)
import Data.Attoparsec.Char8 (char, endOfInput, endOfLine)
import qualified Data.Attoparsec as A
import qualified Data.Attoparsec.Lazy as AL
@@ -55,19 +56,51 @@ import Data.Csv.Util ((<$!>), blankLine)
-- > }
data DecodeOptions = DecodeOptions
{ -- | Field delimiter.
- decDelimiter :: {-# UNPACK #-} !Word8
- } deriving (Eq, Show)
+ decDelimiter :: Word8 -> Bool
--- | Decoding options for parsing CSV files.
+ -- | Runs of consecutive delimiters are regarded as a single
+ -- delimiter. This is useful e.g. when parsing white space
+ -- separated data.
+ , decMergeDelimiters :: !Bool
+
+ -- | Trim leading and trailing whitespace at the begining and
+ -- end of each record (but not at the begining and end of each
+ -- field).
+ , decTrimRecordSpace :: !Bool
+ }
+
+-- | Decoding options for parsing CSV files. Fields' values are set to:
+--
+-- [@'decDelimiter'@] comma
+--
+-- [@'decMergeDelimiters'@] false
+--
+-- [@'decTrimRecordSpace'@] false
defaultDecodeOptions :: DecodeOptions
defaultDecodeOptions = DecodeOptions
- { decDelimiter = 44 -- comma
+ { decDelimiter = (==44) -- comma
+ , decMergeDelimiters = False
+ , decTrimRecordSpace = False
+ }
+
+-- | Decoding options for parsing space-delimited files. Fields' values are set to:
+--
+-- [@'decDelimiter'@] space or tab character.
+--
+-- [@'decMergeDelimiters'@] true
+--
+-- [@'decTrimRecordSpace'@] true
+spaceDecodeOptions :: DecodeOptions
+spaceDecodeOptions = DecodeOptions
+ { decDelimiter = \c -> c == space || c == tab
+ , decMergeDelimiters = True
+ , decTrimRecordSpace = True
}
-- | Parse a CSV file that does not include a header.
csv :: DecodeOptions -> AL.Parser Csv
csv !opts = do
- vals <- record (decDelimiter opts) `sepBy1'` endOfLine
+ vals <- record opts `sepBy1'` endOfLine
_ <- optional endOfLine
endOfInput
let nonEmpty = removeBlankLines vals
@@ -94,23 +127,24 @@ sepBy1' p s = go
-- | Parse a CSV file that includes a header.
csvWithHeader :: DecodeOptions -> AL.Parser (Header, V.Vector NamedRecord)
csvWithHeader !opts = do
- !hdr <- header (decDelimiter opts)
+ !hdr <- header opts
vals <- map (toNamedRecord hdr) . removeBlankLines <$>
- (record (decDelimiter opts)) `sepBy1'` endOfLine
+ (record opts) `sepBy1'` endOfLine
_ <- optional endOfLine
endOfInput
let !v = V.fromList vals
return (hdr, v)
-- | Parse a header, including the terminating line separator.
-header :: Word8 -- ^ Field delimiter
+header :: DecodeOptions -- ^ Field delimiter
-> AL.Parser Header
-header !delim = V.fromList <$!> name delim `sepBy1'` (A.word8 delim) <* endOfLine
+header = record
+{-# INLINE header #-}
-- | Parse a header name. Header names have the same format as regular
-- 'field's.
-name :: Word8 -> AL.Parser Name
-name !delim = field delim
+name :: DecodeOptions -> AL.Parser Name
+name = field
removeBlankLines :: [Record] -> [Record]
removeBlankLines = filter (not . blankLine)
@@ -120,23 +154,42 @@ removeBlankLines = filter (not . blankLine)
-- CSV file is allowed to not have a terminating line separator. You
-- most likely want to use the 'endOfLine' parser in combination with
-- this parser.
-record :: Word8 -- ^ Field delimiter
- -> AL.Parser Record
-record !delim = do
- fs <- field delim `sepBy1'` (A.word8 delim)
- return $! V.fromList fs
+record :: DecodeOptions -> AL.Parser Record
+record !opts
+ -- If we need to trim spaces from line only robust way to do so is
+ -- to read whole line, remove spaces and run record parser on
+ -- trimmed line. For example:
+ --
+ -- + "a,b,c " will be parsed as ["a","b","c "] since spaces are
+ -- allowed in field
+ -- + "a b c " will be parsed as ["a","b","c",""] if we use space
+ -- as separator.
+ | decTrimRecordSpace opts = do
+ AL.skipMany $ AL.satisfy isSpace
+ line <- AL.takeWhile $ \c -> c /= newline && c /= cr
+ let (dat,_) = S.spanEnd isSpace line
+ case AL.parseOnly parser dat of
+ Left e -> fail e
+ Right x -> return x
+ | otherwise = parser
+ where
+ delim = decDelimiter opts
+ delimiter | decMergeDelimiters opts = A.skipMany1 (A.satisfy delim)
+ | otherwise = () <$ A.satisfy delim
+ parser = do fs <- field opts `sepBy1'` delimiter
+ return $! V.fromList fs
{-# INLINE record #-}
-- | Parse a field. The field may be in either the escaped or
-- non-escaped format. The return value is unescaped.
-field :: Word8 -> AL.Parser Field
-field !delim = do
+field :: DecodeOptions -> AL.Parser Field
+field !opt = do
mb <- A.peekWord8
-- We purposely don't use <|> as we want to commit to the first
-- choice if we see a double quote.
case mb of
Just b | b == doubleQuote -> escapedField
- _ -> unescapedField delim
+ _ -> unescapedField opt
{-# INLINE field #-}
escapedField :: AL.Parser S.ByteString
@@ -154,11 +207,14 @@ escapedField = do
Left err -> fail err
else return s
-unescapedField :: Word8 -> AL.Parser S.ByteString
-unescapedField !delim = A.takeWhile (\ c -> c /= doubleQuote &&
- c /= newline &&
- c /= delim &&
- c /= cr)
+unescapedField :: DecodeOptions -> AL.Parser S.ByteString
+unescapedField !opt = A.takeWhile (\ c -> c /= doubleQuote &&
+ c /= newline &&
+ c /= cr &&
+ not (delim c))
+ where
+ delim = decDelimiter opt
+{-# INLINE unescapedField #-}
dquote :: AL.Parser Char
dquote = char '"'
@@ -178,7 +234,13 @@ unescape = toByteString <$!> go mempty where
then return (acc `mappend` fromByteString h)
else rest
-doubleQuote, newline, cr :: Word8
+doubleQuote, newline, cr, space, tab :: Word8
doubleQuote = 34
-newline = 10
-cr = 13
+newline = 10
+cr = 13
+space = 32
+tab = 9
+
+isSpace :: Word8 -> Bool
+isSpace c = c == space || c == tab
+{-# INLINE isSpace #-}
View
21 tests/UnitTests.hs
@@ -23,6 +23,7 @@ import Test.Framework.Providers.QuickCheck2 as TF
import Data.Csv hiding (record)
import qualified Data.Csv.Streaming as S
+
------------------------------------------------------------------------
-- Parse tests
@@ -114,6 +115,14 @@ positionalTests =
[ testCase "tab-delim" $ encodesWithAs (defEnc { encDelimiter = 9 })
[["1", "2"]] "1\t2\r\n"
]
+ , testGroup "encodeSpace" $ map (\(n,a,b) -> testCase n $ encodesWithAs spaceEncodeOptions a b)
+ [ ("simple", [["abc"]], "abc\r\n")
+ , ("leadingSpace", [[" abc"]], "\" abc\"\r\n")
+ , ("comma", [["abc,def"]], "abc,def\r\n")
+ , ("space", [["abc def"]], "\"abc def\"\r\n")
+ , ("tab", [["abc\tdef"]], "\"abc\tdef\"\r\n")
+ , ("twoFields", [["abc","def"]], "abc\tdef\r\n")
+ ]
, testGroup "decode" $ map decodeTest decodeTests
, testGroup "decodeWith" $ map decodeWithTest decodeWithTests
, testGroup "streaming"
@@ -142,7 +151,12 @@ positionalTests =
, ("rfc4180", rfc4180Input, rfc4180Output)
]
decodeWithTests =
- [ ("tab-delim", defDec { decDelimiter = 9 }, "1\t2", [["1", "2"]])
+ [ ("tab-delim", defDec { decDelimiter = (==9) }, "1\t2", [["1", "2"]])
+ , ("mixed-space", spaceDec, " 88 c \t 0.4 ", [["88", "c", "0.4"]])
+ , ("multiline-space", spaceDec, " 11 22 \n 11 22", [ ["11","22"]
+ , ["11","22"]])
+ , ("blankLine-space", spaceDec, "1 2\n\n3 4\n", [ ["1","2"]
+ , ["3","4"]])
]
encodeTest (name, input, expected) =
@@ -155,8 +169,9 @@ positionalTests =
testCase name $ input `decodesStreamingAs` expected
streamingDecodeWithTest (name, opts, input, expected) =
testCase name $ decodesWithStreamingAs opts input expected
- defEnc = defaultEncodeOptions
- defDec = defaultDecodeOptions
+ defEnc = defaultEncodeOptions
+ defDec = defaultDecodeOptions
+ spaceDec = spaceDecodeOptions
nameBasedTests :: [TF.Test]
nameBasedTests =
Something went wrong with that request. Please try again.