Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim quoted values when importing csv? #1051

Closed
benwhalley opened this issue Jun 13, 2019 · 11 comments

Comments

Projects
None yet
2 participants
@benwhalley
Copy link

commented Jun 13, 2019

I am importing some CSV from Amex and noticed that their csv amount field is double-quoted and includes a space in front of the value. So, for example: " -140.00".

I noticed that this causes an error when trying to invert the amount in the csv rules file. So if I add the line: amount -%amount GBP it then throws and error on import.

In full, the output is:

the CSV record is:       "14/04/2019", "Reference: XXX", " 6.99", "PAYPAL *ELIDA76 LTD 4029357733", " Process Date 14/04/2019", ""
the amount rule is:      -%amount GBP
the currency rule is:    unspecified
the default-currency is: unspecified
the parse error is:      1:2:
  |
1 | - 6.99 GBP
  |  ^
unexpected space
expecting amount

Deleting the space in the input csv file removes the error.

Is there a way to force hledger to trim incoming values?

@benwhalley benwhalley added the A BUG label Jun 13, 2019

@simonmichael simonmichael added A WISH and removed A BUG labels Jun 13, 2019

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 13, 2019

It does some of that already, but I guess it would be good to do a bit more here. Thanks for the example. Would you be interested in having a look at the Hledger.Read.CsvReader code ? The strip function would do the job.

@benwhalley

This comment has been minimized.

Copy link
Author

commented Jun 14, 2019

Hmmm. Haskell a bit of a mind bender. After fighting with types and fmap for a bit have lazily used this for the moment:

cat amex.csv | sed 's/,\" /,"/g' |  hledger import --rules-file rules.txt  -
@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Cool. It definitely is at first. If you feel like taking another whack at it, help is available in #hledger. I'll recommend stack build hledger once, then make ghcid and make ghci in side windows, then try making a trivial change in CsvReader.hs and see ghcid update, then try reloading and running in ghci (:r <ENTER>, :main -f examples/sample.csv print <ENTER>), then try to see where to put a strip call. Though, maybe you tried that and found the failure is in the cassava CSV parsing lib, which would be a bit more difficult. I suppose then our options would be 1. preprocess the text like you've done above, but in parseCsv, and/or 2. see if we can get this supported in cassava.

@simonmichael simonmichael removed the easy? label Jun 14, 2019

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Though, thinking about it a little more.. space is significant in CSV, so cassava is right to return it to us. The problem is our field assignment combining a minus sign and a CSV value, which is a fragile way to generate a parseable number. Interpolating a CSV value happens in renderTemplate. If we called strip there, this should work ? However maybe it's not appropriate to strip every interpolated CSV field this way..

And separately, I have become confused. My local repro has been:

2000-01-01,a, " 1"
fields date, description, amount
$ hledger -f a.csv print   # using latest hledger master, built with cassava-0.5.1.0
hledger: user error (/Users/simon/src/PLAINTEXTACCOUNTING/hledger/a.csv:1:15:
  |
1 | 2000-01-01,a, " 1"
  |               ^
unexpected '"'
expecting ',', end of input, end of line, or unescaped character
)

which has no interpolation rule, and fails earlier than yours, while cassava is parsing the CSV. I don't understand how yours gets as far as it does.

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Oh yea, I see my error. , "1" causes cassava to fail. ," 1" causes your hledger rule to fail. I must be mixing up two issues.

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Yes, I was thinking of #1037. Back to your issue: I confirm that adding strip like so fixes it:

    replace ('%':pat) = maybe pat (\i -> strip $ atDef "" record (i-1)) mindex

Now, is it acceptably intuitive/predictable/principled that %-interpolated values would always have whitespace stripped..

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Actually there's a separate code path when assigning an amount field (amount/amount-in/amount-out). So we could strip whitespace only in that case, and leave it as-is otherwise. Which is best ?

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

Sorry for the stream-of-thought. Rather, it would be:

"Any CSV value interpolated in a amount, amount-in or amount-out field assignment will have any outer whitespace removed. In all other field assignments, outer whitespace in CSV values is left intact."

Which sounds a bit messy.

simonmichael added a commit that referenced this issue Jun 14, 2019

csv: strip outer whitespace when interpolating CSV values (#1051)
This removes a potential snag in amount field assignments, and
hopefully is harmless and acceptable otherwise.
@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jun 14, 2019

I just went ahead and committed "always strip outer whitespace when interpolating". This removes the snag you hit and will hopefully be harmless otherwise. I'll mention it on list.

@benwhalley

This comment has been minimized.

Copy link
Author

commented Jun 17, 2019

Hey - thanks. And for the tip on ghci too - that makes debugging etc much easier! I'll try and have a go if something else comes up. I've been interested in Haskell for a while.

@simonmichael

This comment has been minimized.

Copy link
Owner

commented Jul 8, 2019

For the record: RFC 4180 2.4 says "Spaces are considered part of a field and should not be ignored", so we're going slightly against the CSV spec here. Perhaps better if we would strip spaces only from amounts, but we're not aware of CSV fields' meanings when interpolating.

simonmichael added a commit that referenced this issue Jul 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.