Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csv fixes mega pack 2019 #1095

Merged
merged 50 commits into from Nov 6, 2019

Conversation

@adept
Copy link
Collaborator

adept commented Oct 12, 2019

Let's fix some outstanding CSV issues. This PR fixes:

  • #841 and #416 : newlines in description/comment/precomment/... are not breaking output
  • #570 : zero amount-in and non-zero amount-out (or vice versa) is no longer an error
  • #500 : recursive interpolation works for all fields, including amount
  • #1000: it is possible to use just balance assertions when reading csv
  • #1001: empty field assignments are possible now
  • #1076: it is now possible to write "if" expressions with "skip" in the body that will skip records
  • #627 : csv rules could produce up to 9 postings, with different comments, balance assertions or currencies for each.

Obviously, the biggest change is multi-postings generation. I tried to streamline the code, so hopefully this bit is not too hard to read.

Parser now recognizes "account1", "amount1", "amount1-in", "amount1-out", "comment1", "balance1", "currency1" and likewise with "2", "3", ... up to "9" and will create postings for all sets of values that contain sensible data. Legacy names "account", "amount-in", "amount-out", "balance" are recognized as well, and parser tries to resolve conflicts between those and "first" set of fields. I've added some tests to demonstrate this

Current code tries really hard to be backwards-compatible, so, for example, it uses "currency" as currency for all amounts unless overwritten by more specific "currencyX" and still tries to assign account "income:unknown" or "expenses:unknown" to posting 2 (even if there are more than two postings in the transaction), which might be bad/surprising. I am not sure what is the best way to proceed here. It would be nice to just drop backwards compatibility and aim for "least surprise", but I am not sure how feasible this is.

For example, one option could be to set account to "unknown" for any posting that has amount or balance assertion defined, but no account. This will break backwards compatibility wrt "income:unknown", but will be more resilient. What do you think?

Docs are not updated yet, but I am opening this up for an early review nevertheless.

@adept adept force-pushed the adept:csv-mega-pack branch from 9b37d8e to 091fa80 Oct 12, 2019
@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 12, 2019

Is it always safe for us to be ignoring a 0 value like this, do you think ? What happens if both amounts are 0, or both are empty, or both are non zero ?

I claim that it is safe. Current code treats "empty" and "zero" as "no value" or "empty", so we have only four cases:

  1. in is NOT empty , out in NOT empty. Impossible to interpret, and we will error out
  2. in is empty, out is preset. We take "out", and this seems the only possible logical interpretation
  3. in is present, out is empty. We take "in", and this seems the only possible logical interpretation
  4. Both in and out are empty. The only possible interpretation is "no amount change happened", and this is exactly what the code will do. Note that if there was a balance assertion on same line, we will still take it and try and make a posting of it all, but if not - this line is effectively no-op and will be ignored. I am open to explanations how it could be anything else but no-op :)
@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 12, 2019

Would this make a double space with windows line endings ? And multiple spaces when there’s consecutive newlines ? Could limit it to one space ?

Removing consequtive breaks and replacing with a single space seems sensible. I amended the test and committed the fix.

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 12, 2019

I was going to ask if case 4 (both zero) should produce an explicit zero but I’m less clear after seeing the comment edit. But I believe you’re clear so that’s fine.

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 12, 2019

I’ll do a little real world testing soon.

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 12, 2019

Maybe no checking will be less painful overall - we’ll find out. Ok on not validating for now.

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 12, 2019

Here's a report of my first experiments. I want to use the new feature to generate a third (fees) posting for a paypal donation. (Ignore the "expenses" miscategorisation.)

Test command, shows generated entry and if it is parseable and if so the same entry made explicit:

$ ls a.csv* | entr bash -c 'hledger -f a.csv print && (hledger -f a.csv print | hledger -f- print -I -x 2>/dev/null) && echo ok || echo not ok'

I initially had:

rules (fragment):

fields date, time, timezone, description, type, status_, currency, grossamount, feeamount, netamount, fromemail, toemail, code, itemtitle, itemid, referencetxnid, receiptid, balance, note
amount %netamount

result:

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
    sm:assets:online:paypal           $9.41 = $57.60
    sm:expenses:unknown

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
    sm:assets:online:paypal           $9.41 = $57.60
    sm:expenses:unknown              $-9.41

ok

I added amount2:

rules:

amount %netamount
amount2 %feeamount

result:

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
    sm:assets:online:paypal           $9.41 = $57.60
    sm:expenses:unknown              $-0.59

not ok

I changed that to amount3, defined account3:

rules:

amount3 %feeamount
account3 JS:expenses:banking:paypal

result:

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
    sm:assets:online:paypal              $9.41 = $57.60
    sm:expenses:unknown
    JS:expenses:banking:paypal          $-0.59

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
    sm:assets:online:paypal              $9.41 = $57.60
    sm:expenses:unknown                 $-8.82
    JS:expenses:banking:paypal          $-0.59

ok

Success. Next I renamed amount to amount1:

rules:

amount1 %netamount

result:

2018/12/22 (12U43407AR385363D) Someone for Joyful Systems
hledger: amount/amount-in/amount-out and amount1/amount1-in/amount1-out produced conflicting values
the CSV record is:       "12/22/2018","06:22:50","PST","Someone","Subscription Payment","Completed","USD","10.00","-0.59","9.41","someone@some.where","simon@joyful.com","12U43407AR385363D","Joyful Systems","","9KCXBV2DBXXAX","","57.60",""
the account1 rule is: sm:assets:online:paypal
the amount1 rule is: %netamount
the account2 rule is: sm:expenses:unknown
the account3 rule is: JS:expenses:banking:paypal
the amount3 rule is: %feeamount
the balance rule is: %18
the code rule is: %13
the currency rule is: $
the date rule is: %1
the description rule is: %description for %itemtitle 

amount/amount-in/amount-out is Mixed []
amount1/amount1-in/amount1-out isMixed [Amount {acommodity = "$", aquantity = 9.41, aismultiplier = False, astyle = AmountStylePP "L False 2 Just '.' Nothing..", aprice = Nothing}]
                                 ^
                                 TODO: add space
not ok

This last one stumped me, can you tell how it's going wrong ?

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 12, 2019

(PS and maybe you can fix that missing space and colons (see TODO:), I didn't want to conflict)

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 12, 2019

This last one stumped me, can you tell how it's going wrong ?

Yep, thats a bug. Fixed, and fixed error formatting as well

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 12, 2019

Added fix for #1076 and #1000 got fixed as a side-effect of introducing muti-postings

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 13, 2019

Added fix for #1076 and #1000 got fixed as a side-effect of introducing muti-postings

Oh, excellent.

About doc updates.. would it be fair to say that "an assignment to any of accountN, amountN, amount-inN, amount-outN or currencyN will generate a posting (though it's your responsibility to ensure it is a well formed one). Normally the N's are consecutive starting from 1 (but it's not required). One posting will be generated for each unique N."

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 13, 2019

About doc updates.. would it be fair to say that "an assignment to any of accountN, amountN, amount-inN, amount-outN or currencyN will generate a posting (though it's your responsibility to ensure it is a well formed one). Normally the N's are consecutive starting from 1 (but it's not required). One posting will be generated for each unique N."

Very sensible. Applied with slight rewording

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 14, 2019

I'm using conditional skip successfully to ignore trailing junk. Here, ignoring everything after the first record with an empty value:

if ^,
 skip 999999

Issue: it only seems to skip one line, ignoring the argument.
Is it worth adding a "end" directive to provide certainty that it works even with an extra-long csv file ?

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 14, 2019

Hmm.. I think it currently requires amount (no number) to always be assigned:

amount1  0
Please specify (as a top level CSV rule) either the amount field,
both the amount-in and amount-out fields, or the balance field. Eg:
amount %2
@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 14, 2019

Issue: it only seems to skip one line, ignoring the argument.

That's because my initial idea for skip implementation made implementing skip N to be hard, so I did not do it. But my subsequent idea actually made it easier, so i just pushed implementation for skip N, and your example should work. "End" seems sensible as well, so i did this as well

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 14, 2019

Hmm.. I think it currently requires amount (no number) to always be assigned:

Yep, that's a bug. Fixed!

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 14, 2019

  • Nice. Oh, I wasn't thinking skip end.. could it be just end ?

  • Now one of amount or amount1 is required, which doesn't match "Normally the N's are consecutive starting from 1 (but it's not required).". Probably any amount[N] should be accepted here, right ? But actually just checking for one amount isn't as useful now. We should probably just scrap the amount checking (or do more thorough checking), eh ?

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 14, 2019

Nice. Oh, I wasn't thinking skip end.. could it be just end ?

I'd rather keep "skip end", otherwise you need to handle a case when both "skip N" and "end" are in effect, possibly through two different "if" blocks. With "skip end", ambiguous cases like those are made impossible :)

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 14, 2019

Now one of amount or amount1 is required, which doesn't match "Normally the N's are consecutive starting from 1 (but it's not required).". Probably any amount[N] should be accepted here, right ? But actually just checking for one amount isn't as useful now. We should probably just scrap the amount checking (or do more thorough checking), eh ?

I part-wrote/part-kept a piece of code that forces backwards-compatible behavior when just two postings are generated: if there are just two postings, and account2 is not defined, then its name would be income:unknown or expense:unknown, and it would be driven by amount1. If we are happy with generic name like "unknown" when amount is not known, then this special case could be removed, and requirement to always define amount1 could be lifted.

Worth noting that docs currently says that you always need first posting, but after that you can go wild...

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 14, 2019

I'd rather keep "skip end", otherwise you need to handle a case when both "skip N" and "end" are in effect, possibly through two different "if" blocks. With "skip end", ambiguous cases like those are made impossible :)

Good point, but consider a csv record matched by two or more conditionals, each doing a skip N. That's already a bit ambiguous, right ? I guess it will skip the sum of the Ns, which probably isn't useful, but never mind.

Isn't "end" a little clearer ? It's a simple "stop reading further input now" command, if reached the rest of the rules don't matter.

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 15, 2019

If we are happy with generic name like "unknown" when amount is not known, then this special case could be removed, and requirement to always define amount1 could be lifted.

I think that would be fine, if it makes things simpler overall. Eg:
"If a posting's account is not assigned, it defaults to 'income:unknown' or 'expenses:unknown' depending on the sign of the posting's amount (or if the posting's amount is not assigned, it will be just 'unknown')."

However things still seem a bit fragile.
hledger assumes that account 1 is the base/source account for the CSV (I think we encourage this in the docs),
and that it is an asset account,
so it tests posting 1's amount,
and if it's negative, it knows to choose "income". (NB: inconsistent; we mostly use "revenue" elsewhere.)

Now with multiple amounts and more freedom, it's going to be easy to confuse this.
Should it just always use "expenses:unknown" as the default ?

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 15, 2019

I think recursive field assignments are not quite there yet ?
I believe you made it possible for a field assignment to reference a
field of the same name assigned via field list:

2019-01-01,a,1
fields   date, account1, amount
account1 %account1 x
2019/01/01
    a x                          1
    income:unknown

But field assignments still can't reference each other, eg to
incrementally build up a value; all but the last one are ignored:

fields   date, account1, amount
account1 %account1 x
account1 %account1 y
account1 %account1 z
2019/01/01
    a z                          1
    income:unknown

A separate issue: I think "account" (no number) has regressed, eg this is no longer parsed:

account  a
@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 15, 2019

Good point, but consider a csv record matched by two or more conditionals, each doing a skip N. That's already a bit ambiguous, right ? I guess it will skip the sum of the Ns, which probably isn't useful, but never mind.

But field assignments still can't reference each other, eg to
incrementally build up a value; all but the last one are ignored:

fields date, account1, amount
account1 %account1 x
account1 %account1 y
account1 %account1 z

Let me try and address both of these points at the same time.
The code effectively has two separate parts:

  1. Field assignment selection: how do we choose which fee assignments to apply
  2. Field assignment value resolution: how do we compute the value to be assigned to the field

This PR changes (2), but does not change (1) at all. Furthermore, change for (2) in this PR is very limited: code already allowed field value to recursively refer to the field name, so you get limited one-step recursion. This PR simply fixed a small issue where this recursive application mistakenly assumed that grammar for field names are more limited that it really is.

(1) still operates the same way it did before: if there are multiple field assignments in the rule file at the top level, subsequent field assignment override earlier ones. If there are multiple matching "if", field assignment in the subsequent ones override earlier ones. Only when (1) is complete, at the very end, step (2) kicks in and evaluates values for selected assignment.

This covers both "skip" and "account1" examples from above. For "skip" the last assignment will win, and for "account1" the last assignment wins.

Changing (1) will probably require separate PR, as we will move from declarative field assigment language to imperative field assignment language. Suddenly order of field declarations start to matter way more than it currently is. I do not immediately see a clear model to do what you want in your account1 example. Consider a couple of puzzles:

fields   date, account1, account2, amount
if SOME COND
   account1 %account2
   account2 %account1

Currently this swaps two accounts which seems like a nice thing to have. In the proposed model this will require third field, something like:

fields   date, account1, account2, amount
if SOME COND
   account9 %account2
   account2 %account1
   account1 %account9
   account9

Or this:

fields   date, account1, amount
account1 %account1 x
account1 %account1 y

if SOME COND
  account1 %account1 z
  
account1 %account1 t

For row that does not match "if" you expect account1 to be "<...> x y z". What about the row that matches though? Do you expect it to be "<...> x y z t" or "<...> x y t z"?
Now lets consider include files which could have their own top-level field assignment statements and "if" statements....

Are you sure that a change to introduce this level of complexity is warranted?
Even if it is, I kinda feel that it should be in its own PR.

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 15, 2019

Regarding "income:expense" and "account" regression - i will check this later today

@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Oct 15, 2019

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 15, 2019

A separate issue: I think "account" (no number) has regressed, eg this is no longer parsed:

account  a

Current master does not support account, only account1 and account2. Is it worth adding it for consistency?

UPD: I take this back. Lets not overcomplicate things

@adept

This comment has been minimized.

Copy link
Collaborator Author

adept commented Oct 15, 2019

Doesn't this PR currently add some "imperativity" ? Since a standalone field assignment can override a field list assignment.

No, I don't think so. As I said, recursive interpolation already worked before this PR, but just for fields with names conforming to regexp %[a-zA-Z]+. This PR simply extends this to all possible fields (with dashes, underscores, etc).

@adept adept force-pushed the adept:csv-mega-pack branch from 9f2a956 to fc001da Nov 5, 2019
@simonmichael simonmichael merged commit fb5bca0 into simonmichael:master Nov 6, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
simonmichael added a commit that referenced this pull request Nov 6, 2019
simonmichael added a commit that referenced this pull request Nov 6, 2019
@simonmichael

This comment has been minimized.

Copy link
Owner

simonmichael commented Nov 6, 2019

Following our extensive IRC chats.. #1095 has landed! Thank you very much @adept ! \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.