The print-unique command needs to allow field selection for comparison #1046

alerque · 2019-06-11T14:54:15Z

Related to #943 but not quite the same...

The print-unique command is of limited use without some configuration as to what it actually compares. At the moment it only works on part of the description, the text field proper without the code part in other places considered part of the description. In my use case processing imported transactions I'm actually looking for uniq codes (or in one use case, a combination of the code + description). I can also conceive of wanting to match the amount too to get unique transactions, not just unique payees.

What fields are compared should be configurable.

The text was updated successfully, but these errors were encountered:

simonmichael · 2019-06-12T03:29:06Z

Interesting idea, though I'm not totally clear on the real world use cases for more powerful uniqueness checking. If you feel this is valuable enough, would you like to try mocking up the UI and docs here ?

alerque · 2019-06-12T07:41:35Z

It's possible that I'm using the wrong tool, but here is my scenario. The only digital format I can get out of my bank is an XLS sheet of the last N transactions. I can download these whenever I want, but the result is inevitably duplicate transactions. I'm converting the XLS to CSV, then importing to Ledger. Here is a sample set of problem entries from the resulting ledger:

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-0240334-802 AİDAT EKİM, KASIM, ARALIK 2017
    Para Transferi                  ₺2880.00
    Assets:TRY:Caleb:Garanti       ₺-2880.00

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-KOMİSYON+BSMV TAHSİLATI-802 AİDAT EKİM, KASIM, AR
    Expenses:Fees:Banking              ₺4,90
    Assets:TRY:Caleb:Garanti          ₺-4,90

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-0510319-802 AİDAT EKİM, KASIM, ARALIK 2018
    Para Transferi                  ₺3060.00
    Assets:TRY:Caleb:Garanti       ₺-3060.00

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-0510319-802 AİDAT EKİM, KASIM, ARALIK 2018
    Para Transferi                  ₺3060.00
    Assets:TRY:Caleb:Garanti       ₺-3060.00

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-KOMİSYON+BSMV TAHSİLATI-802 AİDAT EKİM, KASIM, AR
    Expenses:Fees:Banking              ₺5,50
    Assets:TRY:Caleb:Garanti          ₺-5,50

Note that there are three wire transfers here, but only two of them are unique (with "unique" codes). Deduplicating these works pretty easily with print-unique because even though the code is ignored, the description line is different.

Then there are two entries for fees associated with the two wire transfers. Normally this would be duplicated too but in this case the fee was added later and the first time I imported it didn't have the fee, a later download did.

Deduplicating the fee transactions is harder. The description line should have been unique (by chance of my including the year date in the memo) but the XLS only has truncated values, so these two years are showing the same description line. They have different codes, but print-unique isn't including the code in the comparison. Just using the code wouldn't work either, because the fees have the same code as the transaction they are associated with.

This results in the awkward output of hledger print-unique having removed something that was actually unique, the description just happened to be truncated:

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-0240334-802 AİDAT EKİM, KASIM, ARALIK 2017
    Para Transferi                  ₺2880,00
    Assets:TRY:Caleb:Garanti       ₺-2880,00

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-KOMİSYON+BSMV TAHSİLATI-802 AİDAT EKİM, KASIM, AR
    Expenses:Fees:Banking              ₺4,90
    Assets:TRY:Caleb:Garanti          ₺-4,90

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-0510319-802 AİDAT EKİM, KASIM, ARALIK 2018
    Para Transferi                  ₺3060,00
    Assets:TRY:Caleb:Garanti       ₺-3060,00

Ideally I would be able to use hledger print-unique --fields date,code,description to print and deduplicate transactions where all of the date, code, and description fields are unique and get the following result:

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-0240334-802 AİDAT EKİM, KASIM, ARALIK 2017
    Para Transferi                  ₺2880.00
    Assets:TRY:Caleb:Garanti       ₺-2880.00

2018/01/19 (2018-01-19-08.52.24.534905) INT-EFTEMR-KOMİSYON+BSMV TAHSİLATI-802 AİDAT EKİM, KASIM, AR
    Expenses:Fees:Banking              ₺4,90
    Assets:TRY:Caleb:Garanti          ₺-4,90

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-0510319-802 AİDAT EKİM, KASIM, ARALIK 2018
    Para Transferi                  ₺3060.00
    Assets:TRY:Caleb:Garanti       ₺-3060.00

2018/10/04 (2018-10-04-09.00.51.446265) INT-EFTEMR-KOMİSYON+BSMV TAHSİLATI-802 AİDAT EKİM, KASIM, AR
    Expenses:Fees:Banking              ₺5,50
    Assets:TRY:Caleb:Garanti          ₺-5,50

(And yes, my bank's exports are inconsistent in their use of number formatting! I do clean that up in the next step by filtering the ledger through a print with an explicit commodity format declaration.)

simonmichael · 2019-06-12T13:38:00Z

Great example, thanks. Though I have to admit I'm still not really clear. The current print-unique was used for something I don't remember. print-unique --fields=FIELDS (with some default set of fields) sounds good.

Except, if we can avoid options it's always better. Why not always check all fields ?

alerque · 2019-06-12T13:44:57Z

In my use case, checking all fields would be more useful than the current behavior, BUT I could imagine a use case for not being so strict. In the cases of a journal that has been imported an modified, being able to still flush out duplicates even if comments have been added or categories tweaked would be nice.

As it stands I'm a little unsure about what use case it does currently work for.

simonmichael · 2019-06-12T14:07:43Z

So I guess:

add --fields=FIELDS option for comparing a subset, using field names compatible with https://hledger.org/csv.html#field-list
compare all fields by default

alerque added the A-WISH Some kind of improvement request, hare-brained proposal, or plea. label Jun 11, 2019

simonmichael added the print-unique label Jun 11, 2019

simonmichael removed the help wanted label Mar 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The print-unique command needs to allow field selection for comparison #1046

The print-unique command needs to allow field selection for comparison #1046

alerque commented Jun 11, 2019

simonmichael commented Jun 12, 2019 •

edited

alerque commented Jun 12, 2019 •

edited

simonmichael commented Jun 12, 2019 •

edited

alerque commented Jun 12, 2019

simonmichael commented Jun 12, 2019

The print-unique command needs to allow field selection for comparison #1046

The print-unique command needs to allow field selection for comparison #1046

Comments

alerque commented Jun 11, 2019

simonmichael commented Jun 12, 2019 • edited

alerque commented Jun 12, 2019 • edited

simonmichael commented Jun 12, 2019 • edited

alerque commented Jun 12, 2019

simonmichael commented Jun 12, 2019

simonmichael commented Jun 12, 2019 •

edited

alerque commented Jun 12, 2019 •

edited

simonmichael commented Jun 12, 2019 •

edited