New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need to read scientific notation numbers in CSV #704

Closed
simonmichael opened this Issue Feb 5, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@simonmichael
Owner

simonmichael commented Feb 5, 2018

Coinbase uses scientific notation for very small amounts, which breaks our CSV reader. I think it makes sense for use to add support for this when reading CSV. Would anyone like to take a crack at it ?

2018-01-01 20:28:12 -0800,3.0e-08,...
@flip111

This comment has been minimized.

Show comment
Hide comment
@flip111

flip111 Feb 9, 2018

Contributor

Is there a recommendation of library to use ?

Contributor

flip111 commented Feb 9, 2018

Is there a recommendation of library to use ?

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Feb 9, 2018

Owner

Haskell can read it natively I believe. We'll have to handle read errors gracefully, still allow the use of e or E as commodity symbol (I do that), etc. numberp in Hledger.Read.Common seems the right neighbourhood.

Owner

simonmichael commented Feb 9, 2018

Haskell can read it natively I believe. We'll have to handle read errors gracefully, still allow the use of e or E as commodity symbol (I do that), etc. numberp in Hledger.Read.Common seems the right neighbourhood.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Feb 9, 2018

Owner
Owner

simonmichael commented Feb 9, 2018

@ony

This comment has been minimized.

Show comment
Hide comment
@ony

ony Feb 9, 2018

Collaborator

One of the benefit of scientific format is that it can be easily combined with generic format.
Right now we use amountp to parse value from both CSV and Journal.

As I know there is no format that allows you to write two numbers without @ or @@ between them. I.e. 10e-5 - will be invalid. So we can use backtracking on e not followed with a digit in numberp to handling it as a commodity on a higher level.

The real problem is how to treat style in this case. I.e. 1,000.5e-1 is it the same as 1,00.05?
I believe we should completely drop digit groups style information and assume it as 100.05.

Less important question is how strict we want to make scientific format. I.e. should we treat 1,000.5e-1 as a valid number at all?
According to wiki digit groups delimiter are allowed. Moreover, they may appear after decimal separator 1.00,05e0 which is completely screw up our heuristic on identifying which is groups separator and which is decimal one.
If we'll forbid digit groups separator we'll avoid problem with style. We can justify this decision by the fact that this isn't a scientific tool and we just need a way to read machine generated output which often by default is scientific or engineer.

Collaborator

ony commented Feb 9, 2018

One of the benefit of scientific format is that it can be easily combined with generic format.
Right now we use amountp to parse value from both CSV and Journal.

As I know there is no format that allows you to write two numbers without @ or @@ between them. I.e. 10e-5 - will be invalid. So we can use backtracking on e not followed with a digit in numberp to handling it as a commodity on a higher level.

The real problem is how to treat style in this case. I.e. 1,000.5e-1 is it the same as 1,00.05?
I believe we should completely drop digit groups style information and assume it as 100.05.

Less important question is how strict we want to make scientific format. I.e. should we treat 1,000.5e-1 as a valid number at all?
According to wiki digit groups delimiter are allowed. Moreover, they may appear after decimal separator 1.00,05e0 which is completely screw up our heuristic on identifying which is groups separator and which is decimal one.
If we'll forbid digit groups separator we'll avoid problem with style. We can justify this decision by the fact that this isn't a scientific tool and we just need a way to read machine generated output which often by default is scientific or engineer.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Feb 9, 2018

Owner

All good points. I need it only for CSV right now. Limiting scope that way could possibly keep this a small task. I have not seen anybody needing this in journal format yet, though I do like the idea of allowing there just for generality, assuming we find solutions to all the horrible issues you raise. :)

Owner

simonmichael commented Feb 9, 2018

All good points. I need it only for CSV right now. Limiting scope that way could possibly keep this a small task. I have not seen anybody needing this in journal format yet, though I do like the idea of allowing there just for generality, assuming we find solutions to all the horrible issues you raise. :)

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 26, 2018

Owner
Owner

simonmichael commented Mar 26, 2018

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 26, 2018

Owner

Now I see the commit comments were showing up on PR #706. Never mind, I continue the main discussion here on the original issue #704.

I guess next steps are to merge #706, add our scientific notation to the journal > Amounts doc, and make sure we do warn if it looks a digit group separator is used in scientific notation.

Owner

simonmichael commented Mar 26, 2018

Now I see the commit comments were showing up on PR #706. Never mind, I continue the main discussion here on the original issue #704.

I guess next steps are to merge #706, add our scientific notation to the journal > Amounts doc, and make sure we do warn if it looks a digit group separator is used in scientific notation.

@simonmichael

This comment has been minimized.

Show comment
Hide comment
@simonmichael

simonmichael Mar 31, 2018

Owner

Merged. For the record (and noted in scientific.test): E-notation with only a digit group separator is indeed wrongly parsed as a decimal point. Eg, assuming you use comma as thousands separator and period as decimal point:

2018/1/1
  (a)   $1,000e1   ; looks like $10000, but parsed as $10.

If you declare the thousands separator with a commodity directive, you get a useful (though unclear) parse error instead:

commodity $1,000.00
2018/1/1
  (a)   $1,000e1   ; gives a parse error

But setting the thousands separator implicitly by the first amount does not give the useful error:

2018/1/1
  (a)   $1,000.00  ; implicitly declare comma as thousands separator
2018/1/1
  (a)   $1,000e3  ; should be parsed as $1000000, is parsed as $1000

I've mentioned it in docs.

Owner

simonmichael commented Mar 31, 2018

Merged. For the record (and noted in scientific.test): E-notation with only a digit group separator is indeed wrongly parsed as a decimal point. Eg, assuming you use comma as thousands separator and period as decimal point:

2018/1/1
  (a)   $1,000e1   ; looks like $10000, but parsed as $10.

If you declare the thousands separator with a commodity directive, you get a useful (though unclear) parse error instead:

commodity $1,000.00
2018/1/1
  (a)   $1,000e1   ; gives a parse error

But setting the thousands separator implicitly by the first amount does not give the useful error:

2018/1/1
  (a)   $1,000.00  ; implicitly declare comma as thousands separator
2018/1/1
  (a)   $1,000e3  ; should be parsed as $1000000, is parsed as $1000

I've mentioned it in docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment