Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression v1 v2 by id2 id4: results has changed in 0.6.2 #357

Closed
jangorecki opened this issue Feb 14, 2021 · 22 comments
Closed

regression v1 v2 by id2 id4: results has changed in 0.6.2 #357

jangorecki opened this issue Feb 14, 2021 · 22 comments

Comments

@jangorecki
Copy link

jangorecki commented Feb 14, 2021

Hi Ritchie,
I noticed that results of one of the queries for one data case only has changed in recent version

   version   chk
1:   0.4.3 0.459
2:   0.4.4 0.459
3:   0.4.5 0.459
4:   0.6.2 0.928

This might be due to a bug fix, meaning that previously computed results were considered incorrect.
Ultimately comparing to other solutions it still looks to be incorrect

      solution         chk
1:  data.table    1.028772
2: pydatatable       1.029
3:       dplyr    1.028772
4:       spark       1.029
5:     juliadf       1.029
6:      polars 0.459;0.928

Could you please have a look at this question just to ensure it should produce matching results to other tools?

ans = x.groupby(["id2","id4"]).agg(pl.pearson_corr("v1","v2").alias("r2")).with_column(col("r2")**2).collect()

Note that I am looking at chk value, which is ans["r2"].sum(), so not really a result but "signature" of a result. This has been observed only on G1_1e8_1e2_5_0 data case.

@ritchie46
Copy link
Member

I believe I've warned future me for this. https://github.com/ritchie46/polars/blob/65184670c07efc6a5b891bd6cbf24e03a18187ed/polars/polars-core/src/functions.rs#L5 😅

Thanks for noting, I will get into this.

@jangorecki
Copy link
Author

I am now checking chk of other questions and see that "median v3 sd v3 by id4 id5" is not well matching other solutions

@jangorecki

This comment has been minimized.

@ritchie46
Copy link
Member

ritchie46 commented Feb 15, 2021

I see. It starts to make sense now. The summations don't do checked addition. For now the quickest solution would be changing the type of the data we read.

x = pl.read_csv(src_grp, dtype={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32, "v3":pl.Float64})

v1, and v2 are currently read as signed int 32 so overflow happens easily. Shall I propose a PR where I change v1 and v2 to Int64?

Then there are still some NaNs that I should figure out on q9 and q10.

@jangorecki
Copy link
Author

q10 is OOM killed, I don't think it has anything to do with NAs.
As for overflow... I would keep it as int32 as overflow happens only when computing chk, which is not really subject of questions to answer in benchmark. What we do for other tools is to convert just before doing sum when computing chk. I would appreciate if you could submit PR for that exactly. Then queries in benchmark can still compute on int32, which should be faster than int64.

@ritchie46
Copy link
Member

@jangorecki . Do you know if this is still relevant now the overflow was fixed?

@jangorecki
Copy link
Author

Overflow was fixed. Not matching check sums is still an issue.

@jangorecki
Copy link
Author

Updated comparison of all checks using polars 0.7.9



|task    |data           |question                    |data.table                        |polars                             |
|:-------|:--------------|:---------------------------|:---------------------------------|:----------------------------------|
|groupby |G1_1e7_1e2_0_0 |sum v1 by id1               |29998789                          |29998789                           |
|groupby |G1_1e7_1e2_0_0 |sum v1 by id1:id2           |29998789                          |29998789                           |
|groupby |G1_1e7_1e2_0_0 |sum v1 mean v3 by id3       |29998789;4999720                  |29998789;4949647                   |
|groupby |G1_1e7_1e2_0_0 |mean v1:v3 by id4           |299.988;799.8942;4999.767         |251;748;4950                       |
|groupby |G1_1e7_1e2_0_0 |sum v1:v3 by id6            |29998789;79989360;499976651       |29998789;499926677;499926677       |
|groupby |G1_1e7_1e2_0_0 |median v3 sd v3 by id4 id5  |499920.1;288648.1                 |494669;283513                      |
|groupby |G1_1e7_1e2_0_0 |max v1 - min v2 by id3      |399882                            |399882                             |
|groupby |G1_1e7_1e2_0_0 |largest two v3 by id6       |19700451                          |19592028                           |
|groupby |G1_1e7_1e2_0_0 |regression v1 v2 by id2 id4 |9.838641                          |0                                  |
|groupby |G1_1e7_1e2_0_0 |sum v3 count by id1:id6     |499976651;10000000                |494976706;10000000                 |
|groupby |G1_1e7_1e1_0_0 |sum v1 by id1               |29998597                          |29998597                           |
|groupby |G1_1e7_1e1_0_0 |sum v1 by id1:id2           |29998597                          |29998597                           |
|groupby |G1_1e7_1e1_0_0 |sum v1 mean v3 by id3       |29998597;50000559                 |29998597;49500460                  |
|groupby |G1_1e7_1e1_0_0 |mean v1:v3 by id4           |29.9986;79.99191;499.9807         |25;74;494                          |
|groupby |G1_1e7_1e1_0_0 |sum v1:v3 by id6            |29998597;79991898;499980747       |29998597;499480612;499480612       |
|groupby |G1_1e7_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.573;2887.162                 |4950;2800                          |
|groupby |G1_1e7_1e1_0_0 |max v1 - min v2 by id3      |2789316                           |2789316                            |
|groupby |G1_1e7_1e1_0_0 |largest two v3 by id6       |170016563                         |169008432                          |
|groupby |G1_1e7_1e1_0_0 |regression v1 v2 by id2 id4 |0.001036507                       |0                                  |
|groupby |G1_1e7_1e1_0_0 |sum v3 count by id1:id6     |499980747;10000000                |494980635;10000000                 |
|groupby |G1_1e7_2e0_0_0 |sum v1 by id1               |30000054                          |30000054                           |
|groupby |G1_1e7_2e0_0_0 |sum v1 by id1:id2           |30000054                          |30000054                           |
|groupby |G1_1e7_2e0_0_0 |sum v1 mean v3 by id3       |30000054;216107547                |30000054;213946287                 |
|groupby |G1_1e7_2e0_0_0 |mean v1:v3 by id4           |6.000011;15.99728;99.9872         |5;14;99                            |
|groupby |G1_1e7_2e0_0_0 |sum v1:v3 by id6            |30000054;79986418;499936032       |30000054;497775009;497775009       |
|groupby |G1_1e7_2e0_0_0 |median v3 sd v3 by id4 id5  |199.9701;115.4886                 |198;112                            |
|groupby |G1_1e7_2e0_0_0 |max v1 - min v2 by id3      |-8263086                          |-8263086                           |
|groupby |G1_1e7_2e0_0_0 |largest two v3 by id6       |419079607                         |415428636                          |
|groupby |G1_1e7_2e0_0_0 |regression v1 v2 by id2 id4 |0.000001462804                    |0                                  |
|groupby |G1_1e7_2e0_0_0 |sum v3 count by id1:id6     |499936032;10000000                |494936026;10000000                 |
|groupby |G1_1e7_1e2_0_1 |sum v1 by id1               |29998789                          |29998789                           |
|groupby |G1_1e7_1e2_0_1 |sum v1 by id1:id2           |29998789                          |29998789                           |
|groupby |G1_1e7_1e2_0_1 |sum v1 mean v3 by id3       |29998789;4999720                  |29998789;4949647                   |
|groupby |G1_1e7_1e2_0_1 |mean v1:v3 by id4           |299.988;799.8942;4999.767         |251;748;4950                       |
|groupby |G1_1e7_1e2_0_1 |sum v1:v3 by id6            |29998789;79989360;499976651       |29998789;499926677;499926677       |
|groupby |G1_1e7_1e2_0_1 |median v3 sd v3 by id4 id5  |499920.1;288648.1                 |494669;283513                      |
|groupby |G1_1e7_1e2_0_1 |max v1 - min v2 by id3      |399882                            |399882                             |
|groupby |G1_1e7_1e2_0_1 |largest two v3 by id6       |19700451                          |19592028                           |
|groupby |G1_1e7_1e2_0_1 |regression v1 v2 by id2 id4 |9.838641                          |0                                  |
|groupby |G1_1e7_1e2_0_1 |sum v3 count by id1:id6     |499976651;10000000                |494976706;10000000                 |
|groupby |G1_1e7_1e2_5_0 |sum v1 by id1               |28498857                          |0                                  |
|groupby |G1_1e7_1e2_5_0 |sum v1 by id1:id2           |28498857                          |0                                  |
|groupby |G1_1e7_1e2_5_0 |sum v1 mean v3 by id3       |28498857;4749468                  |0;0                                |
|groupby |G1_1e7_1e2_5_0 |mean v1:v3 by id4           |287.9894;767.8529;4799.873        |0;0;0                              |
|groupby |G1_1e7_1e2_5_0 |sum v1:v3 by id6            |28498857;75988394;474969574       |0;0;0                              |
|groupby |G1_1e7_1e2_5_0 |median v3 sd v3 by id4 id5  |460771.2;266006.9                 |431747;254775                      |
|groupby |G1_1e7_1e2_5_0 |max v1 - min v2 by id3      |379850                            |95001                              |
|groupby |G1_1e7_1e2_5_0 |largest two v3 by id6       |18700555                          |18597955                           |
|groupby |G1_1e7_1e2_5_0 |regression v1 v2 by id2 id4 |9.940516                          |0                                  |
|groupby |G1_1e7_1e2_5_0 |sum v3 count by id1:id6     |474969574;10000000                |470218869;10000000                 |
|groupby |G1_1e8_1e2_0_0 |sum v1 by id1               |299991302                         |299991302                          |
|groupby |G1_1e8_1e2_0_0 |sum v1 by id1:id2           |299991302                         |299991302                          |
|groupby |G1_1e8_1e2_0_0 |sum v1 mean v3 by id3       |299991302;50001192                |299991302;49501126                 |
|groupby |G1_1e8_1e2_0_0 |mean v1:v3 by id4           |299.9913;799.9782;5000.104        |253;746;4951                       |
|groupby |G1_1e8_1e2_0_0 |sum v1:v3 by id6            |299991302;799978221;5000103938    |299991302;4999603563;4999603563    |
|groupby |G1_1e8_1e2_0_0 |median v3 sd v3 by id4 id5  |500020;288668.4                   |495040;281508                      |
|groupby |G1_1e8_1e2_0_0 |max v1 - min v2 by id3      |3998729                           |3998729                            |
|groupby |G1_1e8_1e2_0_0 |largest two v3 by id6       |196996660                         |195912715                          |
|groupby |G1_1e8_1e2_0_0 |regression v1 v2 by id2 id4 |1.006723                          |0                                  |
|groupby |G1_1e8_1e2_0_0 |sum v3 count by id1:id6     |5000103938;100000000              |4950105511;100000000               |
|groupby |G1_1e8_1e1_0_0 |sum v1 by id1               |300012466                         |300012466                          |
|groupby |G1_1e8_1e1_0_0 |sum v1 by id1:id2           |300012466                         |300012466                          |
|groupby |G1_1e8_1e1_0_0 |sum v1 mean v3 by id3       |300012466;499941401               |300012466;494943306                |
|groupby |G1_1e8_1e1_0_0 |mean v1:v3 by id4           |30.00125;80.00796;499.9575        |26;77;493                          |
|groupby |G1_1e8_1e1_0_0 |sum v1:v3 by id6            |300012466;800079612;4999575436    |300012466;4994576000;4994576000    |
|groupby |G1_1e8_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.826;2886.819                 |4953;2800                          |
|groupby |G1_1e8_1e1_0_0 |max v1 - min v2 by id3      |27890093                          |27890093                           |
|groupby |G1_1e8_1e1_0_0 |largest two v3 by id6       |1700010092                        |1689930387                         |
|groupby |G1_1e8_1e1_0_0 |regression v1 v2 by id2 id4 |0.000091405                       |0                                  |
|groupby |G1_1e8_1e1_0_0 |sum v3 count by id1:id6     |4999575436;100000000              |4949574169;100000000               |
|groupby |G1_1e8_2e0_0_0 |sum v1 by id1               |299988126                         |299988126                          |
|groupby |G1_1e8_2e0_0_0 |sum v1 by id1:id2           |299988126                         |299988126                          |
|groupby |G1_1e8_2e0_0_0 |sum v1 mean v3 by id3       |299988126;2161776167              |299988126;2140160588               |
|groupby |G1_1e8_2e0_0_0 |mean v1:v3 by id4           |5.999763;15.99904;100.001         |4;14;99                            |
|groupby |G1_1e8_2e0_0_0 |sum v1:v3 by id6            |299988126;799952220;5000051370    |299988126;4978435203;4978435203    |
|groupby |G1_1e8_2e0_0_0 |median v3 sd v3 by id4 id5  |199.9981;115.4678                 |197;112                            |
|groupby |G1_1e8_2e0_0_0 |max v1 - min v2 by id3      |-82715914                         |-82715914                          |
|groupby |G1_1e8_2e0_0_0 |largest two v3 by id6       |4191769306                        |4155252005                         |
|groupby |G1_1e8_2e0_0_0 |regression v1 v2 by id2 id4 |0.0000002443467                   |0                                  |
|groupby |G1_1e8_2e0_0_0 |sum v3 count by id1:id6     |5000051370;100000000              |4950052872;100000000               |
|groupby |G1_1e8_1e2_0_1 |sum v1 by id1               |299991302                         |299991302                          |
|groupby |G1_1e8_1e2_0_1 |sum v1 by id1:id2           |299991302                         |299991302                          |
|groupby |G1_1e8_1e2_0_1 |sum v1 mean v3 by id3       |299991302;50001192                |299991302;49501126                 |
|groupby |G1_1e8_1e2_0_1 |mean v1:v3 by id4           |299.9913;799.9782;5000.104        |253;746;4951                       |
|groupby |G1_1e8_1e2_0_1 |sum v1:v3 by id6            |299991302;799978221;5000103938    |299991302;4999603563;4999603563    |
|groupby |G1_1e8_1e2_0_1 |median v3 sd v3 by id4 id5  |500020;288668.4                   |495040;281508                      |
|groupby |G1_1e8_1e2_0_1 |max v1 - min v2 by id3      |3998729                           |3998729                            |
|groupby |G1_1e8_1e2_0_1 |largest two v3 by id6       |196996660                         |195912715                          |
|groupby |G1_1e8_1e2_0_1 |regression v1 v2 by id2 id4 |1.006723                          |0                                  |
|groupby |G1_1e8_1e2_0_1 |sum v3 count by id1:id6     |5000103938;100000000              |4950105511;100000000               |
|groupby |G1_1e8_1e2_5_0 |sum v1 by id1               |284994735                         |0                                  |
|groupby |G1_1e8_1e2_5_0 |sum v1 by id1:id2           |284994735                         |0                                  |
|groupby |G1_1e8_1e2_5_0 |sum v1 mean v3 by id3       |284994735;47500173                |0;0                                |
|groupby |G1_1e8_1e2_5_0 |mean v1:v3 by id4           |287.9924;767.9688;4799.99         |0;0;0                              |
|groupby |G1_1e8_1e2_5_0 |sum v1:v3 by id6            |284994735;759971497;4750083909    |0;0;0                              |
|groupby |G1_1e8_1e2_5_0 |median v3 sd v3 by id4 id5  |460792.4;266033.6                 |431982;256639                      |
|groupby |G1_1e8_1e2_5_0 |max v1 - min v2 by id3      |3798317                           |950001                             |
|groupby |G1_1e8_1e2_5_0 |largest two v3 by id6       |186996834                         |185970956                          |
|groupby |G1_1e8_1e2_5_0 |regression v1 v2 by id2 id4 |1.028772                          |0                                  |
|groupby |G1_1e8_1e2_5_0 |sum v3 count by id1:id6     |4750083909;100000000              |4702522476;100000000               |
|groupby |G1_1e9_1e2_0_0 |sum v1 by id1               |2999924714                        |2999924714                         |
|groupby |G1_1e9_1e2_0_0 |sum v1 by id1:id2           |2999924714                        |2999924714                         |
|groupby |G1_1e9_1e2_0_0 |sum v1 mean v3 by id3       |2999924714;499986250              |2999924714;494985204               |
|groupby |G1_1e9_1e2_0_0 |mean v1:v3 by id4           |299.9925;799.9993;4999.87         |244;750;4946                       |
|groupby |G1_1e9_1e2_0_0 |sum v1:v3 by id6            |2999924714;7999992854;49998699478 |2999924714;49993698993;49993698993 |
|groupby |G1_1e9_1e2_0_0 |median v3 sd v3 by id4 id5  |499981.8;288669.2                 |494951;280003                      |
|groupby |G1_1e9_1e2_0_0 |max v1 - min v2 by id3      |39987226                          |39987226                           |
|groupby |G1_1e9_1e2_0_0 |largest two v3 by id6       |1970001790                        |1959152876                         |
|groupby |G1_1e9_1e2_0_0 |regression v1 v2 by id2 id4 |0.09821376                        |0                                  |
|groupby |G1_1e9_1e2_0_0 |sum v3 count by id1:id6     |49998699478;1000000000            |NA                                 |
|groupby |G1_1e9_1e1_0_0 |sum v1 by id1               |2999933732                        |2999933732                         |
|groupby |G1_1e9_1e1_0_0 |sum v1 by id1:id2           |2999933732                        |2999933732                         |
|groupby |G1_1e9_1e1_0_0 |sum v1 mean v3 by id3       |2999933732;4999733095             |2999933732;4949735505              |
|groupby |G1_1e9_1e1_0_0 |mean v1:v3 by id4           |29.99934;79.99944;499.9972        |24;73;495                          |
|groupby |G1_1e9_1e1_0_0 |sum v1:v3 by id6            |2999933732;7999944432;49999721740 |2999933732;49949722457;49949722457 |
|groupby |G1_1e9_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.948;2886.7                   |4952;2800                          |
|groupby |G1_1e9_1e1_0_0 |max v1 - min v2 by id3      |278916351                         |NA                                 |
|groupby |G1_1e9_1e1_0_0 |largest two v3 by id6       |17000225999                       |NA                                 |
|groupby |G1_1e9_1e1_0_0 |regression v1 v2 by id2 id4 |0.000008661068                    |NA                                 |
|groupby |G1_1e9_1e1_0_0 |sum v3 count by id1:id6     |49999721740;1000000000            |NA                                 |
|groupby |G1_1e9_2e0_0_0 |sum v1 by id1               |2999997259                        |2999997259                         |
|groupby |G1_1e9_2e0_0_0 |sum v1 by id1:id2           |2999997259                        |2999997259                         |
|groupby |G1_1e9_2e0_0_0 |sum v1 mean v3 by id3       |2999997259;21616611234            |NA                                 |
|groupby |G1_1e9_1e2_0_1 |sum v1 by id1               |2999924714                        |2999924714                         |
|groupby |G1_1e9_1e2_0_1 |sum v1 by id1:id2           |2999924714                        |2999924714                         |
|groupby |G1_1e9_1e2_0_1 |sum v1 mean v3 by id3       |2999924714;499986250              |2999924714;494985204               |
|groupby |G1_1e9_1e2_0_1 |mean v1:v3 by id4           |299.9925;799.9993;4999.87         |244;750;4946                       |
|groupby |G1_1e9_1e2_0_1 |sum v1:v3 by id6            |2999924714;7999992854;49998699478 |2999924714;49993698992;49993698992 |
|groupby |G1_1e9_1e2_0_1 |median v3 sd v3 by id4 id5  |499981.8;288669.2                 |494951;280003                      |
|groupby |G1_1e9_1e2_0_1 |max v1 - min v2 by id3      |39987226                          |39987226                           |
|groupby |G1_1e9_1e2_0_1 |largest two v3 by id6       |1970001790                        |1959152876                         |
|groupby |G1_1e9_1e2_0_1 |regression v1 v2 by id2 id4 |0.09821376                        |0                                  |
|groupby |G1_1e9_1e2_0_1 |sum v3 count by id1:id6     |49998699478;1000000000            |NA                                 |
|groupby |G1_1e9_1e2_5_0 |sum v1 by id1               |2849922064                        |0                                  |
|groupby |G1_1e9_1e2_5_0 |sum v1 by id1:id2           |2849922064                        |0                                  |
|groupby |G1_1e9_1e2_5_0 |sum v1 mean v3 by id3       |2849922064;474988853              |0;0                                |
|groupby |G1_1e9_1e2_5_0 |mean v1:v3 by id4           |287.9927;768.0034;4799.897        |0;0;0                              |
|groupby |G1_1e9_1e2_5_0 |sum v1:v3 by id6            |2849922064;7600000111;47498842806 |0;0;0                              |
|groupby |G1_1e9_1e2_5_0 |median v3 sd v3 by id4 id5  |460786.5;266038.4                 |433052;258042                      |
|groupby |G1_1e9_1e2_5_0 |max v1 - min v2 by id3      |37982992                          |9500001                            |
|groupby |G1_1e9_1e2_5_0 |largest two v3 by id6       |1870003947                        |1859739170                         |
|groupby |G1_1e9_1e2_5_0 |regression v1 v2 by id2 id4 |0.09857194                        |0                                  |
|groupby |G1_1e9_1e2_5_0 |sum v3 count by id1:id6     |47498842806;1000000000            |NA                                 |
|join    |J1_1e7_NA_0_0  |small inner on int          |450015154;347720187               |445515254;344142538                |
|join    |J1_1e7_NA_0_0  |medium inner on int         |449954076;449999845               |445453921;445532818                |
|join    |J1_1e7_NA_0_0  |medium outer on int         |500043741;449999845               |495043254;445532818                |
|join    |J1_1e7_NA_0_0  |medium inner on factor      |449954076;449999845               |445453921;445532818                |
|join    |J1_1e7_NA_0_0  |big inner on int            |450032092;449860429               |445531196;445360772                |
|join    |J1_1e7_NA_5_0  |small inner on int          |427503549;436095569               |423228741;431973772                |
|join    |J1_1e7_NA_5_0  |medium inner on int         |406023280;423957579               |401962180;419673295                |
|join    |J1_1e7_NA_5_0  |medium outer on int         |475042481;423957579               |470291959;419673295                |
|join    |J1_1e7_NA_5_0  |medium inner on factor      |406023280;423957579               |401962180;419673295                |
|join    |J1_1e7_NA_5_0  |big inner on int            |406160610;427451488               |402098520;423176680                |
|join    |J1_1e7_NA_0_1  |small inner on int          |449966347;386180314               |445466355;383105420                |
|join    |J1_1e7_NA_0_1  |medium inner on int         |449944125;448928500               |445443871;444451102                |
|join    |J1_1e7_NA_0_1  |medium outer on int         |500043741;448928500               |495043254;444451102                |
|join    |J1_1e7_NA_0_1  |medium inner on factor      |449944125;448928500               |445443871;444451102                |
|join    |J1_1e7_NA_0_1  |big inner on int            |450020346;449938346               |445520098;445438761                |
|join    |J1_1e8_NA_0_0  |small inner on int          |4499430832;4388703871             |4454431863;4341563861              |
|join    |J1_1e8_NA_0_0  |medium inner on int         |4499423746;4507751463             |4454426162;4462680072              |
|join    |J1_1e8_NA_0_0  |medium outer on int         |4999542478;4507751463             |4949541289;4462680072              |
|join    |J1_1e8_NA_0_0  |medium inner on factor      |4499423746;4507751463             |4454426162;4462680072              |
|join    |J1_1e8_NA_0_0  |big inner on int            |4499590098;4499913694             |4454588566;4454911810              |
|join    |J1_1e8_NA_5_0  |small inner on int          |4084298007;4658773531             |4043448078;4613682585              |
|join    |J1_1e8_NA_5_0  |medium inner on int         |4061304227;4268317288             |4020688050;4225492467              |
|join    |J1_1e8_NA_5_0  |medium outer on int         |4749474734;4268317288             |4701973161;4225492467              |
|join    |J1_1e8_NA_5_0  |medium inner on factor      |4061304227;4268317288             |4020688050;4225492467              |
|join    |J1_1e8_NA_5_0  |big inner on int            |4060971185;4275319617             |4020354380;4232568111              |
|join    |J1_1e8_NA_0_1  |small inner on int          |4499308287;3953465650             |4454308625;3906715732              |
|join    |J1_1e8_NA_0_1  |medium inner on int         |4499224468;4506958891             |4454225640;4461896811              |
|join    |J1_1e8_NA_0_1  |medium outer on int         |4999542478;4506958891             |4949541289;4461896811              |
|join    |J1_1e8_NA_0_1  |medium inner on factor      |4499224468;4506958891             |4454225640;4461896811              |
|join    |J1_1e8_NA_0_1  |big inner on int            |4499618843;4499951833             |4454616638;4454949513              |

@jangorecki
Copy link
Author

jangorecki commented Apr 30, 2021

I just spotted that casts to Int64 have been applied to aggressively in h2oai/db-benchmark@c7ad6b0#diff-b9f18f8b66c6e7e35cdae2fc80bb752351481552f8f4004e1e311fd92a77fb0d
v3 is float64, so its results should not be casted to int, will fix.
Additionally mean of v1 and v2 is also floating point already, so casting is not needed.

@ritchie46
Copy link
Member

So the differences in the checksum seem to be mostly by v3, and may be due to this unneeded cast operation AND the regression question which is subject of this thread?

I think I've fixed the regression behavior, but it is not yet released. I plan to release this afternoon, together with a fix for that O(n^2) behavior in the csv-parser. Can you wait for the run with a fix, so that new regression behavior also can be checked?

@jangorecki
Copy link
Author

I amended types for chk computation in h2oai/db-benchmark@3fc6e9f.
Please ping me when new version will land on pypi.

@ritchie46
Copy link
Member

@jangorecki it is on pypi.

@jangorecki
Copy link
Author

0.7.11

|task    |data           |question                    |data.table                        |polars                                 |
|:-------|:--------------|:---------------------------|:---------------------------------|:--------------------------------------|
|groupby |G1_1e7_1e2_0_0 |sum v1 by id1               |29998789                          |29998789                               |
|groupby |G1_1e7_1e2_0_0 |sum v1 by id1:id2           |29998789                          |29998789                               |
|groupby |G1_1e7_1e2_0_0 |sum v1 mean v3 by id3       |29998789;4999720                  |29998789.0;4999719.622                 |
|groupby |G1_1e7_1e2_0_0 |mean v1:v3 by id4           |299.988;799.8942;4999.767         |299.988;799.894;4999.767               |
|groupby |G1_1e7_1e2_0_0 |sum v1:v3 by id6            |29998789;79989360;499976651       |29998789.0;79989360.0;499976651.408    |
|groupby |G1_1e7_1e2_0_0 |median v3 sd v3 by id4 id5  |499920.1;288648.1                 |499920.14;288648.108                   |
|groupby |G1_1e7_1e2_0_0 |max v1 - min v2 by id3      |399882                            |399882                                 |
|groupby |G1_1e7_1e2_0_0 |largest two v3 by id6       |19700451                          |19700450.588                           |
|groupby |G1_1e7_1e2_0_0 |regression v1 v2 by id2 id4 |9.838641                          |9.839                                  |
|groupby |G1_1e7_1e2_0_0 |sum v3 count by id1:id6     |499976651;10000000                |499976651.408;10000000.0               |
|groupby |G1_1e7_1e1_0_0 |sum v1 by id1               |29998597                          |29998597                               |
|groupby |G1_1e7_1e1_0_0 |sum v1 by id1:id2           |29998597                          |29998597                               |
|groupby |G1_1e7_1e1_0_0 |sum v1 mean v3 by id3       |29998597;50000559                 |29998597.0;50000558.524                |
|groupby |G1_1e7_1e1_0_0 |mean v1:v3 by id4           |29.9986;79.99191;499.9807         |29.999;79.992;499.981                  |
|groupby |G1_1e7_1e1_0_0 |sum v1:v3 by id6            |29998597;79991898;499980747       |29998597.0;79991898.0;499980747.01     |
|groupby |G1_1e7_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.573;2887.162                 |4999.573;2887.162                      |
|groupby |G1_1e7_1e1_0_0 |max v1 - min v2 by id3      |2789316                           |2789316                                |
|groupby |G1_1e7_1e1_0_0 |largest two v3 by id6       |170016563                         |170016562.642                          |
|groupby |G1_1e7_1e1_0_0 |regression v1 v2 by id2 id4 |0.001036507                       |0.001                                  |
|groupby |G1_1e7_1e1_0_0 |sum v3 count by id1:id6     |499980747;10000000                |499980747.01;10000000.0                |
|groupby |G1_1e7_2e0_0_0 |sum v1 by id1               |30000054                          |30000054                               |
|groupby |G1_1e7_2e0_0_0 |sum v1 by id1:id2           |30000054                          |30000054                               |
|groupby |G1_1e7_2e0_0_0 |sum v1 mean v3 by id3       |30000054;216107547                |30000054.0;216107547.389               |
|groupby |G1_1e7_2e0_0_0 |mean v1:v3 by id4           |6.000011;15.99728;99.9872         |6.0;15.997;99.987                      |
|groupby |G1_1e7_2e0_0_0 |sum v1:v3 by id6            |30000054;79986418;499936032       |30000054.0;79986418.0;499936032.106    |
|groupby |G1_1e7_2e0_0_0 |median v3 sd v3 by id4 id5  |199.9701;115.4886                 |199.97;115.489                         |
|groupby |G1_1e7_2e0_0_0 |max v1 - min v2 by id3      |-8263086                          |-8263086                               |
|groupby |G1_1e7_2e0_0_0 |largest two v3 by id6       |419079607                         |419079607.0                            |
|groupby |G1_1e7_2e0_0_0 |regression v1 v2 by id2 id4 |0.000001462804                    |0.0                                    |
|groupby |G1_1e7_2e0_0_0 |sum v3 count by id1:id6     |499936032;10000000                |499936032.106;10000000.0               |
|groupby |G1_1e7_1e2_0_1 |sum v1 by id1               |29998789                          |29998789                               |
|groupby |G1_1e7_1e2_0_1 |sum v1 by id1:id2           |29998789                          |29998789                               |
|groupby |G1_1e7_1e2_0_1 |sum v1 mean v3 by id3       |29998789;4999720                  |29998789.0;4999719.622                 |
|groupby |G1_1e7_1e2_0_1 |mean v1:v3 by id4           |299.988;799.8942;4999.767         |299.988;799.894;4999.767               |
|groupby |G1_1e7_1e2_0_1 |sum v1:v3 by id6            |29998789;79989360;499976651       |29998789.0;79989360.0;499976651.408    |
|groupby |G1_1e7_1e2_0_1 |median v3 sd v3 by id4 id5  |499920.1;288648.1                 |499920.14;288648.108                   |
|groupby |G1_1e7_1e2_0_1 |max v1 - min v2 by id3      |399882                            |399882                                 |
|groupby |G1_1e7_1e2_0_1 |largest two v3 by id6       |19700451                          |19700450.588                           |
|groupby |G1_1e7_1e2_0_1 |regression v1 v2 by id2 id4 |9.838641                          |9.839                                  |
|groupby |G1_1e7_1e2_0_1 |sum v3 count by id1:id6     |499976651;10000000                |499976651.408;10000000.0               |
|groupby |G1_1e7_1e2_5_0 |sum v1 by id1               |28498857                          |28498857                               |
|groupby |G1_1e7_1e2_5_0 |sum v1 by id1:id2           |28498857                          |28498857                               |
|groupby |G1_1e7_1e2_5_0 |sum v1 mean v3 by id3       |28498857;4749468                  |28498857.0;4511894.123                 |
|groupby |G1_1e7_1e2_5_0 |mean v1:v3 by id4           |287.9894;767.8529;4799.873        |273.584;729.454;4559.823               |
|groupby |G1_1e7_1e2_5_0 |sum v1:v3 by id6            |28498857;75988394;474969574       |28498857.0;75988394.0;474969574.048    |
|groupby |G1_1e7_1e2_5_0 |median v3 sd v3 by id4 id5  |460771.2;266006.9                 |460771.216;259264.174                  |
|groupby |G1_1e7_1e2_5_0 |max v1 - min v2 by id3      |379850                            |379850                                 |
|groupby |G1_1e7_1e2_5_0 |largest two v3 by id6       |18700555                          |18700554.78                            |
|groupby |G1_1e7_1e2_5_0 |regression v1 v2 by id2 id4 |9.940516                          |11.019                                 |
|groupby |G1_1e7_1e2_5_0 |sum v3 count by id1:id6     |474969574;10000000                |474969574.048;10000000.0               |
|groupby |G1_1e8_1e2_0_0 |sum v1 by id1               |299991302                         |299991302                              |
|groupby |G1_1e8_1e2_0_0 |sum v1 by id1:id2           |299991302                         |299991302                              |
|groupby |G1_1e8_1e2_0_0 |sum v1 mean v3 by id3       |299991302;50001192                |299991302.0;50001192.355               |
|groupby |G1_1e8_1e2_0_0 |mean v1:v3 by id4           |299.9913;799.9782;5000.104        |299.991;799.978;5000.104               |
|groupby |G1_1e8_1e2_0_0 |sum v1:v3 by id6            |299991302;799978221;5000103938    |299991302.0;799978221.0;5000103937.772 |
|groupby |G1_1e8_1e2_0_0 |median v3 sd v3 by id4 id5  |500020;288668.4                   |500019.998;288668.357                  |
|groupby |G1_1e8_1e2_0_0 |max v1 - min v2 by id3      |3998729                           |3998729                                |
|groupby |G1_1e8_1e2_0_0 |largest two v3 by id6       |196996660                         |196996660.391                          |
|groupby |G1_1e8_1e2_0_0 |regression v1 v2 by id2 id4 |1.006723                          |1.007                                  |
|groupby |G1_1e8_1e2_0_0 |sum v3 count by id1:id6     |5000103938;100000000              |5000103937.772;100000000.0             |
|groupby |G1_1e8_1e1_0_0 |sum v1 by id1               |300012466                         |300012466                              |
|groupby |G1_1e8_1e1_0_0 |sum v1 by id1:id2           |300012466                         |300012466                              |
|groupby |G1_1e8_1e1_0_0 |sum v1 mean v3 by id3       |300012466;499941401               |300012466.0;499941400.876              |
|groupby |G1_1e8_1e1_0_0 |mean v1:v3 by id4           |30.00125;80.00796;499.9575        |30.001;80.008;499.958                  |
|groupby |G1_1e8_1e1_0_0 |sum v1:v3 by id6            |300012466;800079612;4999575436    |300012466.0;800079612.0;4999575436.012 |
|groupby |G1_1e8_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.826;2886.819                 |4999.826;2886.819                      |
|groupby |G1_1e8_1e1_0_0 |max v1 - min v2 by id3      |27890093                          |27890093                               |
|groupby |G1_1e8_1e1_0_0 |largest two v3 by id6       |1700010092                        |1700010092.167                         |
|groupby |G1_1e8_1e1_0_0 |regression v1 v2 by id2 id4 |0.000091405                       |0.0                                    |
|groupby |G1_1e8_1e1_0_0 |sum v3 count by id1:id6     |4999575436;100000000              |4999575436.012;100000000.0             |
|groupby |G1_1e8_2e0_0_0 |sum v1 by id1               |299988126                         |299988126                              |
|groupby |G1_1e8_2e0_0_0 |sum v1 by id1:id2           |299988126                         |299988126                              |
|groupby |G1_1e8_2e0_0_0 |sum v1 mean v3 by id3       |299988126;2161776167              |299988126.0;2161776167.331             |
|groupby |G1_1e8_2e0_0_0 |mean v1:v3 by id4           |5.999763;15.99904;100.001         |6.0;15.999;100.001                     |
|groupby |G1_1e8_2e0_0_0 |sum v1:v3 by id6            |299988126;799952220;5000051370    |299988126.0;799952220.0;5000051370.457 |
|groupby |G1_1e8_2e0_0_0 |median v3 sd v3 by id4 id5  |199.9981;115.4678                 |199.998;115.468                        |
|groupby |G1_1e8_2e0_0_0 |max v1 - min v2 by id3      |-82715914                         |-82715914                              |
|groupby |G1_1e8_2e0_0_0 |largest two v3 by id6       |4191769306                        |4191769306.196                         |
|groupby |G1_1e8_2e0_0_0 |regression v1 v2 by id2 id4 |0.0000002443467                   |0.0                                    |
|groupby |G1_1e8_2e0_0_0 |sum v3 count by id1:id6     |5000051370;100000000              |5000051370.457;100000000.0             |
|groupby |G1_1e8_1e2_0_1 |sum v1 by id1               |299991302                         |299991302                              |
|groupby |G1_1e8_1e2_0_1 |sum v1 by id1:id2           |299991302                         |299991302                              |
|groupby |G1_1e8_1e2_0_1 |sum v1 mean v3 by id3       |299991302;50001192                |299991302.0;50001192.355               |
|groupby |G1_1e8_1e2_0_1 |mean v1:v3 by id4           |299.9913;799.9782;5000.104        |299.991;799.978;5000.104               |
|groupby |G1_1e8_1e2_0_1 |sum v1:v3 by id6            |299991302;799978221;5000103938    |299991302.0;799978221.0;5000103937.772 |
|groupby |G1_1e8_1e2_0_1 |median v3 sd v3 by id4 id5  |500020;288668.4                   |500019.998;288668.357                  |
|groupby |G1_1e8_1e2_0_1 |max v1 - min v2 by id3      |3998729                           |3998729                                |
|groupby |G1_1e8_1e2_0_1 |largest two v3 by id6       |196996660                         |196996660.391                          |
|groupby |G1_1e8_1e2_0_1 |regression v1 v2 by id2 id4 |1.006723                          |1.007                                  |
|groupby |G1_1e8_1e2_0_1 |sum v3 count by id1:id6     |5000103938;100000000              |5000103937.772;100000000.0             |
|groupby |G1_1e8_1e2_5_0 |sum v1 by id1               |284994735                         |284994735                              |
|groupby |G1_1e8_1e2_5_0 |sum v1 by id1:id2           |284994735                         |284994735                              |
|groupby |G1_1e8_1e2_5_0 |sum v1 mean v3 by id3       |284994735;47500173                |284994735.0;45125236.05                |
|groupby |G1_1e8_1e2_5_0 |mean v1:v3 by id4           |287.9924;767.9688;4799.99         |273.591;729.567;4559.975               |
|groupby |G1_1e8_1e2_5_0 |sum v1:v3 by id6            |284994735;759971497;4750083909    |284994735.0;759971497.0;4750083909.4   |
|groupby |G1_1e8_1e2_5_0 |median v3 sd v3 by id4 id5  |460792.4;266033.6                 |460792.37;259297.076                   |
|groupby |G1_1e8_1e2_5_0 |max v1 - min v2 by id3      |3798317                           |3798317                                |
|groupby |G1_1e8_1e2_5_0 |largest two v3 by id6       |186996834                         |186996833.999                          |
|groupby |G1_1e8_1e2_5_0 |regression v1 v2 by id2 id4 |1.028772                          |1.14                                   |
|groupby |G1_1e8_1e2_5_0 |sum v3 count by id1:id6     |4750083909;100000000              |4750083909.4;100000000.0               |
|groupby |G1_1e9_1e2_0_0 |sum v1 by id1               |2999924714                        |NA                                     |
|groupby |G1_1e9_1e2_0_0 |sum v1 by id1:id2           |2999924714                        |NA                                     |
|groupby |G1_1e9_1e2_0_0 |sum v1 mean v3 by id3       |2999924714;499986250              |NA                                     |
|groupby |G1_1e9_1e2_0_0 |mean v1:v3 by id4           |299.9925;799.9993;4999.87         |NA                                     |
|groupby |G1_1e9_1e2_0_0 |sum v1:v3 by id6            |2999924714;7999992854;49998699478 |NA                                     |
|groupby |G1_1e9_1e2_0_0 |median v3 sd v3 by id4 id5  |499981.8;288669.2                 |NA                                     |
|groupby |G1_1e9_1e2_0_0 |max v1 - min v2 by id3      |39987226                          |NA                                     |
|groupby |G1_1e9_1e2_0_0 |largest two v3 by id6       |1970001790                        |NA                                     |
|groupby |G1_1e9_1e2_0_0 |regression v1 v2 by id2 id4 |0.09821376                        |NA                                     |
|groupby |G1_1e9_1e2_0_0 |sum v3 count by id1:id6     |49998699478;1000000000            |NA                                     |
|groupby |G1_1e9_1e1_0_0 |sum v1 by id1               |2999933732                        |NA                                     |
|groupby |G1_1e9_1e1_0_0 |sum v1 by id1:id2           |2999933732                        |NA                                     |
|groupby |G1_1e9_1e1_0_0 |sum v1 mean v3 by id3       |2999933732;4999733095             |NA                                     |
|groupby |G1_1e9_1e1_0_0 |mean v1:v3 by id4           |29.99934;79.99944;499.9972        |NA                                     |
|groupby |G1_1e9_1e1_0_0 |sum v1:v3 by id6            |2999933732;7999944432;49999721740 |NA                                     |
|groupby |G1_1e9_1e1_0_0 |median v3 sd v3 by id4 id5  |4999.948;2886.7                   |NA                                     |
|groupby |G1_1e9_1e1_0_0 |max v1 - min v2 by id3      |278916351                         |NA                                     |
|groupby |G1_1e9_1e1_0_0 |largest two v3 by id6       |17000225999                       |NA                                     |
|groupby |G1_1e9_1e1_0_0 |regression v1 v2 by id2 id4 |0.000008661068                    |NA                                     |
|groupby |G1_1e9_1e1_0_0 |sum v3 count by id1:id6     |49999721740;1000000000            |NA                                     |
|groupby |G1_1e9_2e0_0_0 |sum v1 by id1               |2999997259                        |NA                                     |
|groupby |G1_1e9_2e0_0_0 |sum v1 by id1:id2           |2999997259                        |NA                                     |
|groupby |G1_1e9_2e0_0_0 |sum v1 mean v3 by id3       |2999997259;21616611234            |NA                                     |
|groupby |G1_1e9_1e2_0_1 |sum v1 by id1               |2999924714                        |NA                                     |
|groupby |G1_1e9_1e2_0_1 |sum v1 by id1:id2           |2999924714                        |NA                                     |
|groupby |G1_1e9_1e2_0_1 |sum v1 mean v3 by id3       |2999924714;499986250              |NA                                     |
|groupby |G1_1e9_1e2_0_1 |mean v1:v3 by id4           |299.9925;799.9993;4999.87         |NA                                     |
|groupby |G1_1e9_1e2_0_1 |sum v1:v3 by id6            |2999924714;7999992854;49998699478 |NA                                     |
|groupby |G1_1e9_1e2_0_1 |median v3 sd v3 by id4 id5  |499981.8;288669.2                 |NA                                     |
|groupby |G1_1e9_1e2_0_1 |max v1 - min v2 by id3      |39987226                          |NA                                     |
|groupby |G1_1e9_1e2_0_1 |largest two v3 by id6       |1970001790                        |NA                                     |
|groupby |G1_1e9_1e2_0_1 |regression v1 v2 by id2 id4 |0.09821376                        |NA                                     |
|groupby |G1_1e9_1e2_0_1 |sum v3 count by id1:id6     |49998699478;1000000000            |NA                                     |
|groupby |G1_1e9_1e2_5_0 |sum v1 by id1               |2849922064                        |NA                                     |
|groupby |G1_1e9_1e2_5_0 |sum v1 by id1:id2           |2849922064                        |NA                                     |
|groupby |G1_1e9_1e2_5_0 |sum v1 mean v3 by id3       |2849922064;474988853              |NA                                     |
|groupby |G1_1e9_1e2_5_0 |mean v1:v3 by id4           |287.9927;768.0034;4799.897        |NA                                     |
|groupby |G1_1e9_1e2_5_0 |sum v1:v3 by id6            |2849922064;7600000111;47498842806 |NA                                     |
|groupby |G1_1e9_1e2_5_0 |median v3 sd v3 by id4 id5  |460786.5;266038.4                 |NA                                     |
|groupby |G1_1e9_1e2_5_0 |max v1 - min v2 by id3      |37982992                          |NA                                     |
|groupby |G1_1e9_1e2_5_0 |largest two v3 by id6       |1870003947                        |NA                                     |
|groupby |G1_1e9_1e2_5_0 |regression v1 v2 by id2 id4 |0.09857194                        |NA                                     |
|groupby |G1_1e9_1e2_5_0 |sum v3 count by id1:id6     |47498842806;1000000000            |NA                                     |
|join    |J1_1e7_NA_0_0  |small inner on int          |450015154;347720187               |450015153.577;347720187.393            |
|join    |J1_1e7_NA_0_0  |medium inner on int         |449954076;449999845               |449954076.026;449999844.938            |
|join    |J1_1e7_NA_0_0  |medium outer on int         |500043741;449999845               |500043740.752;449999844.938            |
|join    |J1_1e7_NA_0_0  |medium inner on factor      |449954076;449999845               |449954076.026;449999844.938            |
|join    |J1_1e7_NA_0_0  |big inner on int            |450032092;449860429               |450032091.841;449860428.616            |
|join    |J1_1e7_NA_5_0  |small inner on int          |427503549;436095569               |427503548.94;436095569.266             |
|join    |J1_1e7_NA_5_0  |medium inner on int         |406023280;423957579               |406023279.562;423957578.77             |
|join    |J1_1e7_NA_5_0  |medium outer on int         |475042481;423957579               |475042480.57;423957578.77              |
|join    |J1_1e7_NA_5_0  |medium inner on factor      |406023280;423957579               |406023279.562;423957578.77             |
|join    |J1_1e7_NA_5_0  |big inner on int            |406160610;427451488               |406160610.361;427451488.273            |
|join    |J1_1e7_NA_0_1  |small inner on int          |449966347;386180314               |449966347.499;386180313.988            |
|join    |J1_1e7_NA_0_1  |medium inner on int         |449944125;448928500               |449944124.782;448928499.667            |
|join    |J1_1e7_NA_0_1  |medium outer on int         |500043741;448928500               |500043740.752;448928499.667            |
|join    |J1_1e7_NA_0_1  |medium inner on factor      |449944125;448928500               |449944124.782;448928499.667            |
|join    |J1_1e7_NA_0_1  |big inner on int            |450020346;449938346               |450020346.491;449938346.033            |
|join    |J1_1e8_NA_0_0  |small inner on int          |4499430832;4388703871             |4499430832.39;4388703871.269           |
|join    |J1_1e8_NA_0_0  |medium inner on int         |4499423746;4507751463             |4499423746.365;4507751463.255          |
|join    |J1_1e8_NA_0_0  |medium outer on int         |4999542478;4507751463             |4999542477.919;4507751463.255          |
|join    |J1_1e8_NA_0_0  |medium inner on factor      |4499423746;4507751463             |4499423746.365;4507751463.255          |
|join    |J1_1e8_NA_0_0  |big inner on int            |4499590098;4499913694             |4499590098.078;4499913694.243          |
|join    |J1_1e8_NA_5_0  |small inner on int          |4084298007;4658773531             |4084298006.891;4658773531.25           |
|join    |J1_1e8_NA_5_0  |medium inner on int         |4061304227;4268317288             |4061304226.846;4268317288.263          |
|join    |J1_1e8_NA_5_0  |medium outer on int         |4749474734;4268317288             |4749474734.499;4268317288.263          |
|join    |J1_1e8_NA_5_0  |medium inner on factor      |4061304227;4268317288             |4061304226.846;4268317288.263          |
|join    |J1_1e8_NA_5_0  |big inner on int            |4060971185;4275319617             |4060971184.62;4275319616.964           |
|join    |J1_1e8_NA_0_1  |small inner on int          |4499308287;3953465650             |4499308287.26;3953465649.659           |
|join    |J1_1e8_NA_0_1  |medium inner on int         |4499224468;4506958891             |4499224468.491;4506958891.2            |
|join    |J1_1e8_NA_0_1  |medium outer on int         |4999542478;4506958891             |4999542477.919;4506958891.2            |
|join    |J1_1e8_NA_0_1  |medium inner on factor      |4499224468;4506958891             |4499224468.491;4506958891.2            |
|join    |J1_1e8_NA_0_1  |big inner on int            |4499618843;4499951833             |4499618842.792;4499951833.33           |

@jangorecki
Copy link
Author

I looked through those checksum briefly. Most of the issues got fixed but there seems to be still issues for data with NAs:

|task    |data           |question                    |data.table                        |polars                                 |
|:-------|:--------------|:---------------------------|:---------------------------------|:--------------------------------------|
|groupby |G1_1e7_1e2_5_0 |sum v1 mean v3 by id3       |28498857;4749468                  |28498857.0;4511894.123                 |
|groupby |G1_1e7_1e2_5_0 |mean v1:v3 by id4           |287.9894;767.8529;4799.873        |273.584;729.454;4559.823               |
|groupby |G1_1e7_1e2_5_0 |median v3 sd v3 by id4 id5  |460771.2;266006.9                 |460771.216;259264.174                  |
|groupby |G1_1e7_1e2_5_0 |regression v1 v2 by id2 id4 |9.940516                          |11.019                                 |

@ritchie46
Copy link
Member

I looked through those checksum briefly. Most of the issues got fixed but there seems to be still issues for data with NAs:

|task    |data           |question                    |data.table                        |polars                                 |
|:-------|:--------------|:---------------------------|:---------------------------------|:--------------------------------------|
|groupby |G1_1e7_1e2_5_0 |sum v1 mean v3 by id3       |28498857;4749468                  |28498857.0;4511894.123                 |
|groupby |G1_1e7_1e2_5_0 |mean v1:v3 by id4           |287.9894;767.8529;4799.873        |273.584;729.454;4559.823               |
|groupby |G1_1e7_1e2_5_0 |median v3 sd v3 by id4 id5  |460771.2;266006.9                 |460771.216;259264.174                  |
|groupby |G1_1e7_1e2_5_0 |regression v1 v2 by id2 id4 |9.940516                          |11.019                                 |

Great, were tuning in. I will investigate the last ones. Thanks for the feedback.

@ritchie46
Copy link
Member

@jangorecki Ok, I think all checksums are ok now:

"G1_1e7_1e2_5_0" | "sum v1 by id1" | "27192241"
"G1_1e7_1e2_5_0" | "sum v1 by id1" | "27192241"
"G1_1e7_1e2_5_0" | "sum v1 by id1:id2" | "28498857"
"G1_1e7_1e2_5_0" | "sum v1 by id1:id2" | "28498857"
"G1_1e7_1e2_5_0" | "sum v1 mean v3 by id3" | "28498857.0;4749467.632"
"G1_1e7_1e2_5_0" | "sum v1 mean v3 by id3" | "28498857.0;4749467.632"
"G1_1e7_1e2_5_0" | "mean v1:v3 by id4" | "287.989;767.853;4799.873"
"G1_1e7_1e2_5_0" | "mean v1:v3 by id4" | "287.989;767.853;4799.873"
"G1_1e7_1e2_5_0" | "sum v1:v3 by id6" | "28498857.0;75988394.0;474969574.048"
"G1_1e7_1e2_5_0" | "sum v1:v3 by id6" | "28498857.0;75988394.0;474969574.048"
"G1_1e7_1e2_5_0" | "median v3 sd v3 by id4 id5" | "460771.216;266006.905"
"G1_1e7_1e2_5_0" | "median v3 sd v3 by id4 id5" | "460771.216;266006.905"
"G1_1e7_1e2_5_0" | "max v1 - min v2 by id3" | "379850"
"G1_1e7_1e2_5_0" | "max v1 - min v2 by id3" | "379850"
"G1_1e7_1e2_5_0" | "largest two v3 by id6" | "18700554.78"
"G1_1e7_1e2_5_0" | "largest two v3 by id6" | "18700554.78"
"G1_1e7_1e2_5_0" | "regression v1 v2 by id2 id4" | "9.942"
"G1_1e7_1e2_5_0" | "regression v1 v2 by id2 id4" | "9.942"
"G1_1e7_1e2_5_0" | "sum v3 count by id1:id6" | "474969574.048;10000000.0"
"G1_1e7_1e2_5_0" | "sum v3 count by id1:id6" | "474969574.048;10000000.0"

I also made a new patch release, because this is important to do correctly.

@jangorecki
Copy link
Author

AFAIU this patch release was 0.7.11, right? if so then it doesn't seem to be fixed

@ritchie46
Copy link
Member

@jangorecki No the patch release was 0.7.12. And of course the latest version is also patched.

@ritchie46
Copy link
Member

@jangorecki I found another null handling issue which only came up on q1. Will patch that this afternoon.

@ritchie46
Copy link
Member

@jangorecki
Copy link
Author

I can confirm it is fixed using 0.7.16. If possible please ensure you have unit tests for those so we can avoid debugging same issues in future.

@ritchie46
Copy link
Member

If possible please ensure you have unit tests for those so we can avoid debugging same issues in future.

Yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants