-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Spark smoke test error with Criteo #1615
Comments
It seems LightGBM from SynapseML or Spark 3.2 gives lower value of AUC. And the smoke test gave much lower value. However, I tested on my local machine, the AUC was 0.65653376 which is within the tolerance. The comparison shown in the screenshot above can be found here. |
Just found LightGBM has 121 open issues 😂 |
The AUC value is around 0.66 on my local machine, and is around 0.63 on the pipeline test machine. So I think we may change the base value to 0.645. |
Maybe something has changed between MMLSpark and SynapseML:
|
This is also interesting, I wonder why there is such a difference on different machines. It has to come from the way lightgbm is implemented in Spark or some randomization that is affected by the machine. |
@mhamilton723 @imatiach-msft we detected a performance drop in LightGBM, any hint about what could be happening? |
@miguelfierro is this using latest synapseml 0.9.5? Recently, the code has changed a lot with the new "single dataset mode", you should see the speed improve much more. In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1. For more information on the new single dataset mode please see the PR description: You can try turning it off by setting useSingleDatasetMode=False to see if performance reverts. |
is this smoke test running on the full criteo dataset? |
Just to be sure, the difference is only 0.689 vs 0.661? To me that honestly doesn't seem like a big gap, you may see changes like that from just modifying the random seed. |
The test is running this notebook (it uses the downsampled Criteo data set). |
@anargyri did the execution speed of training change a lot? Perhaps you could try to run for more iterations to see if the metric improves. You can also try setting useSingleDatasetMode=False to see if it keeps the same metric value, if that is really important. |
So, @imatiach-msft you are right, |
@anargyri @miguelgfierro I made a PR here to update the notebook to remove the early stopping param since it isn't used: |
how large is the small dataset? how many workers are being used? with 0.9.5 we switched useSingleDatasetMode=True by default, but it was there for a couple releases before that so we could validate it. I suspect the small difference is just due to randomness. Really, it's not a big difference to me, not like a huge gap such as 0.86 and 0.62, which to me would be much more concerning. |
"28s vs. 25s respectively" |
It's 100K in the data set we use vs. 45M rows in the full data. |
I agree it's not a large difference in AUC. I am comparing runs on the same machine and same code (just flipping the useSingleDatasetMode parameter). I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True. |
Yes, this is right. |
"I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True." |
It uses multiple cores on the machine with both settings of the parameter. |
@anargyri I guess this is kind of getting into very specific details, but the single dataset mode essentially hands the spark dataset to native code on one spark worker and "finishes" the other spark workers, and then the parallelization is done with multithreading code in native layer. The previous case would create a native dataset for each worker, and there would be a lot of unnecessary network communication between them, instead of internal thread parallelization. So with single dataset mode we have a single lightgbm dataset created per machine, and without it we have as many datasets as spark workers (so if 1 core per worker and 8 cores there are 8 lightgbm datasets created). I'm actually surprised that this still gives better accuracy in this case somehow. |
this is great @imatiach-msft, thanks for chiming in |
I am observing a similar situation. After upgrading to 0.9.5, my metrics for heavily unbalanced dataset drop (AUC from 0.65 down to 0.51, PR from 0.012 down to 0.003). When setting useSingleDatasetMode=False, I get the original performance metrics from 0.9.4. |
@tbrandonstevenson I wonder if it might be due to this issue: |
if you increase the chunksize parameter, do you see the metrics improving? if so, it's probably that bug. |
@imatiach-msft I just tried to increase chunksize parameter while keeping useSingleDatasetMode=True. chunksize=10k showed no improvement, but chunksize=100k gave me good model results. |
@tbrandonstevenson I see, yes, indeed, then it's exactly that same issue. It is already fixed in master. |
Description
After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation
In which platform does it happen?
How do we replicate the issue?
see details:
https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487
Expected behavior (i.e. solution)
Other Comments
This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms
The text was updated successfully, but these errors were encountered: