[BUG] Spark smoke test error with Criteo #1615

miguelgfierro · 2022-01-19T14:12:20Z

Description

After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation

tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF

=================================== FAILURES ===================================
_____________________ test_mmlspark_lightgbm_criteo_smoke ______________________

notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'

    @pytest.mark.flaky(reruns=5, reruns_delay=2)
    @pytest.mark.smoke
    @pytest.mark.spark
    @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
    def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["mmlspark_lightgbm_criteo"]
        pm.execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )

            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )
    



        results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
            "data"
        ]
>       assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
E       assert 0.6292474883613918 == 0.68895 ± 5.0e-02
E        +  where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
E        +    where <function approx at 0x7f46b6e30840> = pytest.approx

In which platform does it happen?

How do we replicate the issue?

see details:
https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487

Expected behavior (i.e. solution)

Other Comments

This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms

The text was updated successfully, but these errors were encountered:

simonzhaoms · 2022-01-20T03:25:30Z

It seems LightGBM from SynapseML or Spark 3.2 gives lower value of AUC. And the smoke test gave much lower value. However, I tested on my local machine, the AUC was 0.65653376 which is within the tolerance.

The comparison shown in the screenshot above can be found here.

simonzhaoms · 2022-01-20T07:29:46Z

Just found LightGBM has 121 open issues 😂

simonzhaoms · 2022-01-20T10:16:17Z

The AUC value is around 0.66 on my local machine, and is around 0.63 on the pipeline test machine. So I think we may change the base value to 0.645.

anargyri · 2022-01-20T11:07:34Z

Maybe something has changed between MMLSpark and SynapseML:

default parameters of lightgbm,
default intialization, or
version of lightgbm used?

anargyri · 2022-01-20T11:20:14Z

The AUC value is around 0.66 on my local machine, and is around 0.63 on the pipeline test machine. So I think we may change the base value to 0.645.

This is also interesting, I wonder why there is such a difference on different machines. It has to come from the way lightgbm is implemented in Spark or some randomization that is affected by the machine.

miguelgfierro · 2022-01-20T14:33:56Z

@mhamilton723 @imatiach-msft we detected a performance drop in LightGBM, any hint about what could be happening?

imatiach-msft · 2022-01-20T14:39:44Z

@miguelfierro is this using latest synapseml 0.9.5? Recently, the code has changed a lot with the new "single dataset mode", you should see the speed improve much more.
These two PRs improve the performance

#1222
#1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1.

For more information on the new single dataset mode please see the PR description:
#1066

You can try turning it off by setting useSingleDatasetMode=False to see if performance reverts.

imatiach-msft · 2022-01-20T14:41:55Z

is this smoke test running on the full criteo dataset?

imatiach-msft · 2022-01-20T15:04:14Z

Just to be sure, the difference is only 0.689 vs 0.661? To me that honestly doesn't seem like a big gap, you may see changes like that from just modifying the random seed.

anargyri · 2022-01-20T16:34:52Z

The test is running this notebook (it uses the downsampled Criteo data set).
Synapse is 0.9.5 indeed.
We fix the bagging seed in the notebook, according to the docs this is the only seed parameter.

imatiach-msft · 2022-01-20T17:21:45Z

@anargyri did the execution speed of training change a lot? Perhaps you could try to run for more iterations to see if the metric improves. You can also try setting useSingleDatasetMode=False to see if it keeps the same metric value, if that is really important.
Also, I see some early stopping param in that notebook:
earlyStoppingRound=EARLY_STOPPING_ROUND
but it isn't actually used. Maybe I can send a PR to remove it.

anargyri · 2022-01-20T17:23:06Z

So, @imatiach-msft you are right, useSingleDatasetMode is the source of the difference. When I set it to False, I get AUC = 0.66 like before; setting it to True gives AUC = 0.62.
The difference in computational time is about 28s vs. 25s respectively. It's the small data set, maybe there is a more visible improvement with the full one.

imatiach-msft · 2022-01-20T17:25:43Z

@anargyri @miguelgfierro I made a PR here to update the notebook to remove the early stopping param since it isn't used:
#1620

imatiach-msft · 2022-01-20T17:28:25Z

how large is the small dataset? how many workers are being used? with 0.9.5 we switched useSingleDatasetMode=True by default, but it was there for a couple releases before that so we could validate it. I suspect the small difference is just due to randomness. Really, it's not a big difference to me, not like a huge gap such as 0.86 and 0.62, which to me would be much more concerning.

imatiach-msft · 2022-01-20T17:35:08Z

"28s vs. 25s respectively"
Just to double-confirm, is the 25s the useSingleDatasetMode=True? It should be faster. Although if the dataset is small there really shouldn't be any difference, or the time difference is most likely at this point just random.

anargyri · 2022-01-20T17:43:16Z

It's 100K in the data set we use vs. 45M rows in the full data.

anargyri · 2022-01-20T17:50:36Z

I agree it's not a large difference in AUC. I am comparing runs on the same machine and same code (just flipping the useSingleDatasetMode parameter). I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True.

anargyri · 2022-01-20T17:51:57Z

"28s vs. 25s respectively" Just to double-confirm, is the 25s the useSingleDatasetMode=True? It should be faster. Although if the dataset is small there really shouldn't be any difference, or the time difference is most likely at this point just random.

Yes, this is right.

imatiach-msft · 2022-01-20T18:53:14Z

"I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True."
I think when using single dataset mode and using a single machine it will currently just run it like on a single machine, without distributed training:
https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMBase.scala#L252
I recall at some iteration of the code when adding this new mode I actually had the code disable single dataset mode when it was on a single machine, so each core would still be a separate worker, but then during code review that was removed. I should probably take a look at this code in more detail to refresh my memory and ensure everything is working in the optimal way.

anargyri · 2022-01-20T19:15:31Z

It uses multiple cores on the machine with both settings of the parameter.

imatiach-msft · 2022-01-20T19:44:16Z

@anargyri I guess this is kind of getting into very specific details, but the single dataset mode essentially hands the spark dataset to native code on one spark worker and "finishes" the other spark workers, and then the parallelization is done with multithreading code in native layer. The previous case would create a native dataset for each worker, and there would be a lot of unnecessary network communication between them, instead of internal thread parallelization. So with single dataset mode we have a single lightgbm dataset created per machine, and without it we have as many datasets as spark workers (so if 1 core per worker and 8 cores there are 8 lightgbm datasets created). I'm actually surprised that this still gives better accuracy in this case somehow.

miguelgfierro · 2022-01-21T08:46:15Z

this is great @imatiach-msft, thanks for chiming in

tbrandonstevenson · 2022-07-05T18:27:35Z

I am observing a similar situation. After upgrading to 0.9.5, my metrics for heavily unbalanced dataset drop (AUC from 0.65 down to 0.51, PR from 0.012 down to 0.003). When setting useSingleDatasetMode=False, I get the original performance metrics from 0.9.4.

imatiach-msft · 2022-07-05T20:18:51Z

@tbrandonstevenson I wonder if it might be due to this issue:
microsoft/SynapseML#1490
perhaps on latest master/next release the issue will be resolved, as this is a really bad bug in the 0.9.5 release

imatiach-msft · 2022-07-05T20:19:49Z

if you increase the chunksize parameter, do you see the metrics improving? if so, it's probably that bug.

tbrandonstevenson · 2022-07-06T17:51:15Z

@imatiach-msft I just tried to increase chunksize parameter while keeping useSingleDatasetMode=True. chunksize=10k showed no improvement, but chunksize=100k gave me good model results.

imatiach-msft · 2022-07-06T17:59:40Z

@tbrandonstevenson I see, yes, indeed, then it's exactly that same issue. It is already fixed in master.

miguelgfierro added the bug Something isn't working label Jan 19, 2022

simonzhaoms mentioned this issue Jan 20, 2022

Lower LightGBM test AUC base value #1619

Merged

4 tasks

imatiach-msft mentioned this issue Jan 20, 2022

remove early stopping round from lightgbm example notebook #1620

Merged

4 tasks

simonzhaoms closed this as completed Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spark smoke test error with Criteo #1615

[BUG] Spark smoke test error with Criteo #1615

miguelgfierro commented Jan 19, 2022 •

edited

simonzhaoms commented Jan 20, 2022 •

edited

simonzhaoms commented Jan 20, 2022

simonzhaoms commented Jan 20, 2022 •

edited

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022 •

edited

miguelgfierro commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022 •

edited

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022 •

edited

miguelgfierro commented Jan 21, 2022

tbrandonstevenson commented Jul 5, 2022

imatiach-msft commented Jul 5, 2022

imatiach-msft commented Jul 5, 2022

tbrandonstevenson commented Jul 6, 2022

imatiach-msft commented Jul 6, 2022

[BUG] Spark smoke test error with Criteo #1615

[BUG] Spark smoke test error with Criteo #1615

Comments

miguelgfierro commented Jan 19, 2022 • edited

Description

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

simonzhaoms commented Jan 20, 2022 • edited

simonzhaoms commented Jan 20, 2022

simonzhaoms commented Jan 20, 2022 • edited

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022 • edited

miguelgfierro commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022 • edited

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022

anargyri commented Jan 20, 2022

imatiach-msft commented Jan 20, 2022 • edited

miguelgfierro commented Jan 21, 2022

tbrandonstevenson commented Jul 5, 2022

imatiach-msft commented Jul 5, 2022

imatiach-msft commented Jul 5, 2022

tbrandonstevenson commented Jul 6, 2022

imatiach-msft commented Jul 6, 2022

miguelgfierro commented Jan 19, 2022 •

edited

simonzhaoms commented Jan 20, 2022 •

edited

simonzhaoms commented Jan 20, 2022 •

edited

anargyri commented Jan 20, 2022 •

edited

anargyri commented Jan 20, 2022 •

edited

imatiach-msft commented Jan 20, 2022 •

edited