Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark smoke test error with Criteo #1615

Closed
miguelgfierro opened this issue Jan 19, 2022 · 27 comments
Closed

[BUG] Spark smoke test error with Criteo #1615

miguelgfierro opened this issue Jan 19, 2022 · 27 comments
Labels
bug Something isn't working

Comments

@miguelgfierro
Copy link
Collaborator

miguelgfierro commented Jan 19, 2022

Description

After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation

tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF

=================================== FAILURES ===================================
_____________________ test_mmlspark_lightgbm_criteo_smoke ______________________

notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'

    @pytest.mark.flaky(reruns=5, reruns_delay=2)
    @pytest.mark.smoke
    @pytest.mark.spark
    @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
    def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["mmlspark_lightgbm_criteo"]
        pm.execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )

            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )
    



        results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
            "data"
        ]
>       assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
E       assert 0.6292474883613918 == 0.68895 ± 5.0e-02
E        +  where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
E        +    where <function approx at 0x7f46b6e30840> = pytest.approx

In which platform does it happen?

How do we replicate the issue?

see details:
https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487

Expected behavior (i.e. solution)

Other Comments

This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms

@miguelgfierro miguelgfierro added the bug Something isn't working label Jan 19, 2022
@simonzhaoms
Copy link
Contributor

simonzhaoms commented Jan 20, 2022

It seems LightGBM from SynapseML or Spark 3.2 gives lower value of AUC. And the smoke test gave much lower value. However, I tested on my local machine, the AUC was 0.65653376 which is within the tolerance.

Screenshot 2022-01-20 at 11 15 31

The comparison shown in the screenshot above can be found here.

@simonzhaoms
Copy link
Contributor

Just found LightGBM has 121 open issues 😂

@simonzhaoms
Copy link
Contributor

simonzhaoms commented Jan 20, 2022

The AUC value is around 0.66 on my local machine, and is around 0.63 on the pipeline test machine. So I think we may change the base value to 0.645.

@anargyri
Copy link
Collaborator

Maybe something has changed between MMLSpark and SynapseML:

  • default parameters of lightgbm,
  • default intialization, or
  • version of lightgbm used?

@anargyri
Copy link
Collaborator

anargyri commented Jan 20, 2022

The AUC value is around 0.66 on my local machine, and is around 0.63 on the pipeline test machine. So I think we may change the base value to 0.645.

This is also interesting, I wonder why there is such a difference on different machines. It has to come from the way lightgbm is implemented in Spark or some randomization that is affected by the machine.

@miguelgfierro
Copy link
Collaborator Author

@mhamilton723 @imatiach-msft we detected a performance drop in LightGBM, any hint about what could be happening?

@imatiach-msft
Copy link
Collaborator

@miguelfierro is this using latest synapseml 0.9.5? Recently, the code has changed a lot with the new "single dataset mode", you should see the speed improve much more.
These two PRs improve the performance

#1222
#1282

In performance testing we saw big speedup with new single dataset mode and numThreads set to num cores -1.

For more information on the new single dataset mode please see the PR description:
#1066

You can try turning it off by setting useSingleDatasetMode=False to see if performance reverts.

@imatiach-msft
Copy link
Collaborator

is this smoke test running on the full criteo dataset?

@imatiach-msft
Copy link
Collaborator

Just to be sure, the difference is only 0.689 vs 0.661? To me that honestly doesn't seem like a big gap, you may see changes like that from just modifying the random seed.

@anargyri
Copy link
Collaborator

The test is running this notebook (it uses the downsampled Criteo data set).
Synapse is 0.9.5 indeed.
We fix the bagging seed in the notebook, according to the docs this is the only seed parameter.

@imatiach-msft
Copy link
Collaborator

@anargyri did the execution speed of training change a lot? Perhaps you could try to run for more iterations to see if the metric improves. You can also try setting useSingleDatasetMode=False to see if it keeps the same metric value, if that is really important.
Also, I see some early stopping param in that notebook:
earlyStoppingRound=EARLY_STOPPING_ROUND
but it isn't actually used. Maybe I can send a PR to remove it.

@anargyri
Copy link
Collaborator

anargyri commented Jan 20, 2022

So, @imatiach-msft you are right, useSingleDatasetMode is the source of the difference. When I set it to False, I get AUC = 0.66 like before; setting it to True gives AUC = 0.62.
The difference in computational time is about 28s vs. 25s respectively. It's the small data set, maybe there is a more visible improvement with the full one.

@imatiach-msft
Copy link
Collaborator

@anargyri @miguelgfierro I made a PR here to update the notebook to remove the early stopping param since it isn't used:
#1620

@imatiach-msft
Copy link
Collaborator

how large is the small dataset? how many workers are being used? with 0.9.5 we switched useSingleDatasetMode=True by default, but it was there for a couple releases before that so we could validate it. I suspect the small difference is just due to randomness. Really, it's not a big difference to me, not like a huge gap such as 0.86 and 0.62, which to me would be much more concerning.

@imatiach-msft
Copy link
Collaborator

"28s vs. 25s respectively"
Just to double-confirm, is the 25s the useSingleDatasetMode=True? It should be faster. Although if the dataset is small there really shouldn't be any difference, or the time difference is most likely at this point just random.

@anargyri
Copy link
Collaborator

It's 100K in the data set we use vs. 45M rows in the full data.

@anargyri
Copy link
Collaborator

I agree it's not a large difference in AUC. I am comparing runs on the same machine and same code (just flipping the useSingleDatasetMode parameter). I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True.

@anargyri
Copy link
Collaborator

"28s vs. 25s respectively" Just to double-confirm, is the 25s the useSingleDatasetMode=True? It should be faster. Although if the dataset is small there really shouldn't be any difference, or the time difference is most likely at this point just random.

Yes, this is right.

@imatiach-msft
Copy link
Collaborator

"I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True."
I think when using single dataset mode and using a single machine it will currently just run it like on a single machine, without distributed training:
https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMBase.scala#L252
I recall at some iteration of the code when adding this new mode I actually had the code disable single dataset mode when it was on a single machine, so each core would still be a separate worker, but then during code review that was removed. I should probably take a look at this code in more detail to refresh my memory and ensure everything is working in the optimal way.

@anargyri
Copy link
Collaborator

It uses multiple cores on the machine with both settings of the parameter.

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Jan 20, 2022

@anargyri I guess this is kind of getting into very specific details, but the single dataset mode essentially hands the spark dataset to native code on one spark worker and "finishes" the other spark workers, and then the parallelization is done with multithreading code in native layer. The previous case would create a native dataset for each worker, and there would be a lot of unnecessary network communication between them, instead of internal thread parallelization. So with single dataset mode we have a single lightgbm dataset created per machine, and without it we have as many datasets as spark workers (so if 1 core per worker and 8 cores there are 8 lightgbm datasets created). I'm actually surprised that this still gives better accuracy in this case somehow.

@miguelgfierro
Copy link
Collaborator Author

this is great @imatiach-msft, thanks for chiming in

@tbrandonstevenson
Copy link

I am observing a similar situation. After upgrading to 0.9.5, my metrics for heavily unbalanced dataset drop (AUC from 0.65 down to 0.51, PR from 0.012 down to 0.003). When setting useSingleDatasetMode=False, I get the original performance metrics from 0.9.4.

@imatiach-msft
Copy link
Collaborator

@tbrandonstevenson I wonder if it might be due to this issue:
microsoft/SynapseML#1490
perhaps on latest master/next release the issue will be resolved, as this is a really bad bug in the 0.9.5 release

@imatiach-msft
Copy link
Collaborator

if you increase the chunksize parameter, do you see the metrics improving? if so, it's probably that bug.

@tbrandonstevenson
Copy link

@imatiach-msft I just tried to increase chunksize parameter while keeping useSingleDatasetMode=True. chunksize=10k showed no improvement, but chunksize=100k gave me good model results.

@imatiach-msft
Copy link
Collaborator

@tbrandonstevenson I see, yes, indeed, then it's exactly that same issue. It is already fixed in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants