Issue with pymanopt #1606

miguelgfierro · 2022-01-10T16:53:55Z

Description

Fixing a bug in https://github.com/microsoft/recommenders/runs/4763168801?check_suite_focus=true. Fixing the PR #1605

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

miguelgfierro · 2022-01-10T17:00:41Z

Spark tests are all failing:

        with TemporaryDirectory(dir=tmp_path_factory.getbasetemp()) as td:
            config = {
                "spark.local.dir": td,
                "spark.sql.shuffle.partitions": 1,
                "spark.sql.crossJoin.enabled": "true",
            }
>           spark = start_or_get_spark(app_name=app_name, url=url, config=config)

tests/conftest.py:85: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
recommenders/utils/spark_utils.py:69: in start_or_get_spark
    return eval(".".join(spark_opts))
.tox/spark/lib/python3.7/site-packages/pyspark/sql/session.py:228: in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:384: in getOrCreate
    SparkContext(conf=conf or SparkConf())
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:147: in __init__
    conf, jsc, profiler_cls)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:209: in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
.tox/spark/lib/python3.7/site-packages/pyspark/context.py:321: in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
.tox/spark/lib/python3.7/site-packages/py4j/java_gateway.py:1569: in __call__
    answer, self._gateway_client, None, self._fqn)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

answer = 'xro21'
gateway_client = <py4j.java_gateway.GatewayClient object at 0x7fed1a1c38d0>
target_id = None, name = 'org.apache.spark.api.java.JavaSparkContext'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.
    
        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.
    
        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
                    raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
>                       format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
E                   : java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

@laserprec do you know if there was any change in the spark config?

laserprec · 2022-01-10T17:16:32Z

@laserprec do you know if there was any change in the spark config?

Hmm, not that I am aware of. I am curious of why it is trying to connect to "None.org.apache.spark.api.java.JavaSparkContext".

miguelgfierro · 2022-01-10T18:18:14Z

bufff another error in the gpu tests:

tests/unit/recommenders/models/test_deeprec_model.py .......             [ 14%]
tests/unit/recommenders/models/test_deeprec_utils.py ....                [ 22%]
tests/unit/recommenders/models/test_ncf_singlenode.py ..............     [ 50%]
tests/unit/recommenders/models/test_newsrec_model.py ....                [ 58%]
tests/unit/recommenders/models/test_newsrec_utils.py ....                [ 66%]
tests/unit/recommenders/models/test_rbm.py ...                           [ 72%]
tests/unit/recommenders/models/test_wide_deep_utils.py ...               [ 78%]
tests/unit/recommenders/utils/test_gpu_utils.py FFs..FF                  [ 92%]
tests/unit/recommenders/utils/test_tf_utils.py ....                      [100%]

=================================== FAILURES ===================================
______________________________ test_get_gpu_info _______________________________

    @pytest.mark.gpu
    def test_get_gpu_info():
>       assert len(get_gpu_info()) >= 1
E       assert 0 >= 1
E        +  where 0 = len([])
E        +    where [] = get_gpu_info()

tests/unit/recommenders/utils/test_gpu_utils.py:24: AssertionError
------------------------------ Captured log call -------------------------------
17:17:19 ERROR Call to cuInit results in UNKNOWN_CUDA_ERROR
_____________________________ test_get_number_gpus _____________________________

    @pytest.mark.gpu
    def test_get_number_gpus():
>       assert get_number_gpus() >= 1
E       assert 0 >= 1
E        +  where 0 = get_number_gpus()

tests/unit/recommenders/utils/test_gpu_utils.py:29: AssertionError
_____________________________ test_tensorflow_gpu ______________________________

    @pytest.mark.gpu
    def test_tensorflow_gpu():
>       assert tf.test.is_gpu_available()
E       AssertionError: assert False
E        +  where False = <function is_gpu_available at 0x7f0c0d752710>()
E        +    where <function is_gpu_available at 0x7f0c0d752710> = <module 'tensorflow._api.v2.test' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/tensorflow/_api/v2/test/__init__.py'>.is_gpu_available
E        +      where <module 'tensorflow._api.v2.test' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/tensorflow/_api/v2/test/__init__.py'> = tf.test

tests/unit/recommenders/utils/test_gpu_utils.py:51: AssertionError
------------------------------ Captured log call -------------------------------
17:17:20 WARNING From /home/runner/work/recommenders/recommenders/tests/unit/recommenders/utils/test_gpu_utils.py:51: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
_______________________________ test_pytorch_gpu _______________________________

    @pytest.mark.gpu
    def test_pytorch_gpu():
>       assert torch.cuda.is_available()
E       AssertionError: assert False
E        +  where False = <function is_available at 0x7f0ba6750950>()
E        +    where <function is_available at 0x7f0ba6750950> = <module 'torch.cuda' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/cuda/__init__.py'>.is_available
E        +      where <module 'torch.cuda' from '/home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/cuda/__init__.py'> = torch.cuda

miguelgfierro · 2022-01-11T10:13:27Z

Now it is failing the numba test that detects the GPU:

tests/unit/examples/test_notebooks_gpu.py F......                        [100%]

=================================== FAILURES ===================================
_________________________________ test_gpu_vm __________________________________

    @pytest.mark.notebooks
    @pytest.mark.gpu
    def test_gpu_vm():
>       assert get_number_gpus() >= 1
E       assert 0 >= 1
E        +  where 0 = get_number_gpus()

tests/unit/examples/test_notebooks_gpu.py:18: AssertionError

the ADO test of GPU noteooks in python 3.6 passes: https://dev.azure.com/best-practices/recommenders/_build/results?buildId=55750&view=results
however the GitHub one with 3.7 is failing, it might be related to numba?

miguelgfierro · 2022-01-11T11:52:51Z

the test runs on ADO and when I install a python 3.7 env it just works:

$ pip list | grep -E 'numpy|numba|tensorflow'
numba                        0.54.1
numpy                        1.20.3
tensorflow                   2.7.0
tensorflow-estimator         2.7.0
tensorflow-io-gcs-filesystem 0.23.1

$ pytest tests/unit/recommenders/utils/test_gpu_utils.py::test_get_number_gpus
============================================================= slowest 10 durations ==============================================================
4.81s call     tests/unit/recommenders/utils/test_gpu_utils.py::test_get_number_gpus

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
========================================================= 1 passed, 1 warning in 9.42s ==========================================================

miguelgfierro · 2022-01-11T11:59:32Z

@laserprec I have been trying to debug the code, it looks the problem happens only in GitHub actions (see messages before). Any idea about where the problem could be?

laserprec · 2022-01-12T15:17:21Z

@laserprec I have been trying to debug the code, it looks the problem happens only in GitHub actions (see messages before). Any idea about where the problem could be?

I think we've seen this before and it could that the NVIDIA driver is not available on the github action runner (perhaps due to Ubuntu auto-updating the NVIDIA drivers, and the machine needs to restart to apply the changes).

laserprec · 2022-01-12T16:35:38Z

tests/integration/examples/test_notebooks_python.py

@@ -236,6 +236,7 @@ def test_cornac_bpr_integration(


 @pytest.mark.integration
+@pytest.mark.experimental


Haven't been following some of the latest code changes, but do we have this new pytest.marker defined anywhere in our configuration?

right now there is no pipeline for the experimental, as we improve the dependency installation, we will put back some of the tests in the normal pipeline (cpu, gpu or spark). Also, see #1606 (comment)

laserprec

LGTM :), just a bit curious at where this new pytest.marker.experimental is defined

miguelgfierro · 2022-01-12T16:38:17Z

just a bit curious at where this new pytest.marker.experimental is defined

this was a way to take out the dependencies that were making conflicts on the pipeline, @anargyri can provide more context

miguelgfierro · 2022-01-12T18:27:18Z

All ADO tests passing, the issue with the GitHub GPU test is fixed thanks to @laserprec. The issue with the spark test is a flaky test due to a memory error, but the code works.

Merging

Preparing for release 🚀🚀🚀

🐛

d638461

miguelgfierro requested review from anargyri, gramhagen, loomlike and wutaomsft as code owners January 10, 2022 16:53

🐛

48aab47

miguelgfierro mentioned this pull request Jan 11, 2022

RBM Code Cleanup #1599

Merged

4 tasks

trigger tests

5134acc

laserprec reviewed Jan 12, 2022

View reviewed changes

laserprec approved these changes Jan 12, 2022

View reviewed changes

miguelgfierro merged commit 7af6edd into staging Jan 12, 2022

miguelgfierro deleted the miguel/pymanopt branch January 12, 2022 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with pymanopt #1606

Issue with pymanopt #1606

miguelgfierro commented Jan 10, 2022 •

edited

miguelgfierro commented Jan 10, 2022

laserprec commented Jan 10, 2022

miguelgfierro commented Jan 10, 2022

miguelgfierro commented Jan 11, 2022 •

edited

miguelgfierro commented Jan 11, 2022

miguelgfierro commented Jan 11, 2022

laserprec commented Jan 12, 2022

laserprec Jan 12, 2022

miguelgfierro Jan 12, 2022

laserprec left a comment

miguelgfierro commented Jan 12, 2022

miguelgfierro commented Jan 12, 2022 •

edited

		@@ -236,6 +236,7 @@ def test_cornac_bpr_integration(


		@pytest.mark.integration
		@pytest.mark.experimental

Issue with pymanopt #1606

Issue with pymanopt #1606

Conversation

miguelgfierro commented Jan 10, 2022 • edited

Description

Related Issues

Checklist:

miguelgfierro commented Jan 10, 2022

laserprec commented Jan 10, 2022

miguelgfierro commented Jan 10, 2022

miguelgfierro commented Jan 11, 2022 • edited

miguelgfierro commented Jan 11, 2022

miguelgfierro commented Jan 11, 2022

laserprec commented Jan 12, 2022

laserprec Jan 12, 2022

Choose a reason for hiding this comment

miguelgfierro Jan 12, 2022

Choose a reason for hiding this comment

laserprec left a comment

Choose a reason for hiding this comment

miguelgfierro commented Jan 12, 2022

miguelgfierro commented Jan 12, 2022 • edited

miguelgfierro commented Jan 10, 2022 •

edited

miguelgfierro commented Jan 11, 2022 •

edited

miguelgfierro commented Jan 12, 2022 •

edited