Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc(Python): Replace python 2 with python 3 #1341

Merged

Conversation

Ennosigaeon
Copy link
Contributor

@Ennosigaeon Ennosigaeon commented Jan 15, 2021

Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version .Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit replaces the job-server-python egg with a wheel.

Pull Request checklist

  • The commit(s) message(s) follows the contribution guidelines ?
  • Tests for the changes have been added (for bug fixes / features) ?
  • Docs have been added / updated (for bug fixes / features) ?

Other information:
If your python3 executable points to a python >= 3.8 the SubprocessSpec fails as Spark-2.4 does not support python > 3.7. To resolve this issue set the PYTHON_EXECUTABLE to a python version < 3.8. I have added an according comment in the spec.


This change is Reviewable

@Ennosigaeon Ennosigaeon force-pushed the replace_python2_with_python3 branch 3 times, most recently from 5ba4b97 to 04032d5 Compare January 19, 2021 09:32
@noorul
Copy link
Contributor

noorul commented Jan 20, 2021

I don't think we can stop supporting python 2 all of a sudden.

@Ennosigaeon
Copy link
Contributor Author

This PR does not drop support for python 2 at all. It just changes the default python version used in the test cases. If tests should still be executed with python 2, the env variable PYTHON_EXECUTABLE can be used. The only "visible" change for users are the assembly of a jobserver wheel in addition to the current egg. (And some minor docu changes.)

Copy link
Contributor

@bsikander bsikander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file src/python/sparkjobserver/subprocess.py, do we need to add something like a WHEELPATH?

job-server-python/src/python/build.cmd Outdated Show resolved Hide resolved
job-server-python/src/python/build.cmd Outdated Show resolved Hide resolved
job-server-python/src/python/run-tests.cmd Outdated Show resolved Hide resolved
job-server-python/src/python/run-tests.sh Show resolved Hide resolved
@bsikander
Copy link
Contributor

It just changes the default python version used in the test cases

I am not sure about this statement. PythonContextFactory.scala is using the python.executable config variable, which is changed now.

@Ennosigaeon
Copy link
Contributor Author

Ennosigaeon commented Jan 20, 2021

@bsikander: Yes, you are correct, my comment was incorrect.

Regarding the WHEELPATH: I think it is desirable that wheel files can be uploaded to the job-server in addition to egg files. However, this change is rather large and should maybe be done in a separate PR.

@bsikander
Copy link
Contributor

Regarding the WHEELPATH: I think it is desirable that wheel files can be uploaded to the job-server in addition to egg files. However, this change is rather large and should maybe be done in a separate PR.

I agree

@Ennosigaeon Ennosigaeon force-pushed the replace_python2_with_python3 branch 2 times, most recently from d5029f2 to b08d169 Compare January 21, 2021 10:02
Copy link
Contributor

@bsikander bsikander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run a simple word count but getting the following error. My python version is 3.7.9. My spark is 2.4.7.

....
TypeError: an integer is required (got type bytes)
....
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
....

As per your comment, I set the PYTHON_EXECUTABLE directly before the command and also tried to put it in my config/<env>.sh but still the same error.

Here are the commands that I have been using

curl -X POST  "localhost:8090/contexts/py-context?context-factory=spark.jobserver.python.PythonSessionContextFactory"
curl --data-binary @./job-server-python/target/python/sjs_python_examples-0.10.1_SNAPSHOT-py3-none-any.whl -H 'Content-Type: application/python-archive' localhost:8090/binaries/py_bin
curl -d 'input.strings = ["a", "b", "a", "b" ]' "localhost:8090/jobs?appName=py_bin&classPath=example_jobs.word_count.WordCountSparkSessionJob&context=py-context&sync=true" | jq

Any ideas?

job-server-python/src/python/build.sh Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
@Ennosigaeon
Copy link
Contributor Author

Ennosigaeon commented Jan 22, 2021

@bsikander The PYTHON_EXECUTABLE variable is only used for testing. If you want to change the binary for actual usage, you have to change the python.executable in the application.conf. The error you see is typical for Spark 2 with python >= 3.8. I assume python3 points such a python. Is that correct?

I think, I should also add somewhere in the docs, that python 3.8 is not supported. Just for users encountering the same issues you have right now.

Edit: I have added a comment in the application.conf as well as a troubleshooting section in python.md.

Edit2: I just noticed that you uploaded a wheel file instead of a egg. I am not sure whether this will work, I have not tested it (but would actually be really cool if it just works out of the box)

@codecov-io
Copy link

codecov-io commented Jan 22, 2021

Codecov Report

Merging #1341 (5ac3fe1) into master (15f457d) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1341   +/-   ##
=======================================
  Coverage   80.89%   80.89%           
=======================================
  Files          96       96           
  Lines        3973     3973           
  Branches      203      203           
=======================================
  Hits         3214     3214           
  Misses        759      759           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15f457d...5ac3fe1. Read the comment docs.

@bsikander
Copy link
Contributor

bsikander commented Jan 22, 2021

If you want to change the binary for actual usage, you have to change the python.executable in the application.conf

Ok. My python.executable is already pointing to a python 3.7.6 version. but I still had this error.

I just noticed that you uploaded a wheel file instead of a egg. I am not sure whether this will work, I have not tested it (but would actually be really cool if it just works out of the box)

I tried both and got the same error. Interestingly, in both case the job atleast got deployed, so it seems that jobserver is not doing any checks on wheel/eggs

@Ennosigaeon
Copy link
Contributor Author

Hm, weird. I just tried the following:

  1. Start sbt in first shell
job-server-python/buildPyExamples
~job-server-extras/reStart                   # (with absolute path to log4j config for some reason)
  1. In a second shell
curl --data-binary @./job-server-python/target/python/sjs_python_examples-0.10.1_SNAPSHOT-py3.8.egg -H 'Content-Type: application/python-archive' localhost:8090/binaries/py_bin
curl -X POST  "localhost:8090/contexts/py-context?context-factory=spark.jobserver.python.PythonSessionContextFactory"
curl -d 'input.strings = ["a", "b", "a", "b" ]' "localhost:8090/jobs?appName=py_bin&classPath=example_jobs.word_count.WordCountSparkSessionJob&context=py-context&sync=true" | jq

with output

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   108  100    71  100    37     21     11  0:00:03  0:00:03 --:--:--    32
{
  "jobId": "5f17f9c4-8463-4878-bf90-93383b21ea14",
  "result": {
    "a": 2,
    "b": 2
  }
}

Could you please verify that you are really using python 3.7 in the running application. A simple check would be adding
println(sc.pythonExecutable) to PythonJob@65.

What makes this even stranger: With exception of the application.conf, I have only changed code affecting the build or tests. You override the application.conf with your own settings, so my changes should have no effect at all. Can you execute a job using the current master?

@bsikander
Copy link
Contributor

Update:
With current master and python2, it works.
With current master and specifying executable as /usr/local/bin/python3.7 also works.
With this PR and specifying executable as /usr/local/bin/python3.7 and using the examples egg -> also works.
With PR + specifying executable + examples wheel -> also works.

Somehow, if I just use python as executable and my python points to 3.7 it doesn't work but it seems to be a local machine problem.

@Ennosigaeon
Copy link
Contributor Author

Cool, that wheels work. Maybe I will find time this week to look a bit more at wheels and provide a small PR with code maintenance/doc changes if really everything just works.

@bsikander
Copy link
Contributor

@noorul The change looks fine. It can work with python2 also. Any concerns?

@bsikander
Copy link
Contributor

Steps to build and use python jobserver. Writing it here to save time for others.

  • export PYTHON_EXECUTABLE=/usr/local/bin/python2.7 or whatever path your python points to. For python3, you can skip this step as default is already python3.
  • Build python packagesbt buildPython (depending upon the PYTHON_EXECUTABLE, the package will be either for python2 or 3.
  • Build examples wheel and egg file sbt buildPyExamples
  • Modify your <env>.conf file to point to the newly built python package and point the python binary that you want to use in the executable section. Here we are pointing to 2.7 but if you built for the jobserver for python 3 then use python 3 binary.
context-settings {
.......
.......
python {
      paths = [
        ${SPARK_HOME}/python,
        "/path/to/jobserver/job-server-python/target/python/spark_jobserver_python-0.10.1_SNAPSHOT-py2-none-any.whl"
      ]

      # The default value in application.conf is "python"
      executable = "/usr/local/bin/python2.7"
    }
  • Deploy jobserver ./bin/server_deploy.sh <env>
  • Inside your deploy folder, start jobserver ./server_start.sh
  • Create a context
curl -X POST "localhost:8090/contexts/py-context?context-factory=spark.jobserver.python.PythonSessionContextFactory"
  • Upload test python binary. You can upload wheel or egg.
curl --data-binary @/path/to/jobserver/job-server-python/target/python/sjs_python_examples-0.10.1_SNAPSHOT-py3.7.egg -H 'Content-Type: application/python-archive' localhost:8090/binaries/py_bin_egg
  • Run a job
curl -d 'input.strings = ["a", "b", "a", "b" ]' "localhost:8090/jobs?appName=py_bin_egg&classPath=example_jobs.word_count.WordCountSparkSessionJob&context=py-context"

Python 2 has reached EOL last year and should not be used anymore. This commit replaces all references to the "python" binary with the more explicit "python3" binary. If desired, the build can still be performed for Python 2 by settings the "PYTHON_EXECUTABLE" environment variable to an appropriate version. Additionally, python wheels are the preferred way to distribute python code (see https://packaging.python.org/discussions/wheel-vs-egg/). This commit additionally builds the job-server-python wheel.

Spark-2.4 does not support python >= 3.8 (see apache/spark#26194) leading to failed test cases (TypeError: an integer is required (got type bytes)). If you encounter these issues try to state a python executable < 3.8 explicitly.
@Ennosigaeon
Copy link
Contributor Author

Ok, I messed up the commits. Should be fixed now.

@noorul
Copy link
Contributor

noorul commented Jan 25, 2021

@noorul The change looks fine. It can work with python2 also. Any concerns?

@bsikander Looks GTM since we are documenting this as breaking change.

@bsikander bsikander merged commit 22a731e into spark-jobserver:master Jan 26, 2021
@Ennosigaeon Ennosigaeon deleted the replace_python2_with_python3 branch January 26, 2021 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants