Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pickle protocol option support for Spark #3001

Merged
merged 5 commits into from Jan 11, 2022

Conversation

gunnsoo
Copy link
Contributor

@gunnsoo gunnsoo commented Sep 19, 2020

Description

Using spark module may become not working if python version is different between client and spark cluster. pickle module uses default protocol changed by python version in dump and dumps method. If Luigi task which inherits PySparkTask is picked in client and then it's unpicked in spark cluster and the pickle default protocol is different, pickle.load() causes AttributeError.
Default pickle protocol is defined here https://docs.python.org/3/library/pickle.html#pickle-protocols

This change enables users to specify pickle protocol version in luigi configuration file according to their spark cluster environment.

Motivation and Context

I was trying to run Sample task inheriting PySparkTask from client(python3.8 is installed) to spark cluster(python3.6 is installed).

import luigi
from luigi.contrib.spark import PySparkTask


class Sample(PySparkTask):

    def main(self, sc, *args):
        print(luigi.__version__)


if __name__ == '__main__':
    luigi.run()

Then I got the following error.

Traceback (most recent call last):
  File "pyspark_runner.py", line 135, in <module>
    _get_runner_class()(*sys.argv[1:]).run()
  File "pyspark_runner.py", line 103, in __init__
    self.job = pickle.load(fd)
AttributeError: Can't get attribute 'Sample' on <module '__main__' from 'pyspark_runner.py'>

This can happen if pickle default protocol is different between client and cluster like client using python3 but cluster using python2 or like client 3.7 and cluster 3.8.
In my case I can't change cluster's python version, and think some people is in similar situation like me.

Have you tested this? If so, how?

I ran my jobs with this code and it works for me.

luigi/contrib/spark.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.

@stale stale bot added the wontfix label Jan 9, 2022
@dlstadther dlstadther requested a review from a team as a code owner January 11, 2022 02:08
@stale stale bot removed wontfix labels Jan 11, 2022
@dlstadther
Copy link
Collaborator

Apologies for massive delay. Stalebot comments on a bunch of Luigi PRs and that reminded me that I approved some PRs and never merged them. Going back through them now, updating them with master, and confirming that all their checks pass.

@dlstadther dlstadther merged commit 42df63f into spotify:master Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants