Add spark support for shipping conda environments and DRY http proxy setting propagation #22

ottomata · 2021-03-01T21:58:22Z

This helps automate shipping of conda environments
to remote YARN workers when there are dependencies
needed there that are not installed there by default.

This also properly sets up PYSPARK_PYTHON to use
anaconda-wmf on remote YARN workers if the user
chooses not to ship a conda environment of their own.
anaconda-wmf is installed everywhere and is a better default
for remote python than the system installed one.

Bug: T272313

declerambaul

Nice!

I tested with a simple example


! pip install pillow
! pip install --ignore-installed --upgrade git+https://github.com/ottomata/wmfdata.git@conda
import wmfdata
spark =wmfdata.spark.get_session(ship_python_env=True)
df = spark.createDataFrame(data=[('no.jpg',)], schema=['f'])

@F.udf(returnType='string')
def fakeit(f):
    from PIL import Image
    try:
        Image.open(f)
    except Exception as e:
        return f'{e}'
    return f

df = df.withColumn('t', fakeit(F.col('f')))
df.toPandas()

# 0	no.jpg	[Errno 2] No such file or directory: 'no.jpg'

wmfdata/conda.py

ottomata · 2021-03-02T20:13:10Z

Wow actually, Luca just sent me:

https://github.com/criteo/cluster-pack

Which looks maybe way nicer than my patch here. Let me try that.

ottomata · 2021-03-02T20:40:50Z

https://github.com/criteo/cluster-pack

Looked through code, and while it does a lot more stuff, essentially what it is doing is exactly what this patch does. cluster-pack looks like it should be able to detect if new dependencies are installed and only re-pack and upload if so. This patch makes you do that manually (by passing a force=True option).

As is, cluster-pack won't work, as our stacked conda setup makes it so we need to pass ignore_editable_packages=True to conda_pack, and cluster-pack doesn't currently have an option to pass that down.

Let's proceed with this patch.

This helps automate shipping of conda environments to remote YARN workers when there are dependencies needed there that are not installed there by default. This also properly sets up PYSPARK_PYTHONN to use anaconda-wmf on remote YARN workers if the user chooses not to ship a conda environment of their own. anaconda-wmf is installed everywhere and is a better default for remote python than the system installed one. Bug: T272313

This allows us to not repeat the WMF specific hostnames and rely instead on configuration passed in. If using wmfdata in WMF Jupyter, the proxy settings will be set and propagated and used locally and on remote workers. If using wmfdata outside of WMF Jupyter, you have to make sure the local http proxy environment variables are set.

nshahquinn

Okay, I've taken a look and overall it seems good!

I read through all the conda bits, so I understand them at a basic level, but not enough to actually comment on the details. I feel comfortable trusting you, though :)

I did leave some minor comments about code formatting and the logging. Once you deal with those, feel free merge it yourself.

wmfdata/spark.py

wmfdata/conda.py

wmfdata/utils.py

nshahquinn · 2021-03-05T18:07:24Z

wmfdata/spark.py

-findspark.init("/usr/lib/spark2")
-from pyspark.sql import SparkSession

+from wmfdata import conda


The top of init.py says that modules can depend only depend on utils to prevent circular import issues. Can you update that to say depending on conda is okay too?

Also, you may want to import conda there too so it's available after import wmfdata like all the other modules. But I doubt it's going to be used by any end users, so perhaps we should just leave it out? Up to you.

ok. Lemme know if I did that right...

wmfdata/spark.py

nshahquinn · 2021-03-05T18:13:32Z

Also, thanks for testing this, @declerambaul! It made this review a lot easier 😊

This is what was done before my recent changes, and on second thought is the better behavior. Even if not running Spark in YARN, it is likely that users will be using Spark to access data in HDFS, so we might as well just always enforce kinit before using spark.

ottomata requested a review from nshahquinn March 1, 2021 21:58

ottomata force-pushed the conda branch from 2db1c15 to e339255 Compare March 2, 2021 00:39

declerambaul approved these changes Mar 2, 2021

View reviewed changes

wmfdata/conda.py Outdated Show resolved Hide resolved

wmfdata/conda.py Show resolved Hide resolved

wmfdata/conda.py Outdated Show resolved Hide resolved

ottomata force-pushed the conda branch from e339255 to 25d8f3d Compare March 2, 2021 20:48

ottomata force-pushed the conda branch from 25d8f3d to f4f5e19 Compare March 4, 2021 14:26

ottomata changed the title ~~Add spark support for shipping conda environments~~ Add spark support for shipping conda environments and DRY http proxy setting propagation Mar 4, 2021

ottomata force-pushed the conda branch from 31aa004 to a48018e Compare March 4, 2021 21:44

nshahquinn reviewed Mar 5, 2021

View reviewed changes

ottomata force-pushed the conda branch from ba0f571 to b626f35 Compare March 8, 2021 20:55

Update PR resolving comments.

5b8f36c

ottomata force-pushed the conda branch from b626f35 to 5b8f36c Compare March 8, 2021 20:59

nshahquinn merged commit 6250c74 into wikimedia:master Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add spark support for shipping conda environments and DRY http proxy setting propagation #22

Add spark support for shipping conda environments and DRY http proxy setting propagation #22

Uh oh!

ottomata commented Mar 1, 2021

Uh oh!

declerambaul left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ottomata commented Mar 2, 2021

Uh oh!

ottomata commented Mar 2, 2021 •

edited

Loading

Uh oh!

nshahquinn left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nshahquinn Mar 5, 2021

Uh oh!

ottomata Mar 8, 2021

Uh oh!

Uh oh!

nshahquinn commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Add spark support for shipping conda environments and DRY http proxy setting propagation #22

Add spark support for shipping conda environments and DRY http proxy setting propagation #22

Uh oh!

Conversation

ottomata commented Mar 1, 2021

Uh oh!

declerambaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ottomata commented Mar 2, 2021

Uh oh!

ottomata commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nshahquinn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nshahquinn Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

ottomata Mar 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nshahquinn commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ottomata commented Mar 2, 2021 •

edited

Loading