Skip to content
This repository was archived by the owner on Nov 4, 2024. It is now read-only.

Conversation

@ottomata
Copy link
Collaborator

@ottomata ottomata commented Mar 1, 2021

This helps automate shipping of conda environments
to remote YARN workers when there are dependencies
needed there that are not installed there by default.

This also properly sets up PYSPARK_PYTHON to use
anaconda-wmf on remote YARN workers if the user
chooses not to ship a conda environment of their own.
anaconda-wmf is installed everywhere and is a better default
for remote python than the system installed one.

Bug: T272313

Copy link

@declerambaul declerambaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I tested with a simple example


! pip install pillow
! pip install --ignore-installed --upgrade git+https://github.com/ottomata/wmfdata.git@conda
import wmfdata
spark =wmfdata.spark.get_session(ship_python_env=True)
df = spark.createDataFrame(data=[('no.jpg',)], schema=['f'])

@F.udf(returnType='string')
def fakeit(f):
    from PIL import Image
    try:
        Image.open(f)
    except Exception as e:
        return f'{e}'
    return f

df = df.withColumn('t', fakeit(F.col('f')))
df.toPandas()

# 0	no.jpg	[Errno 2] No such file or directory: 'no.jpg'

@ottomata
Copy link
Collaborator Author

ottomata commented Mar 2, 2021

Wow actually, Luca just sent me:

https://github.com/criteo/cluster-pack

Which looks maybe way nicer than my patch here. Let me try that.

@ottomata
Copy link
Collaborator Author

ottomata commented Mar 2, 2021

https://github.com/criteo/cluster-pack

Looked through code, and while it does a lot more stuff, essentially what it is doing is exactly what this patch does. cluster-pack looks like it should be able to detect if new dependencies are installed and only re-pack and upload if so. This patch makes you do that manually (by passing a force=True option).

As is, cluster-pack won't work, as our stacked conda setup makes it so we need to pass ignore_editable_packages=True to conda_pack, and cluster-pack doesn't currently have an option to pass that down.

Let's proceed with this patch.

This helps automate shipping of conda environments
to remote YARN workers when there are dependencies
needed there that are not installed there by default.

This also properly sets up PYSPARK_PYTHONN to use
anaconda-wmf on remote YARN workers if the user
chooses not to ship a conda environment of their own.
anaconda-wmf is installed everywhere and is a better default
for remote python than the system installed one.

Bug: T272313
@ottomata ottomata changed the title Add spark support for shipping conda environments Add spark support for shipping conda environments and DRY http proxy setting propagation Mar 4, 2021
This allows us to not repeat the WMF specific hostnames
and rely instead on configuration passed in.

If using wmfdata in WMF Jupyter, the proxy settings will
be set and propagated and used locally and on remote workers.

If using wmfdata outside of WMF Jupyter, you have to make sure the
local http proxy environment variables are set.
Copy link
Member

@nshahquinn nshahquinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've taken a look and overall it seems good!

I read through all the conda bits, so I understand them at a basic level, but not enough to actually comment on the details. I feel comfortable trusting you, though :)

I did leave some minor comments about code formatting and the logging. Once you deal with those, feel free merge it yourself.

findspark.init("/usr/lib/spark2")
from pyspark.sql import SparkSession

from wmfdata import conda
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top of init.py says that modules can depend only depend on utils to prevent circular import issues. Can you update that to say depending on conda is okay too?

Also, you may want to import conda there too so it's available after import wmfdata like all the other modules. But I doubt it's going to be used by any end users, so perhaps we should just leave it out? Up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Lemme know if I did that right...

@nshahquinn
Copy link
Member

Also, thanks for testing this, @declerambaul! It made this review a lot easier 😊

This is what was done before my recent changes, and on second thought
is the better behavior.  Even if not running Spark in YARN, it is
likely that users will be using Spark to access data in HDFS, so
we might as well just always enforce kinit before using spark.
@nshahquinn nshahquinn merged commit 6250c74 into wikimedia:master Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants