-
Notifications
You must be signed in to change notification settings - Fork 16
Add spark support for shipping conda environments and DRY http proxy setting propagation #22
Conversation
declerambaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I tested with a simple example
! pip install pillow
! pip install --ignore-installed --upgrade git+https://github.com/ottomata/wmfdata.git@conda
import wmfdata
spark =wmfdata.spark.get_session(ship_python_env=True)
df = spark.createDataFrame(data=[('no.jpg',)], schema=['f'])
@F.udf(returnType='string')
def fakeit(f):
from PIL import Image
try:
Image.open(f)
except Exception as e:
return f'{e}'
return f
df = df.withColumn('t', fakeit(F.col('f')))
df.toPandas()
# 0 no.jpg [Errno 2] No such file or directory: 'no.jpg'
|
Wow actually, Luca just sent me: https://github.com/criteo/cluster-pack Which looks maybe way nicer than my patch here. Let me try that. |
|
Looked through code, and while it does a lot more stuff, essentially what it is doing is exactly what this patch does. cluster-pack looks like it should be able to detect if new dependencies are installed and only re-pack and upload if so. This patch makes you do that manually (by passing a force=True option). As is, cluster-pack won't work, as our stacked conda setup makes it so we need to pass Let's proceed with this patch. |
This helps automate shipping of conda environments to remote YARN workers when there are dependencies needed there that are not installed there by default. This also properly sets up PYSPARK_PYTHONN to use anaconda-wmf on remote YARN workers if the user chooses not to ship a conda environment of their own. anaconda-wmf is installed everywhere and is a better default for remote python than the system installed one. Bug: T272313
This allows us to not repeat the WMF specific hostnames and rely instead on configuration passed in. If using wmfdata in WMF Jupyter, the proxy settings will be set and propagated and used locally and on remote workers. If using wmfdata outside of WMF Jupyter, you have to make sure the local http proxy environment variables are set.
nshahquinn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've taken a look and overall it seems good!
I read through all the conda bits, so I understand them at a basic level, but not enough to actually comment on the details. I feel comfortable trusting you, though :)
I did leave some minor comments about code formatting and the logging. Once you deal with those, feel free merge it yourself.
| findspark.init("/usr/lib/spark2") | ||
| from pyspark.sql import SparkSession | ||
|
|
||
| from wmfdata import conda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top of init.py says that modules can depend only depend on utils to prevent circular import issues. Can you update that to say depending on conda is okay too?
Also, you may want to import conda there too so it's available after import wmfdata like all the other modules. But I doubt it's going to be used by any end users, so perhaps we should just leave it out? Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Lemme know if I did that right...
|
Also, thanks for testing this, @declerambaul! It made this review a lot easier 😊 |
This is what was done before my recent changes, and on second thought is the better behavior. Even if not running Spark in YARN, it is likely that users will be using Spark to access data in HDFS, so we might as well just always enforce kinit before using spark.
This helps automate shipping of conda environments
to remote YARN workers when there are dependencies
needed there that are not installed there by default.
This also properly sets up PYSPARK_PYTHON to use
anaconda-wmf on remote YARN workers if the user
chooses not to ship a conda environment of their own.
anaconda-wmf is installed everywhere and is a better default
for remote python than the system installed one.
Bug: T272313