Does the community consider run dataprep eda on Yarn? #771

Bowen0729 · 2021-12-15T05:45:32Z

My company use Hadoop eco system for bigdata, means that we have a Yarn cluster without a dask cluster. As I know, dask can run on Yarn，recently，I tried to run dataprep on yarn, and it worked well. So, does the community consider support dataprep on yarn? and we can work on it toghter.

dovahcrow · 2021-12-15T18:03:48Z

Hi @Bowen0729, that is good news to hear. We always want DataPrep to have more ecosystem integrations. May I ask if you can write down how you run DataPrep on the Yarn cluster and then we can convert that into a page in our docmentation?

Bowen0729 · 2021-12-16T02:34:41Z

Sure!@dovahcrow

Install dask-yarn with pip

pip install dask-yarn
Ensure that the libraries used on the Yarn cluster are the same as what you are using locally.

using conda-pack package a conda environment
conda-pack

and upload the archive to HDFS
hdfs:///mypath/archive.tar.gz
Run dataprep eda with the following:

from dask_yarn import YarnCluster
from dataprep.eda import create_report
from dask.distributed import Client
from dask.dataframe as dd

cluster = YarnCluster(environment='archive.tar.gz', worker_memory='10GiB', worker_vcores=4, scheduler_memory='1GiB')

cluster.scale(4)
client = Client(cluster)
ddf = dd.read_parquet(hdfs:///data-path/data.parquet)

create_report(ddf)

Spark supports multiple cluster manager, such as standalone, mesos, hadoop yarn or kubernetes, and I think dataprep based on dask can handle bigdata, which is the advantage over other frameworks, so does dataprep eda need to support other cluster manager? and it will be more friendly to bigdata scenarios, what do you think?

If it is necessary, we can talk about how to design the dataprep on yarn, perhaps user can choose the running mode.

If it is not necessary, I will open a pr for dataprep on yarn docmentation after you verified, and it's my pleasure to be a contributor of dataprep

jinglinpeng · 2021-12-21T00:55:53Z

Hi @Bowen0729 , thanks a lot for the detailed steps! Currently we do not have enough people to make dataprep work on Yarn, which needs many optimizations and testings.

It would be very nice if you could add the doc for Yarn! You could add a section about Yarn in this file: https://github.com/sfu-db/dataprep/blob/develop/docs/source/installation.rst and then open a PR. Thanks for being a contributor of dataprep!

Bowen0729 · 2021-12-21T10:25:50Z

Thank you for reply @jinglinpeng
I will finish it in next few days!

Bowen0729 added the type: enhancement New feature or request label Dec 15, 2021

Bowen0729 assigned dovahcrow Dec 15, 2021

Bowen0729 changed the title ~~Does the community consider run dataprep on Yarn?~~ Does the community consider run dataprep eda on Yarn? Dec 15, 2021

Bowen0729 mentioned this issue Dec 23, 2021

add the doc of run dataprep.eda on Hadoop yarn #774

Merged

jinglinpeng linked a pull request Dec 25, 2021 that will close this issue

add the doc of run dataprep.eda on Hadoop yarn #774

Merged

jinglinpeng closed this as completed in #774 Dec 25, 2021

Bowen0729 mentioned this issue Mar 18, 2022

Can I define which characters should be missing value? #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the community consider run dataprep eda on Yarn? #771

Does the community consider run dataprep eda on Yarn? #771

Bowen0729 commented Dec 15, 2021

dovahcrow commented Dec 15, 2021

Bowen0729 commented Dec 16, 2021

jinglinpeng commented Dec 21, 2021

Bowen0729 commented Dec 21, 2021

Does the community consider run dataprep eda on Yarn? #771

Does the community consider run dataprep eda on Yarn? #771

Comments

Bowen0729 commented Dec 15, 2021

dovahcrow commented Dec 15, 2021

Bowen0729 commented Dec 16, 2021

jinglinpeng commented Dec 21, 2021

Bowen0729 commented Dec 21, 2021