Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the community consider run dataprep eda on Yarn? #771

Closed
Bowen0729 opened this issue Dec 15, 2021 · 4 comments · Fixed by #774
Closed

Does the community consider run dataprep eda on Yarn? #771

Bowen0729 opened this issue Dec 15, 2021 · 4 comments · Fixed by #774
Assignees
Labels
type: enhancement New feature or request

Comments

@Bowen0729
Copy link
Contributor

My company use Hadoop eco system for bigdata, means that we have a Yarn cluster without a dask cluster. As I know, dask can run on Yarn,recently,I tried to run dataprep on yarn, and it worked well. So, does the community consider support dataprep on yarn? and we can work on it toghter.

@Bowen0729 Bowen0729 added the type: enhancement New feature or request label Dec 15, 2021
@Bowen0729 Bowen0729 changed the title Does the community consider run dataprep on Yarn? Does the community consider run dataprep eda on Yarn? Dec 15, 2021
@dovahcrow
Copy link
Member

Hi @Bowen0729, that is good news to hear. We always want DataPrep to have more ecosystem integrations. May I ask if you can write down how you run DataPrep on the Yarn cluster and then we can convert that into a page in our docmentation?

@Bowen0729
Copy link
Contributor Author

Sure!@dovahcrow

  1. Install dask-yarn with pip

    pip install dask-yarn

  2. Ensure that the libraries used on the Yarn cluster are the same as what you are using locally.

    using conda-pack package a conda environment
    conda-pack

    and upload the archive to HDFS
    hdfs:///mypath/archive.tar.gz

  3. Run dataprep eda with the following:

    from dask_yarn import YarnCluster
    from dataprep.eda import create_report
    from dask.distributed import Client
    from dask.dataframe as dd

    cluster = YarnCluster(environment='archive.tar.gz', worker_memory='10GiB', worker_vcores=4, scheduler_memory='1GiB')

    cluster.scale(4)
    client = Client(cluster)
    ddf = dd.read_parquet(hdfs:///data-path/data.parquet)

    create_report(ddf)

Spark supports multiple cluster manager, such as standalone, mesos, hadoop yarn or kubernetes, and I think dataprep based on dask can handle bigdata, which is the advantage over other frameworks, so does dataprep eda need to support other cluster manager? and it will be more friendly to bigdata scenarios, what do you think?

If it is necessary, we can talk about how to design the dataprep on yarn, perhaps user can choose the running mode.

If it is not necessary, I will open a pr for dataprep on yarn docmentation after you verified, and it's my pleasure to be a contributor of dataprep

@jinglinpeng
Copy link
Contributor

Hi @Bowen0729 , thanks a lot for the detailed steps! Currently we do not have enough people to make dataprep work on Yarn, which needs many optimizations and testings.

It would be very nice if you could add the doc for Yarn! You could add a section about Yarn in this file: https://github.com/sfu-db/dataprep/blob/develop/docs/source/installation.rst and then open a PR. Thanks for being a contributor of dataprep!

@Bowen0729
Copy link
Contributor Author

Thank you for reply @jinglinpeng
I will finish it in next few days!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants