Skip to content
Edwin Chan edited this page Sep 5, 2022 · 12 revisions

Development Guide

Getting Started on spark development (spark-branch!)

  1. Installing system requirements (spark, java, anaconda)

Mac

Linux or Windows with WSL

export SPARK_VERSION=3.2.0
export SPARK_DIRECTORY=/opt/spark
export HADOOP_VERSION=2.7

mkdir -p ${SPARK_DIRECTORY}
sudo apt-get update
sudo apt-get -y install openjdk-8-jdk
curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
--output ${SPARK_DIRECTORY}/spark.tgz
cd ${SPARK_DIRECTORY} && tar -xvzf spark.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
  1. Installing python library requirements in a conda env. Pull [spark-branch](https://github.com/ydataai/pandas-profiling/tree/spark-branch) and run
conda env create -f venv/spark.yml

This creates your conda env for spark called spark-env with all requirements packed inside

then activate the environment using

source activate spark-env
  1. Finally, run the command which should execute and provide profiling for some spark data
tests/backends/spark_backend/example.py

Don’t worry about any errors you see for now - as long as the report builds properly.

image

image