# Method1: pyspark not installed and use findspark
- Let findspark find it for you
- The download folder `spark-3.0.0-bin-hadoop2.7` contains python and pyspark
- Use findspark to initialize (Provides findspark.init() to make pyspark importable as a regular library.)

In [None]:
# Spark download contains pyspark in it, use findspark to locate and initialize that. Don't need to explicitly install pyspark
!pip3 install findspark==1.4.2

In [12]:
# Below variables are to be set in the shell profile
# export SPARK_HOME=/Users/pmacharl/spark-3.0.0-bin-hadoop2.7
# export PATH=$PATH:$SPARK_HOME/bin
# export PYSPARK_SUBMIT_ARGS="pyspark-shell"
# export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
# export PYSPARK_PYTHON=/usr/local/bin/python3
# export PYSPARK_DRIVER_PYTHON_OPTS=notebook
# export PYARROW_IGNORE_TIMEZONE=1

In [4]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("some-example").getOrCreate()
sparkSession

In [2]:
sparkSession.stop()

# Method2: Install pyspark, don't use findspark

In [1]:
!pip3 install pyspark==3.0.0

You should consider upgrading via the '/Users/pmacharl/git-projects/personal/github.com/data_analysis_pandas_spark_koalas/venv/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

config = SparkConf()
config.set("spark.driver.memory", "2g")
config.set("spark.executor.memory", "1g")

spark = SparkSession.builder.config(conf=config).master("local").appName("some-app").getOrCreate()
spark

In [3]:
spark.stop()

# Method3: Install pyspark, explicitly start standalone spark server (cluster mode)
- and connect to it
- Cluster mode enables you to scale spark server independently
- There is a default cluster manager that comes out of box. Other cluster managers like k8s, mesos can be integrated and that is a separate learning for large hyperscale environments
- From `$SPARK_HOME/bin` execute `./sbin/start-all.sh`. More [options](https://spark.apache.org/docs/latest/spark-standalone.html) for passing parameters

In [11]:
# Below variables are to be set in the shell profile
# export SPARK_HOME=/Users/pmacharl/spark-3.0.0-bin-hadoop2.7
# export PATH=$PATH:$SPARK_HOME/bin
# export PYSPARK_SUBMIT_ARGS="pyspark-shell"
# export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
# export PYSPARK_PYTHON=/usr/local/bin/python3
# export PYSPARK_DRIVER_PYTHON_OPTS=notebook

In [None]:
# Instead of below, you could use the pyspark package that comes along with $SPARK_HOME folder, but that 
# would mean you have to set PYTHONPATH variable so that python can look at additional places for packages
# and not just site-packages folder. 
# For e.g. you can set cd spark-3.3.0-bin-hadoop3
# export SPARK_HOME=`pwd`
# export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

!pip3 install pyspark==3.0.0

In [8]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

config = SparkConf()
config.set("spark.driver.memory", "2g")
config.set("spark.executor.memory", "1g")
config.setMaster("spark://192.168.0.4:7077")

spark = SparkSession.builder.config(conf=config).master("local").appName("some-app").getOrCreate()
spark

In [9]:
spark.stop()