# Python Version on Mac
- Jupyter lab should start with same python , that is used by Apache Spark
- For mac, you can do the below (if you are using brew) - All versions go into /usr/local/Cellar directory
    - brew unlink python@3.9
    - brew unlink python@3.8
    - brew link --force python@3.9
- If jupyterlab somehow picks up 3.9 (though you did above),uninstall jupyterlab and re-install again (FYI. jupyterlab fails to start with 3.7 with traitlets error, then `pip3 install jupyter_server==1.13.1` & `pip3 install jupyter_client==7.1.0` from command line. After this jupyterlab starts fine

# Start Spark Cluster
- From `$SPARK_HOME/bin` execute `./sbin/start-all.sh`. More [options](https://spark.apache.org/docs/latest/spark-standalone.html) for passing parameters

In [None]:
# The whole folder for apache-spark is downloaded to site-packages folder , if you are interested to know 
# https://spark.apache.org/docs/latest/api/python/getting_started/install.html
!pip3 install pyspark==3.3.0
!pip3 install openpyxl==3.0.9
# !pip3 install pyspark[pandas_on_spark] - If this errors, just install it from shell command

In [7]:
# Below variables are to be set in the shell profile
# export SPARK_HOME=/Users/pmacharl/spark-3.3.0-bin-hadoop3
# export PATH=$PATH:$SPARK_HOME/bin
# export PYSPARK_SUBMIT_ARGS="pyspark-shell"
# export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
# export PYSPARK_PYTHON=/usr/local/bin/python3
# export PYARROW_IGNORE_TIMEZONE=1

In [1]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

In [8]:
# https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkConf
config = SparkConf()
config.set("spark.driver.memory", "2g")
config.set("spark.executor.memory", "1g")

# Because you are likely running in local mode, it is a good practice to set the number of shuffle partitions
# to something that is going to fit local mode. By default, the value is 200, but there aren't many executors
# on this machine, its worth reducing this to 5
config.set("spark.sql.shuffle.partitions", "5")

# Cluster mode
# https://spark.apache.org/docs/latest/submitting-applications.html
config.setMaster("spark://192.168.0.109:7077") # DONT SET THIS If spark is started in local cluster mode

<pyspark.conf.SparkConf at 0x1105374d0>

In [3]:
# https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# spark = SparkSession.builder.config(conf=config).master("local").appName("Analyzing Real Estate Sales").getOrCreate()

# Cluster mode
spark = SparkSession.builder.config(conf=config).master("spark://192.168.0.109:7077").appName("Reading Excel").getOrCreate()
# spark = SparkSession.builder.config(conf=config).master("local").appName("Reading Excel").getOrCreate()




In [4]:
spark

# Pandas-on-Spark Api (koalas before) data types conversion between spark dataframe and pandas dataframe
```
tinyint                int8
decimal              object
float               float32
double              float64
integer               int32
long                  int64
short                 int16
timestamp    datetime64[ns]
string               object
boolean                bool
date                 object
```

In [10]:
# This will piggyback on the existing SparkSession in memory
import pyspark.pandas as ps
df = ps.read_excel("./data/columns.xlsx")
df

  **kwds,


Unnamed: 0,RealEstData04182022.xlsx,RealEstData04192022.xlsx,RealEstData04172022.xlsx,RealEstData04162022.xlsx,RealEstData04152022.xlsx,RealEstData04132022.xlsx,RealEstData04122022.xlsx,RealEstData04142022.xlsx
0,OWNER1,OWNER1,OWNER1,OWNER1,OWNER1,OWNER1,OWNER1,OWNER1
1,OWNER2,OWNER2,OWNER2,OWNER2,OWNER2,OWNER2,OWNER2,OWNER2
2,Mailing_address1,Mailing_address1,Mailing_address1,Mailing_address1,Mailing_address1,Mailing_address1,Mailing_address1,Mailing_address1
3,Mailing_Address2,Mailing_Address2,Mailing_Address2,Mailing_Address2,Mailing_Address2,Mailing_Address2,Mailing_Address2,Mailing_Address2
4,Mailing_Address3,Mailing_Address3,Mailing_Address3,Mailing_Address3,Mailing_Address3,Mailing_Address3,Mailing_Address3,Mailing_Address3
5,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID,REAL_ESTATE_ID
6,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER,CARD_NUMBER
7,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS,NUMBER_OF_CARDS
8,Street_Number,Street_Number,Street_Number,Street_Number,Street_Number,Street_Number,Street_Number,Street_Number
9,Street_Prefix,Street_Prefix,Street_Prefix,Street_Prefix,Street_Prefix,Street_Prefix,Street_Prefix,Street_Prefix


In [9]:
# spark.catalog.clearCache()
spark.stop()