In [1]:
from pathlib import Path
import os
from dotenv import load_dotenv

import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

In [2]:
load_dotenv()

SPARK_MASTER_HOST = os.getenv("SPARK_MASTER_HOST")
SPARK_JARS = os.getenv("SPARK_JARS")
# config for connecting to GCS
conf = SparkConf() \
    .setMaster(SPARK_MASTER_HOST) \
    .setAppName('test_standalone') \
    .set("spark.jars", SPARK_JARS)

sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")


23/02/27 14:53:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Instantiate a Spark Session

In [3]:
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()

Instantiate Spark Master for workers to attach to

```bash
cd $SPARK_HOME 
./sbin/start-master.sh
```

Before running any jobs, we must attach a worker to the master with

```bash
cd $SPARK_HOME
./sbin/start-worker.sh $SPARK_MASTER_HOST \
    --memory 4G \
    --webui-port 8081 \
    --cores 2 \
    --dir $SPARK_HOME/work \
    --properties-file $SPARK_HOME/conf/spark-defaults.conf
```

In [5]:
DATA_LAKE = os.getenv('DATA_LAKE')
df_green = spark.read.parquet(f'gs://{DATA_LAKE}/data/parts/green/*/*')

Rebuild the `revenue_report` table by converting the `spark_sql` notebook into a script, this time by submitting it as a job to our standalone cluster

### Set environment variables to executors

1. `spark-submit --conf spark.executorEnv.SOME_ENV=SOME_VAL`

    Can also edit `$SPARK_HOME/conf/spark-defaults.conf` file
1. Add env var by creating SparkSession

    ```python
    # Create SparkSession
    spark = SparkSession.builder \
           .appName('test') \
           .config("spark.executorEnv.SOME_ENVIRONMENT", "SOME_VALUE") \
           .getOrCreate()
    ```
1. Spark Config
    Set across executors

    ```python
    # Create SparkSession
    spark = SparkSession.builder \
           .appName('test') \
           .config("SOME_ENVIRONMENT", "SOME_VALUE") \
           .getOrCreate()
    ```
    
### Spark-submit

The python file to be submitted still needs to create `SparkSession`; that is the entrypoint for our script to use Spark resources

```bash
PQ_YELLOW="gs://$DTC_DATA_LAKE/data/raw/yellow/*"
PQ_GREEN="gs://$DTC_DATA_LAKE/data/raw/green/*"
PQ_REPORT="gs://$DTC_DATA_LAKE/data/report/yg_monthly/"

spark-submit \
    --master $SPARK_MASTER \
    --jars $SPARK_GCS_JAR \
    spark_sql.py \
    -y $PQ_YELLOW \
    -g $PQ_GREEN \
    -O $PQ_REPORT
```

Before running that command, export these environment vars:

- $DTC_DATA_LAKE
- $SPARK_MASTER_HOST
- $SPARK_GCS_JAR