# Before you begin

1. Setup **gcloud** by following this guide [here](https://29022131.atlassian.net/wiki/spaces/DP/pages/1006174505/JupyterHub+-+End-user+Guide#JupyterHub-End-userGuide-GCloudsetup).
2. Setup **github** by following this guide [here](https://29022131.atlassian.net/wiki/spaces/DP/pages/1006174505/JupyterHub+-+End-user+Guide#JupyterHub-End-userGuide-Githubsetup).
3. Setup **rs-sdk** and know how to store and get your secret key (no bare key allowed) by following this guide [here](https://29022131.atlassian.net/wiki/spaces/DATA/pages/1010736425/RS-SDK+User+Guide).

# Using PySpark to access S3 and securely use secret key through rs-sdk

#### Import necessary packages

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

In [2]:
import pyspark.sql.functions as sf

from pyspark.sql.types import *
from pyspark.sql import Window, SparkSession
from pyspark.sql import DataFrame

#### Import rs-sdk

In [3]:
from rs.keys import GateKeeper
import os

#### Set Environment Variable for Service Account

In [4]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/absolute/path/to/service-account.json'

#### Initialize GateKeeper Object to manage your secret stored in the a cloud storage within GCP Project

In [5]:
gk = GateKeeper('tvlk-my-project-id')

#### Get the encrypted secret using rs-sdk
Refer to this guide [here](https://29022131.atlassian.net/wiki/spaces/DATA/pages/1010736425/RS-SDK+User+Guide).

In [6]:
acckey = gk.get_secret(keyring_id="my-keyring", key_id="my-key", storage_bucket="tvlk-mybucket", file_path="mykey/my_access_key_id.enc").decode('utf-8')
accsecret = gk.get_secret(keyring_id="my-keyring", key_id="my-key", storage_bucket="tvlk-mybucket", file_path="mykey/my_secret_access_key.enc").decode('utf-8')


#### Locate your resources in S3

In [7]:
writeBucket = "tvlk-data-mybucket"
myprefix = "my-playground"
filename = "00/part*"
sample_data_path = "s3a://%s:%s@%s/%s/%s" % (acckey, accsecret, writeBucket, myprefix, filename)

#### Initiliaze SparkSession

In [8]:
spark = SparkSession\
            .builder\
            .appName("sample-workload")\
            .getOrCreate()

#### Load data from S3
Here we set example how to load Avro data

In [9]:
sample_data = spark\
                    .read\
                    .format("avro")\
                    .load(sample_data_path)

#### Do whatever ETL logic you want

In [10]:
filtered_data = (
        sample_data
        .where(sf.col('auth_level') == 'IDENTIFIED')
    )

In [11]:
count_row = filtered_data.count()

#### Get the results

In [12]:
print("the length of the filtered data are:")
print(count_row)
print("it is supposed to be 11169")

the length of the filtered data are:
11169
it is supposed to be 11169
