# Before your begin

1. Setup **gcloud** by following this guide [here](https://29022131.atlassian.net/wiki/spaces/DP/pages/1006174505/JupyterHub+-+End-user+Guide#JupyterHub-End-userGuide-GCloudsetup).
2. Setup **github** by following this guide [here](https://29022131.atlassian.net/wiki/spaces/DP/pages/1006174505/JupyterHub+-+End-user+Guide#JupyterHub-End-userGuide-Githubsetup).

# Use GCS Connector through PySpark

### Create Spark Context

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext()

### Setup GCS Project and Service Account

In [2]:
sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "GCS_PROJECT_ID")
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "ABSOLUTE_PATH_TO_SERVICE_ACCOUNT")

# Example
# sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "tvlk-data-dev")
# sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile", "/home/jovyan/config/gcloud-svc-acc/test.json")

### Build Spark Seesion

In [3]:
spark = SparkSession.builder\
    .config(conf=sc.getConf())\
    .getOrCreate()

### Load data into dataframe

In [4]:
df_testfile = spark.read.csv("gs://rm_demo_data/wine_classification.csv")

In [5]:
df_testfile.show()

+--------------------+-----+
|                 _c0|  _c1|
+--------------------+-----+
|              review|label|
|<START> this film...|    1|
|<START> big hair ...|    0|
|<START> this has ...|    0|
|<START> the <UNK>...|    1|
|<START> worst mis...|    0|
|<START> begins be...|    0|
|<START> lavish pr...|    1|
|<START> the <UNK>...|    0|
|<START> just got ...|    1|
|<START> this movi...|    0|
|<START> french ho...|    1|
|<START> when i re...|    0|
|<START> i love ch...|    0|
|<START> anyone wh...|    0|
|<START> b movie a...|    0|
|<START> a total w...|    0|
|<START> laputa ca...|    1|
|<START> at the he...|    1|
|<START> i have on...|    0|
+--------------------+-----+
only showing top 20 rows



# Use GCS through Python Client

_Notes:_ For this tutorial, unless you have setup the gcloud to use service account, you will be using your gcloud credential to access resources needed. If you are interested on using serviec account instead, please follow this guide [here](https://29022131.atlassian.net/wiki/spaces/DP/pages/1006174505/JupyterHub+-+End-user+Guide#JupyterHub-End-userGuide-Usinggoogleserviceaccount).

### Import Python GCS client

In [3]:
from google.cloud import storage

### Create gcs client

In [4]:
storage_client = storage.Client()



### Set bucket name

In [5]:
# The name for the new bucket
bucket_name = 'rm_demo_data'

### Download sample data

In [7]:
source_blob_name = 'rm-lab.txt'
destination_file_name = 'rm-lab.txt'

bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob('rm-lab.txt')

blob.download_to_filename('rm-lab.txt')

print('Blob {} downloaded to {}.'.format(
    source_blob_name,
    destination_file_name))

Blob rm-lab.txt downloaded to rm-lab.txt.
