# Lab 4b: Connecting to the Google Cloud from Colab

We will here introduce the **set-up** to access and use **Google Cloud from your notebook in Colab**.  

In addition to authenticating with Google Drive, you'll need to **authenticate with the Google Cloud** system.

Then you can use the command line tools `gsutil` (for storage) and `gcloud` (for other cloud tasks).
These are part of the gcloud CLI (command line interface), which is already installed in Colab.
If you want to work on a local machine, you will need to install them locally (see [https://cloud.google.com/sdk/docs/install](https://cloud.google.com/sdk/docs/install)).  

There are detailed references available for

gcloud: [https://cloud.google.com/sdk/gcloud/reference](https://cloud.google.com/sdk/gcloud/reference)

gsutil [https://cloud.google.com/storage/docs/gsutil](https://cloud.google.com/storage/docs/gsutil).

### Cloud and Drive authentication

This is for **authenticating with with Google Drive and Google Cloud**, so that we can create and use our own buckets and access Dataproc and AI-Platform.

First, we mount Google Drive for persistent local storage.

In [1]:
print('Mounting google drive...')
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive"
%ls

Mounting google drive...
Mounted at /content/drive
/content/drive/MyDrive
 [0m[01;36maidetect_share[0m@
'Andalusien 2013.gmap'
'base (1).cpython-37.pyc'
'base (2).cpython-37.pyc'
'base (3).cpython-37.pyc'
 base.cpython-37.pyc
 [01;34mBD[0m/
 [01;34mBD-CW[0m/
 [01;36mBig_Data[0m@
 [01;34mBig_Data_Material[0m/
'Brazil 2013.gmap'
 CaseForSupport-Draft-v2.gdoc
[01;34m'Colab Notebooks'[0m/
'ConstantinosChristodoulidesReport (1).gdoc'
 ConstantinosChristodoulidesReport.gdoc
'Copy 2 of Form_SemanticMedia_FundingMiniProjects_DarwinMusicWeb.gdoc'
'Copy of Form_SemanticMedia_FundingMiniProjects_DarwinMusicWeb_4_26_backup.gdoc'
 [01;36mdatasets[0m@
 [01;34mDSP_AP[0m/
 EDA_wrangling_WALS.ipynb
 EMR_Velarde-Weyde_Meredith_GV6_DM1_TW.docx.gdoc
 [01;34mEric[0m/
 [01;34mESG[0m/
 [01;36mExtracting_relevant_documents[0m@
[01;36m'Fei - Kew Gardens - Project'[0m@
'Gmx Mail.gdoc'
[01;34m'Happy AI'[0m/
 hello-world.py
 history.pkl
'Identity rule models.gdoc'
 Imputed_scaled_FS_BLO

Next, we authenticate with the cloud system to enable access to DataProc and AI-Platform.

In [2]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import auth
    auth.authenticate_user()

Let's **create a new Google Cloud project** now.
Do this on the [GC Console page](https://console.cloud.google.com) by clicking on the entry at the top, right of the *Google Cloud Platform* and choosing *New Project*. **Copy** the **generated project ID** to the next cell. Also **enable billing** and the **Compute, Storage and Dataproc** APIs as explained in the Cloud Intro.

We also specify the **default project and region**. The REGION should be `us-central1` as that seems to be the only one that reliably works with the free credit.
This way we don't have to specify this information every time we access the cloud.

In [3]:
### this project NEEDS TO BE SET UP IN GOOGLE CLOUD FIRST
PROJECT = 'bd-labs-test' ### Append -xxxx, where xxxx is your City login to make project names unique ###
### it seems that the project name here has the be in lower case.
!gcloud config set project $PROJECT
REGION = 'us-central1' # this has worked most reliably with the free tier
!gcloud config set compute/region $REGION
!gcloud config set dataproc/region $REGION

!gcloud config list # show some information

Updated property [core/project].
Updated property [compute/region].
Updated property [dataproc/region].
[component_manager]
disable_update_check = True
[compute]
region = us-central1
[core]
account = t.e.weyde@city.ac.uk
project = bd-labs-test
[dataproc]
region = us-central1

Your active configuration is: [default]


With the cell below, we **create a storage bucket** that we will use later for **global storage**.
If the bucket exists you will see a "ServiceException: 409 ...", which does not cause any problems.
**You must create your own bucket to have write access.**

In [7]:
BUCKET = 'gs://{}-storage2'.format(PROJECT)
!gsutil mb -b on $BUCKET

CommandException: 1 files/objects could not be removed.
Removing gs://bd-labs-test-storage2/...
Creating gs://bd-labs-test-storage2/...


### Running Spark in the cloud

We will start by **to use Spark on GC Dataproc**.

This section shows you in detail **how to run Python code in Dataproc**.
You may need to **enable the Dataproc API** on the [console Dataproc page](https://console.cloud.google.com/dataproc/clusters/), if you have not done so, yet.

First we need to **create a cluster**. We start with a single machine, just to try it out.

We are using the `gcloud dataproc clusters` command. [Click here for documentation](https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters).
The **parameter** `--image-version 1.5-ubuntu18` makes sure we get **intended software**.

Starting a cluster **can take a few minutes**. You can wait for the cell to finish processing or interrupt its execution and check on the [console Dataproc page](https://console.cloud.google.com/dataproc/clusters/) if the cluster is ready.


In [8]:
CLUSTER = '{}-cluster'.format(PROJECT)
!gcloud dataproc clusters create $CLUSTER \
    --image-version 1.5-ubuntu18 --single-node \
    --master-machine-type n1-standard-4 \
    --master-boot-disk-type pd-ssd --master-boot-disk-size 100 \
    --max-idle 3600s

Waiting on operation [projects/bd-labs-test/regions/us-central1/operations/35fe6d6a-1959-3cf5-b911-3763c7f5c6bd].

Created [https://dataproc.googleapis.com/v1/projects/bd-labs-test/regions/us-central1/clusters/bd-labs-test-cluster] Cluster placed in zone [us-central1-c].


The `--max-idle 3600s` flag means that the cluster will be **deleted automatically** once it has been **idle for 1 hour**. This helps minimising costs for a cluster left running by accident.

This is a single-node cluster. This is created a bit more quickly than a multi-node cluster, but set-up is still quite slow (several minutes) because of the **restrictions on the free tier**.

If you stay **in the free tier Google promises not to charge you**. You could switch to **paid tier (not recommended)** and still use your free credit, but then you may have to **pay for usage**.

The free tier resources are sufficient for us for now, if you use the local Spark installation for testing.

We have not specified the region (we could have used `--region $REGION`) as we set already the default for Dataproc in the beginning.

You can run the **command below to get extensive information** about your running cluster.
However, it is usually **more practical** to look at the [console Dataproc page](https://console.cloud.google.com/dataproc/clusters/).

You can check the details and current state of your cluster by clicking on its name.
Double-check there at the end of your working session to make sure that no clusters are left running, especially when you are not in free mode any more.

In [9]:
!gcloud dataproc clusters describe $CLUSTER

clusterName: bd-labs-test-cluster
clusterUuid: ef41a2f9-81d8-4baa-8595-902d7f532d15
config:
  configBucket: dataproc-staging-us-central1-627592605246-f5boiwpi
  endpointConfig: {}
  gceClusterConfig:
    internalIpOnly: false
    networkUri: https://www.googleapis.com/compute/v1/projects/bd-labs-test/global/networks/default
    serviceAccountScopes:
    - https://www.googleapis.com/auth/bigquery
    - https://www.googleapis.com/auth/bigtable.admin.table
    - https://www.googleapis.com/auth/bigtable.data
    - https://www.googleapis.com/auth/cloud.useraccounts.readonly
    - https://www.googleapis.com/auth/devstorage.full_control
    - https://www.googleapis.com/auth/devstorage.read_write
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring.write
    zoneUri: https://www.googleapis.com/compute/v1/projects/bd-labs-test/zones/us-central1-c
  lifecycleConfig:
    idleDeleteTtl: 3600s
    idleStartTime: '2024-04-27T13:04:22.999058Z'
  masterC

Now that our cluster is running, we can submit a Spark job. A minimal Spark job is just a Python script. A simple "Hello World" Spark script is provided in a public cloud bucket. Let's have a look at it:

In [10]:
!gsutil cat gs://dataproc-examples/pyspark/hello-world/hello-world.py

#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!'])
words = sorted(rdd.collect())
print(words)



... and run it on the cluster. We submit the job with the `gcloud dataproc jobs` command ([click here for the documentation](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs)) with the cluster name.


In [11]:
!gcloud dataproc jobs submit pyspark --cluster $CLUSTER \
    gs://dataproc-examples/pyspark/hello-world/hello-world.py

Job [efe4422c4a4b47028512b57f99c61a06] submitted.
Waiting for job output...
24/04/27 13:10:01 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
24/04/27 13:10:01 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
24/04/27 13:10:01 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
24/04/27 13:10:01 INFO org.spark_project.jetty.util.log: Logging initialized @6777ms to org.spark_project.jetty.util.log.Slf4jLog
24/04/27 13:10:01 INFO org.spark_project.jetty.server.Server: jetty-9.4.z-SNAPSHOT; built: unknown; git: unknown; jvm 1.8.0_382-b05
24/04/27 13:10:01 INFO org.spark_project.jetty.server.Server: Started @7033ms
24/04/27 13:10:01 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@7d7e2198{HTTP/1.1, (http/1.1)}{0.0.0.0:32839}
24/04/27 13:10:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at bd-labs-test-cluster-m/10.128.0.24:8032
24/04/27 13:10:04 INFO org.apache.hadoop.yarn.client.AHSProx

The `trackingUrl` shown above will only work as long as the job is running. On to the [dataproc page](https://console.cloud.google.com/dataproc/clusters/), you can click through to your cluster page and from there to your job page to see details on past jobs.

You may get some warnings like this:
`WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair' ...
or this one:
`WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception`. There was not enough time to research these fully but they don't affect what we do here and can be ignored.

One issue is that we need to **get output data from the cluster back to the notebook**. We can output text through printing into the output stream, but that does not work well for scaling up, for automation, or for binary data.

A better solution is to **pass an argument** to the job, to tell the job on the cluster **where to write the output**.
This requires a bit of extra code as shown below using the `argparse` package.
The example below the checks if the script runs in Colab and if that is the case, it does not execute the run function.
This is useful for quick testing with a local spark instance.

For running the file in DataProc, write to a local file `hello-world.py`, uncomment the  the `%%writefile` magic as in the cell below.


In [12]:
%%writefile hello-world.py

import sys
import pyspark
import argparse
import pickle

def save(object,bucket,filename):
    with open(filename,mode='wb') as f:
        pickle.dump(object,f)
    print("Saving {} to {}".format(filename,bucket))
    import subprocess
    proc = subprocess.run(["gsutil","cp", filename, bucket],stderr=subprocess.PIPE)
    print("gstuil returned: " + str(proc.returncode))
    print(str(proc.stderr))

def runWordCount(argv):
    # Parse the provided arguments
    print(argv)
    parser = argparse.ArgumentParser() # get a parser object
    parser.add_argument('--out_bucket', metavar='out_bucket', required=True,
                        help='The bucket URL for the result.') # add a required argument
    parser.add_argument('--out_file', metavar='out_file', required=True,
                        help='The filename for the result.') # add a required argument
    args = parser.parse_args(argv) # read the value
    # the value provided with --out_bucket is now in args.out_bucket
    sc = pyspark.SparkContext.getOrCreate()
    rdd = sc.parallelize(['Hello,', 'world!'])
    words = sorted(rdd.collect())
    save(words,args.out_bucket,args.out_file)

if  'google.colab' not in sys.modules: # Don't use system arguments running in Colab
    runWordCount(sys.argv[1:])
elif __name__ == "__main__" : # but define them manually
    runWordCount(["--out_bucket", BUCKET, "--out_file", "words.pkl"])

Overwriting hello-world.py


**Once the code works as intended**, you can write it to the local disk (on your Colab instance). For this, **uncomment the first line with the `%%writefile` magic** and then **re-run the cell**.

Then **check** that the file **is in the current directory** and **has the right contents** like this:

In [13]:
!pwd
!ls -l hello-world.py
!cat hello-world.py

/content/drive/MyDrive
-rw------- 1 root root 1402 Apr 27 13:15 hello-world.py

import sys
import pyspark
import argparse
import pickle

def save(object,bucket,filename):
    with open(filename,mode='wb') as f:
        pickle.dump(object,f)
    print("Saving {} to {}".format(filename,bucket))
    import subprocess
    proc = subprocess.run(["gsutil","cp", filename, bucket],stderr=subprocess.PIPE)
    print("gstuil returned: " + str(proc.returncode))
    print(str(proc.stderr))

def runWordCount(argv):
    # Parse the provided arguments
    print(argv)
    parser = argparse.ArgumentParser() # get a parser object
    parser.add_argument('--out_bucket', metavar='out_bucket', required=True,
                        help='The bucket URL for the result.') # add a required argument
    parser.add_argument('--out_file', metavar='out_file', required=True,
                        help='The filename for the result.') # add a required argument
    args = parser.parse_args(argv) # read the value
   

We can now submit the job with an extra section for application arguments. It's started by `--`, which indicates that all following arguments are to be sent to our Spark application.

In [14]:
FILENAME = 'words.pkl'
!gcloud dataproc jobs submit pyspark --cluster $CLUSTER --region $REGION \
    ./hello-world.py \
    -- --out_bucket $BUCKET --out_file $FILENAME

Job [5cf15dd1d08341b298e61ebcd55e7a52] submitted.
Waiting for job output...
['--out_bucket', 'gs://bd-labs-test-storage2', '--out_file', 'words.pkl']
24/04/27 13:15:23 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
24/04/27 13:15:23 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
24/04/27 13:15:23 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
24/04/27 13:15:23 INFO org.spark_project.jetty.util.log: Logging initialized @7710ms to org.spark_project.jetty.util.log.Slf4jLog
24/04/27 13:15:24 INFO org.spark_project.jetty.server.Server: jetty-9.4.z-SNAPSHOT; built: unknown; git: unknown; jvm 1.8.0_382-b05
24/04/27 13:15:24 INFO org.spark_project.jetty.server.Server: Started @8015ms
24/04/27 13:15:24 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@44eee42{HTTP/1.1, (http/1.1)}{0.0.0.0:38027}
24/04/27 13:15:27 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at bd-labs-test-cluster-m/10.1

Once the job has finished, we can **use the output** by **copying it from the bucket** and **reading it as a local file**.

In [15]:
# Make sure you are writing to the right directory
import pickle
%cd /content/drive/MyDrive/BD-CW
!gsutil cp $BUCKET/$FILENAME .
!ls -l
with open(FILENAME,mode='rb') as f:
    words = pickle.load(f)

print("Content of {} : {}".format(FILENAME,words))

/content/drive/MyDrive/BD-CW
CommandException: No URLs matched: gs://bd-labs-test-storage2/words.pkl
total 42
drwx------ 2 root root 4096 Feb 20  2022 flowers_training87
drwx------ 2 root root 4096 Feb 20  2022 flowers_training89
-rw------- 1 root root 1431 Feb 20  2022 hello-world.py
-rw------- 1 root root  196 Apr 30  2022 history.pkl
-rw------- 1 root root 1360 Apr 22  2022 param_res-220422-1717.pkl
-rw------- 1 root root 4545 Feb 20  2022 spark_job2.py
-rw------- 1 root root 7445 Feb 20  2022 spark_job.py
-rw------- 1 root root 7427 Aug  7  2022 spark_speed_test.py
-rw------- 1 root root 4477 Apr 27 12:42 spark_write_tfrec.py
-rw------- 1 root root 1360 Apr 22  2022 test.pkl
-rw------- 1 root root   50 Apr 30  2022 time.pkl
drwx------ 2 root root 4096 Feb 20  2022 trainer
-rw------- 1 root root   34 Feb 20  2022 words.pkl
Content of words.pkl : ['Hello,', 'world!']


At the end of a session we should **delete the cluster**, as it incurs a **cost for the time** it runs.  

In [14]:
!gcloud dataproc clusters delete $CLUSTER -q
# the -q flag disables the confirmation prompt
# , we want to make sure it really gets deleted

Waiting on operation [projects/bd-labs-test/regions/us-central1/operations/c1af5690-763e-30d8-aa4d-0fdcda863fd8].
Deleted [https://dataproc.googleapis.com/v1/projects/bd-labs-test/regions/us-central1/clusters/bd-labs-test-cluster].


###  Set up python packages on a cluster


In order to run our code on the cluster we need to install packages.  
For this, we enable package installation by passing a flag `--initialization-actions` with argument `gs://goog-dataproc-initialization-actions-$REGION/python/pip-install.sh` (this is a public script that will read metadata to determine which packages to install).
The packages are then specified by providing a `--metadata` flag with the argument `PIP_PACKAGES=<name==version>`, e.g . `PIP_PACKAGES=tensorflow` if we want TensorFlow installed.

Once the cluster is running, you can run the job. It is useful to create a new filename, so that results don't get overwritten.

You can for instance use `str(datetime.datetime.now().strftime("%y%m%d-%H%M"))` to get a string with the current date and time and use that in the file name.


In [None]:
import datetime
time_string = str(datetime.datetime.now().strftime("%y%m%d-%H%M"))
time_string