# HW5Q8 - How to run in GCP
__`MIDS w261: Machine Learning at Scale | UC Berkeley School of Information | Fall 2018`__

This notebook contains supplemental materials to help you run your HW5 solution to question 8 using Google Compute Platform. __Important Note:__ _the graders will not read this notebook. If you do use it, please be sure to copy relevant output back into the main homework notebook to receive credit for your results._ 

### Account setup
1. Create your GCP account & apply for credit through the w261 education grant. (see [create_account.md](https://github.com/UCB-w261/w261-environment/blob/master/gcp/account-setup/create_account.md))
2. Set up your project, bucket, service account, access key and virtual environment. (steps 1-15 in [setup.md](https://github.com/UCB-w261/w261-environment/blob/master/gcp/account-setup/setup.md))
3. (OPTIONAL) Review the GCP documentation to become more familiar with the setup steps you've just performed: [key terms & concepts described here](https://cloud.google.com/storage/docs/concepts) 

### Get the submission Script
Copy the [submit_job_to_cluster.py](https://github.com/UCB-w261/w261-environment/blob/master/gcp/dataproc/submit_job_to_cluster.py) file from the environment repo into your current working directory. This script will help you run your own spark jobs on the cluster. You can read more about it here: [w261-environment](https://github.com/UCB-w261/w261-environment/tree/master/gcp).

#### Make sure to give your script executable permissions

In [8]:
!chmod a+x submit_job_to_cluster.py

### Push the data to your gcp bucket

* To copy files from dropbox to google storage run:   
`curl -L file-url | gsutil cp - gs://bucket-name/filename.txt`   

For example, to stream the whole wiki graph from dropbox into my bucket, I would run:   
`curl -L "https://www.dropbox.com/sh/2c0k5adwz36lkcw/AAAD7I_6kQlJtDpXZPhCfVH-a/wikipedia/all-pages-indexed-out.txt?dl=0" | gsutil cp - gs://w261-bucket/wiki_graph.txt`

* To copy files from your computer to google storage, run:   
`gsutil cp 'data/test_graph.txt' gs://bucket-name/test_graph.txt`   

__IMPORTANT:__ You will need to run this outside of the Docker container, as the container doesn't have `gsutil` installed.

For additonal information about moving files to your GS bucket, see: https://www.cloudbooklet.com/gsutil-cp-copy-and-move-files-on-google-cloud/   

### Create and run a spark job on a cluster using GCP
Fill in your PageRank code, and run the cell below to create a file called `pagerank.py` in the current directory.    
__IMPORTANT:__ Make sure and fill in your own Bucket Name!

In [None]:
%%writefile pagerank.py
#!/usr/bin/env python


import re
import ast
import time
import numpy as np
import pandas as pd
import pyspark
from pyspark.accumulators import AccumulatorParam

sc = pyspark.SparkContext()


############## YOUR BUCKET HERE ###############

BUCKET=""

############## (END) YOUR BUCKET ###############


wikiRDD = sc.textFile("gs://"+BUCKET+"/wiki_graph.txt")


def initGraph(dataRDD):
    """
    Spark job to read in the raw data and initialize an 
    adjacency list representation with a record for each
    node (including dangling nodes).
    
    Returns: 
        graphRDD -  a pair RDD of (node_id , (score, edges))
        
    NOTE: The score should be a float, but you may want to be 
    strategic about how format the edges... there are a few 
    options that can work. Make sure that whatever you choose
    is sufficient for Question 8 where you'll run PageRank.
    """
    ############## YOUR CODE HERE ###############
    
   
    ############## (END) YOUR CODE ###############
    
    return graphRDD

class FloatAccumulatorParam(AccumulatorParam):
    """
    Custom accumulator for use in page rank to keep track of various masses.
    
    IMPORTANT: accumulators should only be called inside actions to avoid duplication.
    We stringly recommend you use the 'foreach' action in your implementation below.
    """
    def zero(self, value):
        return value
    def addInPlace(self, val1, val2):
        return val1 + val2
    
def runPageRank(graphInitRDD, alpha = 0.15, maxIter = 10, verbose = True):
    """
    Spark job to implement page rank
    Args: 
        graphInitRDD  - pair RDD of (node_id , (score, edges))
        alpha         - (float) teleportation factor
        maxIter       - (int) stopping criteria (number of iterations)
        verbose       - (bool) option to print logging info after each iteration
    Returns:
        steadyStateRDD - pair RDD of (node_id, pageRank)
    """
    # teleportation:
    a = sc.broadcast(alpha)
    
    # damping factor:
    d = sc.broadcast(1-a.value)
    
    # initialize accumulators for dangling mass & total mass
    mmAccum = sc.accumulator(0.0, FloatAccumulatorParam())
    totAccum = sc.accumulator(0.0, FloatAccumulatorParam())
    
    ############## YOUR CODE HERE ###############
    
    # write your helper functions here, 
    # please document the purpose of each clearly 
    # for reference, the master solution has 5 helper functions.
    
   

               
    # write your main Spark Job here (including the for loop to iterate)
    # for reference, the master solution is 21 lines including comments & whitespace
    
    
    
    
    
    ############## (END) YOUR CODE ###############
    
    return steadyStateRDD




nIter = 10
start = time.time()

# Initialize your graph structure (Q7)
wikiGraphRDD = initGraph(wikiRDD)

# Run PageRank (Q8)
full_results = runPageRank(wikiGraphRDD, alpha = 0.15, maxIter = nIter, verbose = True)

print(f'...trained {nIter} iterations in {time.time() - start} seconds.')
print(f'Top 20 ranked nodes:')
print(full_results.takeOrdered(20, key=lambda x: -x[1]))

### Submit to cluster

Use this command in your terminal (Not in the Docker container!), to submit your job to GCP. You will need to have your environment variables pre-defined. Alterantively, substitute them with the actual values.   

* PROJECT_ID: your GCP project id   
* BUCKET_NAME: the name of your GCP bucket   
* CLUSTER_NAME: choose a cluster name, this should include only a-z, 0-9 & start with a letter   
* ZONE: The zone for your account and bucket, ex: us-central1-b


```
python3 submit_job_to_cluster.py \
    --project_id=${PROJECT_ID} \
    --zone=${ZONE} \
    --cluster_name=${CLUSTER_NAME} \
    --gcs_bucket=${BUCKET_NAME} \
    --key_file=$HOME/w261.json \
    --create_new_cluster \
    --pyspark_file=pagerank.py
```