GCP: Setting up the Scala Stream Collector

Ben Fradet edited this page Apr 17, 2018 · 6 revisions

HOME » SNOWPLOW SETUP GUIDE » Step 1: setup a Collector » GCP: Setting up the Scala Stream Collector

Contents

1. Introduction

The Scala Stream Collector can publish raw events to Google Cloud PubSub as its sink. PubSub is a distributed Message Queue, implemented on Google's infrastructure. Publisher applications publish to topics, whilst subscriber applications listen to subscriptions, which can be set as pull or push, and with different acknowledgment policies. For more on PubSub go to: https://cloud.google.com/pubsub/docs/concepts.

2. Enable and Setup Google Cloud PubSub

gcloud-enable-pubsub

  • You'll then have to create the topics to which the Scala Stream Collector publishes:
    • Click on the hamburger, on the top left corner
    • Scroll down until you find it, under "Big Data" gcloud-pubsub-sidebar
    • Create two topics: these will be the good and bad raw topics. gcloud-pubsub-sidebar

3. Configuring the Scala Stream Collector

To set up the Scala Stream Collector, fill the apropriate fields of the configuration file. You can find an example in the repository: config.hocon.sample.

4. Running the Collector

  • Download the Scala Stream Collector from Bintray.
  • To run the collector, you'll need a config file as the one above.
  • You'll also want to authenticate the machine where the collector will run by doing:
$ gcloud auth login
$ gcloud auth application-default login

NOTE: If you're running the collector in a Compute Instance, you don't need to authenticate, you just need to set the appropriate permissions for your service accounts (automatically authenticated in Compute Instances), so that they're allowed to use PubSub.

4a. locally (useful for testing)

To run the collector locally, assuming you are authenticated as well as have the above configuration file in place, simply run:

$ java -jar snowplow-stream-collector-google-pubsub-*version*.jar --config config.hocon

4b. on a GCP instance

To run the collector on a GCP instance, you'll first need to spin one up. There are two ways to do so:

4b-1. via the dashboard
  • Go to the GCP dashboard, and once again, make sure your project is selected.
  • Click the hamburger on the top left corner, and select Compute Engine, under Compute
  • Enable billing if you haven't (if you haven't enabled billing, at this point the only option you'll see is a button to do so)

gcloud-instance-nobilling

  • Click "Create instance" and pick the apropriate settings for your case, making sure of, at least the following:
    • Under Access scopes, select "Set access for each API" and enable "Cloud PubSub"
    • Under Firewall, select "Allow HTTP traffic"
    • Optional Click Management, disk, networking, SSH keys
      • Under Networking, add a Tag, such as "collector". (This is needed to add a tagged Firewall rule, explained below)

gcloud-instance-create1

gcloud-instance-create2

  • Click the hamburger on the top left corner, and click on "VPC Network", under Networking
  • On the sidebar, click on "Firewall rules"
  • Click "Create Firewall Rule"
  • Name your rule
  • Under Source filter pick "Allow from any source"
  • Under Protocols and ports add "tcp:8080"
    • Note that 8080 is the port assigned to the collector in the configuration file. If you choose another port here, make sure you change the config file
  • Under Target tags add the Tag with which you labeled your instance (here collector)
  • Click "Create"

gcloud-firewall

4b-2. via the command line
  • Make sure you have authenticated as described above
  • Here's an example command of an instance spin up: (check the gcloud reference for more info)
$ gcloud compute --project "example-project-156611" instances create "instance-2" \
                 --zone "us-central1-c" \
                 --machine-type "n1-standard-1" \
                 --subnet "default" \
                 --maintenance-policy "MIGRATE" \
                 --scopes 189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/pubsub",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/servicecontrol",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/service.management.readonly",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/logging.write",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/monitoring.write",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/trace.append",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/devstorage.read_only" \
                 --tags "collector" \
                 --image "/ubuntu-os-cloud/ubuntu-1604-xenial-v20170113" \
                 --boot-disk-size "10" \
                 --boot-disk-type "pd-standard" \
                 --boot-disk-device-name "instance-2"

$ gcloud compute --project "example-project-156611" firewall-rules create "collectors-allow-tcp8080" \
                 --allow tcp:8080 \
                 --network "default" \
                 --source-ranges "0.0.0.0/0" \
                 --target-tags "collector"

To place the above mentioned files in the instance (config file and the collector jar), you can:

  • For the jar, you'll wget it from Bintray into the instance directly;
  • For the config file, store it using GCP Storage and then download it into the instance. To store the config file:
    • Click the hamburger on the top left corner and find Storage, under Storage
    • Create a bucket gcloud-storage1
    • Then click "Upload Files" and upload your configuration file

Once you have your config file in place, ssh into your instance:

$ gcloud compute ssh your-instance-name --zone your-instance-zone

And then run:

$ sudo apt-get update
$ sudo apt-get -y install default-jre
$ sudo apt-get -y install unzip
$ wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_google_pubsub_<VERSION>.zip
$ gsutil cp gs://<YOUR-BUCKET-NAME/<YOUR-CONFIG-FILE-NAME> .
$ unzip snowplow_scala_stream_collector_google_pubsub_<VERSION>.zip
$ java -jar snowplow-stream-collector-google-pubsub-<VERSION>.jar --config <YOUR-CONFIG-FILE-NAME>

4c. a load balanced auto-scaling GCP cluster

To run a load balanced auto-scaling cluster, you'll need to follow the following steps:

  • Create an instance template
  • Create an auto managed instance group
  • Create a load balancer
Creating an instance template

First you'll have to store your config file in some place that your instances can download from, like Google Cloud Storage. We suggest you store it in a GCP Storage bucket, as described above.

via Google Cloud Console
  • Click the hamburger on the top left corner and find "Compute Engine", under Compute
  • Go to "Instance templates" on the sidebar. Click "Create instance template"
  • Choose the appropriate settings for your case. Do (at least) the following:
    • Under Access scopes, select "Set access for each API" and enable "Cloud PubSub"
    • Under Firewall, select "Allow HTTP traffic"
    • Click Management, disk, networking, SSH keys gcloud-instance-template1
    • Under Networking, add a tag, such as "collector". (This is needed to add a Firewall rule)
    • Under Management "Startup script" add the following script (changing the relevant fields for your case):
#! /bin/bash
sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip
archive=snowplow_scala_stream_collector_google_pubsub_<VERSION>.zip
wget https://dl.bintray.com/snowplow/snowplow-generic/$archive
gsutil cp gs://<YOUR-BUCKET-NAME/<YOUR-CONFIG-FILE-NAME> .
unzip $archive
java -jar snowplow-stream-collector-google-pubsub-<VERSION>.jar --config <YOUR-CONFIG-FILE-NAME> &
  • Click "Create"
  • Add a Firewall rule as described above (if you haven't already)
via the command-line

Here's the command-line equivalent for the options selected by performing the steps above:

$ gcloud compute --project "example-project-156611" instance-templates create "ssc-instance-template" \
                 --machine-type "n1-standard-1" \
                 --network "default" \
                 --maintenance-policy "MIGRATE" \
                 --scopes 189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/pubsub",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/servicecontrol",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/service.management.readonly",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/logging.write",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/monitoring.write",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/trace.append",189687079473-compute@developer.gserviceaccount.com="https://www.googleapis.com/auth/devstorage.read_only" \
                 --tags "collector" \
                 --image "/ubuntu-os-cloud/ubuntu-1604-xenial-v20170113" \
                 --boot-disk-size "10" \
                 --boot-disk-type "pd-standard" \
                 --boot-disk-device-name "ssc-instance-template" \
                 --metadata "startup-script=<THE-STARTUP-SCRIPT-AS-DESCRIBED-ABOVE>"
Create an auto managed instance group
via Google Cloud Console
  • On the side bar, click "Instance groups"
  • Click "Create instance group"
  • Fill in with the appropriate values. We named our instance group "collectors".
  • Under Instance template pick the instance template you created previously
  • Set Autoscaling to "On". By default the Autoscale is based on CPU usage and set with default settings. We'll leave them as they are for now.
  • Under Health Check, pick "Create health check"
    • Name your health check
    • Under Port add 8080 or the port you configured above
    • Under Request path add "/health"
    • Click "Save and Continue"
  • Click "Create"

gcloud-group-create1

gcloud-group-create2

via command-line

Here's the command-line equivalent for te options selected by performing the steps above:

$ gcloud compute --project "example-project-156611" instance-groups managed create "collectors" \
                 --zone "us-central1-c" \
                 --base-instance-name "collectors" \
                 --template "ssc-instance-template" \
                 --size "1"

$ gcloud compute --project "example-project-156611" instance-groups managed set-autoscaling "enrichers" \
                 --zone "us-central1-c" \
                 --cool-down-period "60" \
                 --max-num-replicas "10" \
                 --min-num-replicas "1" \
                 --target-cpu-utilization "0.6"
Configure the load balancer
  • Click the hamburger on the top left corner, and find "Network services" under Networking
  • On the side bar, click "Load Balancing"
  • Click "Create load balancer"
  • Select "HTTP load balancing" and click "Start configuration"

gcloud-load-balancer1

  • Under Backend configuration:

    • Click "Create a backend service"
    • Pick an appropriate name
    • Pick the instance group we created above
    • Pick the port number we configured earlier
    • Input a maximum number of requests per second per instance
    • You can input "Maximum CPU utilization" and "Capacity" at your discretion
    • Under Health check pick the health check you created previously
  • Under Host and path rules: you can just make sure that the backend service selected is the one we just created

  • Under Frontend configuration:

    • Leave IP as "Ephemeral" and leave Port to 80
  • Click "Review and finalize" to check everything is OK.

  • Click "Create"

  • You'll be able to check out this load balancer IP and port by clicking on it

  • You can then make sure this load balancer is used by your instance group by going back to "Instance groups"

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.