gcp-demo

Prerequistes

Google Cloud SDK (gsutil, bq, gcloud)
Docker
docker-compose
- pip install docker-compose
JDK 1.8

Background

This repo is for the purpose of trying out GCP including google cloud SDK, BigQuery, and DataProc with Spark. It mostly contains tutorial material from GCP as an introduction to the tools. This was created as exploritation of different tools in a tech stack.

Additionally it contains an example of using streams with MySQL, and part of the NPR Story API.

(Eventually, the goal is to write the combined dataset to GCP. Currently they are not connected, repo is still a work in progress).

Querying MySQL with ORM, Streams, and NPR API XML example

The src folder contains a gradle/java8 sub-project to pull records from the MySQL DB (using the mysql connector, and Hibernate ORM) in a docker stack using compose.

A util class fetches genre data from the NPR API as XML.
Then a dataset of Users is fetched from the DB using the ORM.
Streams are used to perform simple operations:
- First listing the Users
- Then, filtering on Users with a favorite genre matching one selected from the API.

The purpose of this demo was take a look at exising APIs NPR has availible, as well as imagine a use case that the CMS Story API and Public Media Platform (PMP) replacement platform might use.

Running locally

Run the build script to build the MySQL docker container, and then run the java8 app.

./build.sh

or individually:

Build and stage the the Java app with gradle:

./gradlew clean fatJar prepareEnvironment

Then start the docker stack: building the MySQL container, Java container and bringing up the environment:

docker-compose -f ./docker/docker-compose.yml up --build

The database will initialize by loading the init SQL script (InnoDB and startup will take a minute), then the Java container will check for the DB to be healthy before starting the app.

Improvements

This was made as an explorative project before an interview. To acutally run an app or service like this a number of improvements could be made.

Use of an ORM vs direct DB acess

ORMs can be a useful abstraction on top of RDBMS databases, however they can lead to issues. Direct DB access can be more modular and simple in some cases, many things that can be done with an ORM can be done with a SQL query.

Interaction with NPR One API vs Story API

The NPR One API is well documented and seems to have ongoing support. In future iterations, if this was a sample app it could interact with the NPR One API or use the javascript sdk.

XML parser

There are many XML parsers availible, if using the legacy API. SAX, JacksonXML or others could be good refactors.

API Client

If this were a production app, it would be good to generalize the api client for multiple routes in the Story API. Also if pulling static content from that API, it could be useful to make and cache those requests, if the underlying data isn't subject to frequent change.

Integration with brightspot CMS

To integrate this project with brightspot-cms (the CMS to potentially be used in the NPR platform) a few things could be done in future iterations.

Replace the Hibernate ORM, with Dari the data modeling framework used in brightspot.
Make the existing MySQL instance comptible with brightspot/Dari. Dari has it's own DDL (See the DDL here) that is used to make database versions compatible across vendors, also loads table schemas.
Update exiting models classes to use dari. Note that dari doesn't store like Hibernate/ORMs where a class is mapped to a table. Dari will store them in the Record table serializing the object as json and storing it in a blob.
1. In this example project we would extend the Content class provided by brightspot to implement this.

GCP Demo using Apache Spark and BigQuery.

Define a bucket with:

gsutil mb gs://gcp-demo-spark-bucket

Launch spark-shell with the bq connector

spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

or use the following if running python:

gcloud dataproc jobs submit pyspark ./demo/wordcount.py \
    --cluster gcp-demo \
    --region us-west2 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

spark-submit --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar wordcount.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
demo		demo
docker		docker
gradle/wrapper		gradle/wrapper
src/main		src/main
README.md		README.md
build.gradle		build.gradle
build.sh		build.sh
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gcp-demo

Prerequistes

Background

Querying MySQL with ORM, Streams, and NPR API XML example

Running locally

Improvements

Use of an ORM vs direct DB acess

Interaction with NPR One API vs Story API

XML parser

API Client

Integration with brightspot CMS

GCP Demo using Apache Spark and BigQuery.

Awknowledgements

About

Releases

Packages

Languages

struthj2/gcp-demo

Folders and files

Latest commit

History

Repository files navigation

gcp-demo

Prerequistes

Background

Querying MySQL with ORM, Streams, and NPR API XML example

Running locally

Improvements

Use of an ORM vs direct DB acess

Interaction with NPR One API vs Story API

XML parser

API Client

Integration with brightspot CMS

GCP Demo using Apache Spark and BigQuery.

Awknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages