Distruted Analytics & Machine Learning - Dan Zaratsian, March 2021
- Introduction and Module Agenda
- Distributed Computing
- Walk-through of Tools and Services for Big Data
- Distributed Architectures and Use Cases
- Google Colab Notebook Environment
- Google BigQuery Sandbox
- Hadoop 101
- Intro to Apache Hive
- Apache Hive Syntax and Schema Design
- Intro to Apache HBase and Apache Phoenix (NoSQL)
- Apache HBase Schema Design & Best Practices
- Apache Phoenix Syntax
- Intro to Apache SparkSQL
- Apache SparkSQL
- BigQuery (Serverless SQL)
- Google Cloud Firestore (NoSQL)
Assignment
-
Assignment 1 SQL - Solution
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com
-
Assignment 2 NoSQL - Solution (Due on Friday, March 26)
- Due on Friday, March 26
- Please complete as an individual assignment
- Email your code and answers to d.zaratsian@gmail.com
- Apache Spark Overview
- Spark Machine Learning (MLlib)
- ML Pipelines
- Building and deploying Spark machine learning models
- Considerations for ML in distributed environments
- Spark Best Practices and Tuning
- Spark Code Walk-through (within Google Colab)
Assignment
- Assignment 3 - Solution
- Due on Friday, April 2
- Please complete as an individual assignment
- Email your code to d.zaratsian@gmail.com
Slides
- Apache Kafka
- Google PubSub
- Demo of PubSub
- Spark Streaming
- Demo of Spark Streaming
- Apache Beam (Google Dataflow)
Slides
- Overview of Serverless
- Serverless ML
- BigQuery ML
- Google Cloud Functions
- Google Cloud AutoML
Slides
- Overview of Google Cloud and general cloud services for ML Deployment
- Google Cloud AI Platform
- Demo ML Model Deployment for NFL Play Predictions (link to repo)
- Cloud Deployments - App Engine
- Demo App Engine Deployment
- Cloud Deployments - Kubernetes
- Demo Kubernetes Deployment