Skip to content

ryanchynoweth44/DatabricksContent

Repository files navigation

Introduction

This repository aims to provide various Databricks tutorials and demos.

If you would like to follow along, check out the Databricks Community Cloud.

Demos

Stream Databricks Example

The demo is broken into logic sections using the New York City Taxi Tips dataset. Please complete in the following order:

  1. Send Data to Azure Event Hub (python)
  2. Read Data from Azure Event Hub (scala)
  3. Train a Basic Machine Learning Model on Databricks (scala)
  4. Create new Send Data Notebook
  5. Make Streaming Predictions

Databricks Delta

The demo is broken into logic sections. Please complete in the following order:

  1. Setup Environment
  2. Data Ingestion
  3. Bronze Data to Silver Data
  4. A quick ML Model
  5. Silver Data To Gold Data
  6. A Few Cool Features of Delta
  7. Summary

Programmatically Generate a Databricks Access Token

Using Service Principals to Automate the creation of a Databricks Access Token

  1. README
  2. Reference Blog

Delta Lake Views

This is a lie. Delta Lake does not actually support views but it is a common ask from many clients. Whether views are desired to help enforce row-level security or provide different views of data here are a few ways to get it done.

  1. README
  2. Hive Views with Delta Lake
  3. Delta Lake "Views"

Delta Lake CDC Operations

Batch processing changes within a delta lake is common practice and easy to do. We provide a few examples on how to use the Delta Lake time travel capabilities to get different views on how a table has changed between two versions.

  1. README
  2. Python Script
  3. Scala Script

Databricks Autoloader

An example of using the Autoloader capabilities for file-based processing. Ensures exactly one-time processing for files.

  1. README

Resources

In this directory I keep a central repository of articles written and helpful resource links with short descriptions.

Below are a number of link with quick descriptions on what they cover.

  • Upsert Databricks Blog

    • This blog provides a number of very helpful use cases that can be solved using an upsert operation. The parts I found most interesting were different functionality when it came to the actions available when rows are matched or not matched. Users have the ability to delete rows, updates specific values, insert rows, or update entire rows. The foreachBatch function is crucial for CDC operations.
  • Upsert Notebook Example:

    • Python and Scala example completing an upsert with the foreachBatch function.
  • Delta Table Updates

    • Shows various scenarios for updating delta tables via updates, inserts, and deletes.
    • There is specific information surrounding schema evolution with the upsert operations, specifically, schema can evolve when using insertAll or updateAll, but it will not work if you try inserting a row with a column that does not exist yet.
    • There can be 1, 2, or 3 whenMatched or whenNotMatched clauses. Of these, at most 2 can be whenMatched clauses, and at most 1 can be a whenNotMatched clause.
  • Z-ordering Databricks Blog

  • Optimize and Partition Columns

  • Dynamic Partition Pruning

Contact

Please feel free to recommend demos or contact me if there are any confusing/broken steps. For any additional comments or questions email me at ryanachynoweth@gmail.com.

Disclaimer

These examples are not affiliated or purposed to be official documentation for Databricks. For official documentation and tutorials please go to the Databricks Academy or the Databricks blog

About

Examples surrounding Databricks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published