Skip to content

Feature Engineering, Spark ML Random Forest Model, Log MLFlow, Streaming Data Source

Notifications You must be signed in to change notification settings

richiebachala/Databricks-and-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Azure Databricks and Spark

Feature Engineering, Spark ML Random Forest Model, Log MLFlow, Streaming Data Source

Lab Overview

Create DataFrames

As data engineers, we need to make data available to our marketing analysts and data scientists for reporting and modeling. The first step in that process, is to read in data and define schemas.

  1. Read Mounted Data
  2. Create Dataframes
  3. View, Infer, and Define Schemas

Transform and Load Data

Learning how to prepare data and load that transformed data into Databricks Delta Tables. We will:

  1. Merge Data
  2. Join Data
  3. Change Data Types
  4. Remove Duplicate Values
  5. Resolve Data Discrepancies
  6. Create Views using Delta Tables

Explore Data

Working as marketing analysts, we will explore our data and look for answers to a few questions:

How does customer spend compare across channels? When looking at discount amounts, do we see a dip in spend for higher discount amounts? Can we identify any instance in which a lower discount amount leads to higher spend or more conversions?

  1. Read a Databricks Delta Table
  2. Aggregate Data
  3. Quickly Visualize Data

Machine Learning

  1. Build a Pipeline for Feature Engineering
  2. Train a Spark ML Random Forest Model
  3. Evaluate the Model and Tune Parameters
  4. Log Experiments with MLflow

Connect to Streaming Data

  1. Connect to a Streaming Data Source
  2. View and Interact with Streaming Data
  3. Insert Streaming Data into Delta Table

Create and Run a Job

  1. View Code for Constructing a Simple BI Report
  2. Create a Job to Run this Notebook
  3. Run the Job

View Job Output

  1. Read the File Generated from the Job Run
  2. View the DataFrame

Azure Databricks is a Unified Analytics Platform for Data Engineers, Data Scientist, and Analysis

arch


About

Feature Engineering, Spark ML Random Forest Model, Log MLFlow, Streaming Data Source

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages