# Delta Lake Example Code 

This notebook is compatible with HDI 5.0 (Spark 3. x and Scala 2.12). In addition, the notebook demonstrates how users can leverage delta lake on the HDI platform. The sample code uses a customer business model, generating random data using [mockneat](https://github.com/nomemory/mockneat).

The following features are expreienced in this code:

- Configure Delta Lake on HDI
- Generate Random Data Using MockNeat
- Write Delta Format
- Read Delta Format
- Merge Schema
- Time Travel

## Configuration

We need to provide the following list of spark configurations for delta lake. 

- Add Delta Lake Package and Configure spark.sql.extensions and spark.sql.catalog.spark_catalog

In [None]:
%%configure
{ "conf": {"spark.jars.packages": "io.delta:delta-core_2.12:1.0.1,net.andreinc:mockneat:0.4.8",
           "spark.sql.extensions":"io.delta.sql.DeltaSparkSessionExtension",
           "spark.sql.catalog.spark_catalog":"org.apache.spark.sql.delta.catalog.DeltaCatalog"
          }
}

## Generate MockData using MockNeat
- Use Mockneat for Random Data Generation
- Generate Customer Data using Mocknet Library
- Configuration:
   - numberOfRecords1 - number of records to generate during first cycle

In [None]:
import net.andreinc.mockneat.MockNeat
import net.andreinc.mockneat.abstraction.MockUnit
import net.andreinc.mockneat.types.enums.RandomType
import java.time.LocalDate
import scala.reflect.ClassTag

val mockNeat = MockNeat.threadLocal()

/**
* Customer Business Model
**/
case class Customer(var customerId: Int, var customerName: String, var firstName: String,
                    var lastName: String, var userName: String, var registrationDate: String)
//configure base on your need
// this program will run on driver side limit by driver memory
val DateStart = LocalDate.of(2014, 1, 1)
val DateEnd = LocalDate.of(2016, 1, 1)
// number of mock data to be generated
val numberOfRecords1 = 10
val startIndex1 = 1
val endIndex1 = startIndex1 + numberOfRecords1

val customerData = (startIndex1 to endIndex1).map(i=>{
    Customer(i,
             mockNeat.names().full().get(),
             mockNeat.names().first().get(),
             mockNeat.names().last().get(),
             mockNeat.users().get(),
             mockNeat.localDates.between(DateStart, DateEnd).mapToString().get())
})

## Save Data Using Delta Lake Format and Print Schema
- Configuration
    - **adsl2Path** - path where we would like to save delta lake data, It can be a full path or relative path. [More details](https://learn.microsoft.com/en-us/azure/hdinsight/overview-azure-storage#hdinsight-storage-architecture)

In [None]:
import org.apache.hadoop.fs._
import java.util.Date
import scala.collection.immutable.{List=>ScalaList}

// define Delta Lake Path
val adsl2Path = "/tmp/customerdata5"

/**
* Object to capture Delta File Detail
* @param filePath: File Path
* @param modifiedTime: Modified Time
*/
case class DeltaFileDetail(filePath: Path, modifiedTime: Date) {
    override def toString(): String = {
        s"File : ${filePath.toString()} , Modified Time: ${modifiedTime.toString()}"
    }
}

/**
* get list of files from Hadoop System
*/
def getListOfFile(path: String):ScalaList[DeltaFileDetail] = {
  val fs:FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
  fs.listStatus(new Path(s"${path}")).filter(!_.isDir).map(fileStatus=> DeltaFileDetail(fileStatus.getPath, new Date(fileStatus.getModificationTime()))).toList
}


//create data frame
val df = sc.parallelize(customerData).toDF
df.write.mode("append").format("delta").save(adsl2Path)
// print schema of the dataframe
df.printSchema

## LIst Storage Directory (Parquet files and Delta Logs)

In [None]:
println("----------------------------------------------------------------- Parquet Files-----------------------------------------------------------------")
val listOfFiles1 = getListOfFile(adsl2Path)
listOfFiles1.foreach(println)
println("----------------------------------------------------------------- Delta Logs-----------------------------------------------------------------")
val listOfLogs1= getListOfFile(adsl2Path + "/_delta_log")
listOfLogs1.foreach(println)

## Read Delta Format data from storage
- We can read using Spark Read API
- or using [Delta Table API](https://docs.delta.io/latest/api/scala/io/delta/tables/DeltaTable.html)

In [None]:
// we can use Spark read or delta table
val df = spark.read.format("parquet").load(adsl2Path)
println(s"***************** number of records : ${df.count()}")
// you can use delta table to read (auto refresh) data
import io.delta.tables._
val dt: io.delta.tables.DeltaTable = DeltaTable.forPath(adsl2Path)
// convert Table to DataFrame
dt.toDF.show(20)
println(s"***************** number of records from delta table : ${dt.toDF.count()}")
//Delta Table Version History 
dt.history().show(false)

## Missing Schema Enforcement

Removed UserName from the existing model and added a new column age.

You can configure how many records will be generated during the second cycle with numberOfRecords2.

In [None]:
val numberOfRecords2 = 10

/**
* Customer new Business Model
* removed user name and added age column
**/
case class CustomerNew(var customerId: Int, var customerName: String, var firstName: String,
                    var lastName: String, var registrationDate: String, var age:Int)
//configure base on your need
// this program will run on driver side limit by driver memory
val DateStart = LocalDate.of(2014, 1, 1)
val DateEnd = LocalDate.of(2016, 1, 1)
// don't change these variables
val newStartIndex2 = endIndex1+1
val newendIndex2 = newStartIndex2 + numberOfRecords2


val customerNewData = (newStartIndex2 to newendIndex2).map(i=>{
    CustomerNew(i,
             mockNeat.names().full().get(),
             mockNeat.names().first().get(),
             mockNeat.names().last().get(),
             mockNeat.localDates.between(DateStart, DateEnd).mapToString().get(),
             mockNeat.ints().range(10, 100).get())
})

// create datafarme from mock data
val df = sc.parallelize(customerNewData).toDF
//save it in delta format
df.write.option("mergeSchema", "true").mode("append").format("delta").save(adsl2Path)

## New Files - Parquet and Delta Logs

In [None]:
println("----------------------------------------------------------------- Parquet Files-----------------------------------------------------------------")
val listOfFiles2 = getListOfFile(adsl2Path)
listOfFiles1.filterNot(listOfFiles2.toSet).foreach(println)
println("----------------------------------------------------------------- Delta Logs-----------------------------------------------------------------")
val listOfLogs2= getListOfFile(adsl2Path + "/_delta_log")
listOfLogs1.filterNot(listOfLogs2.toSet).foreach(println)

## Delta Log - Transaction log History

In [None]:
//the delta table auto refresh capability will get newly written data unlike Spark read where you have to read data again
dt.toDF.show(30)
// number of records should increase
println(s"***************** number of records from delta table : ${dt.toDF.count()}")
//Delta Table Version History - new version is added
println("------------------------  delta log history ------------------------------------")
dt.history().show(false)

## Time Travel
 Read Version Zero (Initial)

In [None]:
// load version zero (initial)
val dfVersion0 = spark.read.format("delta").option("versionAsOf",0).load(adsl2Path)
dfVersion0.count()