# Azure Discovery Day 2019
## Analytics with NRT Intelligence on Azure

#### Summary
In this Python Jupyter notebook, you will:
1. Connect to Azure storage
2. Ingest data from CSV files in Azure storage to Spark dataframes
3. Conform and merge heterogenous data sets using the Spark dataframe API
4. Emit data to Azure storage in Parquet file format

Additionally, there are optional steps to create Hive tables on the data, query them with Spark SQL, as well as some exploratory data analysis (EDA).

In [2]:
## Need some library includes

from pyspark.sql.types import *
from pyspark.sql.functions import broadcast, lit

### Functions

In [4]:
## Function to get a Spark DataFrame from a CSV source file

def GetDataFrameFromCsvFile(schema, sourceFilePath, delimiter):
  df = spark\
    .read\
    .format("csv")\
    .option("header", "true")\
    .option("delimiter", delimiter)\
    .schema(schema)\
    .load(sourceFilePath)
  
  return df;

In [5]:
def HandleReferenceDataFrame(df):
  broadcast(df)
  df.cache()
  count = df.count()
  
  return count;

### Connect to Azure Storage

In [7]:
# Define some variables to minimize "hard-coding" in below cells. Note that variables can also be defined in a separate notebook.

# Azure storage account information
storageAcctName = "PROVIDE"
storageAcctKey = "PROVIDE"
containerName = "PROVIDE"

# The mount point in the DBFS file system - this will look like a local folder but points to the Azure storage location
mountPoint = "/mnt/" + containerName + "/"

In [8]:
## Use the Databricks file system utilities to mount a Databricks file system location (/mnt/YOUR CONTAINER NAME) that points to the Azure storage account where data files are located
## We use variables defined above and string concatenation here so that no "hard-coding" is needed

dbutils.fs.mount(
  source = "wasbs://" + containerName + "@" + storageAcctName + ".blob.core.windows.net",
  mount_point = mountPoint,
  extra_configs = {"fs.azure.account.key." + storageAcctName + ".blob.core.windows.net":storageAcctKey}
)

In [9]:
## This is included to remove the Azure storage mount
## Commented out since not needed for the lab, but included here "just in case" for debugging/experimenting - for example, mount, unmount, try something different, mount again

#dbutils.fs.unmount(mountPoint)

In [10]:
## List contents of the Azure storage account to validate successful connect and mount
## We are using the Databricks display() function here to improve the esthetics of the output

display(dbutils.fs.ls(mountPoint))

### Load Reference Data Files into DataFrames

##### Define variables to hold the source path for each of the reference data files

In [13]:
src_file_ref_payment_type = mountPoint + "reference-data/payment_type_lookup.csv"
src_file_ref_rate_code = mountPoint + "reference-data/rate_code_lookup.csv"
src_file_ref_taxi_zone = mountPoint + "reference-data/taxi_zone_lookup.csv"
src_file_ref_trip_month = mountPoint + "reference-data/trip_month_lookup.csv"
src_file_ref_trip_type = mountPoint + "reference-data/trip_type_lookup.csv"
src_file_ref_vendor = mountPoint + "reference-data/vendor_lookup.csv"

##### Define explicit schemas for each of the reference data files

We could also ingest files with schema inference (i.e. tell Spark to try to figure it out) but let's be explicit here for greater control.

In [15]:
## Payment type
schema_ref_payment_type = StructType([
    StructField("payment_type", IntegerType(), True),
    StructField("abbreviation", StringType(), True),
    StructField("description", StringType(), True)
])

## Rate code ID
schema_ref_rate_code = StructType([
    StructField("rate_code_id", IntegerType(), True),
    StructField("description", StringType(), True)
])

## Taxi zone
schema_ref_taxi_zone = StructType([
    StructField("location_id", StringType(), True),
    StructField("borough", StringType(), True),
    StructField("zone", StringType(), True),
    StructField("service_zone", StringType(), True)
])

## Trip month
schema_ref_trip_month = StructType([
    StructField("trip_month", StringType(), True),
    StructField("month_name_short", StringType(), True),
    StructField("month_name_full", StringType(), True)
])

## Trip type
schema_ref_trip_type = StructType([
    StructField("trip_type", IntegerType(), True),
    StructField("description", StringType(), True)
])

## Vendor ID
schema_ref_vendor = StructType([
    StructField("vendor_id", IntegerType(), True),
    StructField("abbreviation", StringType(), True),
    StructField("description", StringType(), True)
])

##### Load each reference data set into a Spark DataFrame

We load the data from source file into dataframe using a function (above) for that purpose.

Then we do some more optimizations for the reference dataframes:
1. Broadcast the dataframe. These are small dataframes with reference data. Broadcasting means we replicate a dataframe to each worker node in a Spark cluster, so that cross-node (cross-network) joins are avoided.
2. Lazy-cache the dataframe into memory as another performance optimization.

Last, we print the number rows in the dataframe.

In [17]:
df_ref_payment_type = GetDataFrameFromCsvFile(schema_ref_payment_type, src_file_ref_payment_type, "|")

print(HandleReferenceDataFrame(df_ref_payment_type))
display(df_ref_payment_type)

In [18]:
df_ref_rate_code = GetDataFrameFromCsvFile(schema_ref_rate_code, src_file_ref_rate_code, "|")

print(HandleReferenceDataFrame(df_ref_rate_code))
display(df_ref_rate_code)

In [19]:
df_ref_taxi_zone = GetDataFrameFromCsvFile(schema_ref_taxi_zone, src_file_ref_taxi_zone, ",")

print(HandleReferenceDataFrame(df_ref_taxi_zone))
display(df_ref_taxi_zone)

In [20]:
df_ref_trip_month = GetDataFrameFromCsvFile(schema_ref_trip_month, src_file_ref_trip_month, ",")

print(HandleReferenceDataFrame(df_ref_trip_month))
display(df_ref_trip_month)

In [21]:
df_ref_trip_type = GetDataFrameFromCsvFile(schema_ref_trip_type, src_file_ref_trip_type, "|")

print(HandleReferenceDataFrame(df_ref_trip_type))
display(df_ref_trip_type)

In [22]:
df_ref_vendor = GetDataFrameFromCsvFile(schema_ref_vendor, src_file_ref_vendor, "|")

print(HandleReferenceDataFrame(df_ref_vendor))
display(df_ref_vendor)