# Using Spark in Azure Databricks

This notebook will give you a quick overview on how to harness the power of the Apache Spark engine within Azure Databricks by executing cells in an interactive web document, called a notebook.

Spark is used to do many things, including reading and writing huge files and data sets. It provides a query engine capable of processing data in very, very large data files.  Some of the largest Spark jobs in the world run on Petabytes of data.

Up to this point in the lab, you have worked with reading and writing flat files using Azure SQL Data Warehouse (SQL DW) and PolyBase. An alternative to working with these flat files is to use the Apache Spark engine. The reasons you would want to do this include having the ability to more rapidly work with the files in an interactive way, combine both batch and stream processing using the same engine and code base, and include machine learning and deep learning as part of your big data process. Use SQL Data Warehouse as a key component of a big data solution. Import big data into SQL Data Warehouse with simple PolyBase T-SQL queries, and then use the power of massively parallel processing (MPP) to run high-performance analytics. As you integrate and analyze, the data warehouse will become the single version of truth your business can count on for insights.

-sandbox
### Getting Started

Run the following cell to configure your module.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster before running any cells in your notebook. In the notebook's toolbar, select the drop down arrow next to Detached, then select your cluster under Attach to.

![Attached to cluster](https://databricksdemostore.blob.core.windows.net/images/03/03/databricks-cluster-attach.png)

-sandbox
### Step 3

Now that you have opened this notebook, we can use it to run some code.
1. In the cell below, we have a simple caluclation we would like to run: `1 + 1`.
2. Run the cell by clicking the run icon and selecting **Run Cell**.
<div><img src="https://files.training.databricks.com/images/eLearning/run-notebook-1.png" style="width:600px; margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You can also run a cell by typing **Ctrl-Enter**.

In [4]:
1 + 1

### Data sources for this lab

For completion of this lab, it is mandatory to complete the environment setup found in the Microsoft learning module.

The PowerShell script executed during setup creates an Azure Storage account with two containers: `labdata` and `dwtemp`. The `dwtemp` container acts as a common Blob Storage container that both the Azure Databricks cluster and the SQL DW instance use to exchange data between the two systems.

The PowerShell script copied datasets into the `labdata` container that are required for this lab. Those files are contained within `/retaildata/rawdata/`.

Here is a description of the datasets we will use:

1.  **ProductFile**: Contains products data e.g. ProductId, ProductName,
    Department, DepartmentId, Category, CategoryId and price for single
    product.

2.  **UserFile**: Contains user data e.g. UserId, FirstName, LastName,
    Age, Gender, RegistrationDate.

3.  **Weblog**: Contains weblog data which provides users activity log
    on retail site e.g. SessionId, UserId, CustomerAction, ProductId,
    TransactionTime etc. CustomerAction field determines if user has
    purchased, browsed or added product in cart.

### How to mount an Azure Blob to DBFS

Azure Databricks has its own file system called Databricks File System, or DBFS. DBFS provides a common place to read and write files that are stored across one or more mount points. A mount point is how you attach file storage from one or more services, such as Azure Blob storage or Azure Data Lake Store. We will mount the Azure storage account as a new DBFS directory to make it easier to access the files contained within.

Once the blob is mounted as a DBFS directory, access it without exposing your Azure Blob Store keys.

-sandbox

### Creating a Shared Access Signature (SAS) URL
Azure provides you with a secure way to create and share access keys for your Azure Blob Store without compromising your account keys.

More details are provided <a href="http://docs.microsoft.com/azure/storage/common/storage-dotnet-shared-access-signature-part-1" target="_blank"> in this document</a>.

This allows access to your Azure Blob Store data directly from Databricks distributed file system (DBFS).

As shown in the screen shot, in the Azure Portal, go to the storage account containing the blob to be mounted. Then:

1. Select Shared access signature from the menu.
2. Click the Generate SAS button.
3. Copy the entire Blob service SAS URL to the clipboard.
4. Use the URL in the mount operation, as shown below.

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-sas-keys.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px; padding: 10px"/>

-sandbox
Create the mount point with `dbutils.fs.mount(source = .., mountPoint = .., extraConfigs = ..)`.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If the directory is already mounted, you receive the following error:

> Directory already mounted: /mnt/dw-training

In this case, use a different mount point such as `dw-training-2`, and ensure you update all three references below. Be sure to udate **SasURL**, and **StorageAccount**. The ContainerName value should be "labdata".

In [9]:
SasURL = "https://azuredatabricksstore03.blob.core.windows.net/?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2020-01-06T06:44:19Z&st=2019-01-05T22:44:19Z&spr=https&sig=FCoo3O4PjjoeF34yn%2FIMBcKyte%2BMTaKFcrdZiP8cdLE%3D"
indQuestionMark = SasURL.index('?')
SasKey = SasURL[indQuestionMark:len(SasURL)]
StorageAccount = "azuredatabricksstore03"
ContainerName = "labdata"
MountPoint = "/mnt/dw-training"

dbutils.fs.mount(
  source = "wasbs://%s@%s.blob.core.windows.net/" % (ContainerName, StorageAccount),
  mount_point = MountPoint,
  extra_configs = {"fs.azure.sas.%s.%s.blob.core.windows.net" % (ContainerName, StorageAccount) : "%s" % SasKey}
)

Take a look at the file contents of the **/retaildata/rawdata** subdirectory of the mount you just created:

In [11]:
%fs ls /mnt/dw-training/retaildata/rawdata

### Introducing DataFrames

This new mount point is now available to all Spark nodes of the Databricks cluster, allowing us to parallelize reads and writes as needed. We can now read the data into data structures called DataFrames.

Under the covers, DataFrames are derived from data structures known as Resilient Distributed Datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data. Let's take a closer look at what some of these terms mean before we understand how they relate to DataFrames:

* **Resilient**: They are fault tolerant, so if part of your operation fails, Spark  quickly recovers the lost computation.
* **Distributed**: RDDs are distributed across networked machines known as a cluster.
* **DataFrame**: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations under the hood. 

Without the named columns and declared types provided by a schema, Spark wouldn't know how to optimize the executation of any computation. Since DataFrames have a schema, they use the Catalyst Optimizer to determine the optimal way to execute your code.

DataFrames were invented because the business community uses tables in a relational database, Pandas or R DataFrames, or Excel worksheets. A Spark DataFrame is conceptually equivalent to these, with richer optimizations under the hood and the benefit of being distributed across a cluster.

#### Interacting with DataFrames

Once created (instantiated), a DataFrame object has methods attached to it. Methods are operations one can perform on DataFrames such as filtering,
counting, aggregating and many others.

> <b>Example</b>: To create (instantiate) a DataFrame, use this syntax: `df = ...`

To display the contents of the DataFrame, apply a `show` operation (method) on it using the syntax `df.show()`. 

The `.` indicates you are *applying a method on the object*.

In working with DataFrames, it is common to chain operations together, such as: `df.select().filter().orderBy()`.  

By chaining operations together, you don't need to save intermediate DataFrames into local variables (thereby avoiding the creation of extra objects).

Also note that you do not have to worry about how to order operations because the optimizier determines the optimal order of execution of the operations for you.

`df.select(...).orderBy(...).filter(...)`

versus

`df.filter(...).select(...).orderBy(...)`

-sandbox
#### DataFrames and SQL

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.

Suppose we have a data set we loaded as a table called `myTable` and an equivalent DataFrame, called `df`.
We have three fields/columns called `col_1` (numeric type), `col_2` (string type) and `col_3` (timestamp type)
Here are basic SQL operations and their DataFrame equivalents. 

Notice that columns in DataFrames are referenced by `col("<columnName>")`.

| SQL                                         | DataFrame (Python)                    |
| ------------------------------------------- | ------------------------------------- | 
| `SELECT col_1 FROM myTable`                 | `df.select(col("col_1"))`             | 
| `DESCRIBE myTable`                          | `df.printSchema()`                    | 
| `SELECT * FROM myTable WHERE col_1 > 0`     | `df.filter(col("col_1") > 0)`         | 
| `..GROUP BY col_2`                          | `..groupBy(col("col_2"))`             | 
| `..ORDER BY col_2`                          | `..orderBy(col("col_2"))`             | 
| `..WHERE year(col_3) > 1990`                | `..filter(year(col("col_3")) > 1990)` | 
| `SELECT * FROM myTable LIMIT 10`            | `df.limit(10)`                        |
| `display(myTable)` (text format)            | `df.show()`                           | 
| `display(myTable)` (html format)            | `display(df)`                         |

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You can also run SQL queries with the special syntax `spark.sql("SELECT * FROM myTable")`

In this course you see many other usages of DataFrames. It is left up to you to figure out the SQL equivalents 
(left as exercises in some cases).

## Reading data

Now that you understand what DataFrames are, let's create a new DataFrame named `weblogData` by reading the weblogs. These files are in text format, and the data within is delimited by pipe (|) characters. We also know that each file contains a head row. We can let the Spark engine know this by using the `option` properties of the `read` function, as seen below.

In [16]:
weblogData = (spark
              .read
              .option("header","true") # Allows us to extract the header
              .option("delimiter", '|') # Specifies the pipe character as the delimiter
              .csv("mnt/dw-training/retaildata/rawdata/weblognew/[3-5]/{*}/weblog.txt"))
weblogData.show()

In the above cell, we are trying to read data from the mounted path:

`/retaildata/rawdata/weblognew/\[3-5\]/{\*}/weblog.txt`

We can use wild card characters in a path string to read multiple files.

**\[3-5\]** matches with all the sub directories inside the weblognew directory having names 3, 4 and 5.

**\*** matches with all the sub directories inside parent directory.

The above code returns a DataFrame and sets it to the variable `weblogData`.

**DataFrame.take(** *n* **)** returns the first n number of elements from a DataFrame.

In [18]:
display(weblogData.take(5))

Now, let's create two new DataFrames; one for users and one for products.

These files are a bit different than the weblog files. To start, they are comma-delimited, and they also contain no header. As such, we set the delimiter to comma, and we set an alias for each column for clarity.

In [20]:
from pyspark.sql.functions import col
users = (spark
         .read
         .option("inferSchema", "true") # Automatically applies a schema by evaluating the data types
         .option("delimiter", ',')
         .csv("mnt/dw-training/retaildata/rawdata/UserFile/part{*}")
         .select(col("_c0").alias("UserId"),
                 col("_c3").alias("FirstName"),
                 col("_c5").alias("LastName"),
                 col("_c9").alias("Gender"),
                 col("_c15").alias("Age"),
                 col("_c18").alias("RegisteredDate")
                )
        )
products = (spark
            .read
            .option("inferSchema", "true")
            .option("delimiter", ',')
            .csv("mnt/dw-training/retaildata/rawdata/ProductFile/part{*}")
            .select(col("_c0").alias("ProductId"),
                    col("_c1").alias("ProductName"),
                    col("_c2").alias("BasePrice"),
                    col("_c3").alias("CategoryId"),
                    col("_c7").alias("Category"),
                    col("_c8").alias("Department")
                   )
           )

Let's take a look at the user data.

In [22]:
display(users)

Now, view the product data.

In [24]:
display(products)

### Perform DataFrame operations

We will now explore various operations that can be performed on DataFrames, using the three we have created so far.

#### Select

**select(\*cols)**

Projects a set of expressions and returns a new DataFrame.

**Parameters**:

cols -- list of column names (string) or expressions (Column). If
one of the column names is '\*', that column is expanded to include
all columns in the current DataFrame.

Try this out in the cell below:

In [27]:
products.select("ProductId","ProductName").show(5)

#### OrderBy

**orderBy**(\*cols, \*\*kwargs)

Returns a new DataFrame sorted by the specified column(s).

**Parameters**:

cols -- list of Column or column names to sort by.

ascending -- boolean or list of boolean (default True). Sort
ascending vs. descending. Specify list for multiple sort orders. If
a list is specified, length of the list must equal length of the
cols.

Execute the following 3 cells to try a few of different methods:

In [29]:
products.orderBy("ProductId",ascending=False).show(5)

In [30]:
products.orderBy(["ProductName","BasePrice"],ascending=[0, 1]).show(5)

In [31]:
products.orderBy(products.ProductId.desc()).show(5)

#### Count

**count()**

Returns the number of rows in this DataFrame.

Execute the following cell to see the count of product records:

In [33]:
products.count()

#### Agg

**agg(\*exprs)**

Aggregate on the entire DataFrame without groups.

Execute the following two cells for a couple examples of the `agg` operation:

In [35]:
products.agg({"BasePrice" : "max"}).show(1)

In [36]:
from pyspark.sql import functions as F

products.agg(F.min(products.BasePrice)).show(1)

#### Where

**where(**condition**)**

Filters rows using the given condition.

`where()` is an alias for `filter()`.

Execute the following cell:

In [38]:
products.where(products.BasePrice > 500).show(5)

#### GroupBy

**groupBy**(\*cols)

Groups the DataFrame using the specified columns, so we can run
aggregation on them. See GroupedData for all the available aggregate
functions.

**Parameters**:

cols -- list of columns to group by. Each element should be a column
name (string) or an expression (Column).

Execute the following three cells for a few examples of the `groupBy` operation:

In [40]:
# Without column name

products.groupBy().min().show()

In [41]:
# With column name

products.groupBy("Department").count().show(5)

In [42]:
# Get minimum BasePrice in each department

products.groupBy(products.Department).min("BasePrice").show(5)

### Joins

Before exploring joins, we need to create a weblog DataFrame containing a small sample of records.

Execute the following to retrieve 10 records from the weblog DataFrame where the customer action is Purchased:

In [44]:
weblogSampleData = weblogData.where(weblogData.Action == "Purchased").take(10)

The above code snippet returns a list of Row objects. We have to convert that list object into an RDD.

Execute the following cell to convert the list into an RDD:

In [46]:
weblogSampleRDD = sc.parallelize(weblogSampleData)

Now execute below cell to convert the RDD into a DataFrame:

In [48]:
weblogSampleDF = weblogSampleRDD.toDF()

#### Join

**join**(other, joinExprs=None, joinType=None)

Joins with another DataFrame, using the given join expression.

**Parameters**:

other -- Right side of the join

joinExprs -- a string for join column name, or a join expression
(Column). If joinExprs is a string indicating the name of the join
column, the column must exist on both sides, and this performs an
inner equi-join.

joinType -- str, default 'inner'. One of inner, outer, left\_outer,
right\_outer, semijoin.

Execute the following cell to join the sample weblog DataFrame with the products DataFrame based on the Product ID key to get the BasePrice for each product purchased in the weblog DataFrame:

In [50]:
weblogSampleDF.join(products,weblogSampleDF.ProductId == products.ProductId,'inner').select(weblogSampleDF.SessionId,weblogSampleDF.ProductId,weblogSampleDF.Quantity,products.BasePrice).show(10)

## Learn Spark SQL 

The **sql** function of **SparkSession** object enables applications to
run SQL queries programmatically and returns the result as a DataFrame
object. However, to run SQL queries we need to create a table.

### Databricks databases and tables

An Azure Databricks database is a collection of tables. An Azure Databricks table is a collection of structured data. Tables are equivalent to Apache Spark DataFrames. This means that you can cache, filter, and perform any operations supported by DataFrames on tables. You can query tables with Spark APIs and Spark SQL.

There are two types of tables: global and local. A global table is available across all clusters. Azure Databricks registers global tables to the Hive metastore. A local table is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary table or a view.

A temporary view gives you a name to query from SQL, but unlike a table it exists only for the duration of your Spark Session. As a result, the temporary view will not carry over when you restart the cluster or switch to a new notebook. It also won't show up in the Data button on the menu on the left side of a Databricks notebook which provides easy access to databases and tables.

You create a temporary view with the `createOrReplaceTempView` operation on a DataFrame.

For example, to create a temporary view of the products DataFrame, you would execute the following:

`products.createOrReplaceTempView("Products")`

For our purposes, we will be creating a **global table**. This is because you will be executing a different notebook after you are done with this one, and you will want to have access to the data structures you define here when writing the data back to your Azure SQL Data Warehouse instance. This as opposed to creating the DataFrames all over again.

> Note: To avoid potential consistency issues, the best approach to replacing table contents is to overwrite the table. Just in case the "products" table already exists, we will overwrite its contents with our DataFrame.

Execute the cell below to create a new global "products" table from the `products` DataFrame:

In [53]:
products.write.mode("OVERWRITE").saveAsTable("products")

Now, execute the following cell to create a new global "weblogs" table, overwriting any existing tables:

In [55]:
weblogSampleDF.write.mode("OVERWRITE").saveAsTable("weblogs")

Finally, execute the following cell to create a new global "users" table:

In [57]:
users.write.mode("OVERWRITE").saveAsTable("users")

### Run SQL queries

Execute the following cells to learn how to run Spark SQL queries.

Notice that each cell starts with the **%sql** magic. This tells the Spark engine to execute the cell using the SQL language, as opposed to the default python language.

In [59]:
%sql

select * from products
limit 10

In [60]:
%sql

select ProductName from products where ProductId = 30

In [61]:
%sql

select Department, count(ProductId) as ProductCount
from products group by Department order by ProductCount desc

In [62]:
%sql

select w.SessionId, w.ProductId, p.ProductName, p.BasePrice, w.Quantity, (p.BasePrice * w.Quantity) as Total
from weblogs w Join products p on w.ProductId == p.ProductId

-sandbox
## Create visualizations

Azure Databricks provides easy-to-use, built-in visualizations for your data. 

Display the data by invoking the Spark `display` function.

Visualize the query below by selecting the down arrow button next to the bar graph icon once the table is displayed. Select **Area** to select the Area chart visualization:

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/databricks-display-graph-button.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

Configure the area chart settings by clicking the **Plot Options...** button below the chart. A few controls will appear on the screen. Set the values for these controls as shown below:

* Select **RegistrationDate** for **Keys**.
* Select **count(UserId)** for **Values**.
* Select **MAX** for **Aggregation**.

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/databricks-display-graph-options.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

You should see an area chart that looks like the following:

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/databricks-display-graph.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

In [65]:
%sql

SELECT COUNT(UserId), Year(RegisteredDate) as RegistrationDate FROM users
GROUP BY Year(RegisteredDate) ORDER BY Year(RegisteredDate)

Next, create a bar chart of the number of registrations by age.

Execute the cell below, then select the **Bar** chart this time.

Configure the bar chart settings by clicking the **Plot Options...** button below the chart. Set the values for these controls as shown below:

* Select **Age** for **Keys**.
* Select **count(UserId)** for **Values**.
* Select **MAX** for **Aggregation**.

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/databricks-display-chart-options.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

You should see a bar chart that looks like the following:

<img src="https://databricksdemostore.blob.core.windows.net/images/02-SQL-DW/databricks-display-chart.png" style="border: 1px solid #aaa; padding: 10px; border-radius: 10px 10px 10px 10px"/>

In [68]:
%sql

SELECT COUNT(UserId), Age FROM Users GROUP BY Age ORDER BY Age

## Next Steps

By this point, you should have a good grasp on how to process data using Spark on Azure Databricks. The next lesson will build upon these concepts by connecting to your Azure SQL Data Warehouse instance and writing this data to a new external table.

Start the next lesson, [Understanding the SQL DW Connector]($./02-Understanding-the-SQL-DW-Connector ).