-sandbox
# Accessing Data

Apache Spark&trade; and Azure Databricks&reg; provide numerous ways to access your data.

<h2 style="color:red">WARNING!</h2> This notebook must be run using Databricks runtime 4.0 or better.

### Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run "./Includes/Classroom-Setup"

### Create a DataFrame From an Existing File

The <a href="https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html" target="_blank">Databricks File System</a> (DBFS) is the built-in, Azure-blob-backed, alternative to the <a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html" target="_blank">Hadoop Distributed File System</a> (HDFS).

-sandbox
The example below creates a DataFrame from the **ip-geocode.parquet** file (if it doesn't exist).

For Parquet files, you need to specify only one option: the path to the file.

A Parquet "file" is actually a collection of files stored in a single directory.  The Parquet format offers features that make it the ideal choice for storing "big data" on distributed file systems. 

For more information, see <a href="https://parquet.apache.org/" target="_blank">Apache Parquet</a>.

In [6]:
ipGeocodeDF = spark.read.parquet("/mnt/training/ip-geocode.parquet")

Now the DataFrame has been created, see its schema by invoking the `printSchema` method.

Note the data types are known ahead of time (this is a property of the parquet file format) and 
that `nullable` is set to `true`.

This treats all missing values as `NULLs`.

In [8]:
ipGeocodeDF.printSchema()

### File Formats Other than Parquet

You can create DataFrames from file other formats. 

One common format is comma-separated-values (CSV), for which you specify:
* The file's delimiter; the default is "**,**".
* Whether the file has a header or not; the default is **false**.
* Whether or not to infer the schema; the default is **false**.

In order to know which options to use, look at the first couple of lines of the file.

Take a look at the head of the file **/mnt/training/bikeSharing/data-001/day.csv.**

In [11]:
%fs head /mnt/training/bikeSharing/data-001/day.csv --maxBytes=492

Let's create a DataFrame from the CSV file described above.

As you can see above:
* There is a header.
* The file is comma separated (the default).
* Let Spark infer the schema.

In [13]:
bikeSharingDayDF = (spark
  .read                                                # Call the read method returning a DataFrame
  .option("inferSchema","true")                        # Option to tell Spark to infer the schema
  .option("header","true")                             # Option telling Spark that the file has a header
  .csv("/mnt/training/bikeSharing/data-001/day.csv"))  # Option telling Spark where the file is

Now the DataFrame is created, view its contents by invoking the `show` method.

By default, `show()` (without any parameters) prints the first 20 rows. 

Print the top `n` rows by invoking `show(n)`

In [15]:
bikeSharingDayDF.show(10)

Alternatively, invoke the `display` function to show the same table in html format.

In [17]:
display(bikeSharingDayDF)

### Upload a Local File as a Table

The last two examples use files already loaded on the "server."

You can also create DataFrames by uploading files. The files are nominally stored as tables, from which you create DataFrames.

Download the following file to your local machine: <a href="https://s3-us-west-2.amazonaws.com/databricks-corp-training/common/dataframes/state-income.csv">state-income.csv</a>

-sandbox

1. Select **Data** from the sidebar, and click the **databricks** database
2. Select the **+** icon to create a new table

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-table-1-databricks-db.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; width: auto; height: auto; max-height: 383px"/>

<br>
1. Select **Upload File**
2. click on Browse and select the **state-income.csv** file from your machine, or drag-and-drop the file to initiate the upload

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-table-2.png" style="border: 1px solid #aaa; border-radius: 5px 5px 5px 5px; width: auto; height: auto; max-height: 300px  "/>

-sandbox
Once Databricks finishes processing the file, you'll see another table preview.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Databricks tries to choose a table name that doesn't clash with tables created by other users. However, a name clash is still possible. If the table already exists, you'll see an error like the following:

<img src="https://files.training.databricks.com/images/eLearning/create-table-7.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px; padding: 10px"/>

If that happens, type in a different table name, and try again.

Access the file via the path `/FileStore/tables/state_income-9f7c5.csv`

In [22]:
stateIncomeDF = (spark
  .read                                                # Call the read method returning a DataFrame
  .option("inferSchema","true")                        # Option to tell Spark to infer the schema
  .option("header","true")                             # Option telling Spark that the file has a header
  .csv("/FileStore/tables/state_income-9f7c5.csv"))    # Option telling Spark where the file is

View the first 10 lines of its contents.

In [24]:
stateIncomeDF.show(10)

-sandbox
### How to Mount an Azure Blob to DBFS

Microsoft Azure provides cloud file storage in the form of the Blob Store.  Files are stored in "blobs."
If you have an Azure account, create a blob, store data files in that blob, and mount the blob as a DBFS directory. 

Once the blob is mounted as a DBFS directory, access it without exposing your Azure Blob Store keys.

Take a look at the blobs already mounted to your DBFS:

In [27]:
%fs mounts

-sandbox
Mount a Databricks Azure blob (using read-only access and secret key pair), access one of the files in the blob as a DBFS path, then unmount the blob.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> The mount point **must** start with `/mnt/`.

-sandbox

### Creating a Shared Access Signature (SAS) URL
Azure provides you with a secure way to create and share access keys for your Azure Blob Store without compromising your account keys.

More details are provided <a href="http://docs.microsoft.com/en-us/azure/storage/common/storage-dotnet-shared-access-signature-part-1" target="_blank"> in this document</a>.

This allows access to your Azure Blob Store data directly from Databricks distributed file system (DBFS).

As shown in the screen shot, in the Azure Portal, go to the storage account containing the blob to be mounted. Then:

1. Select Shared access signature from the menu.
2. Click the Generate SAS button.
3. Copy the entire Blog service SAS URL to the clipboard.
4. Use the URL in the mount operation, as shown below.

<img src="https://files.training.databricks.com/images/eLearning/DataFrames-MSFT/create-sas-keys.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; margin-top: 20px; padding: 10px"/>

-sandbox
Create the mount point with `dbutils.fs.mount(source = .., mountPoint = .., extraConfigs = ..)`.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> If the directory is already mounted, you receive the following error:

> Directory already mounted: /mnt/temp-training

In this case, use a different mount point such as `temp-training-2`, and ensure you update all three references below.

In [31]:
SasURL = "https://dbtraineastus2.blob.core.windows.net/?sv=2017-07-29&ss=b&srt=sco&sp=rl&se=2023-04-19T06:32:30Z&st=2018-04-18T22:32:30Z&spr=https&sig=BB%2FQzc0XHAH%2FarDQhKcpu49feb7llv3ZjnfViuI9IWo%3D"
indQuestionMark = SasURL.index('?')
SasKey = SasURL[indQuestionMark:len(SasURL)]
StorageAccount = "dbtraineastus2"
ContainerName = "training"
MountPoint = "/mnt/temp-training"

dbutils.fs.mount(
  source = "wasbs://%s@%s.blob.core.windows.net/" % (ContainerName, StorageAccount),
  mount_point = MountPoint,
  extra_configs = {"fs.azure.sas.%s.%s.blob.core.windows.net" % (ContainerName, StorageAccount) : "%s" % SasKey}
)

In [32]:
%fs mounts

List the contents of a subdirectory in directory you just mounted:

In [34]:
%fs ls /mnt/temp-training

Take a peek at the head of the file `auto-mpg.csv`:

In [36]:
%fs head /mnt/temp-training/auto-mpg.csv

Now you are done, unmount the directory.

In [38]:
%fs unmount /mnt/temp-training2

## Summary

Databricks allows you to:
  * Create DataFrames from existing data
  * Create DataFrames from uploaded files
  * Mount your own Azure blobs

## Review Questions
**Q:** What is Azure Blob Store?  
**A:** Blob Storage stores from hundreds to billions of objects such as unstructured data—images, videos, audio, documents easily and cost-effectively.

**Q:** What is DBFS?  
**A:** DBFS stands for Databricks File System.  DBFS provides for the cloud what the Hadoop File System (HDFS) provides for local spark deployments.  DBFS uses Azure Blob Store and makes it easy to access files by name.

**Q:** Which is more efficient to query, a parquet file or a CSV file?  
**A:** Parquet files are highly optimized binary formats for storing tables.  The overhead is less than required to parse a CSV file.  Parquet is the big data analogue to CSV as it is optimized, distributed, and more fault tolerant than CSV files.

**Q:** What is the syntax for defining a DataFrame in Spark from an existing parquet file in DBFS?  
**A:** Scala: 

`val IPGeocodeDF = spark.read.parquet("dbfs:/mnt/training/ip-geocode.parquet")`

Python: 

`IPGeocodeDF = spark.read.parquet("dbfs:/mnt/training/ip-geocode.parquet")`

**Q:** What is the syntax for defining a DataFrame in Spark from an existing CSV file in DBFS using the first row in the CSV as the schema? 
**A:** Scala: 

`val myDF = spark.read.option("header","true").option("inferSchema","true").csv("dbfs:/mnt/training/myfile.csv")`

Python: 

`myDF = spark.read.option("header","true").option("inferSchema","true").csv("dbfs:/mnt/training/myfile.csv")`

## Next Steps

Start the next lesson, [Querying JSON & Hierarchical Data with DataFrames]($./05-Querying-JSON).

## Additional Topics & Resources

* <a href="https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html" target="_blank">The Databricks DBFS File System</a>