d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Connecting to Azure Blob Storage

Apache Spark&trade; and Databricks&reg; allow you to connect to virtually any data store including Azure Blob Storage.
## In this lesson you:
* Mount and access data in Azure Blob Storage
* Define options when reading from Azure Blob Storage

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/8qe9xs3k7u?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/8qe9xs3k7u?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Spark as a Connector

Spark quickly rose to popularity as a replacement for the [Apache Hadoop&trade;](http://hadoop.apache.org/) MapReduce paradigm in a large part because it easily connected to a number of different data sources.  Most important among these data sources was the Hadoop Distributed File System (HDFS).  Now, Spark engineers connect to a wide variety of data sources including:  
<br>
* Traditional databases like Postgres, SQL Server, and MySQL
* Message brokers like <a href="https://kafka.apache.org/" target="_blank">Apache Kafka</a> and <a href="https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-about">Azure Event Hubs</a>
* Distributed databases like Cassandra and Redshift
* Data warehouses like Hive and Cosmos DB
* File types like CSV, Parquet, and Avro

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/open-source-ecosystem_2.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
### DBFS Mounts and Azure Blob Storage

Azure Blob Storage is the backbone of Databricks workflows.  Azure Blob Storage offers data storage that easily scales to the demands of most data applications and, by colocating data with Spark clusters, Databricks quickly reads from and writes to Azure Blob Storage in a distributed manner.

The Databricks File System (DBFS), is a layer over Azure Blob Storage that allows you to mount Blob containers, making them available to other users in your workspace and persisting the data after a cluster is shut down.

In our road map for ETL, this is the <b>Extract and Validate </b> step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-1.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

<iframe  
src="//fast.wistia.net/embed/iframe/sls8z8pw8n?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/sls8z8pw8n?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox

Define your Azure Blob credentials.  You need the following elements:<br><br>

* Storage account name
* Container name
* Mount point (how the mount will appear in DBFS)
* Shared Access Signature (SAS) key

Below these elements are defined, including a read-only SAS key.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For more information on SAS keys, <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-dotnet-shared-access-signature-part-1" target="_blank"> see the Azure documentation.</a><br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> SAS keys are normally provided as a SAS URI. Of this URI, focus on everything from the `?` on, including the `?`. The following cell provides an example of this.

In [10]:
storageAccount = "dbtraineastus2"   # this is real data ! ! ! ! 
container = "training"
sasKey = "?sv=2017-07-29&ss=b&srt=sco&sp=rl&se=2023-04-19T06:32:30Z&st=2018-04-18T22:32:30Z&spr=https&sig=BB%2FQzc0XHAH%2FarDQhKcpu49feb7llv3ZjnfViuI9IWo%3D"

In addition to the sourcing information above, we need to define a target location.

So that no two students produce the exact same mount, we are going to be a little more creative with this one.

In [12]:

mountPoint = f"/mnt/etlp1a-{username}-si"

# moduleName: etl-part-1
# lessonName: ETL1 03 - Connecting to Azure Blob Storage
# username: tbresee@mail.smu.edu
# userhome: dbfs:/user/tbresee@mail.smu.edu
# workingDir: dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_03___connecting_to_azure_blob_storage_psp
    

In case you mounted this bucket earlier, you might need to unmount it.

In [14]:
try:
  dbutils.fs.unmount(mountPoint) # Use this to unmount as needed
except:
  print("{} already unmounted".format(mountPoint))

In [15]:
#  very handy way of doing things ! ! ! ! 

Define two strings populated with the storage account and container information.  This will be passed to the `mount` function.

In [17]:
sourceString = f"wasbs://{container}@{storageAccount}.blob.core.windows.net/"
confKey = f"fs.azure.sas.{container}.{storageAccount}.blob.core.windows.net"

-sandbox

Now, mount the container <a href="https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs" target="_blank"> using the template provided in the docs.</a>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The code below includes error handling logic to handle the case where the mount is already mounted.

In [19]:
try:
  dbutils.fs.mount(
    source = sourceString,
    mount_point = mountPoint,
    extra_configs = {confKey: sasKey}
  )
except Exception as e:
  print(f"ERROR: {mountPoint} already mounted. Run previous cells to unmount first")
  

Next, explore the mount using `%fs ls` and the name of the mount.

In [21]:
%fs ls /mnt/training

path,name,size
dbfs:/mnt/training/301/,301/,0
dbfs:/mnt/training/Chicago-Crimes-2018.csv,Chicago-Crimes-2018.csv,5201668
dbfs:/mnt/training/City-Data.delta/,City-Data.delta/,0
dbfs:/mnt/training/City-Data.parquet/,City-Data.parquet/,0
dbfs:/mnt/training/EDGAR-Log-20170329/,EDGAR-Log-20170329/,0
dbfs:/mnt/training/StatLib/,StatLib/,0
dbfs:/mnt/training/UbiqLog4UCI/,UbiqLog4UCI/,0
dbfs:/mnt/training/_META/,_META/,0
dbfs:/mnt/training/adventure-works/,adventure-works/,0
dbfs:/mnt/training/airbnb/,airbnb/,0


-sandbox

In practice, always secure your credentials.  Do this by either maintaining a single notebook with restricted permissions that holds SAS keys, or delete the cells or notebooks that expose the keys. **After a cell used to mount a container is run, access this mount in any notebook, any cluster, and share the mount between colleagues.**

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See <a href="https://docs.azuredatabricks.net/user-guide/secrets/index.html" target="_blank">secret management to securely store and reference your credentials in notebooks and jobs.</a>

In [23]:
# #  mountPoint = f"/mnt/etlp1a-{username}-si"

In [24]:

%fs ls /mnt




path,name,size
dbfs:/mnt/delta/,delta/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/,etlp1a-tbresee@mail.smu.edu-si/,0
dbfs:/mnt/my-data/,my-data/,0
dbfs:/mnt/sample/,sample/,0
dbfs:/mnt/training/,training/,0


In [25]:

%fs ls dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/


path,name,size
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/301/,301/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/Chicago-Crimes-2018.csv,Chicago-Crimes-2018.csv,5201668
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/City-Data.delta/,City-Data.delta/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/City-Data.parquet/,City-Data.parquet/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/EDGAR-Log-20170329/,EDGAR-Log-20170329/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/StatLib/,StatLib/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/UbiqLog4UCI/,UbiqLog4UCI/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/_META/,_META/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/adventure-works/,adventure-works/,0
dbfs:/mnt/etlp1a-tbresee@mail.smu.edu-si/airbnb/,airbnb/,0


## Adding Options

When you import that data into a cluster, you can add options based on the specific characteristics of the data.

<iframe  
src="//fast.wistia.net/embed/iframe/6pckay2lii?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/6pckay2lii?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Display the first few lines of `Chicago-Crimes-2018.csv` using `%fs head`.

In [31]:
%fs head /mnt/training/Chicago-Crimes-2018.csv

**`option`** is a method of **`DataFrameReader`**. 

Options are key/value pairs and must be specified before calling **`.csv()`**.

This is a tab-delimited file, as seen in the previous cell. Specify the **`"delimiter"`** option in the import statement.  

:NOTE: Find a [full list of parameters here.](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dateformat#pyspark.sql.DataFrameReader.csv)

In [33]:
display(spark.read
  .option("delimiter", "\t")
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)

#  very cool:  https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dateformat#pyspark.sql.DataFrameReader.csv
  
  

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19,_c20,_c21
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
23811,JB141441,02/05/2018 01:10:00 AM,118XX S INDIANA AVE,0110,HOMICIDE,FIRST DEGREE MURDER,VACANT LOT,false,false,0532,005,9,53,01A,1179707,1826280,2018,02/12/2018 03:49:14 PM,41.678585145,-87.617837834,"(41.678585145, -87.617837834)"
11228589,JB148990,01/23/2018 09:00:00 AM,072XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,false,false,0323,003,6,69,11,,,2018,02/12/2018 03:49:14 PM,,,
11228563,JB148931,01/31/2018 10:12:00 AM,040XX N KEYSTONE AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,false,false,1722,017,39,16,11,,,2018,02/12/2018 03:49:14 PM,,,
11228555,JB148885,02/01/2018 02:00:00 PM,017XX W CONGRESS PKWY,0820,THEFT,$500 AND UNDER,HOSPITAL BUILDING/GROUNDS,false,false,1231,012,2,28,06,,,2018,02/12/2018 03:49:14 PM,,,
11228430,JB148675,01/27/2018 09:00:00 PM,061XX S EBERHART AVE,0560,ASSAULT,SIMPLE,RESIDENCE,false,true,0313,003,20,42,08A,,,2018,02/12/2018 03:49:14 PM,,,
11228401,JB148683,02/02/2018 12:00:00 PM,038XX N SAWYER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,false,false,1733,017,33,16,11,,,2018,02/12/2018 03:49:14 PM,,,
11228347,JB148599,01/28/2018 07:00:00 PM,008XX E 45TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,false,false,0221,002,4,39,05,,,2018,02/12/2018 03:49:14 PM,,,
11228291,JB148591,01/10/2018 04:45:00 PM,010XX E 53RD ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,false,false,0233,002,4,41,11,,,2018,02/12/2018 03:49:14 PM,,,
11228287,JB148482,01/03/2018 03:45:00 PM,0000X W C1 ST,0810,THEFT,OVER $500,AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA,false,false,1651,016,41,76,06,,,2018,02/12/2018 03:49:14 PM,,,


Spark doesn't read the header by default, as demonstrated by the column names of `_c0`, `_c1`, etc. Notice the column names are present in the first row of the DataFrame. 

Fix this by setting the `"header"` option to `True`.

In [35]:
display(spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
23811,JB141441,02/05/2018 01:10:00 AM,118XX S INDIANA AVE,0110,HOMICIDE,FIRST DEGREE MURDER,VACANT LOT,False,False,532,5,9,53,01A,1179707.0,1826280.0,2018,02/12/2018 03:49:14 PM,41.678585145,-87.617837834,"(41.678585145, -87.617837834)"
11228589,JB148990,01/23/2018 09:00:00 AM,072XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,323,3,6,69,11,,,2018,02/12/2018 03:49:14 PM,,,
11228563,JB148931,01/31/2018 10:12:00 AM,040XX N KEYSTONE AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,1722,17,39,16,11,,,2018,02/12/2018 03:49:14 PM,,,
11228555,JB148885,02/01/2018 02:00:00 PM,017XX W CONGRESS PKWY,0820,THEFT,$500 AND UNDER,HOSPITAL BUILDING/GROUNDS,False,False,1231,12,2,28,06,,,2018,02/12/2018 03:49:14 PM,,,
11228430,JB148675,01/27/2018 09:00:00 PM,061XX S EBERHART AVE,0560,ASSAULT,SIMPLE,RESIDENCE,False,True,313,3,20,42,08A,,,2018,02/12/2018 03:49:14 PM,,,
11228401,JB148683,02/02/2018 12:00:00 PM,038XX N SAWYER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1733,17,33,16,11,,,2018,02/12/2018 03:49:14 PM,,,
11228347,JB148599,01/28/2018 07:00:00 PM,008XX E 45TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,False,False,221,2,4,39,05,,,2018,02/12/2018 03:49:14 PM,,,
11228291,JB148591,01/10/2018 04:45:00 PM,010XX E 53RD ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,233,2,4,41,11,,,2018,02/12/2018 03:49:14 PM,,,
11228287,JB148482,01/03/2018 03:45:00 PM,0000X W C1 ST,0810,THEFT,OVER $500,AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA,False,False,1651,16,41,76,06,,,2018,02/12/2018 03:49:14 PM,,,
11228268,JB148558,02/04/2018 04:00:00 PM,044XX S MICHIGAN AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,215,2,3,38,26,,,2018,02/12/2018 03:49:14 PM,,,


Spark didn't infer the schema, or read the timestamp format, since this file uses an atypical timestamp.  Change that by adding the option `"timestampFormat"` and pass it the format used in this file.  

Set `"inferSchema"` to `True`, which triggers Spark to make an extra pass over the data to infer the schema.

In [37]:
crimeDF = (spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .option("timestampFormat", "mm/dd/yyyy hh:mm:ss a")
  .option("inferSchema", True)
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)
display(crimeDF)

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
23811,JB141441,2018-01-05T01:10:00.000+0000,118XX S INDIANA AVE,0110,HOMICIDE,FIRST DEGREE MURDER,VACANT LOT,False,False,532,5,9,53,01A,1179707.0,1826280.0,2018,2018-01-12T15:49:14.000+0000,41.678585145,-87.617837834,"(41.678585145, -87.617837834)"
11228589,JB148990,2018-01-23T09:00:00.000+0000,072XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,323,3,6,69,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228563,JB148931,2018-01-31T10:12:00.000+0000,040XX N KEYSTONE AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,1722,17,39,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228555,JB148885,2018-01-01T14:00:00.000+0000,017XX W CONGRESS PKWY,0820,THEFT,$500 AND UNDER,HOSPITAL BUILDING/GROUNDS,False,False,1231,12,2,28,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228430,JB148675,2018-01-27T21:00:00.000+0000,061XX S EBERHART AVE,0560,ASSAULT,SIMPLE,RESIDENCE,False,True,313,3,20,42,08A,,,2018,2018-01-12T15:49:14.000+0000,,,
11228401,JB148683,2018-01-02T12:00:00.000+0000,038XX N SAWYER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1733,17,33,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228347,JB148599,2018-01-28T19:00:00.000+0000,008XX E 45TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,False,False,221,2,4,39,05,,,2018,2018-01-12T15:49:14.000+0000,,,
11228291,JB148591,2018-01-10T16:45:00.000+0000,010XX E 53RD ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,233,2,4,41,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228287,JB148482,2018-01-03T15:45:00.000+0000,0000X W C1 ST,0810,THEFT,OVER $500,AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA,False,False,1651,16,41,76,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228268,JB148558,2018-01-04T16:00:00.000+0000,044XX S MICHIGAN AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,215,2,3,38,26,,,2018,2018-01-12T15:49:14.000+0000,,,


## The Design Pattern

Other connections work in much the same way, whether your data sits in Cassandra, Cosmos DB, Redshift, or another common data store.  The general pattern is always:  
<br>
1. Define the connection point
2. Define connection parameters such as access credentials
3. Add necessary options

After adhering to this, read data using `spark.read.options(<option key>, <option value>).<connection_type>(<endpoint>)`.

## Exercise 1: Read Wikipedia Data

Read Wikipedia data from Azure Blob Storage, accounting for its delimiter and header.

### Step 1: Get a Sense for the Data

Take a look at the head of the data, located at `/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv`.

In [41]:

%fs ls /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv


path,name,size
dbfs:/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv,pageviews_by_second.tsv,262099389


In [42]:

%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv



In [43]:

# "timestamp"	"site"	"requests"
# "2015-03-16T00:09:55"	"mobile"	1595
# "2015-03-16T00:10:39"	"mobile"	1544
# "2015-03-16T00:19:39"	"desktop"	2460
# "2015-03-16T00:38:11"	"desktop"	2237
# "2015-03-16T00:42:40"	"mobile"	1656
# "2015-03-16T00:52:24"	"desktop"	2452
# "2015-03-16T00:54:16"	"mobile"	1654
# "2015-03-16T01:18:11"	"mobile"	1720
# "2015-03-16T01:30:32"	"desktop"	2288
# "2015-03-16T01:32:24"	"mobile"	1609
# "2015-03-16T01:42:08"	"desktop"	2341
# "2015-03-16T01:45:53"	"mobile"	1704
# "2015-03-16T01:55:37"	"desktop"	2554



### Step 2: Import the Raw Data

Import the data **without any options** and save it to `wikiDF`. Display the result.

In [45]:
# TODO
wikiDF = FILL_IN

wikiDF = (spark.read
  .csv("/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv")
)

display(wikiDF)


_c0
"""timestamp""	""site""	""requests"""
"""2015-03-16T00:09:55""	""mobile""	1595"
"""2015-03-16T00:10:39""	""mobile""	1544"
"""2015-03-16T00:19:39""	""desktop""	2460"
"""2015-03-16T00:38:11""	""desktop""	2237"
"""2015-03-16T00:42:40""	""mobile""	1656"
"""2015-03-16T00:52:24""	""desktop""	2452"
"""2015-03-16T00:54:16""	""mobile""	1654"
"""2015-03-16T01:18:11""	""mobile""	1720"
"""2015-03-16T01:30:32""	""desktop""	2288"


In [46]:
# TEST - Run this cell to test your solution

dbTest("ET1-P-03-01-01", 7200001, wikiDF.count())
dbTest("ET1-P-03-01-02", '_c0', wikiDF.columns[0])

print("Tests passed!")

### Step 3: Import the Data with Options

Import the data with options and save it to `wikiWithOptionsDF`.  Display the result.  Your import statement should account for:<br><br>  

 - The header
 - The delimiter

In [48]:
# TODO


wikiWithOptionsDF = (spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .csv("/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv")
)

display(wikiWithOptionsDF)



timestamp,site,requests
2015-03-16T00:09:55,mobile,1595
2015-03-16T00:10:39,mobile,1544
2015-03-16T00:19:39,desktop,2460
2015-03-16T00:38:11,desktop,2237
2015-03-16T00:42:40,mobile,1656
2015-03-16T00:52:24,desktop,2452
2015-03-16T00:54:16,mobile,1654
2015-03-16T01:18:11,mobile,1720
2015-03-16T01:30:32,desktop,2288
2015-03-16T01:32:24,mobile,1609


In [49]:
# TEST - Run this cell to test your solution
cols = wikiWithOptionsDF.columns

dbTest("ET1-P-03-02-01", 7200000, wikiWithOptionsDF.count())

dbTest("ET1-P-03-02-02", True, "timestamp" in cols)
dbTest("ET1-P-03-02-03", True, "site" in cols)
dbTest("ET1-P-03-02-04", True, "requests" in cols)

print("Tests passed!")

## Review

**Question:** What accounts for Spark's quick rise in popularity as an ETL tool?  
**Answer:** Spark easily accesses data virtually anywhere it lives, and the scalable framework lowers the difficulties in building connectors to access data.  Spark offers a unified API for connecting to data making reads from a CSV file, JSON data, or a database, to provide a few examples, nearly identical.  This allows developers to focus on writing their code rather than writing connectors.

**Question:** What is DBFS and why is it important?  
**Answer:** The Databricks File System (DBFS) allows access to scalable, fast, and distributed storage backed by S3 or the Azure Blob Store.

**Question:** How do you connect your Spark cluster to the Azure Blob?  
**Answer:** By mounting it. Mounts require Azure credentials such as SAS keys and give access to a virtually infinite store for your data. One other option is to define your keys in a single notebook that only you have permission to access. Click the arrow next to a notebook in the Workspace tab to define access permissions.

**Question:** How do you specify parameters when reading data?  
**Answer:** Using `.option()` during your read allows you to pass key/value pairs specifying aspects of your read.  For instance, options for reading CSV data include `header`, `delimiter`, and `inferSchema`.

**Question:** What is the general design pattern for connecting to your data?  
**Answer:** The general design pattern is as follows:
0. Define the connection point
0. Define connection parameters such as access credentials
0. Add necessary options such as for headers or parallelization

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [52]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Connecting to JDBC]($./ETL1 04 - Connecting to JDBC ).

## Additional Topics & Resources

**Q:** Where can I find more information on DBFS?  
**A:** <a href="https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html#dbfs" target="_blank">Take a look at the Databricks documentation for more details

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>