d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# 3.3 Accessing Data

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this notebook you:<br>

* Read data from a BLOB store
* Read data in serial from JDBC
* Read data in parallel from JDBC

In [0]:
%run ../Includes/Classroom-Setup

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DBFS Mounts and S3

Amazon Simple Storage Service (S3) is the backbone of Databricks workflows.  S3 offers data storage that easily scales to the demands of most data applications and, by colocating data with Spark clusters, Databricks quickly reads from and writes to S3 in a distributed manner.

The Databricks File System, or DBFS, is a layer over S3 that allows you to [mount S3 buckets](https://docs.databricks.com/spark/latest/data-sources/aws/amazon-s3.html#mount-aws-s3), making them available to other users in your workspace and persisting the data after a cluster is shut down.

In Azure Databricks, DBFS is backed by the Azure Blob Store. More documentation can be found [here](https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage).

In the first lesson, you uploaded data using the Databricks user interface.  This can be done by clicking Data on the left-hand side of the screen.  Here we'll be mounting data.

Define your AWS credentials.  Below are defined read-only keys, the name of an AWS bucket, and the mount name we will be referring to in DBFS.

For getting AWS keys, take a look at <a href="https://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html" target="_blank"> take a look at the AWS documentation

In [0]:
%python
ACCESS_KEY = ""
# Encode the Secret Key to remove any "/" characters
SECRET_KEY = "".replace("/", "%2F")
AWS_BUCKET_NAME = "davis-dsv1071/data/"
MOUNT_NAME = "/mnt/davis-tmp"

-sandbox

Now mount the bucket [using the template provided in the docs.](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#mounting-an-s3-bucket)

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The code below includes error handling logic to handle the case where the mount is already mounted.

In [0]:
%python
try:
  dbutils.fs.mount("s3a://{}:{}@{}".format(ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), MOUNT_NAME)
except:
  print("""{} already mounted. Unmount using `dbutils.fs.unmount("{}")` to unmount first""".format(MOUNT_NAME, MOUNT_NAME))

Next, explore the mount using `%fs ls` and the name of the mount.

In [0]:
%fs ls /mnt/davis-tmp

path,name,size
dbfs:/mnt/davis-tmp/dataframes/,dataframes/,0
dbfs:/mnt/davis-tmp/fire-calls/,fire-calls/,0
dbfs:/mnt/davis-tmp/fire-incidents/,fire-incidents/,0


In practice, always secure your AWS credentials.  Do this by either maintaining a single notebook with restricted permissions that holds AWS keys, or delete the cells or notebooks that expose the keys. After a cell used to mount a bucket is run, you can access the data in this mount point in any notebook or any cluster in Databricks, and share the mount between colleagues.

In [0]:
%fs mounts

mountPoint,source,encryptionType
/mnt/davis,s3a://davis-dsv1071/data,
/databricks-datasets,databricks-datasets,sse-s3
/databricks/mlflow-tracking,databricks/mlflow-tracking,sse-s3
/databricks-results,databricks-results,sse-s3
/mnt/davis-tmp,s3a://davis-dsv1071/data/,
/databricks/mlflow-registry,databricks/mlflow-registry,sse-s3
/,DatabricksRoot,sse-s3


You now have access to an unlimited store of data.  This is a read-only S3 bucket.  If you create and mount your own, you can write to it as well.

Now unmount the data.

In [0]:
%python
dbutils.fs.unmount("/mnt/davis-tmp")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Serial JDBC Reads

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

Databases are advanced technologies that benefit from decades of research and development. To leverage the inherent efficiencies of database engines, Spark uses an optimization called predicate push down.  **Predicate push down uses the database itself to handle certain parts of a query (the predicates).**  In mathematics and functional programming, a predicate is anything that returns a Boolean.  In SQL terms, this often refers to the `WHERE` clause.  Since the database is filtering data before it arrives on the Spark cluster, there's less data transfer across the network and fewer records for Spark to process.  Spark's Catalyst Optimizer includes predicate push down communicated through the JDBC API, making JDBC an ideal data source for Spark workloads.

Run the cell below to confirm you are using the right driver.

In [0]:
%scala
Class.forName("org.postgresql.Driver")

First we need to create the JDBC String.

In [0]:
%sql
DROP TABLE IF EXISTS twitterJDBC;

CREATE TABLE IF NOT EXISTS twitterJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account"
)

We now have a twitter table.

In [0]:
%sql
SELECT * FROM twitterJDBC LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
2273875015,LadyPhrases,,7361,30938,Tengo un lado cariñoso. Tengo un lado perezoso. Tengo un lado insoportable. Tengo un lado amable... Y cada un@ tiene mi lado que se merece,266287981956,3153
3195178765,sweetmary647,,193,575,Hi im maria delacruz a simple woman but sweet as a sugar and I love dongyanzia so much proud to be dongyanfans❤❤❤ power couple DongYan song song couple😍😍😍,266287981957,3153
451118435,destruectag,ENTP-T,497,549,and i was like: why are you so obsessed with me?,266287981958,3153
314849076,InfoKosovaNet,Prishtinë,2077,1106,Portali më i madh i lajmeve http://infokosova.org Agjensia e lajmeve INFOKOSOVA (e Licencuar pran MTI - Ministrisë së Tregëtisë dhe Industrisë),266287981959,3153
267873620,mayutan0103,ざきみや→かながーわ→南国みやざき,188,828,ナナシス/デレステ/スクフェス ソシャゲ含む趣味垢。女じゃありません。リムブロご自由に。いいねは既読みたいな感じでちらほら。,266287981960,3153
903571704231772160,rrd2652,,67,2,HQ/무기력조/사실다좋음/구독계,266287981961,3153
784626793949126658,RoyalExpoCenter,,0,2,#RoyalExpoCenter,266287981962,3153
902810825609461760,kcliskook,"Central Luzon, Republic of the",20,4,Hi im KC✋ i have a you tube please subscribe me guys type it's KC's Channel... Thanks..,266287981963,3153
841690446829043712,nyaon_Espurr,巻き込みリプは地雷,257,284,｜HN:にゃおん｜14ちゃい♀｜絵を稀に描く｜プリリズ｜プリパラ｜#ちはやー｜#色違いモクロー難民｜アイコン【@fuwarin0111 】｜ツイプロ読んで、どうぞ｜,266287981964,3153
2511970980,hisanov_16,人生苦しんだモン勝ち！v,69,69,"県大＞社福＞人間,1年 鳴海 死に損なった余生を欲望のままに過ごします 英語で病みツイしてた時は和訳してはいけません 紅茶/邦ロック/Vロック/EDM",266287981965,3153


Add a subquery to `dbtable`, which pushes the predicate to JDBC to process before transferring the data to your Spark cluster.

In [0]:
%sql
DROP TABLE IF EXISTS twitterPhilippinesJDBC;

CREATE TABLE twitterPhilippinesJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "(SELECT * FROM training.Account WHERE location = 'Philippines') as subq"
)

In [0]:
%sql
SELECT * FROM twitterPhilippinesJDBC LIMIT 10

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
336910614,iamdorkRY,Philippines,158,229,I know how to unfollow too. 🐳,266287982070,3153
2963210763,shiela_banico,Philippines,675,356,Don't dreams be just a dreams. TIWALA LANG. 26KATHNIEL26.Lifetime.Kill them with http://kindness.BESTIE.to see KATHNIEL is the best gift.,249108113879,3153
163833585,banyeokjammie,Philippines,1244,300,Uragirimonoooooo. Slipped into the diamond life. 08.14.16 | 09.03.16 | 05.06.17 | 10.06.17,249108114376,3153
954877023775744001,rlyn_jyn,Philippines,111,15,•Photographer📷•Ftbllplyr⚽•RwnCñt💘•Kops💓•Army💗•14🌷•,266287983126,3153
923914213642268673,alingjessica,Philippines,316,491,"No fuck it! This is me, and if you don't like it then sod you! I think if we all did that, the world would be a better place. -Jesy Nelson of @LittleMix",249108114806,3153
944178631366000640,jendacer,Philippines,205,49,,240518179929,3153
131810145,rzlwyn,Philippines,1280,745,treat people with kindness ✨ chong’s,266287983718,3153
373753809,nejylc25,Philippines,377,162,"A quitter never wins, a winner never quits./",266287984283,3153
742613662033051649,cheylovess,Philippines,78,91,Lost is the best place to find your self👁 👣💜,257698049556,3153
883638312627388416,Springday_aiz,Philippines,89,34,,266287984718,3153


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Parallel JDBC Reads

In [0]:
%sql
SELECT min(userID) as minID, max(userID) as maxID from twitterJDBC

minID,maxID
509,959519629566672896


In [0]:
%sql
DROP TABLE IF EXISTS twitterParallelJDBC;

CREATE TABLE IF NOT EXISTS twitterParallelJDBC
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "org.postgresql.Driver",
  url "jdbc:postgresql://server1.databricks.training:5432/training",
  user "readonly",
  password "readonly",
  dbtable "training.Account",
  partitionColumn '"userID"',
  lowerBound 2591,
  upperBound 951253910555168768,
  numPartitions 25
)

In [0]:
%sql
SELECT * from twitterParallelJDBC

userID,screenName,location,friendsCount,followersCount,description,insertID,ETLID
2273875015,LadyPhrases,,7361,30938,Tengo un lado cariñoso. Tengo un lado perezoso. Tengo un lado insoportable. Tengo un lado amable... Y cada un@ tiene mi lado que se merece,266287981956,3153
3195178765,sweetmary647,,193,575,Hi im maria delacruz a simple woman but sweet as a sugar and I love dongyanzia so much proud to be dongyanfans❤❤❤ power couple DongYan song song couple😍😍😍,266287981957,3153
451118435,destruectag,ENTP-T,497,549,and i was like: why are you so obsessed with me?,266287981958,3153
314849076,InfoKosovaNet,Prishtinë,2077,1106,Portali më i madh i lajmeve http://infokosova.org Agjensia e lajmeve INFOKOSOVA (e Licencuar pran MTI - Ministrisë së Tregëtisë dhe Industrisë),266287981959,3153
267873620,mayutan0103,ざきみや→かながーわ→南国みやざき,188,828,ナナシス/デレステ/スクフェス ソシャゲ含む趣味垢。女じゃありません。リムブロご自由に。いいねは既読みたいな感じでちらほら。,266287981960,3153
2511970980,hisanov_16,人生苦しんだモン勝ち！v,69,69,"県大＞社福＞人間,1年 鳴海 死に損なった余生を欲望のままに過ごします 英語で病みツイしてた時は和訳してはいけません 紅茶/邦ロック/Vロック/EDM",266287981965,3153
4120354274,gren8s,"Los Angeles, CA",157,590,"who cares, baby",266287981968,3153
2937688669,promart_jp,大阪府,1204,1073,生鮮卸売市場プロマートは中央卸売市場から生鮮食品や加工食品を仕入れることができるサイトです。業者の方から一般の方にも、国内外問わず、お問い合わせをお待ちしています。,266287981970,3153
814603224,do_nutgiveafuck,Narnia,5307,8715,You shall not pass.,266287981971,3153
2150800563,NielKomar,Уфа.,693,388,Самобразование и самообучение Здесь обязательный взаимный фолловинг Занимаюсь авто мототехникой,266287981972,3153


In [0]:
%python
%timeit sql("SELECT * from twitterJDBC").describe()

In [0]:
%python
%timeit sql("SELECT * from twitterParallelJDBC").describe()

For additional options [see the Spark docs.](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>