## DSEE Azure Data Stores Integration with Azure Databricks  ![adb](files/pics/adb.png)  +  ![adls](files/pics/adls1.png)  +  ![sqldb](files/pics/azsqldb.png)  +  ![blob](files/pics/blob1.png)
  ** This notebook guides us how to connect to various data stores in DSEE like ADLS,Azure SQLDB and Blob store **

## Data Science Exploration Environment(DSEE) Solution Overview

![dsee](files/pics/dsee.png)

### Get the Secrets from Azure Databricks Workspace   ![adb](files/pics/adb.png)

In [4]:
SpnId = dbutils.secrets.get(scope = "krm_secrets", key = "krmSpnId")
SpnVal = dbutils.secrets.get(scope = "krm_secrets", key = "krmSpnVal")
SqlUsername = dbutils.secrets.get(scope = "krm_secrets", key = "krmSqlUsername")
SqlPasswd = dbutils.secrets.get(scope = "krm_secrets", key = "krmSqlPasswd")
BlobKey = dbutils.secrets.get(scope = "krm_secrets", key = "krmBlobKey")

##Integration with Azure Data lake Store(ADLS)   ![adb](files/pics/adb.png) + ![adls](files/pics/adls1.png)
###Connection Configuration for ADLS through End User Multi-Factor Authentication

Attach the below libraries to your cluster
- azure-mgmt-resource
- azure-mgmt-datalake-store
- azure-datalake-store

Refer : https://docs.azuredatabricks.net/user-guide/libraries.html

### Define Variables

In [7]:
adls_account = "cbspp01dls"
adls_folder = "/dseekrm-d-01/"
file_path = "/dseekrm-d-01/data-training-dls/green_tripdata_2014-01.csv"
out_file_path = "/dseekrm-d-01/output/df_output.csv"
delimiter = ","
file_format = "csv"

In [8]:
import pandas as pd
from azure.datalake.store import core, lib, multithread

In [9]:
token = lib.auth()

In [10]:
adlsFileSystemClient = core.AzureDLFileSystem(token, store_name=adls_account)
adlsFileSystemClient.walk(adls_folder)

In [11]:
adlsFileSystemClient = core.AzureDLFileSystem(token, store_name=adls_account)
with adlsFileSystemClient.open(file_path, 'rb') as f:
    df_data = pd.read_csv(f,sep=delimiter,header=1)

df_data.head()

In [12]:
df_str = df_data.to_csv()
adlsFileSystemClient = core.AzureDLFileSystem(token, store_name=adls_account)
with adlsFileSystemClient.open(out_file_path, 'wb') as f:
    f.write(str.encode(df_str))
    f.close()

##Integration with Azure Data lake Store(ADLS)   ![adb](files/pics/adb.png) + ![adls](files/pics/adls1.png)
###Connection Configuration for ADLS through Service to Service Authentication

In [14]:
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", SpnId)
spark.conf.set("dfs.adls.oauth2.credential", SpnVal)
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/3a15904d-3fd9-4256-a753-beb05cdf0c6d/oauth2/token")

### Define Variables

In [16]:
adls_account = "cbspp01dls"
adls_uri = "adl://"+adls_account+".azuredatalakestore.net"
adls_folder = "/dseekrm-d-01/"
file_path = adls_uri+adls_folder+"/data-training-dls/green_tripdata_2014-01.csv"
out_file_path = adls_folder+"output/df_output.csv"
delimiter = ","
file_format = "csv"


In [17]:
dataPath = file_path
df = spark.read.format(file_format)\
  .option("header","true")\
  .option("sep", delimiter)\
  .option("inferSchema", "true")\
  .load(dataPath)
  
display(df)

VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type
2,2014-01-01T00:00:00.000+0000,2014-01-01T01:08:06.000+0000,N,1,0.0,0.0,-73.86504364013672,40.87230682373047,1,6.47,20.0,0.5,0.5,0.0,0.0,,21.0,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T06:03:57.000+0000,N,2,0.0,0.0,-73.7763671875,40.64548873901367,1,20.12,52.0,0.0,0.5,0.0,5.33,,57.83,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T18:22:44.000+0000,N,1,0.0,0.0,-73.93264770507812,40.85257339477539,2,0.81,5.0,0.5,0.5,0.0,0.0,,6.0,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T00:52:03.000+0000,N,1,0.0,0.0,-73.99407958984375,40.74909210205078,1,9.55,33.5,0.5,0.5,2.17,5.33,,42.0,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T00:49:25.000+0000,N,1,0.0,0.0,-73.93606567382812,40.73472595214844,1,1.22,7.0,0.5,0.5,2.0,0.0,,10.0,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T00:01:15.000+0000,N,1,0.0,0.0,-73.91215515136719,40.684059143066406,2,4.27,17.0,0.5,0.5,0.0,0.0,,18.0,2,
2,2014-01-01T00:00:00.000+0000,2014-01-01T02:37:20.000+0000,N,1,0.0,0.0,-73.93531799316406,40.73701095581055,1,7.5,40.0,0.5,0.5,0.0,0.0,,41.0,2,
2,2014-01-01T00:00:00.000+0000,2014-01-01T15:24:02.000+0000,N,5,0.0,0.0,-73.93746948242188,40.804195404052734,2,0.02,18.0,0.0,0.5,0.0,0.0,,18.5,1,
2,2014-01-01T00:00:00.000+0000,2014-01-01T06:11:44.000+0000,N,1,0.0,0.0,0.0,0.0,5,3.02,15.0,0.5,0.5,0.0,0.0,,16.0,2,
2,2014-01-01T00:00:00.000+0000,2014-01-01T01:17:43.000+0000,N,1,0.0,0.0,-73.91567993164062,40.77630615234375,1,1.41,9.0,0.5,0.5,2.38,0.0,,12.38,1,


## Mount Data Lake Store to DBFS  ![adb](files/pics/adb.png)  +  ![adls](files/pics/adls1.png)
- You should create a mount point only if you want all users in the Databricks workspace to have access to the mounted Data Lake Store.
- The service client that you use to access the Data Lake Store should be granted access only to that Data Lake Store; it should not be granted access to other resources in Azure.
- Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

In [19]:
adls_account = "cbspp01dls"
adls_uri = "adl://"+adls_account+".azuredatalakestore.net"
adls_folder = "/dseekrm-d-01/"
adls_dbfs_mnt_folder = "/mnt/dseekrm-d-01/"
adls_dbfs_file_path = "dbfs:"+adls_dbfs_mnt_folder+"data-training-dls/green_tripdata_2014-01.csv"
dbfs_out_file_path = "dbfs:"+adls_dbfs_mnt_folder+"output/df_output.csv"
delimiter = ","
file_format = "csv"

In [20]:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": SpnId,
           "dfs.adls.oauth2.credential": SpnVal,
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/3a15904d-3fd9-4256-a753-beb05cdf0c6d/oauth2/token"}

dbutils.fs.mount(
  source = adls_uri+adls_folder,
  mount_point = adls_dbfs_mnt_folder,
  extra_configs = configs)

In [21]:
display(dbutils.fs.ls("dbfs:/mnt/dseekrm-d-01/data-training-dls"))

In [22]:
dataPath = adls_dbfs_file_path
df = spark.read.format(file_format)\
  .option("header","true")\
  .option("sep", delimiter)\
  .option("inferSchema", "true")\
  .load(dataPath)
  
display(df)

### Databricks FileSystem Maintenance  ![adb](files/pics/adb.png)
  - Unmount a mount point
  - Refresh Mount
  - Explore Data Lake Store

In [24]:
#dbutils.fs.unmount(adls_dbfs_mnt_folder)
#dbutils.fs.refreshMounts()
#dbutils.fs.ls("dbfs:"+adls_dbfs_mnt_folder)

##Integration with Azure SQL Database   ![adb](files/pics/adb.png)  +  ![sqldb](files/pics/azsqldb.png)
###Connection Configuration for Azure SQLDB through User/Passwd authentication

In [26]:
sqlserver = "krmdsqlsrv01.database.windows.net"
port = "1433"
database = "krmdsqldb"
user = SqlUsername
pswd = SqlPasswd
query = "(select getdate() as dt) test"

In [27]:
df1 = spark.read \
  .option('user', user) \
  .option('password', pswd) \
  .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' + database, query)
  
display(df1)  

##Integration with Azure Blob Store   ![adb](files/pics/adb.png)  +  ![blob](files/pics/blob1.png)
###Connection Configuration for Azure Blob through Account Access Key:

In [29]:
storage_account_name = "krmdhaz2jjno5iyna"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",BlobKey)
file_type = "csv"
container_name = "data-training"
file_location = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net"
file_path = "/green_tripdata_2014-01.csv"

In [30]:
df2 = spark.read.format(file_type).option("header","true").option("inferSchema", "true").load(file_location+file_path)
display(df2)

In [31]:
import pyodbc