d
#![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Reading SAS data files into Databricks


**In this lesson you:**
1. Learn about Databricks databases and tables and how they compare to SAS libraries and datasets
1. Understand how to use named Databricks DataFrames and tables in a similar manner as SAS filename references 
1. Read data in from various locations and write data out to DataFrames and tables

## SAS dataset libraries and Databricks databases and tables

In SAS, a library points to a particular location on a drive where a collection of sas datasets are stored. This could be on a local network drive, a filesystem, or a remote database. Using the `libname` assigned to a library allows you to reference a dataset within a library.  

In Databricks, this is equivalent to creating a [metastore database](https://docs.databricks.com/data/metastores/index.html) that points to some [unmanaged tables](https://docs.databricks.com/data/tables.html#managed-and-unmanaged-tables) and allows you to persist the tables outside of a session or an individual cluster.

A Databricks database is a collection of tables. A [Databricks table](https://docs.databricks.com/data/tables.html#) is a collection of structured data. You can cache, filter, and perform any operations supported by Apache Spark DataFrames on Databricks tables. You can query tables with Spark APIs and Spark SQL.

There are two types of tables: global and local. 
- A global table is available across all clusters. Databricks registers global tables either to the Databricks Hive metastore or to an external Hive metastore. 
- A local table is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary view.

You can learn more about creating and managing databases, tables, and views in the Databricks Academy course "Quick Reference: Relational Entities on Databricks".

## Data storage and access

Using SAS, you often use the `filename in` pattern to read data from external sources. In Databricks, you can use the [Databricks File System (DBFS)](https://docs.databricks.com/data/databricks-file-system.html) to access data in object storage.

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
- Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
- Allows you to interact with object storage using directory and file semantics instead of storage URLs.
- Persists files to object storage, so you won’t lose data after you terminate a cluster.

[Mounting object storage](https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs) to DBFS allows you to access objects in object storage (i.e. AWS S3 buckets, Azure Blob storage containers, etc.) as if they were on the local file system. In this course, we have already mounted a data file to DBFS and will access the file from there.

An example of the equivalent in SAS:

`filename in s3 '/directory1/i.csv' ;
`

#### Run the cells below to get started.

In [0]:
%run ./Includes/classroom-setup


username: rashbeats@gmail.com
working_dir:   dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc
database_name: dbacademy_rashbeats_gmail_com_sasproc
Out[1]: True

In [0]:
# create a pointer to a sas7bdat file in the source directory that has been set up for you in cloud object storage:
sasfile = sourcedir + 'allergies.sas7bdat'

# create a filepath location for the dataset in the Databricks File System (DBFS)
filepath = working_dir + "/rawdata/allergies.sas7bdat"

# copy the file from the source directory to the filepath location
dbutils.fs.cp(sasfile, filepath)

# list the files in the dbfs location
dbutils.fs.ls(filepath)

Out[2]: [FileInfo(path='dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc/rawdata/allergies.sas7bdat', name='allergies.sas7bdat', size=720896, modificationTime=1669912839000)]

## Read a dataset from a specified location

In SAS, you read a dataset in using the `set` method:

  `Data a;  
      Set b;  
      Run;`

There are several methods we can use to read a SAS dataset in and write to a Databricks format:
 1. Write to a Spark DataFrame using Python
 1. Write to a SQL table or a temporary view using Python or SQL
 1. Write to a Delta table using Python
 
We will load a sas7bdat file using the [saurfang](https://github.com/saurfang/spark-sas7bdat) library format option. In order to read in sas7bdat files, we need to use the saurfang library that we installed in our cluster. 

We will demonstrate two methods for loading the dataset:
1. Loading the data into a table using SQL
1. Loading the data and writing it to a Spark DataFrame using Python.

In [0]:
# get the path for the dataset in dbfs
print(filepath)

dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc/rawdata/allergies.sas7bdat


**In the cell below, copy the filepath output from the previous cell and paste it in where you see FILL IN.**

Verify that the table was created correctly by running a SQL query:

You can now issue any SQL queries you want on the `allergies` table.

Below, we will show how you can write the data to a DataFrame instead, and various methods you can use on a DataFrame.

### 2. Write to a Spark DataFrame using Python

The code block below returns a DataFrame `df`. A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects. 

The Spark DataFrameReader is called via `spark.read` and is similar to the `proc import` command in SAS. The DataFrameReader can take in many file formats, such as json, parquet, csv, and many more, and you can specify many options. You can learn more about this in the [official documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output). For more information and examples, see the [Quickstart](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html) on the Apache Spark documentation website.

NOTE: Here, we specify the input schema for the data by listing the field names and their data types. If we do not specify a schema, the DataFrameReader will infer one by sampling some of the input file.

In [0]:
from pyspark.sql.functions import *

df = (spark.read
           .format("com.github.saurfang.sas.spark")
           .schema("start DATE, stop STRING, patient STRING, encounter STRING, code LONG, description STRING")
           .load(filepath, forceLowercaseNames=True)
     )
display(df)

start,stop,patient,encounter,code,description
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,419474003,Allergy to mould
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,232350006,House dust mite allergy
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,232347008,Dander (animal) allergy
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,419263009,Allergy to tree pollen
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,300913006,Shellfish allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,300916003,Latex allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,419474003,Allergy to mould
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,232350006,House dust mite allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,232347008,Dander (animal) allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,418689008,Allergy to grass pollen


We can also use the `display` command to see the data in tabular format. 

NOTE: You can display plots of the DataFrame and specify options, using the buttons below the output. 

**Try it: display the output as a bar chart and set the Plot Options to use `description` as the Key, `patient` as the Value, and `COUNT` as the Aggregation.**

In [0]:
display(df)

start,stop,patient,encounter,code,description
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,419474003,Allergy to mould
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,232350006,House dust mite allergy
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,232347008,Dander (animal) allergy
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,419263009,Allergy to tree pollen
1957-04-19,,7ff0403d-6cc4-48a8-a0b1-ddb318b6017c,34adae8e-8a79-4e34-a5c3-4d2d8a5ef5f6,300913006,Shellfish allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,300916003,Latex allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,419474003,Allergy to mould
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,232350006,House dust mite allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,232347008,Dander (animal) allergy
1991-10-27,,fdd3cddc-7b4a-468a-b6cb-99a341973501,b8d8595e-9b81-4e2f-a2ac-f5751873611d,418689008,Allergy to grass pollen


Now we can work with this dataset just like we would with any Spark DataFrame.

#### Write to a table or a temporary view using PySpark

Before you can issue SQL queries on a DataFrame, you must save it as a table or temporary view. Temporary views are removed once your session has ended, whereas tables are persisted beyond a given session.

A [Databricks table](https://docs.databricks.com/data/tables.html) is a collection of structured data. You can cache, filter, and perform any operations supported by Apache Spark DataFrames on Databricks tables. You can query tables with Spark APIs and Spark SQL.

Create a temporary table/view from a DataFrame using Python and then query the table using SQL:

In [0]:
df.createOrReplaceTempView("allergies")

You can delete the table, without deleting the DataFrame or original data, using SQL:

#### Write to a Delta table using Python

We can also write the DataFrame to a Delta table.

For more on Delta tables, see the [Quickstart](https://docs.databricks.com/delta/quick-start.html) on the Databricks documentation site.

In [0]:
(df.write
   .format("delta")
   .mode("append")
   .save(working_dir + "/allergies_table")
)

We can now list the contents of our userhome directory and see the `allergies_table` Delta table, along with the `rawdata` folder we created at the beginning of this notebook:

In [0]:
dbutils.fs.ls(working_dir)

Out[9]: [FileInfo(path='dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc/allergies_table/', name='allergies_table/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc/rawdata/', name='rawdata/', size=0, modificationTime=0)]

In the next notebook, we will explore how to perform some common SAS DATA Steps in Databricks.