d
##![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Performing SAS PROC Step operations in Databricks


**In this lesson you:**
- Learn how to perform some of the most common PROC Step operations in Databricks
- Demonstrate examples of performing PROC steps using PySpark and Spark SQL
- Become familiar with looking up functions in Databricks and Apache Spark documentation

Run the following cell to get started.

In [0]:
%run ./Includes/classroom-setup-3


username: rashbeats@gmail.com
working_dir:   dbfs:/user/rashbeats_gmail_com/dbacademy/sasproc
database_name: dbacademy_rashbeats_gmail_com_sasproc
Out[1]: True

Out[3]: True

## Setting up the data

In this lesson, we will be working with only one of the global sales data tables. It has already been copied into your dbfs location.

In [0]:
# read the raw data file into a DataFrame
transactions_df = spark.read.format("delta").load(filepath + '/transactions')

In order to perform any summary statistics on the dataset, we first need to convert some of the data types from "string" to "int":

In [0]:
transactions_df = (transactions_df.withColumn("amount", col("amount").cast("int"))
                   .withColumn("retailer_id", col("retailer_id").cast("int"))
                  )

In [0]:
transactions_df.printSchema()

root
 |-- city_id: string (nullable = true)
 |-- amount: integer (nullable = true)
 |-- retailer_id: integer (nullable = true)
 |-- trx_id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- transacted_at: string (nullable = true)



### `PROC SQL` 

PROC SQL is generally used to:
- generate reports and summary statistics
- retrieve or combine data from tables or views
- create tables, views, and indexes
- update the data values in PROC SQL tables
- update and retrieve data from database management system (DBMS) tables
- modify a PROC SQL table by adding, modifying, or dropping columns

**You can perform any PROC SQL operations on Databricks using Spark SQL almost exactly the same as you do in SAS.**    
Simply use `%sql` or `spark.sql("some code")` with the SQL code you want to use.

You can also replace many PROC SQL steps with Pyspark.sql.functions, such as:
- filter()
- join()
- withColumn()
- concat()
- agg(sum()) 
- orderBy()

We demonstrated some of these in the previous notebook. In this notebook, we will focus on generating summary statistics, reports, and visualizations, and conclude with a discussion of macros/UDFs.

In the following cell, we create a temporary table from the `transactions_df` DataFrame so we can run a SQL query on it. Notice that we can use Spark SQL or ANSI SQL for either of these operations.

In [0]:
transactions_df.createOrReplaceTempView("transactions")

In [0]:
spark.sql("CREATE TABLE transactions_cities AS \
          SELECT city_id, amount \
          FROM transactions")

Out[9]: DataFrame[num_affected_rows: bigint, num_inserted_rows: bigint]

### `PROC SORT`


As we mentioned previously, you do NOT have to sort datasets before performing operations in Spark. However, there may be times you would want to sort for visualization purposes. To do this, use the [`orderBy` method](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.orderBy.html). You can sort on multiple fields and by ascending or descending order, per field. 

You can chain multiple operations into one statement, as we demonstrate in the next cell using some of the methods from the previous notebook. Take a look and make sure you understand what each step is doing:

In [0]:
display(transactions_df.groupBy("city_id", "transacted_at")
                       .agg(round(avg("amount"))
                       .alias("avg_amount"))
                       .orderBy(["avg_amount", "transacted_at"], ascending=[0, 1])
       )

city_id,transacted_at,avg_amount
699300425,2011-04-01 00:00:00,2999.0
1941318699,2011-04-14 05:00:00,2999.0
1198308439,2011-06-24 08:00:00,2999.0
1067330965,2011-07-14 09:00:00,2999.0
1998549640,2011-08-26 08:00:00,2998.0
1487919653,2011-05-26 07:00:00,2997.0
77397141,2011-08-13 03:00:00,2997.0
1032553787,2011-02-25 09:00:00,2996.0
149195426,2011-06-06 08:00:00,2996.0
310083237,2011-11-10 05:00:00,2996.0


Experiment with the different types of plots you can create from the table above using the plotting buttons below the output. 

Refer to the [Databricks documentation](https://docs.databricks.com/notebooks/visualizations/index.html#plot-types) for more information on creating plots.

### Common PROC Steps Statistical operations

In the cells that follow, we will demonstrate how you can perform some common statistical operations using various tools on Databricks.

You can perform a basic statistical summary on a DataFrame (numerical data only) using the [`.summary()` function](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.summary.html).

This function computes specified statistics for numeric and string columns. Available statistics are: 
- count 
- mean 
- stddev 
- min 
- max 
- arbitrary approximate percentiles specified as a percentage (e.g., 75%)

If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

Below, we run `summary()` only on the numeric column `amount`:

In [0]:
transactions_df.select("amount").summary().show()

+-------+-----------------+
|summary|           amount|
+-------+-----------------+
|  count|         33387995|
|   mean|369.5082314167113|
| stddev|693.3854438354283|
|    min|                0|
|    25%|               23|
|    50%|               47|
|    75%|              321|
|    max|             2999|
+-------+-----------------+



Quartiles, as well as the IQR, can be calculated as follows using SQL:

We can also calculate a median using SQL:

### MLlib

One way to perform statistical modeling, such as [linear regression](https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression) or [generalized linear modeling (GLM)](https://spark.apache.org/docs/latest/ml-classification-regression.html#generalized-linear-regression), is using MLlib. MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and loading algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.

_There are several Databricks Academy courses on machine learning if you would like to learn more about how to do ML on Databricks._

Below is a simple example of a linear regression model with MLlib, using our transactions data (and we can see from the output "Coefficients" that, as expected, there is no correlation between the retailer ID and the amount of the transactions):

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# prepare the input data as a vector
assembler = VectorAssembler(inputCols=["retailer_id"], outputCol="features")
train_df = assembler.transform(transactions_df)

# fit the linear regression model on the data
lr = LinearRegression(featuresCol="features", labelCol="amount")
lrModel = lr.fit(train_df)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

Coefficients: [-3.9093588594041943e-10]
Intercept: 369.9309756386105


### Python's `scipy.stats` and `statsmodels`

These Python modules make many statistical operations relatively easy. 

There are many, many operations that are possible with these modules. Refer to the documentation for more information:

- [scipy.stats documentation - tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html)
- [statsmodels documentation](https://www.statsmodels.org/stable/index.html)

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator.

Below, we demonstrate a simple example of ordinary least squares, from the statsmodels documentation:

In [0]:
import numpy as np

import statsmodels.api as sm

import statsmodels.formula.api as smf

# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data

# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()

# Inspect the results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Thu, 01 Dec 2022   Prob (F-statistic):           1.90e-08
Time:                        17:47:28   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233     

Sidenote:

The scipy and statsmodels modules are already installed in many Databricks Runtimes, such as the one we set up for this course. If your cluster is _not_ on a Runtime that has these modules installed, you can install the modules you need by running a pip install, like so:

`%pip install statsmodels`

Keep in mind that this will restart your cluster, so you should run it at the beginning of a notebook so as not to lose variables and tables.

## Working with macros/UDFs

The final topic we address in this notebook is how to perform similar functions as macros in Databricks. To do this, we create and call User-Defined Functions (UDFs) using [Python](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html) or [SQL](https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-function.html). (UDFs can be written in other languages as well, but that is beyond the scope of this course.) UDFs are user-programmable routines, just like macros.

User-defined functions allow you to define your own functions when the system’s built-in functions are not enough to perform the desired task. To use UDFs, you: 
  1. Define the function
  1. Register the function with Spark
  1. Call the registered function.  

User-defined functions can act on a single row or act on multiple rows at once. 

When working with UDFs, it's important to remember that Spark is typically working with large datasets, and UDFs should be avoided if at all possible because they are usually inefficient. This is because Spark doesn’t know how to optimize a UDF. If there is a built-in function that can do what you need, it is better to use that than to use a UDF.

A simple UDF could look something like the following example:

In [0]:
# define a function that concatenates two fields as strings
def concatFields(field1, field2):
  result = str(field1) + ': $' + str(field2)
  return result  

# register the function with Spark
spark.udf.register("concatWithPython", concatFields)

Out[14]: <function __main__.concatFields(field1, field2)>

Now that we have registered our UDF, we can use it in a SQL query:

Run the following cell to clear the data you worked with in this notebook.

In [0]:
classroom_cleanup()
