d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Common Transformations

Apache Spark&trade; and Databricks&reg; allow you to manipulate data with built-in functions that accommodate common design patterns.

## In this lesson you:
* Apply built-in functions to manipulate data
* Define logic to handle null values
* Deduplicate a data set

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/xjbyksd137?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/xjbyksd137?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Transformations in ETL

The goal of transformations in ETL is to transform raw data in order to populate a data model.  The most common models are **relational models** and **snowflake (or star) schemas,** though other models such as query-first modeling also exist. Relational modeling entails distilling your data into efficient tables that you can join back together. A snowflake model is generally used in data warehousing where a fact table references any number of related dimension tables. Regardless of the model you use, the ETL approach is generally the same.

Transforming data can range in complexity from simply parsing relevant fields to handling null values without affecting downstream operations and applying complex conditional logic.  Common transformations include:<br><br>

* Normalizing values
* Imputing null or missing data
* Deduplicating data
* Performing database rollups
* Exploding arrays
* Pivoting DataFrames

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/data-models.png" style="height: 400px; margin: 20px"/></div>

-sandbox
### Built-In Functions

Built-in functions offer a range of performant options to manipulate data. This includes options familiar to:<br><br>

1. SQL users such as `.select()` and `.groupBy()`
2. Python, Scala and R users such as `max()` and `sum()`
3. Data warehousing options such as `rollup()` and `cube()`

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** For more depth on built-in functions, see  <a href="https://academy.databricks.com/collections/frontpage/products/dataframes" target="_blank">Getting Started with Apache Spark DataFrames course from Databricks Academy</a>.

-sandbox
### Getting Started

Run the following cell to configure our "classroom."

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster. Click <b>Detached</b> in the upper left hand corner and then select your preferred cluster.

<img src="https://files.training.databricks.com/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

Run the cell below to mount the data.

In [0]:
%run "./Includes/Classroom-Setup"

### Normalizing Data

Normalizing refers to different practices including restructuring data in normal form to reduce redundancy, and scaling data down to a small, specified range. For this case, bound a range of integers between 0 and 1.

Start by taking a DataFrame of a range of integers

In [0]:
integerDF = spark.range(1000, 10000)

display(integerDF)

id
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009


-sandbox
To normalize these values between 0 and 1, subtract the minimum and divide by the maximum, minus the minimum.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** <a href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=minmaxscaler#pyspark.ml.feature.MinMaxScaler" target="_blank">Also see the built-in class `MinMaxScaler`</a>

In [0]:
from pyspark.sql.functions import col, max, min

colMin = integerDF.select(min("id")).first()[0]
colMax = integerDF.select(max("id")).first()[0]

normalizedIntegerDF = (integerDF
  .withColumn("normalizedValue", (col("id") - colMin) / (colMax - colMin) )
)

display(normalizedIntegerDF)

id,normalizedValue
1000,0.0
1001,0.000111123458162018
1002,0.000222246916324036
1003,0.000333370374486054
1004,0.000444493832648072
1005,0.00055561729081009
1006,0.000666740748972108
1007,0.000777864207134126
1008,0.0008889876652961441
1009,0.0010001111234581


-sandbox

### Imputing Null or Missing Data

Null values refer to unknown or missing data as well as irrelevant responses. Strategies for dealing with this scenario include:<br><br>

* **Dropping these records:** Works when you do not need to use the information for downstream workloads
* **Adding a placeholder (e.g. `-1`):** Allows you to see missing data later on without violating a schema
* **Basic imputing:** Allows you to have a "best guess" of what the data could have been, often by using the mean of non-missing data
* **Advanced imputing:** Determines the "best guess" of what data should be using more advanced strategies such as clustering machine learning algorithms or oversampling techniques 

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** <a href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=imputer#pyspark.ml.feature.Imputer" target="_blank">Also see the built-in class `Imputer`</a>

Take a look at the following DataFrame, which has missing values.

In [0]:
corruptDF = spark.createDataFrame([
  (11, 66, 5),
  (12, 68, None),
  (1, None, 6),
  (2, 72, 7)], 
  ["hour", "temperature", "wind"]
)

display(corruptDF)

hour,temperature,wind
11,66.0,5.0
12,68.0,
1,,6.0
2,72.0,7.0


Drop any records that have null values.

In [0]:
corruptDroppedDF = corruptDF.dropna("any")

display(corruptDroppedDF)

hour,temperature,wind
11,66,5
2,72,7


Impute values with the mean.

In [0]:
corruptImputedDF = corruptDF.na.fill({"temperature": 68, "wind": 6})

display(corruptImputedDF)

hour,temperature,wind
11,66,5
12,68,6
1,68,6
2,72,7


### Deduplicating Data

Duplicate data comes in many forms. The simple case involves records that are complete duplicates of another record. The more complex cases involve duplicates that are not complete matches, such as matches on one or two columns or "fuzzy" matches that account for formatting differences or other non-exact matches.

Take a look at the following DataFrame that has duplicate values.

In [0]:
duplicateDF = spark.createDataFrame([
  (15342, "Conor", "red"),
  (15342, "conor", "red"),
  (12512, "Dorothy", "blue"),
  (5234, "Doug", "aqua")], 
  ["id", "name", "favorite_color"]
)

display(duplicateDF)

id,name,favorite_color
15342,Conor,red
15342,conor,red
12512,Dorothy,blue
5234,Doug,aqua


Drop duplicates on `id` and `favorite_color`.

In [0]:
duplicateDedupedDF = duplicateDF.dropDuplicates(["id", "favorite_color"])

display(duplicateDedupedDF)

id,name,favorite_color
5234,Doug,aqua
12512,Dorothy,blue
15342,Conor,red


### Other Helpful Data Manipulation Functions

| Function    | Use                                                                                                                        |
|:------------|:---------------------------------------------------------------------------------------------------------------------------|
| `explode()` | Returns a new row for each element in the given array or map                                                               |
| `pivot()`   | Pivots a column of the current DataFrame and perform the specified aggregation                                             |
| `cube()`    | Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them   |
| `rollup()`  | Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them |

## Exercise 1: Deduplicating Data

A common ETL workload involves cleaning duplicated records that don't completely match up.  The source of the problem can be anything from user-generated content to schema evolution and data corruption.  Here, you match records and reduce duplicate records.

-sandbox
### Step 1: Import and Examine the Data

The file is sitting in `/mnt/training/dataframes/people-with-dups.txt`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** You have to deal with the header and delimiter.

In [0]:
# TODO
path="/mnt/training/dataframes/people-with-dups.txt"
dupedDF =(spark
  .read
  .option("header", True)
  .option("delimiter", ":")
  .csv(path)

)
display(dupedDF)


firstName,middleName,lastName,gender,birthDate,salary,ssn
Emanuel,Wallace,Panton,M,1988-03-04,101255,935-90-7627
Eloisa,Rubye,Cayouette,F,2000-06-20,204031,935-89-9009
Cathi,Svetlana,Prins,F,2012-12-22,35895,959-30-7957
Mitchel,Andres,Mozdzierz,M,1966-05-06,55108,989-27-8093
Angla,Melba,Hartzheim,F,1938-07-26,13199,935-27-4276
Rachel,Marlin,Borremans,F,1923-02-23,67070,996-41-8616
Catarina,Phylicia,Dominic,F,1969-09-29,201021,999-84-8888
Antione,Randy,Hamacher,M,2004-03-05,271486,917-96-3554
Madaline,Shawanda,Piszczek,F,1996-03-17,183944,963-87-9974
Luciano,Norbert,Sarcone,M,1962-12-14,73069,909-96-1669


In [0]:
# TEST - Run this cell to test your solution
cols = set(dupedDF.columns)

dbTest("ET2-P-02-01-01", 103000, dupedDF.count())
dbTest("ET2-P-02-01-02", True, "salary" in cols and "lastName" in cols)

print("Tests passed!")

-sandbox
### Step 2: Add Columns to Filter Duplicates

Add columns following to allow you to filter duplicate values.  Add the following:

- `lcFirstName`: first name lower case
- `lcLastName`: last name lower case
- `lcMiddleName`: middle name lower case
- `ssnNums`: social security number without hyphens between numbers

Save the results to `dupedWithColsDF`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use the Spark function `lower()`

In [0]:
from pyspark.sql.functions import col,lower

dupedWithColsDF = (dupedDF
                  .withColumn("lcFirstName",lower(col("firstName")))
                  .withColumn("lcLastName",lower(col("lastName")))
                  .withColumn("lcMiddleName",lower(col("middleName")))
                  .withColumn("ssnNums",lower(col("ssn")))
                        
                  )
display(dupedWithColsDF)

firstName,middleName,lastName,gender,birthDate,salary,ssn,lcFirstName,lcLastName,lcMiddleName,ssnNums
Emanuel,Wallace,Panton,M,1988-03-04,101255,935-90-7627,emanuel,panton,wallace,935-90-7627
Eloisa,Rubye,Cayouette,F,2000-06-20,204031,935-89-9009,eloisa,cayouette,rubye,935-89-9009
Cathi,Svetlana,Prins,F,2012-12-22,35895,959-30-7957,cathi,prins,svetlana,959-30-7957
Mitchel,Andres,Mozdzierz,M,1966-05-06,55108,989-27-8093,mitchel,mozdzierz,andres,989-27-8093
Angla,Melba,Hartzheim,F,1938-07-26,13199,935-27-4276,angla,hartzheim,melba,935-27-4276
Rachel,Marlin,Borremans,F,1923-02-23,67070,996-41-8616,rachel,borremans,marlin,996-41-8616
Catarina,Phylicia,Dominic,F,1969-09-29,201021,999-84-8888,catarina,dominic,phylicia,999-84-8888
Antione,Randy,Hamacher,M,2004-03-05,271486,917-96-3554,antione,hamacher,randy,917-96-3554
Madaline,Shawanda,Piszczek,F,1996-03-17,183944,963-87-9974,madaline,piszczek,shawanda,963-87-9974
Luciano,Norbert,Sarcone,M,1962-12-14,73069,909-96-1669,luciano,sarcone,norbert,909-96-1669


In [0]:
# TEST - Run this cell to test your solution
cols = set(dupedWithColsDF.columns)

dbTest("ET2-P-02-02-01", 103000, dupedWithColsDF.count())
dbTest("ET2-P-02-02-02", True, "lcFirstName" in cols and "lcLastName" in cols)

print("Tests passed!")

### Step 3: Deduplicate the Data

Deduplicate the data by dropping duplicates of all records except for the original names (first, middle, and last) and the original `ssn`.  Save the result to `dedupedDF`.  Drop the columns you added in step 2.

In [0]:
# TODO
dedupedDF = (dupedWithColsDF
            .dropDuplicates(["gender","birthDate","salary"])
            .drop("lcFirstName", "lcLastName","lcMiddleName","ssnNums")
             
            )

display(dedupedDF)

firstName,middleName,lastName,gender,birthDate,salary,ssn
Mirna,Catheryn,Catching,F,1915-01-28,241711,964-24-3456
Devin,Devon,Boender,F,1915-03-11,17399,910-11-9349
Jolene,Yen,Terkelsen,F,1915-07-11,77987,906-11-4089
Fernande,Victorina,Keiter,F,1915-11-01,121153,995-70-9809
Hortencia,Slyvia,Valladares,F,1916-05-06,229731,998-26-6372
Simone,Elba,Acton,F,1916-07-17,64530,901-12-5058
Cornelia,Berniece,Winland,F,1916-10-10,78783,966-10-7627
Shala,Pauline,Ceconi,F,1917-04-30,194412,960-29-1335
Janiece,Darcey,Lautenbach,F,1917-06-05,205318,918-31-7078
Artie,Marisha,Bessick,F,1918-01-09,280919,913-80-9143


In [0]:
# TEST - Run this cell to test your solution
cols = set(dedupedDF.columns)

dbTest("ET2-P-02-03-01", 100000, dedupedDF.count())
dbTest("ET2-P-02-03-02", 7, len(cols))

print("Tests passed!")

## Review
**Question:** What built-in functions are available in Spark?  
**Answer:** Built-in functions include SQL functions, common programming language primitives, and data warehousing specific functions.  See the Spark API Docs for more details. (<a href="http://spark.apache.org/docs/latest/api/python/index.html" target="_blank">Python</a> or <a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package" target="_blank">Scala</a>).

**Question:** What's the best way to handle null values?  
**Answer:** The answer depends largely on what you hope to do with your data moving forward. You can drop null values or impute them with a number of different techniques.  For instance, clustering your data to fill null values with the values of nearby neighbors often gives more insight to machine learning models than using a simple mean.

**Question:** What are potential challenges of deduplicating data and imputing null values?  
**Answer:** Challenges include knowing which is the correct record to keep and how to define logic that applies to the root cause of your situation. This decision making process depends largely on how removing or imputing data will affect downstream operations like database queries and machine learning workloads. Knowing the end application of the data helps determine the best strategy to use.

## Next Steps

Start the next lesson, [User Defined Functions]($./03-User-Defined-Functions ).

## Additional Topics & Resources

**Q:** How can I do ACID transactions with Spark?  
**A:** ACID compliance refers to a set of properties of database transactions that guarantee the validity of you data.  <a href="https://databricks.com/product/databricks-delta" target="_blank">Databricks Delta</a> is an ACID compliant solution to transactionality with Spark workloads.

**Q:** How can I handle more complex conditional logic in Spark?  
**A:** You can handle more complex if/then conditional logic using the `when()` function and its `.otherwise()` method.

**Q:** How can I handle data warehousing functions like rollups?  
**A:** Spark allows for rollups and cubes, which are common in star schemas, using the `rollup()` and `cube()` functions.

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>