
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# Cleaning Data

As we inspect and clean our data, we'll need to construct various column expressions and queries to express transformations to apply on our dataset.

Column expressions are constructed from existing columns, operators, and built-in functions. They can be used in **`SELECT`** statements to express transformations that create new columns.

Many standard SQL query commands (e.g. **`DISTINCT`**, **`WHERE`**, **`GROUP BY`**, etc.) are available in Spark SQL to express transformations.

In this notebook, we'll review a few concepts that might differ from other systems you're used to, as well as calling out a few useful functions for common operations.

We'll pay special attention to behaviors around **`NULL`** values, as well as formatting strings and datetime fields.

## Learning Objectives
By the end of this lesson, you should be able to:
- Summarize datasets and describe null behaviors
- Retrieve and remove duplicates
- Validate datasets for expected counts, missing values, and duplicate records
- Apply common transformations to clean and transform data



## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ./Includes/Classroom-Setup-02.4

## Data Overview

We'll work with new users records from the **`users_bronze`** table, which has the following schema:

| field | type | description |
|---|---|---|
| user_id | string | unique identifier |
| user_first_touch_timestamp | long | time at which the user record was created in microseconds since epoch |
| email | string | most recent email address provided by the user to complete an action |
| updated | timestamp | time at which this record was last updated |

Let's start by creating a `users_silver` table, based on the `users_bronze` table. This allows us to keep the `users_bronze` table in its raw, original form so we have it if we need it. Thus, the `users_silver` table will be the clean version of our users data. We are going to add a handful of extra columns that will store additional items we feel are important for analysts to work with:
* `first_touch_time`
* `first_touch`

In [0]:
CREATE TABLE IF NOT EXISTS users_silver 
  (user_id STRING, 
  user_first_touch_timestamp BIGINT, 
  email STRING, 
  updated TIMESTAMP, 
  first_touch TIMESTAMP,
  first_touch_date DATE,
  first_touch_time STRING,
  email_domain STRING);

CREATE OR REPLACE TABLE users_silver_working AS
  SELECT * FROM users_bronze;
  
SELECT * FROM users_silver_working;

## Data Profile 

Databricks offers two convenient methods for data profiling within Notebooks: through the cell output UI and via the dbutils library.

When working with data frames or the results of SQL queries in a Databricks Notebook, users have the option to access a dedicated **Data Profile** tab. Clicking on this tab initiates the creation of an extensive data profile, providing not only summary statistics but also histograms that cover the entire dataset, ensuring a comprehensive view of the data, rather than just what is visible.

This data profile encompasses a range of insights, including information about numeric, string, and date columns, making it a powerful tool for data exploration and understanding.

**Using cell output UI:**

1. In the upper-left corner of the cell output of our query above, you will see the word **Table**. Click the "+" symbol immediately to the right of this, and select **Data Profile**.

1. Databricks will automatically execute a new command to generate a data profile.

1. The generated data profile will provide summary statistics for numeric, string, and date columns, along with histograms of value distributions for each column.


## Missing Data

Based on the counts above, it looks like there are at least a handful of null values in all of our fields. Null values behave incorrectly in some math functions, including **`count()`**.

But more importantly, we may have problems with null values in our user_id column. From the count of all the rows in the table, found at the bottom of the Table results, and the count of the `user_id` column in the Data Profile, we can see that there are three rows with null values for `user_id`. Let's query these rows.

In [0]:
SELECT * FROM users_silver_working WHERE user_id IS NULL;

Since all three rows are obvious errors, let's remove them.

In [0]:
CREATE OR REPLACE TABLE users_silver_working AS
  SELECT * FROM users_silver_working WHERE user_id IS NOT NULL;

 
## Deduplicate Rows
We can use **`DISTINCT *`** to remove true duplicate records where entire rows contain the same values.

After running the cell below, note that there were no true duplicates.

In [0]:
INSERT OVERWRITE users_silver_working 
  SELECT DISTINCT(*) FROM users_silver_working

## Deduplicate Rows Based on Specific Columns

The code below uses **`GROUP BY`** to remove duplicate records based on **`user_id`** and **`user_first_touch_timestamp`** column values. (Recall that these fields are both generated when a given user is first encountered, thus forming unique tuples.)

Here, we are using the aggregate function **`max`** as a hack to:
- Keep values from the **`email`** and **`updated`** columns in the result of our group by
- Capture non-null emails when multiple records are present

In [0]:
INSERT OVERWRITE users_silver_working
SELECT user_id, user_first_touch_timestamp, max(email) AS email, max(updated) AS updated
FROM users_silver_working
WHERE user_id IS NOT NULL
GROUP BY user_id, user_first_touch_timestamp;

SELECT count(*) FROM users_silver_working;


## Validate Datasets
Let's programmatically perform validation using simple filters and **`WHERE`** clauses.

Validate that the **`user_id`** for each row is unique.

We expect that there will only be one of each `user_id` in our `users_silver_working` table. By grouping by the `user_id`, and counting the number of rows in each group, we can determine if there is more than one `user_id` by running a comparison in the **`SELECT`** clause. We, therefore, expect a Boolean value as our result set: true if there is only one of each `user_id` and false if there is more than one.

In [0]:
SELECT max(row_count) <= 1 no_duplicate_ids FROM (
  SELECT user_id, count(*) AS row_count
  FROM users_silver_working
  GROUP BY user_id)

Confirm that each email is associated with at most one **`user_id`**.

We perform the same action as above, but this time, we are checking the `email` field. Again, we get a Boolean in return.

In [0]:
SELECT max(user_id_count) <= 1 at_most_one_id FROM (
  SELECT email, count(user_id) AS user_id_count
  FROM users_silver_working
  WHERE email IS NOT NULL
  GROUP BY email)

## Date Format and Regex
Now that we've removed null fields and eliminated duplicates, we may wish to extract further value out of the data.

Currently, the **`user_first_touch_timestamp`** is formatted as a Unix timestamp (the number of microseconds since January 1, 1970). We want to convert this to a Spark timestamp in `YYYY-MM-DDThh.mm.sssss` format.

The code below:
- Correctly scales and casts the **`user_first_touch_timestamp`** to a timestamp
- Extracts the calendar date and clock time for this timestamp in human readable format
- Uses **`regexp_extract`** to extract the domains from the email column using regex

We have a number of different [date formats](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) to choose from.

Note also in line 6 that we are using a regular expression (regex). In this regex string, we are using a "positive look behind" to return all characters after the "@" symbol. You can [learn more about Java regular expressions](https://www.w3schools.com/java/java_regex.asp).

In [0]:
INSERT INTO users_silver
(
SELECT *, 
  to_date(date_format(first_touch, "MMM d, yyyy")) AS first_touch_date,
  date_format(first_touch, "HH:mm:ss") AS first_touch_time,
  regexp_extract(email, "@(.*)", 0) AS email_domain
FROM (
  SELECT *,
    CAST(user_first_touch_timestamp / 1e6 AS timestamp) AS first_touch 
  FROM users_silver_working
))

Run the cell below to see the cleaned data in the `users_silver` table.

In [0]:
SELECT * FROM users_silver;


 
Run the following cell to delete the tables and files associated with this lesson.

In [0]:
-- %python
-- DA.cleanup()


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>