# Exercise User Defined Functions
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

In this exercise, we're doing ETL. Instead of using built-in functions, use UDFs to complete the exercise. 

As a reminder, that file contains data about people, including:


* first, middle and last names
* gender
* birth date
* Social Security number
* salary

But, as is unfortunately common in data we get from this customer, the file contains some duplicate records. Worse:

* In some of the records, the names are mixed case (e.g., "Carol"), while in others, they are uppercase (e.g., "CAROL"). 
* The Social Security numbers aren't consistent, either. Some of them are hyphenated (e.g., "992-83-4829"), while others are missing hyphens ("992834829").

The name fields are guaranteed to match, if you disregard character case, and the birth dates will also match. (The salaries will match, as well,
and the Social Security Numbers *would* match, if they were somehow put in the same format).

Your job is to remove the duplicate records. The specific requirements of your job are:

* Remove duplicates. It doesn't matter which record you keep; it only matters that you keep one of them.
* Preserve the data format of the columns. For example, if you write the first name column in all lower-case, you haven't met this requirement.
* Write the result as a Parquet file, as designated by *destFile*.
* The final Parquet "file" must contain 8 part files (8 files ending in ".parquet").

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** The initial dataset contains 103,000 records.<br/>
The de-duplicated result haves 100,000 records.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."


We need UDFs to standardize the case of the name fields and to format SSNs consistently.

Case-insensitive name matching: We’ll write UDFs to convert names to lowercase for comparison purposes but keep the original format.
SSN formatting: Write a UDF that ensures all SSNs are hyphenated.

### **Block 1: Start the Spark Session**
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import re
```
- **`from pyspark.sql import SparkSession`**: We import `SparkSession` to create an entry point for working with Spark.
- **`from pyspark.sql.functions import udf`**: We import the `udf` function from PySpark to create User Defined Functions.
- **`from pyspark.sql.types import StringType`**: We import `StringType` to specify the return type of our UDF (which is string data).
- **`import re`**: We import the `re` module, which helps us use regular expressions to reformat Social Security numbers.

### **Block 2: Create a Spark Session and Load the Data**
```python
# Start Spark session
spark = SparkSession.builder.appName("ETL-Exercise").getOrCreate()

# Load the data
data = spark.read.csv("path_to_file.csv", header=True, inferSchema=True)
```
- **`spark = SparkSession.builder.appName("ETL-Exercise").getOrCreate()`**: We create a Spark session. This session allows us to use Spark functions in our program. The `appName` gives our session a name for reference, and `getOrCreate()` either creates a new session or returns an existing one.
  
- **`data = spark.read.csv("path_to_file.csv", header=True, inferSchema=True)`**: This line loads a CSV file into a DataFrame. 
    - `path_to_file.csv` is the file’s location.
    - `header=True` means that the first row of the CSV contains the column names.
    - `inferSchema=True` tells Spark to automatically guess the correct data types for each column.

### **Block 3: Define UDFs to Normalize Data**
#### UDF 1: Convert names to lowercase
```python
# Define UDFs for lowercasing names
def lowercase_name(name):
    return name.lower() if name else None

lowercaseUDF = udf(lowercase_name, StringType())
```
- **`def lowercase_name(name):`**: This defines a function named `lowercase_name` that takes a name as input.
- **`return name.lower() if name else None`**: Inside the function, we check if the `name` is not `None`. If a name exists, we convert it to lowercase using `lower()` and return it. If the name is missing (`None`), we return `None`.
  
- **`lowercaseUDF = udf(lowercase_name, StringType())`**: This registers the `lowercase_name` function as a User Defined Function (UDF) so that we can use it on a Spark DataFrame. We specify that this UDF will return a string (`StringType()`).

#### UDF 2: Format SSNs with hyphens
```python
# Define UDF to format SSNs consistently
def format_ssn(ssn):
    if ssn:
        return re.sub(r"(\d{3})(\d{2})(\d{4})", r"\1-\2-\3", ssn.replace("-", ""))
    return None

formatSSNUDF = udf(format_ssn, StringType())
```
- **`def format_ssn(ssn):`**: This defines a function named `format_ssn` to format Social Security Numbers.
- **`if ssn:`**: This checks if the SSN exists (is not `None`).
  
- **`ssn.replace("-", "")`**: This removes any existing hyphens from the SSN, making it easier to reformat.
  
- **`re.sub(r"(\d{3})(\d{2})(\d{4})", r"\1-\2-\3", ssn)`**: This uses a regular expression (`re.sub`) to ensure that the SSN is formatted as `XXX-XX-XXXX`. It groups the first 3 digits, the next 2 digits, and the last 4 digits, and inserts hyphens in between.
  
- **`return None`**: If no SSN is provided, the function returns `None`.
  
- **`formatSSNUDF = udf(format_ssn, StringType())`**: This registers the `format_ssn` function as a UDF so we can apply it to a Spark DataFrame. It will return a formatted string.

### **Block 4: Apply UDFs to the DataFrame**
```python
# Apply UDFs to create new columns
data = data.withColumn("first_name_lower", lowercaseUDF(data['first_name'])) \
           .withColumn("middle_name_lower", lowercaseUDF(data['middle_name'])) \
           .withColumn("last_name_lower", lowercaseUDF(data['last_name'])) \
           .withColumn("formatted_ssn", formatSSNUDF(data['ssn']))
```
- **`withColumn("new_column", someFunction(df['column']))`**: This method creates a new column in the DataFrame. The first argument is the new column’s name, and the second argument is the operation applied to the existing column.
  
- **`lowercaseUDF(data['first_name'])`**: This applies the `lowercaseUDF` to the `first_name` column, creating a new column called `first_name_lower`. The same process is repeated for `middle_name` and `last_name`.
  
- **`formatSSNUDF(data['ssn'])`**: This applies the `formatSSNUDF` to the `ssn` column, creating a new column called `formatted_ssn`.

### **Block 5: Remove Duplicates**
```python
# Deduplicate the data
deduplicated_data = data.dropDuplicates(["first_name_lower", "middle_name_lower", "last_name_lower", "birth_date"])
```
- **`dropDuplicates(["columns"])`**: This method removes rows that are duplicated based on the specified columns. Here, duplicates are removed based on the lowercase versions of the first, middle, and last names, and the `birth_date`. This ensures that names are compared case-insensitively, but only one record is kept.

### **Block 6: Clean Up Temporary Columns and Fix SSN Column**
```python
# Drop temporary columns and keep formatted SSNs
deduplicated_data = deduplicated_data.drop("first_name_lower", "middle_name_lower", "last_name_lower") \
                                     .drop("ssn").withColumnRenamed("formatted_ssn", "ssn")
```
- **`drop("column_name")`**: This method drops columns from the DataFrame. We drop the temporary lowercase columns (`first_name_lower`, `middle_name_lower`, `last_name_lower`) because we don’t need them anymore.
  
- **`withColumnRenamed("old_column", "new_column")`**: This renames a column. Here, we rename the `formatted_ssn` column back to `ssn`, replacing the original, inconsistent `ssn` column.

### **Block 7: Write the Final Data to a Parquet File**
```python
# Write to Parquet with 8 part files
deduplicated_data.repartition(8).write.parquet("path_to_destination", mode="overwrite")
```
- **`repartition(8)`**: This repartitions the data into 8 parts, which ensures that when we write the data, we will end up with 8 Parquet files.
  
- **`write.parquet("path_to_destination", mode="overwrite")`**: This writes the DataFrame to the specified path in Parquet format. The `mode="overwrite"` option means that if a file with the same name already exists at that location, it will be replaced.

### Summary:
1. We loaded the dataset and cleaned it up using UDFs to format names and SSNs.
2. We removed duplicates based on case-insensitive name matching.
3. We preserved the original formatting, fixed inconsistent SSNs, and wrote the cleaned data to a Parquet file with exactly 8 part files.

This approach ensures we meet the requirements of deduplication and consistent data formatting!

In [0]:
display(dbutils.fs.ls('dbfs:/databricks-datasets/data.gov'))

path,name,size,modificationTime
dbfs:/databricks-datasets/data.gov/README.md,README.md,400,1459052893000
dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/,farmers_markets_geographic_data/,0,1729513631556
dbfs:/databricks-datasets/data.gov/irs_zip_code_data/,irs_zip_code_data/,0,1729513631556


In [0]:
%run "./Includes/Classroom-Setup"


##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Hints

* Use the <a href="http://spark.apache.org/docs/latest/api/python/index.html" target="_blank">API docs</a>. Specifically, you might find 
  <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame" target="_blank">DataFrame</a> and
  <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions" target="_blank">functions</a> to be helpful.
* It's helpful to look at the file first, so you can check the format. `dbutils.fs.head()` (or just `%fs head`) is a big help here.

In [0]:
# TODO

sourceFile = "dbfs:/databricks-datasets/samples/people/people.json"
destFile = userhome + "/people.parquet"

# In case it already exists
dbutils.fs.rm(destFile, True)

Out[13]: False

In [0]:
sourceFile = "dbfs:/databricks-datasets/samples/people/people.json"
peopleDF = spark.read.json(sourceFile)
display(peopleDF)

age,name
40,Jane
30,Andy
50,Justin


In [0]:
sourceDF = spark.read.option("header", "true").option("inferSchema", "true").csv(sourceFile)
display(sourceDF)

"{""name"":""Jane""","""age"":40}"
"{""name"":""Andy""","""age"":30}"
"{""name"":""Justin""","""age"":50}"


##![Spark Logo Tiny](https://s3-us-west-2.amazonaws.com/curriculum-release/images/105/logo_spark_tiny.png) Validate Your Answer

At the bare minimum, we can verify that you wrote the parquet file out to **destFile** and that you have the right number of records.

Running the following cell to confirm your result:

In [0]:
# partFiles = len(list(filter(lambda f: f.path.endswith(".parquet"), dbutils.fs.ls(destFile))))

# finalDF = spark.read.parquet(destFile)
# finalCount = finalDF.count()

# clearYourResults()
# validateYourAnswer("01 Parquet File Exists", 1276280174, partFiles)
# validateYourAnswer("02 Expected 100000 Records", 972882115, finalCount)
# summarizeYourResults()