# De-Duping Data

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

In this exercise, we're doing ETL on a file we've received from some customer. That file contains data about people, including:

* first, middle and last names
* gender
* birth date
* Social Security number
* salary

But, as is unfortunately common in data we get from this customer, the file contains some duplicate records. Worse:

* In some of the records, the names are mixed case (e.g., "Carol"), while in others, they are uppercase (e.g., "CAROL"). 
* The Social Security numbers aren't consistent, either. Some of them are hyphenated (e.g., "992-83-4829"), while others are missing hyphens ("992834829").

The name fields are guaranteed to match, if you disregard character case, and the birth dates will also match. (The salaries will match, as well,
and the Social Security Numbers *would* match, if they were somehow put in the same format).

Your job is to remove the duplicate records. The specific requirements of your job are:

* Remove duplicates. It doesn't matter which record you keep; it only matters that you keep one of them.
* Preserve the data format of the columns. For example, if you write the first name column in all lower-case, you haven't met this requirement.
* Write the result as a Parquet file, as designated by *destFile*.
* The final Parquet "file" must contain multiple part files (ending in ".parquet").

**Hint:** <br/>
The initial dataset contains 103,000 records.<br/>
The de-duplicated result haves 100,000 records.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Initialize Spark Session
spark = (SparkSession.builder
         .appName("Create DataFrame from Dummy Data")
         .getOrCreate())

In [3]:
spark

**Data Source**

In [4]:
sourceFile = "../dataset/people-with-dups.txt"
destFile = "../dataset/out/people.parquet"

**Set the ```shuffle.partitions```**

In [5]:
# dropDuplicates() will likely introduce a shuffle, so it helps to reduce the number of post-shuffle partitions.
spark.conf.set("spark.sql.shuffle.partitions", 7)

**Read the data**

In [6]:
# Okay, now we can read this thing.

df = (spark
    .read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("sep", ":")
    .csv(sourceFile)
)

In [7]:
df.sample(False, 0.1).show(20)

+---------+----------+------------+------+----------+------+-----------+
|firstName|middleName|    lastName|gender| birthDate|salary|        ssn|
+---------+----------+------------+------+----------+------+-----------+
|    Angla|     Melba|   Hartzheim|     F|1938-07-26| 13199|935-27-4276|
|   Rachel|    Marlin|   Borremans|     F|1923-02-23| 67070|996-41-8616|
| Madaline|  Shawanda|    Piszczek|     F|1996-03-17|183944|963-87-9974|
|      Siu|   Cherrie|     Lechelt|     F|2012-07-24|148331|906-85-3202|
|   Cheree|  Dorethea|    Anspaugh|     F|1985-01-17|278860|961-36-6578|
|      See|    Sharen|     Howryla|     F|1979-12-30|169570|925-12-1644|
|   Kattie|    Sammie|       Ercek|     F|2002-07-26|211993|996-32-1564|
|  Bernard|    Reggie|      Coache|     M|1960-06-23| 53020|941-56-6401|
|   Cordie|      Cara|     Sheilds|     F|2007-02-08|219449|950-98-5411|
|   Audrey|   Lorrine|    Sprewell|     F|1932-10-25|283164|997-53-7925|
|    ARLEN|    HAYDEN|     CARVILL|     M|1986-05-2

**Drop the duplicate record**

In [9]:
from pyspark.sql.functions import *

In [None]:
(df
  .select(col("*"),
      lower(col("firstName")).alias("lcFirstName"),
      lower(col("lastName")).alias("lcLastName"),
      lower(col("middleName")).alias("lcMiddleName"),
      translate(col("ssn"), "-", "").alias("ssnNums")
   ).show())

+---------+----------+---------+------+----------+------+-----------+-----------+----------+------------+---------+
|firstName|middleName| lastName|gender| birthDate|salary|        ssn|lcFirstName|lcLastName|lcMiddleName|  ssnNums|
+---------+----------+---------+------+----------+------+-----------+-----------+----------+------------+---------+
|  Emanuel|   Wallace|   Panton|     M|1988-03-04|101255|935-90-7627|    emanuel|    panton|     wallace|935907627|
|   Eloisa|     Rubye|Cayouette|     F|2000-06-20|204031|935-89-9009|     eloisa| cayouette|       rubye|935899009|
|    Cathi|  Svetlana|    Prins|     F|2012-12-22| 35895|959-30-7957|      cathi|     prins|    svetlana|959307957|
|  Mitchel|    Andres|Mozdzierz|     M|1966-05-06| 55108|989-27-8093|    mitchel| mozdzierz|      andres|989278093|
|    Angla|     Melba|Hartzheim|     F|1938-07-26| 13199|935-27-4276|      angla| hartzheim|       melba|935274276|
|   Rachel|    Marlin|Borremans|     F|1923-02-23| 67070|996-41-8616|   

In [11]:
from pyspark.sql.functions import *

dedupedDF = (df
  .select(col("*"),
      lower(col("firstName")).alias("lcFirstName"),
      lower(col("lastName")).alias("lcLastName"),
      lower(col("middleName")).alias("lcMiddleName"),
      translate(col("ssn"), "-", "").alias("ssnNums")
   )
  .dropDuplicates(["lcFirstName", "lcMiddleName", "lcLastName", "ssnNums", "gender", "birthDate", "salary"])
  .drop("lcFirstName", "lcMiddleName", "lcLastName", "ssnNums")
)

In [12]:
dedupedDF.show(15)

+---------+----------+-----------+------+----------+------+-----------+
|firstName|middleName|   lastName|gender| birthDate|salary|        ssn|
+---------+----------+-----------+------+----------+------+-----------+
|    Aaron|    Walker| Okoniewski|     M|1930-07-29| 97932|951-32-1950|
|    Aaron|   Brendon|   Jernberg|     M|1924-09-26|277299|951-57-5457|
|    Aaron| Alejandro|      Parbs|     M|1958-08-13| 10828|959-70-4852|
|    Aaron|    Rashad|  Immediato|     M|1922-02-13| 38566|959-93-7472|
|    Aaron|    Barton|     Crasco|     M|1986-11-21|298912|986-88-3115|
|    Aaron|     Micah| Fotopoulos|     M|2010-02-02| 10842|995-82-1665|
|    Abbie|    Evelin|     Nichol|     F|1985-01-10| 95861|919-95-6712|
|    Abbie|     Marty|     Gungor|     F|1970-01-04| 45702|989-77-8677|
|     Abby|       Mei|Hershnowitz|     F|1949-04-12|274011|666-90-8782|
|     Abby|   Loraine|     Ligler|     F|1950-09-26|296309|926-91-5492|
|     Abby|   Natisha|     Bermel|     F|2002-08-01|262392|928-9

In [14]:
df.count()

103000

In [13]:
dedupedDF.count()

100000

**Save the data as parquest**

In [15]:
# Now we can save the results. We'll also re-read them and count them, just as a final check.
(dedupedDF.write
   .mode("overwrite")
   .parquet(destFile)
)

In [16]:
import os

# List files in the directory
file_list = os.listdir(destFile)

# Display the list of files
print("\n".join(file_list))

.part-00000-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00001-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00002-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00003-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00004-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00005-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
.part-00006-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet.crc
._SUCCESS.crc
part-00000-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00001-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00002-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00003-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00004-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00005-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet
part-00006-1462c10c-1220-4f93-9880-c89d4d5b5464-c000.snappy.parquet

In [17]:
spark.stop()