# Introduction

The terms **"normalized" and "denormalized"** have been part of database terminology for a long time. Although they might seem confusing at first, these concepts are relatively simple to understand, implement, and leverage.

**Please note** that this is not an in-depth lesson on Data Modeling or Data Warehousing. It will provide a simplified explanation of "normalized" and "denormalized" for the sake of clarity.

+ **Normalized Data:** In normalized data, information about a subject is distributed across multiple tables or datasets. This approach organizes data into several tables, with each table containing specific details about the subject. To retrieve a comprehensive view of the data, you need to join these tables using an identifying data point, such as an ID, name, serial number, or other unique identifier (which could even be a physical address, business name, email address, etc.).

In summary, working with normalized data often requires querying multiple tables and using reference datasets to fill in missing information. It's a more complex structure but can be efficient in certain scenarios.

**TL;DR:** Normalized data involves multiple tables, and you need to join them to get a complete view of the data.

+ **Denormalized Data:** In contrast, denormalized data is stored in a single table, providing a convenient and comprehensive view of the information. All relevant data is present in this one table, making queries more straightforward and potentially faster.

So, when should you use normalized vs. denormalized datasets? The decision depends on various factors, including storage space, performance, the number and cardinality of columns, and usability. It's a topic that deserves a more in-depth discussion.

In this post, we'll focus on using Spark to denormalize normalized data, a common task in Data Engineering.

In previous sections, we used a fictional animal dataset to illustrate Spark examples. Now, let's see what denormalized data can look like, still using our animal dataset as an example.

## Denormalized dataset

In [6]:
import pandas as pd
# Define the list of dictionaries representing the denormalized dataset
denormalized_dataset = [
    {"name": "fido", "animal": "dog", "age": 4, "color": "brown"},
    {"name": "annabelle", "animal": "cat", "age": 15, "color": "white"},
    {"name": "fred", "animal": "bear", "age": 29, "color": "brown"},
    {"name": "julie", "animal": "parrot", "age": 1, "color": "brown"},
    {"name": "gus", "animal": "fish", "age": 1, "color": "gold"},
    {"name": "daisy", "animal": "iguana", "age": 2, "color": "green"}
]

# Print the denormalized dataset
# Create a DataFrame from the denormalized dataset
df = pd.DataFrame(denormalized_dataset)
# Print the DataFrame
print(df)

        name  animal  age  color
0       fido     dog    4  brown
1  annabelle     cat   15  white
2       fred    bear   29  brown
3      julie  parrot    1  brown
4        gus    fish    1   gold
5      daisy  iguana    2  green


An example of our animal data in a normalized format might look like the following, where the fields for animal and color are represented by numeric codes instead of their actual values:

In [7]:
# Define the data as a dictionary
data = {
    "name": ["fido", "annabelle", "fred", "julie", "gus", "daisy"],
    "animal": [1, 2, 3, 4, 5, 6],
    "age": [4, 15, 29, 1, 1, 2],
    "color": [1, 2, 1, 1, 4, 5]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

        name  animal  age  color
0       fido       1    4      1
1  annabelle       2   15      2
2       fred       3   29      1
3      julie       4    1      1
4        gus       5    1      4
5      daisy       6    2      5


There are situations where you may need to normalize denormalized data or denormalize normalized data, and Spark is a powerful tool for handling these transformations. This capability is not unique to Spark but demonstrates a common and essential data manipulation task.

Here's a high-level overview of the trade-offs between normalized and denormalized data:

### Normalized Data:

+ Reduces data duplication.
+ Enhances data integrity by maintaining data consistency.
+ Requires joining multiple tables, potentially making queries slower due to the overhead of these joins.

### Denormalized Data:

+ Can offer faster query performance because it contains all data points in a single table as literal values.
+ May lead to duplicate records.
+ Could result in reduced data integrity as data consistency may be harder to maintain.
+ Choosing the right data format, like columnar storage, can enhance access efficiency when dealing with wide datasets.

Both approaches have their time and place, depending on specific use cases and requirements.

In the context of Spark, let's focus on how to convert data from a normalized format to a denormalized format. This transformation can be valuable when you prioritize query speed over data integrity and duplication concerns.

In [8]:
import findspark
findspark.init()
findspark.find()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Vamsi_App_8').getOrCreate()

#### Detailed set of instructions for converting normalized data into a denormalized format using Spark.

**Step 1:** Create Lists of Rows representing normalized data for animals, animal types, and animal colors.

In [9]:
animalsNormalized = [("fido", 1, 4, 1),
 ("annabelle", 2, 15, 2),
 ("fred", 3, 29, 1),
 ("fred", 4, 1, 1),
 ("gus", 5, 1, 4),
 ("daisy", 6, 2, 5)]

animalTypeLookup = [("dog", 1),
 ("cat", 2),
 ("bear", 3),
 ("parrot", 4),
 ("fish", 5),
 ("iguana", 6)]

animalColorLookup = [("brown", 1),
 ("white", 2),
 ("black", 3),
 ("gold", 4),
 ("green", 5),
 ("red", 6)]

**Step 2:** Create RDDs from the Lists using sc.parallelize().

In [11]:
petsRDD = spark.sparkContext.parallelize(animalsNormalized)
colorsRDD = spark.sparkContext.parallelize(animalColorLookup)
typesRDD = spark.sparkContext.parallelize(animalTypeLookup)

**Step 3:** Create DataFrames from the RDDs with specified schemas.

In [12]:
petsDF = spark.createDataFrame(petsRDD, ['nickname', 'type', 'age', 'color'])
colors = spark.createDataFrame(colorsRDD, ['color_name', 'color_id'])
types = spark.createDataFrame(typesRDD, ['type_name', 'type_id'])

**Step 4:** Join the first DataFrame (petsDF) with the second (colors) using the color_id as the join key. Call this new DataFrame petsWithColors.

In [14]:
from pyspark.sql.functions import col
petsWithColors = petsDF.join(colors, col("color") == col("color_id"), how="left")
petsWithColors.select("nickname", "color_name", "age").show()

+---------+----------+---+
| nickname|color_name|age|
+---------+----------+---+
|     fido|     brown|  4|
|annabelle|     white| 15|
|     fred|     brown| 29|
|     fred|     brown|  1|
|      gus|      gold|  1|
|    daisy|     green|  2|
+---------+----------+---+



**Step 5:** Join the petsWithColors DataFrame with the types DataFrame using the type_id as the join key. Call this new DataFrame petsWithColorAndType.

In [15]:
petsWithColorAndType = petsWithColors.join(types, col("type") == col("type_id"), how="left")
petsWithColorAndType.select("nickname", "type_name", "age", "color_name").show()

+---------+---------+---+----------+
| nickname|type_name|age|color_name|
+---------+---------+---+----------+
|    daisy|   iguana|  2|     green|
|      gus|     fish|  1|      gold|
|     fido|      dog|  4|     brown|
|     fred|     bear| 29|     brown|
|annabelle|      cat| 15|     white|
|     fred|   parrot|  1|     brown|
+---------+---------+---+----------+



This sequence of operations transforms the normalized data into a denormalized format, providing a comprehensive view of the pets' information with their type, age, and color.