### Introduction:

**Standardization** means giving labels to pieces of data that might be different but are treated as the same for a specific reason. It's like how we can call a cat by various names like kitty, kitty cat, kitten, or feline. Instead, we can simplify everything by using just "cat." This helps make our data neater and more organized, which is really useful when we're working with data. This process is often used to clean up data and make it easier to work with. Standardization also helps with something called skew, which we'll talk about in a later part of this course called "Addressing Data Cardinality and Skew."

### Two Types of Standardization:

There are at least two simple methods to standardize something:

+ You can acknowledge when two things are alike but have different names ("puppy" and "dog") and link them without altering anything.
+ You can identify when two things are alike and modify them to match (replace all occurrences of "puppy" with "dog").

To perform these actions using Spark in Python, both methods require creating a custom library of synonyms.

**Standardization through Suggestion**
Exercise Setup

Import additional relevant Spark libraries using the following code:

In [1]:
import findspark

In [2]:
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [3]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName("Vamsi_App_4").getOrCreate()

In [9]:
from pyspark.sql.functions import *

In [11]:
from pyspark.sql import Row

Create a List of Rows, each containing a name and type using the following code:

In [12]:
pets = [Row("annabelle", "cat"),
                    Row("daisy", "kitten"),
                    Row("roger", "puppy"),
                    Row("joe", "puppy dog"),
                    Row("rosco", "dog"),
                    Row("julie", "feline")]

Use the parallelize() function of Spark to turn that List into an RDD as shown in the following code:

In [15]:
petsRDD = spark.sparkContext.parallelize(pets)

Create a DataFrame from the RDD and a provided schema using the following code:

In [16]:
# Create DataFrame
petsDF = spark.createDataFrame(petsRDD, ['nickname', 'type'])

#### Filter Dogs:

Use the where() function of the DataFrame in combination with the isin() function (of the implicits library) to only keep rows where the name matches a provided list of dog nouns. Print the results to the console as shown in the following code:

In [17]:
dogs = petsDF.where(col("type").isin("dog", "puppy", "puppy dog", "hound", "canine"))
dogs.show()

+--------+---------+
|nickname|     type|
+--------+---------+
|   roger|    puppy|
|     joe|puppy dog|
|   rosco|      dog|
+--------+---------+



Use the where() function of the DataFrame in combination with the isin() function (of the implicits library) to only keep rows where the name matches a provided list of cat nouns. Print the results to the console as shown in the following code:

In [19]:
cats = petsDF.where(col("type").isin(["cat", "kitty", "kitten", "feline", "kitty cat"]))
 
cats.show()

+---------+------+
| nickname|  type|
+---------+------+
|annabelle|   cat|
|    daisy|kitten|
|    julie|feline|
+---------+------+



This example also demonstrates that you can pass a list to the isin() function, not just a comma-separated list of strings values as demonstrated in the previous step.

### Standardization through Modification


In the previous exercise, we would quietly identify animals as a certain type if their type was found in a list of common synonyms for the proper type. In this exercise, we will actually modify our data to be standardized, by replacing the similar type value with its preferred, standard alternative.

Create and utilize a standardize() function to compare the petType to a list of common dog and cat nouns — returning “dog” or “cat”, respectively, if there is a match.

In [20]:
def standardize(pet):
    name = pet[0]
    animal_type = pet[1]
    
    dog_nouns = ["dog", "puppy", "puppy dog", "hound", "canine"]
    cat_nouns = ["cat", "kitty", "kitten", "feline", "kitty cat"]
    
    if animal_type in dog_nouns:
        return name, "dog"
    elif animal_type in cat_nouns:
        return name, "cat"
    else:
        return pet

Then, apply the standardize() function to petsRDD (created in the previous exercise) using the map() function. Hint: You can also use a UDF on the DataFrame instead of this RDD map method, but we’ll cover that in a future exercise!

In [23]:
standardizedPets = petsRDD.map(standardize)
standardizedPetsDF = spark.createDataFrame(standardizedPets, ['nickname', 'type'])
 
standardizedPetsDF.show()
petsDF.show()

+---------+----+
| nickname|type|
+---------+----+
|annabelle| cat|
|    daisy| cat|
|    roger| dog|
|      joe| dog|
|    rosco| dog|
|    julie| cat|
+---------+----+

+---------+---------+
| nickname|     type|
+---------+---------+
|annabelle|      cat|
|    daisy|   kitten|
|    roger|    puppy|
|      joe|puppy dog|
|    rosco|      dog|
|    julie|   feline|
+---------+---------+



This code defines a function standardize() that compares the type of a pet to lists of common dog and cat nouns. It then applies this function to the pet data, turning the original data into standardized categories of "dog" or "cat". The output shows how the pets are now consistently labeled as dogs or cats based on the standardized logic.

In [24]:
print(petsRDD)

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:287


To view the data in RDD's, you need to **collect()** the data to the driver and loop through the result and print the contents of each element of RDD to the console.

In [25]:
dataColl=petsRDD.collect()
for row in dataColl:
    print(row[0] + "," +str(row[1]))

annabelle,cat
daisy,kitten
roger,puppy
joe,puppy dog
rosco,dog
julie,feline
