<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Module 2: Preparing Data - Data Normalization

## Data Normalization 

### Lesson Objectives

-	After completing this lesson, you should be able to: 
-	Normalize a dataset to have unit p-norm
-	Normalize a dataset to have unit standard deviation and zero mean 
-	Normalize a dataset to have given minimum and maximum values 


## Normalizer

-	A Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm
-	Takes a parameter P, which specifies the p-norm used for normalization (p=2 by default)
- Standardize input data and improve the behavior of learning algorithms

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._

Intitializing Scala interpreter ...

Spark Web UI available at http://host.docker.internal:4047
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1669554413591)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7b0ba53d
import spark.implicits._


In [2]:
// Continuing from Previous Example 

import  org.apache.spark.ml.feature.VectorAssembler

import  org.apache.spark.sql.functions._

val dfRandom = spark.range(0, 10).select("id").
 withColumn("uniform", rand(10L)).
 withColumn("normal1", randn(10L)).
 withColumn("normal2", randn(11L))

val assembler = new  VectorAssembler().
 setInputCols(Array("uniform","normal1","normal2")).
 setOutputCol("features")

val dfVec = assembler.transform(dfRandom)


// Continuing from Previous Example 

dfVec.select("id","features").show()
 

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|[0.17094971379555...|
|  1|[0.03422639313807...|
|  2|[0.36546259581613...|
|  3|[0.41750190407920...|
|  4|[0.98991293998274...|
|  5|[0.16452185994603...|
|  6|[0.18141810315190...|
|  7|[0.49595620559530...|
|  8|[0.96974749453753...|
|  9|[0.07530606222259...|
+---+--------------------+



import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.functions._
dfRandom: org.apache.spark.sql.DataFrame = [id: bigint, uniform: double ... 2 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = VectorAssembler: uid=vecAssembler_4d76d02a954f, handleInvalid=error, numInputCols=3
dfVec: org.apache.spark.sql.DataFrame = [id: bigint, uniform: double ... 3 more fields]


In [3]:

// A Simple Normalizer 

import  org.apache.spark.ml.feature.Normalizer

val scaler1 = new Normalizer().setInputCol("features").setOutputCol("scaledFeat").setP(1.0)
scaler1.transform(dfVec.select("id","features")).show(5)

+---+--------------------+--------------------+
| id|            features|          scaledFeat|
+---+--------------------+--------------------+
|  0|[0.17094971379555...|[0.06784834504592...|
|  1|[0.03422639313807...|[0.01637411312549...|
|  2|[0.36546259581613...|[0.62820267089235...|
|  3|[0.41750190407920...|[0.24834671851479...|
|  4|[0.98991293998274...|[0.75155926146783...|
+---+--------------------+--------------------+
only showing top 5 rows



import org.apache.spark.ml.feature.Normalizer
scaler1: org.apache.spark.ml.feature.Normalizer = Normalizer: uid=normalizer_3a737e9d4f79, p=1.0


## Standard Scaler

-	A Model which can be fit on a dataset to produce a `StandardScalerModel`
-	A Transformer which transforms a dataset of `Vector` rows, normalizing each feature to have unit standard deviation and/or zero mean
- Takes two parameters:
	-	`withStd`: scales the data to unit standard deviation (default: true)
	-	`withMean`: centers the data with mean before scaling (default: false)
-	It builds a dense output, sparse inputs will raise an exception
-	If the standard deviation of a feature is zero, it returns 0.0 in the Vector for that feature

In [6]:
// A Simple Standard Scaler 

import  org.apache.spark.ml.feature.StandardScaler

val  scaler2 = new StandardScaler().
 setInputCol("features"). setOutputCol("scaledFeat").
 setWithStd(true). setWithMean(true)

val  scaler2Model = scaler2.fit(dfVec.select("id","features"))
scaler2Model.transform(dfVec.select("id","features")).show(5)

+---+--------------------+--------------------+
| id|            features|          scaledFeat|
+---+--------------------+--------------------+
|  0|[0.17094971379555...|[-0.6232800161534...|
|  1|[0.03422639313807...|[-1.0186252770716...|
|  2|[0.36546259581613...|[-0.0608321051124...|
|  3|[0.41750190407920...|[0.08964327688038...|
|  4|[0.98991293998274...|[1.74481072932024...|
+---+--------------------+--------------------+
only showing top 5 rows



import org.apache.spark.ml.feature.StandardScaler
scaler2: org.apache.spark.ml.feature.StandardScaler = stdScal_756528a1f26d
scaler2Model: org.apache.spark.ml.feature.StandardScalerModel = StandardScalerModel: uid=stdScal_756528a1f26d, numFeatures=3, withMean=true, withStd=true


## MinMax Scaler 

-	A Model which can be fit on a dataset to produce a `MinMaxScalerModel`
-	A Transformer which transforms a dataset of `Vector` rows, rescaling each feature to a specific range (often `[0,1]`)
-	Takes two parameters: 
	-	min: lower bound after transformation, shared by all features (default:0.0)
	-	max: upper bound after transformation, shared by all features (default: 1.0)
-	Since zero values are likely to be transformed to non-zero values, sparse inputs may result in dense outputs

In [7]:

// A Simple MinMax Scaler 
import  org.apache.spark.ml.feature.MinMaxScaler 

val scaler3 = new MinMaxScaler().
 setInputCol("features").setOutputCol("scaledFeat").
 setMin(-1.0).setMax(1.0)

val scaler3Model = scaler3.fit(dfVec.select("id","features"))
scaler3Model.transform(dfVec.select("id","features")).show(5)

+---+--------------------+--------------------+
| id|            features|          scaledFeat|
+---+--------------------+--------------------+
|  0|[0.17094971379555...|[-0.7138741335035...|
|  1|[0.03422639313807...|[-1.0,-1.0,0.1073...|
|  2|[0.36546259581613...|[-0.3068099498278...|
|  3|[0.41750190407920...|[-0.1979053964784...|
|  4|[0.98991293998274...|[1.0,1.0,0.344574...|
+---+--------------------+--------------------+
only showing top 5 rows



import org.apache.spark.ml.feature.MinMaxScaler
scaler3: org.apache.spark.ml.feature.MinMaxScaler = minMaxScal_8dc8e3aca96d
scaler3Model: org.apache.spark.ml.feature.MinMaxScalerModel = MinMaxScalerModel: uid=minMaxScal_8dc8e3aca96d, numFeatures=3, min=-1.0, max=1.0


## Lesson Summary 

-	Having completed this lesson, you should be able to: 
- Normalize a dataset to have unit p-norm
-	Normalize a dataset to have unit standard deviation and zero mean
-	Normalize a dataset to have given minimum and maximum values

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.