## Data Standardization - part 3

Data is usually collected from different agencies in different formats. (Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)

<h3>What is standardization?</h3>

Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.

<h3>Example</h3>

Transform mpg to L/100km:

In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard.

We will need to apply **data transformation** to transform mpg into L/100km.

The formula for unit conversion is:

L/100km = 235 / mpg

We can do many mathematical operations directly in Pandas

In [1]:
import numpy as np
import pandas as pd

In [8]:
# create headers
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

In [13]:
df = pd.read_csv('automobile_new.csv', names= headers)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,80,mpfi,3.329751,3.255423,9.4,135,6000,16,23,15645
66,0,93.0,mercedes-benz,diesel,turbo,two,hardtop,rwd,front,106.7,...,183,idi,3.58,3.64,21.5,123,4350,22,25,28176
145,0,85.0,subaru,gas,turbo,four,wagon,4wd,front,96.9,...,108,mpfi,3.62,2.64,7.7,111,4800,23,23,11694
128,3,150.0,saab,gas,std,two,hatchback,fwd,front,99.1,...,121,mpfi,3.54,3.07,9.31,110,5250,21,28,11850
139,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,108,mpfi,3.62,2.64,9.0,94,5200,26,32,9960


In [14]:
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df['city-mpg']

# check your transformed data
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,mpfi,3.329751,3.255423,9.4,135,6000,16,23,15645,14.6875
66,0,93.0,mercedes-benz,diesel,turbo,two,hardtop,rwd,front,106.7,...,idi,3.58,3.64,21.5,123,4350,22,25,28176,10.681818
145,0,85.0,subaru,gas,turbo,four,wagon,4wd,front,96.9,...,mpfi,3.62,2.64,7.7,111,4800,23,23,11694,10.217391
128,3,150.0,saab,gas,std,two,hatchback,fwd,front,99.1,...,mpfi,3.54,3.07,9.31,110,5250,21,28,11850,11.190476
139,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,mpfi,3.62,2.64,9.0,94,5200,26,32,9960,9.038462


<h3>Q : According to the example above, transform mpg to L/100km in the column of "highway-mpg" and change the name of column to "highway-L/100km".</h3>

In [15]:
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df['highway-mpg'] = 235/df['highway-mpg']

In [18]:
# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'highway-mpg':'highway-L/100km'}, inplace= True)

# check your transformed data
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-L/100km,price,city-L/100km
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,mpfi,3.329751,3.255423,9.4,135,6000,16,10.217391,15645,14.6875
66,0,93.0,mercedes-benz,diesel,turbo,two,hardtop,rwd,front,106.7,...,idi,3.58,3.64,21.5,123,4350,22,9.4,28176,10.681818
145,0,85.0,subaru,gas,turbo,four,wagon,4wd,front,96.9,...,mpfi,3.62,2.64,7.7,111,4800,23,10.217391,11694,10.217391
128,3,150.0,saab,gas,std,two,hatchback,fwd,front,99.1,...,mpfi,3.54,3.07,9.31,110,5250,21,8.392857,11850,11.190476
139,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,mpfi,3.62,2.64,9.0,94,5200,26,7.34375,9960,9.038462


<h2>Data Normalization</h2>

<h3>Why normalization?</h3>

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

Example

To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height".

Target: would like to normalize those variables so their value ranges from 0 to 1

Approach: replace original value by (original value)/(maximum value)

<h3>Method 1. Simple Feature Scaling</h3>

In [19]:
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

In [20]:
df['height'] = df['height']/df['height'].max()

# show the scaled columns
df[['length','width','height']].head()

Unnamed: 0,length,width,height
55,0.877011,0.920168,0.884135
66,0.973015,0.984594,0.97861
145,0.900882,0.915966,0.97861
128,0.968345,0.931373,1.0
139,0.892579,0.915966,0.935829


In [21]:
df['height'].value_counts().max()

2

<h3>Method 2. Min Max</h3>

In [22]:
# Min Max (original value) - (min value) / (min value - maximum value)
df['length_new'] = (df['length'] - df['length'].min()) / (df['length'].max() - df['length'].min())
df['length_new']

55     0.302941
66     0.847059
145    0.438235
128    0.820588
139    0.391176
160    0.294118
153    0.223529
57     0.561765
7      1.000000
147    0.000000
Name: length_new, dtype: float64

<h3>Method 3 - Z-score</h3>

In [23]:
# Z-score
df['length_zscore'] = (df['length'] - df['length'].mean()) / df['length'].std() # where df['length'].mean() is average of feature & df['length'].std() is standard deviation sigma

df['length_zscore']

55    -0.585379
66     1.136324
145   -0.157280
128    1.052565
139   -0.306184
160   -0.613299
153   -0.836655
57     0.233593
7      1.620262
147   -1.543949
Name: length_zscore, dtype: float64