# 資料正規化與標準化

### Why normalization and standardization?

為什麼要這樣做？真的有關係嗎？不標準化有什麼影響？
>在與我們的研發團隊成員交談並研究在線文章時，我很快意識到標準化多變量數據非常重要，特別是當變量尺度顯著不同時。當使用多尺度變量時，在多變量分析期間影響模型穩定性和參數估計精度。例如，在邊界檢測中，範圍在0到100之間的變量將超過範圍在0和1之間的變量。使用沒有標準化的變量可以在分析中為更大範圍的變量賦予更大的重要性。將數據轉換為可比較的比例可以防止這個問題。在神經網絡中標準化連續預測變量非常重要。在執行回歸分析時，標準化多尺度變量有助於減少包含交互項的模型的多重共線性問題。在集群分析之前標準化數據也非常關鍵。聚類是一種無監督學習技術，可將觀察分類為類似的組或聚類。常用的相似度量是歐幾里德距離。歐幾里德距離的計算方法是取觀察值之差的平方和的平方根。變量之間的比例差異會極大地影響該距離。通常，具有大差異的變量對該度量的影響大於具有小差異的變量。因此，在執行群集之前，建議標準化多比例變量。標準化或標準化數據在主成分分析（PCA）中也很重要，因為它將原始數據投影到正交方向上，從而最大化方差。但是，基於樹的分析對異常值不敏感，不需要變量轉換。因此，決策樹，隨機森林和/或梯度增強算法不需要標準化多尺度數據。<br><br>by Sas官方技術文章

資料正規化與標準化在機器學習流程當中相當重要，許多機器學習算法的特性需要透過資料的縮放方能表現得更好。有兩種主要的資料縮方法分別為:正規化、標準化。在本筆記本中，將探討的主題為:
<li>正規化資料
<li>標準化資料
<li>正規化與標準化的使用時機
<li>機器學習模型及適合的資料縮放方法

## 資料集
在此文章內，我使用Pima Indians Diabetes Dataset。該資料出自於Kaggle的公開資料科學比賽，比賽的內容為預測5年內發生糖尿病的機率。
檔案的目錄在此資料夾中的data資料夾內，檔案名稱為pima-indians-diabetes.csv。

### 1 正規化
<p>
正規化有數種不同的方式，此案例中的正規化是將計量資料的數值轉換為0-1之間的數字。用途是在訓練機器學習模型時為了避免因為資料尺度不同，某些資料的數值表現較其他為大，而左右了某些對資料計量尺度較為敏感的機器學習模型；使其學習時偏頗於數值較大的資料而忽略其他數值較小的。</p><p>
例子:SVM模型的特點是以最大化資料點之間的距離，假如特徵中的其中一個有著比其他特徵尺度都大的數值，則此特徵支配其他特徵對模型的影響。將所有特徵都縮放到[0, 1]的尺度，則所有特徵的影響相等。</p>

透過正規化以後不同尺度的資料將被統一。例如:身高、體重皆為計量單位，但因為尺度不同，計量大小的標準也不一樣。透過正規化，尺度統一為0-1之間的數字，因此無論是身高或體重，都可以用同樣的尺度來表示計量的多寡。在做正規化之前先決條件是我們已經知道資料的最大值、最小值，有時候需要透過觀察資料得知、有時候則需要一些特殊領域的知識，如:用來表現一張8bit圖片資料的多維陣列，最小值是0、最大值是255，因為8bit圖片顧名思義是由0-255的非負整數的像素組成，因此我們能夠使用此8bit照片的特性來做正規化。正規化的公式為:

\begin{align}
scaledvalue = \frac{value - min}{max - min}
\end{align}


### 1-1 Python implementation

In [1]:
from csv import reader

def load_csv(path):
    
    dataset = []
    
    with open(path, 'r') as f:
        csv_reader = reader(f)
        
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
            
    return dataset


def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        

def min_max(dataset):
    min_max_output = []
    
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        _min = min(col_values)
        _max = max(col_values)
        min_max_output.append((_min, _max))
        
    return min_max_output


def normalize_dataset(dataset, dataset_min_max):
    for column in range(len(dataset[0])):
        _min, _max = dataset_min_max[column]
        for row in dataset:
            row[column] = (row[column] - _min) / (_max - _min)

In [2]:
dataset = load_csv('data/pima-indians-diabetes.data.csv')
print(dataset[0])

for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)   
print(dataset[0])

dataset_min_max = min_max(dataset)
print(dataset_min_max)

normalize_dataset(dataset, dataset_min_max)
print(dataset[0])

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']
[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
[(0.0, 17.0), (0.0, 199.0), (0.0, 122.0), (0.0, 99.0), (0.0, 846.0), (0.0, 67.1), (0.078, 2.42), (21.0, 81.0), (0.0, 1.0)]
[0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]


### 1-2 Pandas implementation

In [3]:
import pandas as pd

def normalized_df(df, cols):
    """
    Normalize dataframe by given columns.
    
    parameters:
    df(pandas.DataFrame):The input dataframe. 
    cols(list):A list of column names that need to be normalized.
    
    return:
    df(pandas.DataFrame):The normalized dataframe.
    """
    df = df.copy()
    
    for col in cols:
        df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
        
    return df


#load Pima Indians Diabetes Dataset
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
#normalize all columns
normalized_data = normalized_df(data, data.columns)

normalized_data.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,0.226014,0.607333,0.566407,0.207248,0.094449,0.476758,0.168093,0.203651,0.34811
std,0.198287,0.160696,0.158755,0.161152,0.136268,0.117572,0.141545,0.195872,0.476682
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.058824,0.497487,0.508197,0.0,0.0,0.406855,0.070666,0.05,0.0
50%,0.176471,0.58794,0.590164,0.232323,0.037825,0.4769,0.125107,0.133333,0.0
75%,0.352941,0.703518,0.655738,0.323232,0.150709,0.545455,0.233561,0.333333,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 2 標準化(Z-score)
標準化將資料後，資料的算術平均數為0，標準差為1。作法是以資料的算術平均數作為中心點對資料進行縮放，因此透過標準化後資料呈現了高斯分布。進行資料標準化的先決條件是每個欄位的算術平均數以及標準差。

#### 算術平均數
\begin{align}
mean = \frac{\sum_{i=1}^nvalues_i}{n}
\end{align}


#### 標準差
\begin{align}
std = \sqrt\frac{\sum_{i=1}^n(values_i - mean)^2}{n - 1}
\end{align}

####  標準化(z-score) 公式
\begin{align}
zscore = \frac{values_i - mean}{std}
\end{align}
標準化的用題在於位移和縮放資料點，使其結果以0中心，標準差為1。分子項計算所有資料點對於mean的相對位置，也可稱為對資料點做了資料<b>平移</b>，位移後的mean為0。例:假設X中洽有一資料點的值等於mean，則此點的z-score分子項為為mean - mean = 0，也就是透過z-score正規化後的資料點平均為0的典故。分母則執行了資料的縮放；將資料點距離轉以標準差做為單位換算。此方法計算每個資料點的z-score，z-score的是各資料點相距於算術平均數的標準差單位。至於標準差為1的原因是透過資料縮放的結果。

## 延伸閱讀

#### 標準化(z-score) 公式相關進階閱讀:
Khan學院<a href='https://www.youtube.com/watch?v=Wp2nVIzBsE8'>z-score</a><br>
熊仔高中數學<a href='https://www.youtube.com/watch?v=e5sz5NMqay4'>標準化公式推導</a>(24:00-36:00分鐘處)<br>
標準差公式的分母為什麼是n-1而不是n<a href='https://www.youtube.com/watch?v=9ONRMymR2Eg'>解釋影片</a><br>
標準差公式的分母為什麼是n-1而不是n<a href='https://www.youtube.com/watch?v=D1hgiAla3KI'>推導影片</a>

### 2-1 Python implementation

In [11]:
from math import sqrt

def column_means(dataset):
    means = []
    
    for column in range(len(dataset[0])):
        column_values = [row[column] for row in dataset]
        column_mean = sum(column_values) / float(len(column_values))
        means.append(column_mean)
        
    return means

def column_stds(dataset, column_means):
    stds = []
    
    for column in range(len(dataset[0])):
        varience = sum([pow((row[column] - column_means[column]), 2) for row in dataset])
        std = sqrt(varience / (len(dataset) - 1))
        stds.append(std)
        
    return stds

def column_standardlization(dataset, column_means, column_stds):
    standardlizations = []
    
    for column in range(len(dataset[0])):
        for row in dataset:
            row[column] = (row[column] - column_means[column]) / column_stds[column]

In [47]:
dataset = load_csv('data/pima-indians-diabetes.data.csv')
print(dataset[0])

for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)   
print(dataset[0])

dataset_means = column_means(dataset)
print(dataset_means)

dataset_stds = column_stds(dataset, dataset_means)
print(dataset_stds)

column_standardlization(dataset, dataset_means, dataset_stds)
print(dataset[0])

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']
[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
[3.8450520833333335, 120.89453125, 69.10546875, 20.536458333333332, 79.79947916666667, 31.992578124999977, 0.4718763020833327, 33.240885416666664, 0.3489583333333333]
[3.3695780626988623, 31.97261819513622, 19.355807170644777, 15.952217567727677, 115.24400235133837, 7.8841603203754405, 0.33132859501277484, 11.76023154067868, 0.4769513772427971]
[0.6395304921176576, 0.8477713205896718, 0.14954329852954296, 0.9066790623472505, -0.692439324724129, 0.2038799072674717, 0.468186870229798, 1.4250667195933604, 1.3650063669598067]


### 2-2 Pandas implementation

In [8]:
def standardlized_df(df, cols):
    """
    Standardlize dataframe by given columns.
    
    parameters:
    df(pandas.DataFrame):The input dataframe. 
    cols(list):A list of column names that need to be Standardlized.
    
    return:
    df(pandas.DataFrame):The standardlized dataframe.
    """
    df = df.copy()
    
    for col in cols:
        df[col] = (df[col] - df[col].mean()) / df[col].std()
        
    return df


#load Pima Indians Diabetes Dataset
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
#standardlize all columns
standardlized_dataset = standardlized_df(data, data.columns)

standardlized_dataset.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,1.139835,3.779393,3.5678,1.286043,0.693107,4.055028,1.422859,2.8266,0.730277
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.235296,1.786885,0.0
25%,0.296659,3.095833,3.201131,0.0,0.0,3.460475,0.734546,2.042154,0.0
50%,0.889976,3.658712,3.717442,1.441639,0.277578,4.056234,1.119164,2.467603,0.0
75%,1.779952,4.377946,4.130492,2.005759,1.105973,4.639318,1.885385,3.48868,2.097836
max,5.043199,6.222937,6.299,6.205317,7.338456,8.505416,7.300209,6.89227,2.097836


## 更多的資料縮放方法
資料縮放的方法多種多樣，以下羅列出更多的資料縮放方法:
<li>資料正規化，但數值介於-1和1之間
<li>標準化資料，但是標準差在1以上
<li>指數變換，如對數，平方根和指數
<li>諸如Box-Cox之類的功率變換用於校正資料中的偏態分布。

In [9]:
#Normalize data so values lay between -1 and 1
def normalized_df_minus_one_to_one(df, cols):
    """
    Normalize dataframe by given columns.
    
    parameters:
    df(pandas.DataFrame):The input dataframe. 
    cols(list):A list of column names that need to be normalized.
    
    return:
    df(pandas.DataFrame):The normalized dataframe.
    """
    df = df.copy()
    
    for col in cols:
        df[col] = ((df[col] - df[col].min()) / (df[col].max() - df[col].min())) * 2 -1
        
    return df


#load Pima Indians Diabetes Dataset
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
#normalize all columns
normalized_data = normalized_df_minus_one_to_one(data, data.columns)

normalized_data.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,-0.547971,0.214665,0.132815,-0.585503,-0.811103,-0.046483,-0.663814,-0.592699,-0.303781
std,0.396574,0.321392,0.317511,0.322304,0.272537,0.235144,0.283089,0.391743,0.953364
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,-0.882353,-0.005025,0.016393,-1.0,-1.0,-0.186289,-0.858668,-0.9,-1.0
50%,-0.647059,0.175879,0.180328,-0.535354,-0.92435,-0.0462,-0.749787,-0.733333,-1.0
75%,-0.294118,0.407035,0.311475,-0.353535,-0.698582,0.090909,-0.532878,-0.333333,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [11]:
def standardlized_df_with_n_std(df, cols, n_std):
    """
    Standardlize dataframe by given columns.
    
    parameters:
    df(pandas.DataFrame):The input dataframe. 
    cols(list):A list of column names that need to be Standardlized.
    
    return:
    df(pandas.DataFrame):The standardlized dataframe.
    """
    df = df.copy()
    
    for col in cols:
        df[col] = ((df[col] - df[col].mean()) * n_std) / df[col].std()
        
    return df


n_std = 2
#load Pima Indians Diabetes Dataset
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
#standardlize all columns
standardlized_dataset = standardlized_df_with_n_std(data, data.columns, n_std)

standardlized_dataset.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,-5.558353e-17,1.343269e-16,-5.141476e-16,3.7055680000000006e-17,-4.6319600000000004e-17,1.120934e-15,-5.095157e-17,3.612929e-16,8.337529000000001e-17
std,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
min,-2.27967,-7.558785,-7.135599,-2.572085,-1.386214,-8.110055,-2.375126,-2.07943,-1.460553
25%,-1.686352,-1.367119,-0.7333373,-2.572085,-1.386214,-1.189106,-1.376627,-1.568891,-1.460553
50%,-0.4997172,-0.2413619,0.2992856,0.3111934,-0.8310588,0.002412851,-0.6073898,-0.7179934,-1.460553
75%,1.280235,1.197106,1.125384,1.439433,0.8257321,1.16858,0.9250508,1.324161,2.735118
max,7.806728,4.887089,5.4624,9.838549,13.2907,8.900777,11.7547,8.131342,2.735118
