# 資料正規化與標準化
資料的縮放在機器學習流程當中相當重要，因為許多機器學習算法的特性需要透過資料的縮放方能表現得更好。有兩種主要的資料縮方法分別為:正規化、標準化。在本筆記本中，將探討的主題為:
<li>正規化資料
<li>標準化資料
<li>正規化與標準化的使用時機
<li>機器學習模型及適合的資料縮放方法

## 資料集
在此文章內，我使用Pima Indians Diabetes Dataset。該資料出自於Kaggle的公開資料科學比賽，比賽的內容為預測5年內發生糖尿病的機率。
檔案的目錄在此資料夾中的data資料夾內，檔案名稱為pima-indians-diabetes.csv。

## 1.正規化
正規化有數種不同的方式，此案例中的正規化是將計量資料的數值轉換為0-1之間的數字。用途是在訓練機器學習模型時為了避免因為資料尺度不同，某些資料的數值表現較其他為大，而左右了某些對資料計量尺度較為敏感的機器學習模型；使其學習時偏頗於數值較大的資料而忽略其他數值較小的。透過正規化以後不同尺度的資料將被統一。例如:身高、體重皆為計量單位，但因為尺度不同，計量大小的標準也不一樣。透過正規化，尺度統一為0-1之間的數字，因此無論是身高或體重，都可以用同樣的尺度來表示計量的多寡。在做正規化之前先決條件是我們已經知道資料的最大值、最小值，有時候需要透過觀察資料得知、有時候則需要一些特殊領域的知識，如:用來表現一張8bit圖片資料的多維陣列，最小值是0、最大值是255，因為8bit圖片顧名思義是由0-255的非負整數的像素組成，因此我們能夠使用此8bit照片的特性來做正規化。正規化的公式為:

\begin{align}
scaledValue = \frac{value - min}{max - min}
\end{align}

### Python implementation

In [25]:
from csv import reader

def load_csv(path):
    
    dataset = []
    
    with open(path, 'r') as f:
        csv_reader = reader(f)
        
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
            
    return dataset


def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        

def min_max(dataset):
    min_max_output = []
    
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        _min = min(col_values)
        _max = max(col_values)
        min_max_output.append((_min, _max))
        
    return min_max_output


def normalize_dataset(dataset, dataset_min_max):
    for column in range(len(dataset[0])):
        _min, _max = dataset_min_max[column]
        for row in dataset:
            row[column] = (row[column] - _min) / (_max - _min)

In [30]:
dataset = load_csv('data/pima-indians-diabetes.data.csv')
print(dataset[0])

for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)   
print(dataset[0])

dataset_min_max = min_max(dataset)
print(dataset_min_max)

normalize_dataset(dataset, dataset_min_max)
print(dataset[0])

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']
[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
[(0.0, 17.0), (0.0, 199.0), (0.0, 122.0), (0.0, 99.0), (0.0, 846.0), (0.0, 67.1), (0.078, 2.42), (21.0, 81.0), (0.0, 1.0)]
[0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]


## Pandas implementation

In [8]:
import pandas as pd

def min_max(pd_series):
    """
    Find the minimum and maximum values.
    
    parameters:
    pd_series(pandas.Series):A particular column from the original dataframe. 
    
    return:
    min(float):The minimum value in this series data.
    max(float):The maximum value in this series data.
    """
    return pd_series.min(), pd_series.max()

def normalized_df(df, cols):
    """
    Normalize dataframe by given columns.
    
    parameters:
    df(pandas.DataFrame):The input dataframe. 
    cols(list):A list of column names that need to be normalized.
    
    return:
    df(pandas.DataFrame):The normalized dataframe.
    """
    df = df.copy()
    
    for col in cols:
        _min, _max = min_max(df[col])
        df[col] = (df[col] - _min) / (_max - _min)
        
    return df


#load Pima Indians Diabetes Dataset
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
#normalize all columns
normalized_data = normalized_df(data, data.columns)

normalized_data.describe()

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
count,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0,767.0
mean,0.226014,0.607333,0.566407,0.207248,0.094449,0.476758,0.168093,0.203651,0.34811
std,0.198287,0.160696,0.158755,0.161152,0.136268,0.117572,0.141545,0.195872,0.476682
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.058824,0.497487,0.508197,0.0,0.0,0.406855,0.070666,0.05,0.0
50%,0.176471,0.58794,0.590164,0.232323,0.037825,0.4769,0.125107,0.133333,0.0
75%,0.352941,0.703518,0.655738,0.323232,0.150709,0.545455,0.233561,0.333333,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
