# Rescaling Columns through Normalization and Regularization

This example will use the pima-indians-diabetes dataset. The next three code blocks are only for reading the CSV file from scratch.

In [43]:
from csv import reader

In [44]:
def load_csv(filename):
    dataset = list()
    
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        
        for row in csv_reader:
            if not row:
                continue
            
            dataset.append(row)
    
    return dataset

def convert_col_to_float(dataset, col):
    for row in dataset:
        row[col] = float(row[col].strip())

In [45]:
filename = 'data/pima-indians-diabetes.data.csv'
dataset = load_csv(filename)

### Min-Max Normalization

Normalization means scaling values between 0 and 1. The most common method is called 'min-max', where we get the minimum and maximum values for each column/attribute in the dataset. We can then use the following formula to rescale each value:

```
scaled_value = (value - minumum) / (maximum - minimum)
```

Let's take a look at this example:

In [46]:
for col in range(len(dataset[0])):
    convert_col_to_float(dataset, col)
    
dataset[:4]

[[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0],
 [1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0, 0.0],
 [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0],
 [1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0]]

We'll get the minimum and maximum values of the first column (6.0, 1.0, 8.0, 1.0, ...), then the second column (148.0, 85.0, 183.0, 89.0, ...), etc. The function below cthen returns a 2-dimensional list containing the minimum and maximum values of each column.

In [47]:
def get_min_max(dataset):
    min_max_values = list()
    
    for col in range(len(dataset[0])):
        col_values = [row[col] for row in dataset]
        
        minimum = min(col_values)
        maximum = max(col_values)
        
        min_max_values.append([minimum, maximum])
        
    return min_max_values

In [48]:
min_max_values = get_min_max(dataset)

print('Output shape: ', len(min_max_values), len(min_max_values[0]))

Output shape:  9 2


The dataset contains 9 columns, so the output list contains 9 items, each item contains a list of 2 values: the min and max.

In [49]:
min_max_values[:4]

[[0.0, 17.0], [0.0, 199.0], [0.0, 122.0], [0.0, 99.0]]

Now that we know the min and max of each column, we can then use the formula to rescale each value in each column:
```
scaled_value = (value - minumum) / (maximum - minimum)

```

In [50]:
def normalize(dataset, min_max_values):
    for col in range(len(dataset[0])):
        for row in range(len(dataset)):
            dataset[row][col] = (dataset[row][col] - min_max_values[col][0]) / (min_max_values[col][1] - min_max_values[col][0])

In [51]:
normalize(dataset, min_max_values)

All the data are now rescaled into values from 0 and 1.

In [54]:
print(dataset[0])

[0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]


### Standardization

Standardization means rescaling the data such that it has a mean of 0 and a standard deviation of 1. We only use regularization if we know that our data is normally distributed. We'll use the same dataset for this example. To perform standardization, we first calculate the mean and standard deviation of the columns/attributes we want to rescale. Then, each value can be rescaled by the following formula:

```
scaled_value = (value - mean) / standard_deviation
```

In [40]:
from math import sqrt

In [41]:
filename = 'data/pima-indians-diabetes.data.csv'
dataset = load_csv(filename)

for col in range(len(dataset[0])):
    convert_col_to_float(dataset, col)
    
dataset[:4]

[[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0],
 [1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0, 0.0],
 [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0],
 [1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0]]

First, we define a function to get the mean of each column.

In [34]:
def get_col_means(dataset):
    col_means = list()
    
    for col in range(len(dataset[0])):
        col_items = [row[col] for row in dataset]
        
        col_mean = sum(col_items) / len(col_items)
        
        col_means.append(col_mean)
        
    return col_means

Then, we define a function to compute the standard deviation of each column. The standard deviation is computed as:
```
variance = sum((value - mean)^2)
standard deviation = square_root( variance / (count(values) - 1) )
```

In [35]:
def get_col_std_devs(dataset, col_means):
    col_std_devs = [0 for col in range(len(dataset[0]))]
    
    for col in range(len(dataset[0])):
        variance = sum([pow(row[col] - col_means[col], 2) for row in dataset])
        
        col_std_devs[col] = sqrt(variance / float(len(dataset) - 1))
        
    return col_std_devs

In [36]:
col_means = get_col_means(dataset)
std_devs = get_col_std_devs(dataset, col_means)

After obtaining the means and standard deviations, we can then rescale the values of each column using the formula with defined above:
```
scaled_value = (value - mean) / standard_deviation
```

In [37]:
def standardize(dataset, col_means, std_devs):
    for col in range(len(dataset[0])):
        for row in range(len(dataset)):
            dataset[row][col] = (dataset[row][col] - col_means[col]) / std_devs[col]

In [38]:
standardize(dataset, col_means, std_devs)

In [39]:
print(dataset[0])

[0.6395304921176576, 0.8477713205896718, 0.14954329852954296, 0.9066790623472505, -0.692439324724129, 0.2038799072674717, 0.468186870229798, 1.4250667195933604, 1.3650063669598067]
