<a href="https://colab.research.google.com/github/yohanesnuwara/machine-learning/blob/master/02_datapreproc/scale_data_for_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scale Machine Learning Data**

In [0]:
!git clone https://github.com/yohanesnuwara/machine-learning

Cloning into 'machine-learning'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects:   2% (1/47)[Kremote: Counting objects:   4% (2/47)[Kremote: Counting objects:   6% (3/47)[Kremote: Counting objects:   8% (4/47)[Kremote: Counting objects:  10% (5/47)[Kremote: Counting objects:  12% (6/47)[Kremote: Counting objects:  14% (7/47)[Kremote: Counting objects:  17% (8/47)[Kremote: Counting objects:  19% (9/47)[Kremote: Counting objects:  21% (10/47)[Kremote: Counting objects:  23% (11/47)[Kremote: Counting objects:  25% (12/47)[Kremote: Counting objects:  27% (13/47)[Kremote: Counting objects:  29% (14/47)[Kremote: Counting objects:  31% (15/47)[Kremote: Counting objects:  34% (16/47)[Kremote: Counting objects:  36% (17/47)[Kremote: Counting objects:  38% (18/47)[Kremote: Counting objects:  40% (19/47)[Kremote: Counting objects:  42% (20/47)[Kremote: Counting objects:  44% (21/47)[Kremote: Counting objects:  46% (22/47)[Kremote: 

## Normalize data

Find minimum and maximum value of the data.

In [0]:
# Find the min and max values for each column
def dataset_minmax(dataset):
  minmax = list()
  for i in range(len(dataset[0])):
    col_values = [row[i] for row in dataset]
    value_min = min(col_values)
    value_max = max(col_values)
    minmax.append([value_min, value_max])
  return minmax

In [0]:
# Small dataset example (2 columns and 4 rows)
dataset = [[50, 30], [20, 90], [40, 10], [5, 10]]
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
print(minmax)

[[5, 50], [10, 90]]


Normalize min to 0 and max to 1. Equation to normalize:

$$scaled value = \frac{value - min}{max - min}$$

In [0]:
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
  for row in dataset:
    for i in range(len(row)):
      row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

In [0]:
# Normalize columns
normalize_dataset(dataset, minmax)
print(dataset)

[[1.0, 0.25], [0.3333333333333333, 1.0], [0.7777777777777778, 0.0], [0.0, 0.0]]


Load (using function library in folder `01_dataload`) and normalize `pima-indians-diabetes` dataset.

In [0]:
import os, sys
sys.path.append('/content/machine-learning/01_dataload')

from load_csv_data import *

# Load pima-indians-diabetes dataset
filename = '/content/machine-learning/datasets/pima-indians-diabetes.csv'
dataset = load_csv(filename)

# convert string columns to float
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)
print('Original dataset:', dataset[0])

# Calculate min and max for each column
minmax = dataset_minmax(dataset)
# Normalize columns
normalize_dataset(dataset, minmax)
print('Normalized dataset:', dataset[0])

Original dataset: [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
Normalized dataset: [0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]


## Standardize data

Centering the distribution of the data on the value 0 and the standard deviation to the value 1. Using mean and standard deviation to summarize a normal distribution. 

In [0]:
# Calculate mean
def mean(numbers):
  return sum(numbers)/float(len(numbers))

# Calculate column mean
def column_mean(dataset):
  mean_col = [(mean(column)) for column in zip(*dataset)]
  # del(mean_col[-1])
  return mean_col

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
  from math import sqrt
  avg = mean(numbers)
  variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
  return sqrt(variance)

# Calculate column standard deviation
def column_stdev(dataset):
  stdev_col = [(stdev(column)) for column in zip(*dataset)]
  # del(stdev_col[-1])
  return stdev_col

Calculate `mean` and `standard deviation` of the simple dataset.

In [139]:
# simple dataset (2 columns and 3 rows)
dataset = [[50, 30], [20, 90], [30, 50]]
means = column_mean(dataset)
stdevs = column_stdev(dataset)
print('Mean of column 1:', means[0], 'and column 2:', means[1])
print('Standard deviation of column 1:', stdevs[0], 'and column 2:', stdevs[1])

Mean of column 1: 33.333333333333336 and column 2: 56.666666666666664
Standard deviation of column 1: 15.275252316519467 and column 2: 30.550504633038933


Standardize the dataset based on the equation:

$$standardizedvalue(i) = \frac{\sum_{i=1}^{n}(value(i) - mean)}{stdev}$$

In [0]:
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
  for row in dataset:
    for i in range(len(row)):
      row[i] = (row[i] - means[i]) / stdevs[i]

In [141]:
# standardize dataset
standardize_dataset(dataset, means, stdevs)
print('Standardized dataset:', dataset)

Standardized dataset: [[1.0910894511799618, -0.8728715609439694], [-0.8728715609439697, 1.091089451179962], [-0.21821789023599253, -0.2182178902359923]]


Implement to `pima-indians-diabetes` dataset

In [149]:
# Load pima-indians-diabetes dataset
filename = '/content/machine-learning/datasets/pima-indians-diabetes.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename, len(dataset), len(dataset[0])))

# convert string columns to float
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)
print('Original dataset:', dataset[0])

# Estimate mean and standard deviation
means = column_mean(dataset)
stdevs = column_stdev(dataset)

# standardize dataset
standardize_dataset(dataset, means, stdevs)
print('Standardized dataset:', dataset[0])

Loaded data file /content/machine-learning/datasets/pima-indians-diabetes.csv with 768 rows and 9 columns
Original dataset: [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
Standardized dataset: [0.6395304921176576, 0.8477713205896718, 0.14954329852954296, 0.9066790623472505, -0.692439324724129, 0.2038799072674717, 0.468186870229798, 1.4250667195933604, 1.3650063669598067]


## Extension: Other data transform techniques

* Exponential transform (logarithm, exponents, square root)
* Power transforms e.g. Box-Cox to fix skew in normally distributed data

In [0]:
import seaborn as sns
