<a href="https://colab.research.google.com/github/yohanesnuwara/machine-learning/blob/master/02_datascaling/scale_data_for_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scale Machine Learning Data**

In [0]:
!git clone https://github.com/yohanesnuwara/machine-learning

Cloning into 'machine-learning'...
remote: Enumerating objects: 47, done.[K
remote: Counting objects:   2% (1/47)[Kremote: Counting objects:   4% (2/47)[Kremote: Counting objects:   6% (3/47)[Kremote: Counting objects:   8% (4/47)[Kremote: Counting objects:  10% (5/47)[Kremote: Counting objects:  12% (6/47)[Kremote: Counting objects:  14% (7/47)[Kremote: Counting objects:  17% (8/47)[Kremote: Counting objects:  19% (9/47)[Kremote: Counting objects:  21% (10/47)[Kremote: Counting objects:  23% (11/47)[Kremote: Counting objects:  25% (12/47)[Kremote: Counting objects:  27% (13/47)[Kremote: Counting objects:  29% (14/47)[Kremote: Counting objects:  31% (15/47)[Kremote: Counting objects:  34% (16/47)[Kremote: Counting objects:  36% (17/47)[Kremote: Counting objects:  38% (18/47)[Kremote: Counting objects:  40% (19/47)[Kremote: Counting objects:  42% (20/47)[Kremote: Counting objects:  44% (21/47)[Kremote: Counting objects:  46% (22/47)[Kremote: 

## Normalize data

Find minimum and maximum value of the data.

In [0]:
# Find the min and max values for each column
def dataset_minmax(dataset):
  minmax = list()
  for i in range(len(dataset[0])):
    col_values = [row[i] for row in dataset]
    value_min = min(col_values)
    value_max = max(col_values)
    minmax.append([value_min, value_max])
  return minmax

In [0]:
# Small dataset example (2 columns and 4 rows)
dataset = [[50, 30], [20, 90], [40, 10], [5, 10]]
# Calculate min and max for each column
minmax = dataset_minmax(dataset)
print(minmax)

[[5, 50], [10, 90]]


Normalize min to 0 and max to 1. Equation to normalize:

$$scaled value = \frac{value - min}{max - min}$$

In [0]:
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
  for row in dataset:
    for i in range(len(row)):
      row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

In [0]:
# Normalize columns
normalize_dataset(dataset, minmax)
print(dataset)

[[1.0, 0.25], [0.3333333333333333, 1.0], [0.7777777777777778, 0.0], [0.0, 0.0]]


Load (using function library in folder `01_dataload`) and normalize `pima-indians-diabetes` dataset.

In [0]:
import os, sys
sys.path.append('/content/machine-learning/01_dataload')

from load_csv_data import *

# Load pima-indians-diabetes dataset
filename = '/content/machine-learning/datasets/pima-indians-diabetes.csv'
dataset = load_csv(filename)

# convert string columns to float
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)
print('Original dataset:', dataset[0])

# Calculate min and max for each column
minmax = dataset_minmax(dataset)
# Normalize columns
normalize_dataset(dataset, minmax)
print('Normalized dataset:', dataset[0])

Original dataset: [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
Normalized dataset: [0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]


## Standardize data

Centering the distribution of the data on the value 0 and the standard deviation to the value 1. Using mean and standard deviation to summarize a normal distribution. 

In [0]:
# calculate column means
def column_means(dataset):
  means = [0 for i in range(len(dataset[0]))]
  for i in range(len(dataset[0])):
    col_values = [row[i] for row in dataset]
    means[i] = sum(col_values) / float(len(dataset))
  return means

# calculate column standard deviations
def column_stdevs(dataset, means):
  stdevs = [0 for i in range(len(dataset[0]))]
  for i in range(len(dataset[0])):
    variance = [pow(row[i]-means[i], 2) for row in dataset]
    stdevs[i] = sum(variance)
    stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
  return stdevs

Calculate `mean` and `standard deviation` of the simple dataset.

In [0]:
# Example of calculating stats on a contrived dataset
from math import sqrt

# calculate column means
def column_means(dataset):
  means = [0 for i in range(len(dataset[0]))]
  for i in range(len(dataset[0])):
    col_values = [row[i] for row in dataset]
    means[i] = sum(col_values) / float(len(dataset))
  return means

# calculate column standard deviations
def column_stdevs(dataset, means):
  stdevs = [0 for i in range(len(dataset[0]))]
  for i in range(len(dataset[0])):
    variance = [pow(row[i]-means[i], 2) for row in dataset]
    stdevs[i] = sum(variance)
    stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
  return stdevs

# # Standardize dataset
dataset = [[50, 30], [20, 90], [30, 50]]
# print(dataset)

# # Estimate mean and standard deviation
# means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
# print(means)
print(stdevs)

[2.7636255459558434, 30.550504633038933]


Standardize the dataset based on the equation:

$$standardizedvalue(i) = \frac{\sum_{i=1}^{n}(value(i) - mean)}{stdev}$$

In [0]:
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
  for row in dataset:
    for i in range(len(row)):
      row[i] = (row[i] - means[i]) / stdevs[i]

In [0]:
# standardize dataset
standardize_dataset(dataset, means, stdevs)
print('Standardized dataset:', dataset)

Standardized dataset: [[8.198251143111397, -0.13206763594884358], [-3.375750470692928, 1.4527439954372794], [4.340250605176622, -0.6603381797442178], [-9.162751277595092, -0.6603381797442178]]


Implement to `pima-indians-diabetes` dataset

In [0]:
# Load pima-indians-diabetes dataset
filename = '/content/machine-learning/datasets/pima-indians-diabetes.csv'
dataset = load_csv(filename)
print('Loaded data file {0} with {1} rows and {2} columns'.format(filename, len(dataset), len(dataset[0])))

# convert string columns to float
for i in range(len(dataset[0])):
  str_column_to_float(dataset, i)
print(dataset[0])

# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)

# standardize dataset
standardize_dataset(dataset, means, stdevs)
print(dataset[0])

Loaded data file /content/machine-learning/datasets/pima-indians-diabetes.csv with 768 rows and 9 columns
[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
[1602.885751134972, 19211.38797304414, 1910.6988961190102, 8266.696656061265, -30036.170762693295, 415.1751840233564, 29.799566225970707, 135.34456358289233, 1.3650063669598067]


In [0]:
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
  from math import sqrt
  avg = mean(numbers)
  variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
  return sqrt(variance)

def summarize_dataset(dataset):
  summaries = [(stdev(column)) for column in zip(*dataset)]
  return summaries

dataset = [[50, 30], [20, 90], [30, 50]]
summary = summarize_dataset(dataset)
print(summary)

[15.275252316519467, 30.550504633038933]


In [0]:
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
  from math import sqrt
  avg = mean(numbers)
  variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
  return sqrt(variance)

def summarize_dataset(dataset, mean):
  avg = mean(numbers)
  summaries = [(stdev(column)) for column in zip(*dataset)]
  return summaries

dataset = [[50, 30], [20, 90], [30, 50]]
summary = summarize_dataset(dataset)
print(summary)