## STRUCTURED DATA NORMALIZATION AND STANDARIZATION

Often it is important to standardize and normalize numeric data so that you can compare two numeric data that may not be of the same scale.  Take for example, temperature data.  If one compares Celcius to Farenheit temperatures purely on a numeric scale, Farenheit will appear to be larger, and in some cases may lead to misleading results when trying to make a decision as to which temperatures are warmer, colder, larger or smaller.  Thus, we would usually convert both to the same scale (either Celsius or Farenheit).  

Taking this one step further, we typically like to adjust values so that they fall within a specific range -- which is called  _normalization_.  In many instances, these values would be normalized to the range of 0.0 to 1.0, but this does not always have to be the case.

_Standardization_ typically involves rescaling data about the mean $\mu$ and standard deviation $\sigma$ of a set of population values.

### MIN-MAX NORMALIZATION

For data values $V = (v_1, v_2, \ldots, v_i, \ldots, v_n)$,

$$
minmax(V, v_i) = {
{v_i - min(V)} \over
{max(V) - min(V)}
}
$$

In [1]:
def minmax(V, v_i):
    return \
        ( v_i - min(V) ) \
        / \
        (max(V) - min(V)*1.)

V = xrange(1,10)
print [(i, minmax(V, i)) for i in V]

[(1, 0.0), (2, 0.125), (3, 0.25), (4, 0.375), (5, 0.5), (6, 0.625), (7, 0.75), (8, 0.875), (9, 1.0)]


### Z-score STANDARDIZATION
For normal distrubutions, Z-score normalization works well, where you know the mean $\mu$, and standard deviation $\sigma$, of the data population. 


$$
v' = { { v_i - \mu } \over \sigma }
$$

In [2]:
def z_score(V, v_i):
    
    ## inspired by python statistics library implementations
    def mean(V):
        """Return the sample arithmetic mean of data."""
        n = len(V)
        if n < 1:
            raise ValueError('mean requires at least one data point')
        return sum(V)/float(n) 

    def _ss(V):
        """Return sum of square deviations of sequence data."""
        c = mean(V)
        ss = sum((x-c)**2 for x in V)
        return ss

    def pstdev(data):
        """Calculates the population standard deviation."""
        n = len(V)
        if n < 2:
            raise ValueError('variance requires at least two data points')
        ss = _ss(V)
        pvar = ss/n # the population variance
        return pvar**0.5
    
    return (v_i - mean(V)) / pstdev(V)

**NOTE:** The mean, sum of squares and standard deviation calculations are done merely for example -- it is much better to use the [same functions provided in a library like Numpy](http://docs.scipy.org/doc/numpy/reference/routines.statistics.html) or [Scipy]().

In [3]:
V = xrange(1,10)
print [(i, z_score(V, i)) for i in V]

[(1, -1.5491933384829668), (2, -1.161895003862225), (3, -0.7745966692414834), (4, -0.3872983346207417), (5, 0.0), (6, 0.3872983346207417), (7, 0.7745966692414834), (8, 1.161895003862225), (9, 1.5491933384829668)]
