# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab you will be performing a particular form of feature scaling known as *mean normalization*. Mean normalization will not only scale the data but will also ensure your data has zero mean. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [4]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, (1000,20))

# print the shape of X
X.shape

(1000, 20)

Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [15]:
# Average of the values in each column of X
ave_cols = np.mean(X, axis=0)

# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [16]:
# Print the shape of ave_cols
print(ave_cols.shape)
# Print the shape of std_cols
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [17]:
# Mean normalize X
X_norm = (X - ave_cols)/std_cols

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero. You can verify this by filing the code below:

In [21]:
# Print the average of all the values of X_norm
print(np.mean(X_norm, axis=0))
# Print the minimum value of each column of X_norm
print(np.min(X_norm, axis=0))
# Print the maximum value of each column of X_norm
print(np.max(X_norm, axis=0 ))

[  5.45535839e-17  -1.58650870e-16  -1.26898492e-16  -1.08579812e-16
   1.66533454e-17  -5.12923037e-17   5.06261699e-17  -9.80882042e-17
  -6.55031585e-17   1.25344179e-16   1.55320201e-16   3.24185123e-17
   6.21724894e-17  -5.72875081e-17  -3.16691118e-17  -6.62803146e-17
  -4.11892742e-17  -8.22675261e-17   2.22044605e-17  -1.15685239e-16]
[-1.7778388  -1.7541885  -1.72843159 -1.6629704  -1.70997632 -1.71831195
 -1.7363688  -1.69837986 -1.74584155 -1.69804162 -1.67935101 -1.78613914
 -1.74410013 -1.73521634 -1.75337061 -1.67247844 -1.76090954 -1.76984842
 -1.76218208 -1.74385421]
[ 1.65084767  1.76426153  1.70584949  1.76288242  1.72853131  1.66189527
  1.6689241   1.72716237  1.7482172   1.77338831  1.69233494  1.73480653
  1.68235253  1.6961292   1.72131683  1.74300151  1.69129991  1.7205453
  1.7685601   1.72716331]


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [22]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([2, 4, 0, 1, 3])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [27]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[758 608 207 937 983 797 837 111 974 268 765 439 385 102 637 500 854 496
 543 836 631 559 895 200 337  45 939 383 334 801  42  12 716 436 213 614
  26 322 267 255 220 377 358 233 402 826 619 651 306 963 998 460  34 529
 266 601 540 320 313 746 902 502 932 640 510 890  86 427 864 790 676 324
 258 693 715 791 101 650 970 923 251 925 635 116 544 595 707 754 238 861
 805 620 182 319  14 649 604 539 897 112 298 197 538 782 271 851 371 590
 341 847 862 645 618 883 927 234 735 907   0 507 422 764 554 849  95 428
 335 452 886  49 174 374 738 901 779 195 244 318 709   2 844 379 443   7
 246 494 593 744 840 789 198 817 768 290  21 566 121 119 194 323 809 403
   8 488 491 467 638 257 761 463 426 456 888 457 157 904 201 473 933 516
 912 103 176 705 228 569 623 697 684 986 763 777 418 332 333 400 390 760
 520 219 479 215 833 701  43 687 350 150 445 508 177 386 276 866 280 381
 606 314 277 223 842 793 818 499 795 120 513 211 653 928 423  69 938 772
 152 750 876 551 984 741 739 535 767 179 660 487 83

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [37]:
print(X_norm[row_indices[0:600]])

[[-0.09198215 -1.49296277 -1.51357382 ...,  1.01030295  0.09181206
  -1.5152784 ]
 [-1.65359611 -0.33117719 -0.57150514 ..., -0.26198156 -1.01349458
   1.26862217]
 [-0.48461652  0.04270653  0.02004237 ..., -0.63597728 -0.93408833
  -1.09703329]
 ..., 
 [ 1.5938745   1.39601071 -0.34011985 ...,  1.38429868 -1.62109419
  -0.51065643]
 [-1.07700018  0.98973781  0.59161913 ...,  0.96346611  1.01420073
  -0.6836515 ]
 [ 0.98638951 -0.21640686  1.43452269 ...,  0.36856824  0.2669312
  -0.32446094]]


In [44]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.


# Create a Training Set
X_train = X_norm[row_indices[:600]]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800]]

# Create a Test Set
X_test = X_norm[row_indices[800:]]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [45]:
# Print the shape of X_train
print(X_train.shape)
# Print the shape of X_crossVal
print(X_crossVal.shape)
# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
