# Mean Normalization
#### Trung Ng

----------------------------------------------------

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [10]:
# import NumPy into Python
import numpy as np
%config IPCompleter.greedy=True

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, (1000, 20))

# print the shape of X
print(X.shape)

(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [9]:
# Average of the values in each column of X
ave_cols = np.mean(X, axis=0)
print(ave_cols)

# Standard Deviation of the values in each column of X
std_cols = np.std(X, axis=0)
print(std_cols)

[2431.421 2369.013 2506.119 2557.359 2525.366 2424.275 2555.758 2580.932
 2524.247 2464.656 2492.975 2619.58  2560.204 2551.055 2482.234 2470.991
 2542.546 2493.005 2509.98  2544.714]
[1459.90296587 1457.90639783 1435.56138108 1436.07232761 1481.79093331
 1451.74333592 1423.82386391 1435.92957187 1462.23692061 1478.7432467
 1432.13364543 1422.53443951 1440.09748225 1455.87786231 1439.97559398
 1459.05297331 1429.36383888 1470.92930591 1433.37146114 1443.56091669]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [12]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [13]:
# Mean normalize X
X_norm = X - ave_cols.reshape((1, 20)) / std_cols.reshape((1, 20))

If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [14]:
# Print the average of all the values of X_norm
print(X_norm)

# Print the average of the minimum value in each column of X_norm
print(np.mean(np.min(X_norm, axis=0)))

# Print the average of the maximum value in each column of X_norm
print(np.mean(np.max(X_norm, axis=0)))


[[1718.33453246 2149.37505823 4397.25425855 ... 1546.30514968
  1076.24889774 3445.23719666]
 [1638.33453246 1243.37505823 1556.25425855 ... 4133.30514968
  3471.24889774  951.23719666]
 [ 489.33453246 2870.37505823 3072.25425855 ... 3950.30514968
  4755.24889774 3583.23719666]
 ...
 [2555.33453246  635.37505823  952.25425855 ...   22.30514968
  1142.24889774 4718.23719666]
 [2788.33453246 3380.37505823  779.25425855 ...  792.30514968
  4750.24889774 4134.23719666]
 [3294.33453246 4383.37505823 1755.25425855 ... 3830.30514968
  1467.24889774  369.23719666]]
2.3653142204769657
4994.365314220478


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [15]:
# We create a random permutation of integers 0 to 4
print(np.random.permutation(5))

[1 4 0 3 2]


# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [16]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[775 326 897 145 781 130 554 149 428 354 580  91 911 968 808  78 283 880
 934 447 913 822 573 237 972 431 314 780 239 154 527 298 471 497 768 661
 843 493 480 172 232 544 754  40 588  64  86 924 315  95 210 776 785 252
 802 833  57 777 229 953 618 640 483 267 459 881 418 576   4 352 192 587
 121 627 898 496  31  55 194 760 177 592 343 741 178 160 319  84 596   5
 854 753 362 917 905 955 762 690 286 989 360 857 159  74 285 831 879 219
 757 948  11 240 426  61  88 637 106 675 608 146 528 650 585 441 184  77
  36 827 395  23  18 558 803 280 358 823 796 681 814 589 932 478 330 415
 818 782 505 129 891 944 817 499 919 770 494 724 519 988 852 429  17 751
 402 137 208 138 979 509 866 369 908 864 976 696 481 270 135 526 151 357
  48 736 368 188 477 749 521 226 906 718 867 702 982  54 861 570 297 634
 127 313 694 810  47 767   6 439 830 639 176  41 243 747 875 222 886 937
 617 709 539 567  26 946 251 469 113 654 421 205 986 712 918 645 788 855
 234 301 950 501 502 931 963 805 420  13 216  80 84

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [33]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
train_set_range = int(X_norm.shape[0] * 0.6)
valid_set_start_index = train_set_range
print(valid_set_start_index)
valid_set_end_index = valid_set_start_index + int(X_norm.shape[0] * 0.2)
print(valid_set_end_index)
test_set_start_index = valid_set_end_index - X_norm.shape[0]
print(test_set_start_index)


# Create a Training Set
X_train = X_norm[0:train_set_range, :]

# Create a Cross Validation Set
X_crossVal = X_norm[valid_set_start_index:valid_set_end_index, :]

# Create a Test Set
X_test = X_norm[test_set_start_index:, :]

600
800
-200


If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [34]:
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)


(600, 20)
(200, 20)
(200, 20)
