# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [1]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, 20000).reshape((1000,20))

# print the shape of X
print(X)

[[ 903 4854 1922 ... 1386   97 2133]
 [4663 3083 3510 ... 3321 4914 1573]
 [4975  330 1173 ...  355 3124 4609]
 ...
 [1033   46 4959 ... 1449 1175  559]
 [4659 3750  871 ... 2300 2169 2935]
 [2207 4632 2504 ... 2114  411 3738]]


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [4]:
# Average of the values in each column of X
ave_cols = np.average(X, axis=0)

# Standard Deviation of the values in each column of X
std_cols = np.std(X, axis=0)

In [5]:
print(ave_cols)

[2508.956 2468.245 2448.081 2508.531 2355.794 2463.385 2480.731 2503.403
 2554.818 2471.439 2485.552 2529.545 2505.633 2422.493 2466.19  2477.954
 2553.414 2534.104 2472.184 2492.623]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [6]:
# Print the shape of ave_cols
print(ave_cols.shape)

# Print the shape of std_cols
print(std_cols.shape)

(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [7]:
# Mean normalize X
X_norm = (X-ave_cols) / std_cols

In [8]:
print(X_norm)

[[-1.10360016  1.63145288 -0.36071302 ... -0.80637781 -1.62251389
  -0.24783951]
 [ 1.48024186  0.42038844  0.72811602 ...  0.55268118  1.66803093
  -0.6337718 ]
 [ 1.69464578 -1.46219791 -0.87427281 ... -1.53050691  0.44526256
   1.45853252]
 ...
 [-1.0142652  -1.65640587  1.72163823 ... -0.76212937 -0.88612043
  -1.3325849 ]
 [ 1.4774931   0.87650362 -1.08134231 ... -0.16442436 -0.20710827
   0.30487065]
 [-0.20750176  1.47964243  0.03834146 ... -0.29506259 -1.40801709
   0.85826998]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [15]:
# Print the average of all the values of X_norm
print(X_norm)


# Print the average of the minimum value in each column of X_norm
print(np.min(X_norm, axis = 0))

# Print the average of the maximum value in each column of X_norm
print(np.max(X_norm, axis = 0))


[[ 0.90117946  0.36636189 -0.04348306 ..., -1.34783151  0.99019938
   0.1330315 ]
 [ 1.15749367 -0.29445348 -1.5146379  ..., -0.2519591  -1.24574745
   1.45536056]
 [ 1.56504036 -0.73731814 -0.42374812 ..., -1.59821755 -1.51016299
   0.0585052 ]
 ..., 
 [-1.31689697  0.50214586 -0.70068792 ..., -0.21103728  0.02862639
   1.17639972]
 [ 0.38358096  0.32667118  1.35457091 ...,  1.1893212   0.15674526
   0.17132024]
 [-0.22560905 -1.00053016 -1.14983648 ...,  0.22453732  0.58812421
  -0.23139528]]
[-1.79899489 -1.78529192 -1.74026652 -1.65158268 -1.69298052 -1.67870984
 -1.69170172 -1.6717572  -1.65763524 -1.79261397 -1.69457635 -1.75619607
 -1.68593706 -1.74188    -1.71738252 -1.75768035 -1.79801333 -1.79380997
 -1.66417822 -1.7540565 ]
[ 1.74893338  1.68242198  1.77208929  1.72435638  1.74697265  1.77799881
  1.77167744  1.75507361  1.75313124  1.72799205  1.74540427  1.70092649
  1.7189183   1.69917834  1.77344711  1.66636155  1.73433487  1.66512403
  1.73574215  1.65979509]


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [9]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([4, 3, 2, 0, 1])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [10]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])

In [11]:
print(row_indices)

[427 993 559 474 648 631  35 551 865 745 368 783  40 866  27 758 786 638
 478 934 609 235   2 348 263 145 913 660 465 701  23 375 985 507  57 554
  30  68 779 608 159 475 256 730 512 852 174 126 744   5 242 833 868 952
 137 417 937 352 525 457 592 423   1  63 792 988  60 290 255 247  20 330
 960 232 763 140 873 484 219 910 337 966 564  84 676 600 237 557 922   3
 829 208 824 860 811 260 148 718 750 963 302 529 115 580 545 916 968 391
 872 202 809 901 412 309 977  83  50 511 324 266 729 995 742  80 943 211
 954 532 383 780 945  78 722 665 433 550 274 797  56 254 895 796 327 538
 125 759 326 316 425 832 900 981 784 491 429 820 798 921 519 884 304 218
 527 240 424 887 773  73 867 907 649 384 987 118 117 245 571  38 407  36
 456 755 119 301 955 192 320 894 581 364  14 909 325  39  96 940 111 246
 845 104 688  28 863 778 942 978 707 332 526 544 160 201 992 815 354 630
 239 108 689 716 279 542 480 875 880 839 129 818 164  15 594 394  88 535
 460 556 483  87 711 743 549 617 849 941 226 214 61

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [27]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
train_idx = row_indices[: int(row_indices.shape[0] * 60 / 100)] 
cross_idx = row_indices[int(row_indices.shape[0] * 60 / 100 -1) : int(row_indices.shape[0] * 80 / 100 )]
test_idx = row_indices[int(row_indices.shape[0] * 80 / 100 -1) : int(row_indices.shape[0])]

# Create a Training Set
X_train = X_norm[train_idx]

# Create a Cross Validation Set
X_crossVal = X_norm[cross_idx]

# Create a Test Set
X_test = X_norm[test_idx]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [28]:
# Print the shape of X_train
print(X_train)

# Print the shape of X_crossVal
print(X_crossVal)

# Print the shape of X_test
print(X_test)

[[-1.40280838 -0.10157059 -1.22223445 ...,  0.08998084  0.82391744
   1.48339338]
 [ 1.4152279  -1.33476765 -0.79417268 ..., -0.31438221  0.99156235
  -0.05020748]
 [ 0.87277899  0.35382798 -1.67841194 ..., -1.29442507 -0.80346477
   1.31656387]
 ..., 
 [ 1.14329344 -0.97267704 -0.44834937 ...,  0.87720564  0.69852451
   0.02090018]
 [ 0.88768924  1.49371707  1.52607678 ...,  0.10870777 -0.22488541
   1.48612829]
 [-1.08046308 -0.88285072  0.70088052 ..., -1.70225607 -0.32983385
  -0.00986756]]
[[ -1.08046308e+00  -8.82850715e-01   7.00880515e-01 ...,  -1.70225607e+00
   -3.29833847e-01  -9.86755581e-03]
 [ -4.66302998e-01  -4.53216281e-01  -1.03385916e+00 ...,   8.25879966e-01
   -8.43672283e-01  -1.06622660e+00]
 [ -1.77485449e+00   5.29302659e-01  -3.37292295e-01 ...,   4.77004129e-01
    7.70117061e-02  -1.14477869e-01]
 ..., 
 [ -1.38292805e+00   6.98510386e-01  -1.19693031e+00 ...,   6.78145268e-01
    6.88983743e-01   5.04979279e-01]
 [  1.53876993e+00   9.99324123e-01  -5.36210