# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [1]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001, size = (1000, 20))

# print the shape of X
print(X.shape)

(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

Norm_Coli = (Coli - Mi) / $\sigma_i$

where Coli is the i th column of X, Mi is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the ith column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [3]:
# Average of the values in each column of X
ave_cols = np.mean(X, axis = 0)

# Standard Deviation of the values in each column of X
std_cols = np.std(X, axis = 0)

print("avg ", ave_cols, "Std ", std_cols)

avg  [2586.193 2431.678 2472.509 2425.142 2549.725 2512.455 2467.976 2513.422
 2544.432 2507.975 2578.101 2547.339 2627.567 2478.674 2482.932 2474.226
 2430.581 2536.959 2543.466 2538.073] Std  [1423.80099795 1424.04427611 1431.6269402  1436.90375176 1457.08430895
 1422.28549032 1413.17238914 1458.06015717 1444.06997039 1439.56461556
 1433.30276104 1430.12214726 1428.97612209 1424.50018523 1456.94740515
 1435.42107652 1414.82414789 1433.96176704 1460.43488347 1440.94841118]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [4]:
# Print the shape of ave_cols
print("ave_cols shape ",ave_cols.shape)

# Print the shape of std_cols
print("std_cols shape ",std_cols.shape)

ave_cols shape  (20,)
std_cols shape  (20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [6]:
# Mean normalize X
X_norm = (X - ave_cols) / std_cols
print(X_norm)

[[ 0.96348226  0.03463516 -0.92657449 ...  1.2483185   0.50295567
   0.91809463]
 [ 1.57101098  0.42507246 -1.21016792 ... -1.67016936 -0.48921455
   0.77443925]
 [ 1.33923702 -1.63174561  1.40643553 ... -0.35214258 -0.67477572
   0.95348799]
 ...
 [-0.19257818 -0.6261589  -0.0541405  ...  0.18134444 -0.88567181
   1.21165129]
 [-0.83803355  1.39765458  1.54753374 ... -1.51256404 -0.87540089
   1.21442724]
 [ 1.18612573 -1.36981555  0.26856927 ... -0.71686639 -1.61696083
  -1.410233  ]]


If you have performed the mean normalization correctly, then the average of all the elements in Xnorm should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [7]:
# Print the average of all the values of X_norm
# You can use either the function or a method. So, there are multiple ways to solve. 
print("The average of all the values of X_norm is: ")
print(np.mean(X_norm))
print(X_norm.mean())

# Print the average of the minimum value in each column of X_norm
print("The average of the minimum value in each column of X_norm is: ")
print(X_norm.min(axis = 0).mean())
print(np.mean(np.sort(X_norm, axis=0)[0]))

# Print the average of the maximum value in each column of X_norm
print("The average of the maximum value in each column of X_norm is: ")
print(np.mean(np.sort(X_norm, axis=0)[-1]))
print(X_norm.max(axis = 0).mean())

The average of all the values of X_norm is: 
2.4513724383723456e-17
2.4513724383723456e-17
The average of the minimum value in each column of X_norm is: 
-1.7476323670189688
-1.7476323670189688
The average of the maximum value in each column of X_norm is: 
1.7300298956249751
1.7300298956249751


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [None]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [8]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

[242 761 956 391 263 916 312 131 445 366 403 795 605 152 906 140 310 782
 550 661 545 251 260  84 808 884 983  52 134  58 528 270 878 852 785 460
 180 563 384 555 394 532 237 111 120 932 400 606 585 342 787 734 883 684
 498 847 598 945 526 335 538 796 845 806 520 202 339 733 625 174 102 749
 101 697 172 399 377 970 860 899 378 558 420 979 337 126 716 830 763 219
 651 564 657 631 879 307 390 115 478 412 167 104 912 546 811 646 881 145
 301 187 214 740 989 622 117 893 433 888 273 815 287 981 930  21 313 866
 774 159 470  64 918 936 687 944 204 333  41 946  12 818  11 660 254 980
  10 533 178 508  15 427 754 321 952 453 352  76 396 608 968 580 463   7
 959 856 836 623 701 386 900 437 978 314 340 371  34 746 951 693 609 566
 150 770 431 296 539 729 348 972 109 973 775 100 171  90 107 144 379  29
 425 721 894 519 725  53 341 596 250 402 814 303 148 169 189   4 976 490
 722 531 792 812 939 680 516 293 849  74 789 767 559 820 850 327 764 786
 855 966 653  40 907  49 543 639 469 426 176 875 14

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [9]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.

sixty = int(len(X_norm) * 0.6)
eighty = int(len(X_norm) * 0.8)

# Create a Training Set
X_train = X_norm[row_indices[:sixty], :]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[sixty: eighty], :]

# Create a Test Set
X_test = X_norm[row_indices[eighty: ], :]

If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [10]:
# Print the shape of X_train
print(X_train.shape)

# Print the shape of X_crossVal
print(X_crossVal.shape)

# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)
