# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this lab, you will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when you perform *mean normalization* your data will not only be scaled but it will also have an average of zero. 

# To Do:

You will start by importing NumPy and creating a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. Fill in the code below

In [4]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size=(1000,20))
# print(X)

# print the shape of X
print(X.shape)


(1000, 20)


Now that you created the array we will mean normalize it. We will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. In the space below, you will first calculate the average and standard deviation of each column of $X$. 

In [11]:
# Average of the values in each column of X
ave_cols = np.mean(X,axis=0)
print(f"Average of each cols - {ave_cols}")
# Standard Deviation of the values in each column of X
std_cols =np.std(X,axis=0)
print(f"Standard Deviation of each cols - {std_cols}")

Average of each cols - [2536.107 2517.731 2441.051 2536.053 2527.438 2458.212 2479.699 2523.24
 2438.617 2489.626 2557.096 2534.023 2566.711 2584.17  2465.125 2501.985
 2528.517 2492.72  2511.475 2455.179]
Standard Deviation of each cols - [1447.2216781  1446.58525177 1455.09308994 1439.00881658 1461.67672355
 1454.33644906 1429.36480802 1417.65555563 1440.51941546 1438.2196307
 1443.94662186 1426.04860312 1428.04290814 1443.13410988 1415.68538785
 1455.13202383 1467.98734385 1444.11637675 1468.77045088 1469.54708225]


If you have done the above calculations correctly, then `ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns. You can verify this by filling the code below:

In [12]:
# Print the shape of ave_cols
print(ave_cols.shape)


# Print the shape of std_cols
print(std_cols.shape)


(20,)
(20,)


You can now take advantage of Broadcasting to calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [14]:
# Mean normalize X
X_norm = (X - ave_cols) /std_cols
print(X_norm)

[[-0.44990136 -0.38209362 -1.51196581 ... -0.60986775 -0.14670434
   1.07708084]
 [ 0.74411061 -1.740465    0.27142525 ... -0.5198473  -0.78669543
  -0.93850617]
 [-0.10441179  1.25555614 -0.85977386 ... -0.93532628  0.37005442
   0.53473687]
 ...
 [-1.58656206 -1.45703891  0.05356977 ...  0.64695617  1.02706655
   1.5153111 ]
 [-0.76844275 -1.41417936 -1.14085553 ... -0.26640512 -0.27878761
  -0.46897375]
 [ 1.12276718 -0.43393986  1.44592055 ...  0.65872807 -0.07862018
  -0.4145352 ]]


If you have performed the mean normalization correctly, then the average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero. You can verify this by filing the code below:

In [21]:
# Print the average of all the values of X_norm
print(f"Average of all the values of X_norm :  {X_norm.mean()}")

# Print the average of the minimum value in each column of X_norm
print(f"Average of the minimum value in each column of X_norm :  {X_norm.min(axis=0).mean()}")

# Print the average of the maximum value in each column of X_norm
print(f"Average of the maximum value in each column of X_norm :  {X_norm.max(axis=0).mean()}")

Average of all the values of X_norm :  -2.842170943040401e-18
Average of the minimum value in each column of X_norm :  -1.7303871791356724
Average of the maximum value in each column of X_norm :  1.722599520307113


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab you will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

You will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. Let's see an example:

In [28]:
# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([3, 4, 0, 1, 2])

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. You can do this in one line of code by extracting the number of rows of `X_norm` using the `shape` attribute and then passing it to the  `np.random.permutation()` function. Remember the `shape` attribute returns a tuple with two numbers in the form `(rows,columns)`.

In [43]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(len(X_norm))  # creates  one D array 
# print(row_indices)
print(X_norm.shape[0])
row_indices = np.random.permutation(X_norm.shape[0])
print(row_indices)

1000
[ 48 195 508 128 444 493 369 420 976 571 835  47 731 140 482 395 627  22
 449 674 512 470 380 333   4 631 328 649 950  58 106 389  24 530 264 689
 672 870 407 719  69  33 787 662 592 396 743 721 342 992 284 892 596 745
 961 372 338 692 144 257 223 644 357 101 240 898 336 991 650 415 135 329
 839 791 688  17 818 737  39 993 325   5 186 557 926 751 679  43 318   3
 590  16 621 893 400 570  85 256 311 939 901 359 711 513 614 833 102 722
 966 708 406  66  73  88 378 365 347 648 757 916  36  26 155  61 373 288
  19 558 109 304 702 654 651 663 481 165 853 450 254 414 174 435 233 807
 387 739 716 620 604 520  71 391 547 439 483 799 285 881 612 734 220 877
  78 377   6 236 684 600 813 419 575 646 912 828 984 125 327 277 628 937
  68 949 758 946 495 769 868 823 748 162 732 459 932 691 999 468 972 536
 562 902 436 473  51 841 475 446 776 804 848 956 591 831 521 671 434 790
 138 738 774 761 467 816 735 802 640 145 385 888 977 921 572 465 343 609
 266 783 197 657 477  84 858 201 489 532 855 2

Now you can create the three datasets using the `row_indices` ndarray to select the rows that will go into each dataset. Rememeber that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. Each set requires just one line of code to create. Fill in the code below

In [51]:
# Make any necessary calculations.
# You can save your calculations into variables to use later.
sixty = int(len(X_norm)* 0.6) 
print(sixty)

eighty = int(len(X_norm)* 0.8) 
print(eighty)

# Create a Training Set
X_train = X_norm[row_indices[:sixty],:]
print(X_train)                 

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[sixty:eighty],:]

# Create a Test Set
X_test = X_norm[row_indices[eighty:],:]

600
800
[[ 0.93205693 -1.52685851  1.01708201 ... -0.2075456   0.50622274
  -1.42505063]
 [-0.41949828  0.54146066 -1.10511899 ...  0.2806422  -1.08149983
  -0.18249092]
 [-0.62195517 -0.9980269   1.14422164 ...  1.31172254  0.74111308
  -0.52069036]
 ...
 [-1.35162915  0.97766032  0.24805904 ...  1.4647573  -1.10328677
  -0.53293903]
 [ 1.6057616  -0.33716022 -0.1258002  ...  0.54170149  0.98281185
  -1.44342364]
 [ 1.44476346  1.27560335 -0.64741631 ...  0.26817783 -1.20609384
  -0.48802724]]


If you performed the above calculations correctly, then `X_tain` should have 600 rows and 20 columns, `X_crossVal` should have 200 rows and 20 columns, and `X_test` should have 200 rows and 20 columns. You can verify this by filling the code below:

In [53]:
# Print the shape of X_train
print(X_train.shape)                 

# Print the shape of X_crossVal
print(X_crossVal.shape)  


# Print the shape of X_test
print(X_test.shape)  


(600, 20)
(200, 20)
(200, 20)
