# Mean Normalization

In machine learning we use large amounts of data to train our models. Some machine learning algorithms may require that the data is *normalized* in order to work correctly. The idea of normalization, also known as *feature scaling*, is to ensure that all the data is on a similar scale, *i.e.* that all the data takes on a similar range of values. For example, we might have a dataset that has values between 0 and 5,000. By normalizing the data we can make the range of values be between 0 and 1.

In this mini project, I will be performing a different kind of feature scaling known as *mean normalization*. Mean normalization will scale the data, but instead of making the values be between 0 and 1, it will distribute the values evenly in some small interval around zero. For example, if we have a dataset that has values between 0 and 5,000, after mean normalization the range of values will be distributed in some small range around 0, for example between -3 to 3. Because the range of values are distributed evenly around zero, this guarantees that the average (mean) of all elements will be zero. Therefore, when I perform *mean normalization* the data will not only be scaled but it will also have an average of zero. 

# To Do:

Import NumPy and create a rank 2 ndarray of random integers between 0 and 5,000 (inclusive) with 1000 rows and 20 columns. This array will simulate a dataset with a wide range of values. 

In [39]:
# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0, 5001, size=[1000,20])

# print the shape of X
np.shape(X)

(1000, 20)

Now that the array has been created, I will mean normalize it. I will perform mean normalization using the following equation:

$\mbox{Norm_Col}_i = \frac{\mbox{Col}_i - \mu_i}{\sigma_i}$

where $\mbox{Col}_i$ is the $i$th column of $X$, $\mu_i$ is average of the values in the $i$th column of $X$, and $\sigma_i$ is the standard deviation of the values in the $i$th column of $X$. In other words, mean normalization is performed by subtracting from each column of $X$ the average of its values, and then by dividing by the standard deviation of its values. This can be done with the following code.

In [5]:
# Average of the values in each column of X
ave_cols = X.mean(axis=0)

# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

`ave_cols` and `std_cols`, should both be vectors with shape `(20,)` since $X$ has 20 columns.

In [17]:
# Print the shape of ave_cols
print("shape of ave_cols is", np.shape(ave_cols))

# Print the shape of std_cols
print("shape of std_cols is",np.shape(std_cols))

shape of ave_cols is (20,)
shape of std_cols is (20,)


By taking advantage of Broadcasting I will calculate the mean normalized version of $X$ in just one line of code using the equation above. Fill in the code below

In [11]:
# Mean normalize X
X_norm = ((X - ave_cols)/std_cols)

The average of all the elements in $X_{\tiny{\mbox{norm}}}$ should be close to zero, and they should be evenly distributed in some small interval around zero.

In [18]:
# Print the average of all the values of X_norm
print("the average of all the values of X_norm is", X_norm.mean())

# Print the average of the minimum value in each column of X_norm
print("The average of the minimum value in each column of X_norm is",
      X_norm.min(axis=0).mean())

# Print the average of the maximum value in each column of X_norm
print("The average of the maximum value in each column of X_norm is",
      X_norm.max(axis=0).mean())


the average of all the values of X_norm is 3.730349362740526e-18
The average of the minimum value in each column of X_norm is -1.7345229511622224
The average of the maximum value in each column of X_norm is 1.7345739704399699


You should note that since $X$ was created using random integers, the above values will vary. 

# Data Separation

After the data has been mean normalized, it is customary in machine learnig to split our dataset into three sets:

1. A Training Set
2. A Cross Validation Set
3. A Test Set

The dataset is usually divided such that the Training Set contains 60% of the data, the Cross Validation Set contains 20% of the data, and the Test Set contains 20% of the data. 

In this part of the lab I will separate `X_norm` into a Training Set, Cross Validation Set, and a Test Set. Each data set will contain rows of `X_norm` chosen at random, making sure that we don't pick the same row twice. This will guarantee that all the rows of `X_norm` are chosen and randomly distributed among the three new sets.

I will start by creating a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. I will do this by using the `np.random.permutation()` function. The `np.random.permutation(N)` function creates a random permutation of integers from 0 to `N - 1`. 

# To Do

In the space below create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`. 

In [26]:
# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(1000)

In [35]:
#Create the three datasets using the row_indices ndarray 
#to select the rows that will go into each dataset.
trainingset = int(len(row_indices) * 0.6)
crossxtest = int(len(row_indices) *0.2)

# Create a Training Set
X_train = X_norm[row_indices[:trainingset]]

# Create a Cross Validation Set
X_crossVal = X_train = X_norm[row_indices[:crossxtest]]

# Create a Test Set
X_test = X_train = X_norm[row_indices[:crossxtest]]

In [37]:
# Print the shape of X_train
print("The shape of X_train is ", np.shape(X_train))

# Print the shape of X_crossVal
print("The shape of X_crossVal is ", np.shape(X_crossVal))

# Print the shape of X_test
print("The shape of X_crossVal is ", np.shape(X_test))

The shape of X_train is  (200, 20)
The shape of X_crossVal is  (200, 20)
The shape of X_crossVal is  (200, 20)
