### <font color="brown">Problem Set 8: NumPy - Solution</font>

In [3]:
import numpy as np

---

#### Problem 1 

Mean normalization: <br>

Mean normalizing is a common data pre-processing step used in Data Science and Machine learning. <br>Write a function that replaces all nan values to zero in a given array. Then, perform mean normalization i.e. subtract, from all items of each row, the mean value of that row.

In [17]:
# e.g. input array
X = np.array([[5,6,np.nan,7],[1,np.nan,0,5],[-1,5,np.nan,2]])

#### Solution

Before getting into the full solution, a key thing to note is when you find the mean of each row, the result is a 1D array, and NOT a 2D array with 3 rows and 1 column:

In [22]:
arr = np.arange(1,13).reshape(3,4)
print(arr,'\n')
res = arr.mean(axis=1)
print(res,'\n')
print(res.shape)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]] 

[ 2.5  6.5 10.5] 

(3,)


Note that the shape is (3,) meaning a 1D array with 3 items. If you try to subtract this from each item in arr, it won't work:

In [23]:
arr - arr.mean(axis=1)

ValueError: operands could not be broadcast together with shapes (3,4) (3,) 

The error message above basically says the (3,4) shape is incompatible with the (3,) shape for the row by row subtraction

One way to fix this is to reshape the result of the mean function to (3,1):

In [24]:
arr - arr.mean(axis=1).reshape(3,1)

array([[-1.5, -0.5,  0.5,  1.5],
       [-1.5, -0.5,  0.5,  1.5],
       [-1.5, -0.5,  0.5,  1.5]])

Another option is to use the keepdims parameter in the mean function, which maintains the 2D-ness of the result:

In [25]:
arr - arr.mean(axis=1,keepdims=True)

array([[-1.5, -0.5,  0.5,  1.5],
       [-1.5, -0.5,  0.5,  1.5],
       [-1.5, -0.5,  0.5,  1.5]])

**Following is the complete solution:**

In [26]:
def mean_normalize(X):
    X = np.nan_to_num(X)   # this function replaces NaNs with 0
    Y = X - X.mean(axis=1, keepdims=True)
    return Y

In [29]:
print(X,'\n')
Y = mean_normalize(X)
print(Y)

[[ 5.  6. nan  7.]
 [ 1. nan  0.  5.]
 [-1.  5. nan  2.]] 

[[ 0.5  1.5 -4.5  2.5]
 [-0.5 -1.5 -1.5  3.5]
 [-2.5  3.5 -1.5  0.5]]


---

#### Problem 2

Given a 2D NumPy array with n rows and k columns, where the rows are students, and the columns are scores on quizzes (integer, between 0 and 20, inclusive), compute a result NumPy array of size k x 3 with the max, min, and average score on each quiz, and, separately, the average class score on all quizzes combined.

#### Solution

In [4]:
def getStats(scores):
    res = np.empty((scores.shape[1],3))
    res[:,0] = np.max(scores,axis=0)
    res[:,1] = np.min(scores,axis=0)
    res[:,2] = np.average(scores,axis=0)
    return res,np.average(scores) 
    

In [5]:
scores = np.random.randint(1,21,(10,5))
scores

array([[ 6, 13, 12,  2, 11],
       [10,  9, 20, 18,  3],
       [ 8,  7, 10, 11, 11],
       [17, 18,  4,  1,  6],
       [14, 11, 17,  5,  7],
       [13, 11,  1, 11,  5],
       [ 9,  4, 20, 15, 17],
       [16, 19,  4,  2, 15],
       [16,  6,  6,  9,  3],
       [ 5, 20,  2, 16, 14]])

In [6]:
res,avg = getStats(scores)
print(res,'\n')
print(avg)

[[17.   5.  11.4]
 [20.   4.  11.8]
 [20.   1.   9.6]
 [18.   1.   9. ]
 [17.   3.   9.2]] 

10.2


---

#### Problem 3

Write a function that takes a 2D ndarray and cycles the rows up by 1 so that the first row becomes the last, the last becomes second-to-last, etc. 

#### Solution

In [7]:
def rowcycle(ndarr):
    cycle = list(range(1,ndarr.shape[0])) + [0]
    return ndarr[cycle]

arr2d = np.random.randint(1,13,(4,3))
print(arr2d,'\n')
print(rowcycle(arr2d))

[[6 9 8]
 [9 9 2]
 [9 1 7]
 [6 1 1]] 

[[9 9 2]
 [9 1 7]
 [6 1 1]
 [6 9 8]]


---

#### Problem 4

Write a function that takes an ndarray and computes the standard deviation of the values in each row, without using the standard deviation function. It should return an array with these standard deviations. 
See https://www.mathsisfun.com/data/standard-deviation-formulas.html

#### Solution

In [30]:
# Building the solution, one step at a time

# 1. Sample 2D ndarray
arr = np.array([[3,1,-2],[1,8,2],[6,1,5]])
print(f'Input array:\n {arr}\n')

# 2. Mean for each row
mn = np.mean(arr,axis=1)
print(f'Row means: {mn}\n')

# 3. Reshape the means array (which is 1D) into 2D
mn = mn.reshape(3,1)
print(f'Row means, column vector:\n {mn}\n')

# 4. Subtract row's mean from each row value
arr1 = arr - mn
print(f'Row value minus mean:\n {arr1}\n')

# 5. Square the differences
arr1 = arr1 ** 2
print(f'Differences squared:\n {arr1}\n')

# 6. Sum the squared differences
arr1 = arr1.sum(axis=1)
print(f'Sum of squared differences:\n {arr1}\n')

# 7. Divide each by number of columns (values in each row)
arr1 = arr1 / arr.shape[1]
print(f'Divide by N={arr.shape[1]} (number of values in each row):\n {arr1}\n')

# 8. Square root of each
arr1 = np.sqrt(arr1)
print(f'Standard deviations:\n {arr1}\n')

Input array:
 [[ 3  1 -2]
 [ 1  8  2]
 [ 6  1  5]]

Row means: [0.66666667 3.66666667 4.        ]

Row means, column vector:
 [[0.66666667]
 [3.66666667]
 [4.        ]]

Row value minus mean:
 [[ 2.33333333  0.33333333 -2.66666667]
 [-2.66666667  4.33333333 -1.66666667]
 [ 2.         -3.          1.        ]]

Differences squared:
 [[ 5.44444444  0.11111111  7.11111111]
 [ 7.11111111 18.77777778  2.77777778]
 [ 4.          9.          1.        ]]

Sum of squared differences:
 [12.66666667 28.66666667 14.        ]

Divide by N=3 (number of values in each row):
 [4.22222222 9.55555556 4.66666667]

Standard deviations:
 [2.05480467 3.09120617 2.1602469 ]



In [156]:
# Verify against np standard deviation function, std
arr.std(axis=1)

array([2.05480467, 3.09120617, 2.1602469 ])

In [157]:
# Solution function
def stddev(arr):
    arr1 = arr - np.mean(arr,axis=1).reshape(3,1)
    arr1 = (arr1 ** 2).sum(axis=1)/arr.shape[1]
    return np.sqrt(arr1)


In [158]:
# Test
stddev(arr)

array([2.05480467, 3.09120617, 2.1602469 ])

---

#### Problem 5:

Create a 2D array of shape 5x3 to contain random decimal numbers between 5 and 10. Get the position (index) of the two largest numbers in each row. From the generated 2D array, replace all values greater than 8 to 10 and less than 6 to 5.

#### Solution
In addition to functions covered in class, uses the NumPy flip function:<br>
https://numpy.org/doc/stable/reference/generated/numpy.flip.html

In [13]:
# Two options to create 2D array

# Option1 :
a = np.random.randint(low=5, high=10, size=(5,3)) + np.random.random((5,3))

# Option 2: 
a = np.random.uniform(5,10, size=(5,3))
print(a)

max_pos = np.flip(np.argsort(a, axis=1), axis=1)
max_pos = max_pos[:,:2]
print(max_pos)

new_arr = np.where(a < 6, 5, np.where(a > 8, 10, a))
print(new_arr)

[[8.06574787 6.17902828 6.23045524]
 [7.76811644 8.96141786 8.06881827]
 [7.87480721 6.34269721 6.98387953]
 [8.44756539 6.5717027  9.67490396]
 [9.75381999 7.99566054 6.54296105]]
[[0 2]
 [1 2]
 [0 2]
 [2 0]
 [0 1]]
[[10.          6.17902828  6.23045524]
 [ 7.76811644 10.         10.        ]
 [ 7.87480721  6.34269721  6.98387953]
 [10.          6.5717027  10.        ]
 [10.          7.99566054  6.54296105]]


---

#### Problem 6:

Generate *one-hot encodings* for a list of values (classes). One-hot encoding and its applications are explained in the following resources: 
1. https://en.wikipedia.org/wiki/One-hot
2. https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179

Write a function that takes a 1-d List as input and return a 2-d Numpy array where the rows are the one-hot encoding of the classes in the list. Eg: Input: ['cat','camel','dog','cat'] <br>
Output: [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]

#### Solution

In [14]:
def one_hot_encoding(l):
    arr = np.array(l)
    num_classes = np.unique(arr)
    encoding = np.zeros((arr.shape[0], num_classes.shape[0]))
    for i, k in enumerate(arr):
        encoding[i, k-1] = 1
    return encoding

In [15]:
# Test

l = [1,2,0,1,2]
encoding = one_hot_encoding(l)
print(encoding)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


---