# Achtung!

This version of the notebook has everything all filled in!
If we use it at a future time, we will need to turn all the answers back into questions!

# Multi-dimensional arrays
So far, 

* We've seen how to work with single columns of data. 
* But data are often in tables.

# Multi-dimensional data 

Consider

In [1]:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9]])
x

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

This is a *multi-dimensional array*. 
To access it, note that: 

In [2]:
print(x[0])
print(x[1])
print(x[2])
print(x.shape)

[1 2 3]
[4 5 6]
[7 8 9]
(3, 3)


* `x.shape` is the shape of the array: 3x3. 
* `x[0]` - `x[2]` are the "rows". 
* Obviously `x[i][j]` is the object at row `i`, column `j`. 

In [3]:
x[2][2]

9

You'll be happy to know that the things you are used to for single-dimensional arrays still work, e.g., 

In [4]:
x + 4

array([[ 5,  6,  7],
       [ 8,  9, 10],
       [11, 12, 13]])

In [5]:
x > 5


array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

In [6]:
x[x>5]

array([6, 7, 8, 9])

Oops. That doesn't do quite what we might want. It *flattened* the array and produced the elements that match. 

# The concept of an axis
Most meaningful operations on multi-dimensional arrays act on rows or columns. We might want to remove some rows or columns, or we might want to filter out all rows matching some criteria.

An *axis* is a number or designation that describes the dimension to which to apply an operation. 

Consider, e.g., 


In [7]:
print("input = ")
print(x)
print("x.sum(axis=0)={}".format(x.sum(axis=0)))
print("x.sum(axis=1)={}".format(x.sum(axis=1)))

input = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
x.sum(axis=0)=[12 15 18]
x.sum(axis=1)=[ 6 15 24]


* The first parameter of `sum` is `axis` (0,1). 
* `axis` is 0 --> sum *rows.* 
* `axis` is 1 --> sum *columns.*

# Broadcasting

* When Numpy is faced with arrays of the exact same shape, things proceed normally. 
* when shapes differ, numpy invented -- and many other libraries copied -- the idea of *broadcasting*. 
* We've already seen this in the single dimensional case, but here it is in the multi-dimensional case. 
* Consider ![this diagram](../figures/03-02-broadcasting.png)

In [8]:
a = np.array([[1,2,3],[4,5,6]])
b = np.array([10,11,12])
a + b

array([[11, 13, 15],
       [14, 16, 18]])

Simply stated, computing `a + b` actually replicates `b` to `b'` that is a two-dimensional array with the rows copied. This is identical to: 

In [9]:
a = np.array([[1,2,3], [4,5,6]])
c = [10,11,12]
d = np.array([c, c])
print("a=")
print(a)
print("d=")
print(d)
a+d

a=
[[1 2 3]
 [4 5 6]]
d=
[[10 11 12]
 [10 11 12]]


array([[11, 13, 15],
       [14, 16, 18]])

# What is the point of "broadcasting"?
* Very often we want to repeat a comparison among all rows. 
* Consider the following really counter-intuitive but very useful pattern.

In [10]:
# convention: -1 means missing
data = np.array([[1,2,1],
                 [-1,5,2],
                 [3,3,-1],
                 [1,1,4],
                 [2,1,2]])
print("data is:")
print(data)
print("data != -1 is:")
print(data != -1)
# True means corresponding line has no missing data
choices = (data != -1).all(axis=1)
print("(data != -1).all(axis=1) =")
print(choices)
# select lines without missing data. 
print("data[choices] is")
data[choices]

data is:
[[ 1  2  1]
 [-1  5  2]
 [ 3  3 -1]
 [ 1  1  4]
 [ 2  1  2]]
data != -1 is:
[[ True  True  True]
 [False  True  True]
 [ True  True False]
 [ True  True  True]
 [ True  True  True]]
(data != -1).all(axis=1) =
[ True False False  True  True]
data[choices] is


array([[1, 2, 1],
       [1, 1, 4],
       [2, 1, 2]])

In other words, I just selected all rows that don't have missing columns. 

Let's take this apart carefully. 
* by convention, data that is missing is represented by `-1`. 
* Thus there are two rows in which data is missing. 
* We want to exclude any row with a -1 in it. 
* We compare every element to -1. 
* Then we compute the logical and of every *row* (axis=1). This means 
  to generate summaries for rows, by doing logical and of columns (1). 
* Then we select all rows (axis=0) for which every test is True. 

# A compelling example

Suppose we want to examine all rows whose elements are more than 1 standard deviation from the mean for the respective columns. Consider this code: 

In [11]:
data = np.array([[12, 42, 12],
                 [13,  2, 13],
                 [11, 40, 14],
                 [14, 44, 11], 
                 [10, 39, 15],
                 [13, 43, 14]])
stdev = data.std(axis=0)
print("stdev = {}".format(stdev))
means = data.mean(axis=0)
print("means = {}".format(means))
mins = means - stdev
print("mins = {}".format(mins))
maxs = means + stdev
print("maxs = {}".format(maxs))
gt_lower = (data > mins).all(axis=1)
lt_higher = (data < maxs).all(axis=1)
in_bounds = gt_lower & lt_higher
outliers = np.invert(in_bounds)
print("outlier choices are:")
print(outliers)
print("outlier rows are:")
print(data[outliers])

stdev = [ 1.34370962 14.8548533   1.34370962]
means = [12.16666667 35.         13.16666667]
mins = [10.82295704 20.1451467  11.82295704]
maxs = [13.51037629 49.8548533  14.51037629]
outlier choices are:
[False  True False  True  True False]
outlier rows are:
[[13  2 13]
 [14 44 11]
 [10 39 15]]


# A really curious logic

* What we want is to compute a flag table of rows to choose
( `outliers = [False, True, False, True, True, False]`)

How we do that: 
* compute mean and standard deviation. 
* compute upper and lower bounds for each *column*. 
* broadcast those tables in comparisons with each *row*. 
* compute from that whether each row matches. 
* invert that to get rows that don't match. 
* select these via `data[selection]` pattern

We can do most of this in one line: 

In [12]:
data[np.invert(((data > mins) & (data < maxs)).all(axis=1))]

array([[13,  2, 13],
       [14, 44, 11],
       [10, 39, 15]])

# Aside: functional programming

This seemingly curious logic is part of a movement in Computer Science toward what is called *functional programming*. 
* Express things in terms of functions that transform data. 
* Avoid the "for" loop at all costs. 

This way of thinking is theoretically desirable: functional programs are: 
* much easier to debug and correct. 
* much easier to speed up through parallel computing. 

In fact, *the use of "for" loops makes these things more difficult!* 

# Patterns for functional data programming
* *operations on rows:* broadcasting and parallel selection. 
* *transposition:* switch rows and columns: operate on columns as rows in order to use row patterns! 
Consider the following: 

In [13]:
columns = data.transpose()
print(columns)
# Let's remove column 1, which is now row 1!
c2 = columns[[True, False, True]]
print(c2)
d2 = c2.transpose()
d2

[[12 13 11 14 10 13]
 [42  2 40 44 39 43]
 [12 13 14 11 15 14]]
[[12 13 11 14 10 13]
 [12 13 14 11 15 14]]


array([[12, 12],
       [13, 13],
       [11, 14],
       [14, 11],
       [10, 15],
       [13, 14]])

# What happened? 
* The pattern data[selections] works on rows. 
* I wanted to remove a column. 

so

* transpose rows and columns. 
* remove the new "row" (old "column")
* transpose rows and columns back. 

Result is removing a column! 

# Let's put this into practice. 
A note on the format of these exercises: 
* This is an exercise in *functional programming.* 
* Thus, I will ask you to *write functions* to accomplish specific things. 
* These functions should work on any input I give them, within bounds. 
* They will be tested on arbitrary test cases.

First login to grading:

In [14]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-02-multi-dimensional-data.ok')
ok.auth(inline=True)

Assignment: Multi-dimensional data
OK, version v1.14.15



ERROR  | auth.py:102 | {'error': 'invalid_grant'}



Open the following URL:

https://okpy.org/client/login/

After logging in, copy the code from the web page and paste it into the box.
Then press the "Enter" key on your keyboard.

Paste your code here: BCHnrfrPW79wzZOQTeOoY9kumFBUMj
Successfully logged in as j.singh@datathinks.org


1. **Write a function `clean_rows`** that takes a two-dimensional `array` and deletes the *rows* that contain -1's. Hint: act on the whole array and then collect results for rows with `all(axis=1)`. (-1 is a conventional code for *missing data* in public data corpora.) 

In [15]:
def clean_rows(data): 
    # Fill in details here
    mask = (data != -1).all(axis=1)
    return data[mask] 

In [16]:
# Test your code on this example.
data = np.array([[4, -1, 1], [1, 2, 3], [7, 1, 9], [-1, 2, -1], [2, 4, 6]])
print("Before:")
print(data)
print("After:")
print(clean_rows(data))

Before:
[[ 4 -1  1]
 [ 1  2  3]
 [ 7  1  9]
 [-1  2 -1]
 [ 2  4  6]]
After:
[[1 2 3]
 [7 1 9]
 [2 4 6]]


In [17]:
_ = ok.grade('q01')  # check that your solution works. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. **Write a function `clean_columns`** that removes all columns containing -1 in any row. Hint: this is the transpose of the first problem. 

In [18]:
def clean_columns(data): 
    # fill in details here
    trans = data.transpose()
    sel = trans[(trans != -1).all(axis=1)] 
    return sel.transpose()

In [19]:
# Test your code on this example
data = np.array([[4, 5, -1, 2], [1, 2, 3, 1],
                 [7, -1, 9, 8], [4, 2, 3, 6], [2, 4, 6, 2]])
print("Before:")
print(data)
print("After:")
print(clean_columns(data))

Before:
[[ 4  5 -1  2]
 [ 1  2  3  1]
 [ 7 -1  9  8]
 [ 4  2  3  6]
 [ 2  4  6  2]]
After:
[[4 2]
 [1 1]
 [7 8]
 [4 6]
 [2 2]]


In [20]:
_ = ok.grade('q02')  # check that your solution works. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



3. **Write a function `masked`** that masks missing data using a masked array.  See this documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.masked_where.html

In [21]:
def masked(data):
    # fill in details...
    return np.ma.masked_where(data == -1, data)

In [22]:
# Test your code on this example
data = np.array([[4, 5, -1, 2], [1, 2, 3, 1],
                 [7, -1, 9, 8], [4, 2, 3, 6], [2, 4, 6, 2]])
print("Before:")
print(data)
print("After:")
print(masked(data))

Before:
[[ 4  5 -1  2]
 [ 1  2  3  1]
 [ 7 -1  9  8]
 [ 4  2  3  6]
 [ 2  4  6  2]]
After:
[[4 5 -- 2]
 [1 2 3 1]
 [7 -- 9 8]
 [4 2 3 6]
 [2 4 6 2]]


In [23]:
_ = ok.grade('q03')  # check that your solution works. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



4. **Write a function column_averages** that computes the averages of each column, skipping missing data in each column. Read about this here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html . Use masking to skip missing data.

In [24]:
def column_averages(data):
    # fill in details...
    m = masked(data)
    return m.mean(axis=0)

In [25]:
data = np.array([[4, 5, -1, 2], [1, 2, 3, 1],
                 [6, -1, 9, 8], [4, 2, 3, 6], [2, 4, 6, 2]])
print("Before:")
print(data)
print("After:")
print(column_averages(data))

Before:
[[ 4  5 -1  2]
 [ 1  2  3  1]
 [ 6 -1  9  8]
 [ 4  2  3  6]
 [ 2  4  6  2]]
After:
[3.4 3.25 5.25 3.8]


In [26]:
_ = ok.grade('q04')  # check that your solution works. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



5. (Advanced) **Write a function `default_missing`** that replaces missing data for a column with the mean of the non-missing data rows for that column. *This won't change the mean!*

   a. Create a masked array using your function `masked`.
    
   b. Use `mean` to compute the mean of the masked array. Compute the `axis 0 mean` of that masked array. These are the means for the non-missing data. Use `keepdims=1` to allow this to broadcast in the next step. 
   
   c. Use `np.select` to replace the -1s with averages. Read about this here:  https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html

In [27]:
def default_missing(data): 
    # fill in details ... 
    masked = np.ma.masked_where(data==-1,data)
    # print(masked)
    means = masked.mean(keepdims=1, axis=0)
    # print("means={}".format(means))
    result = np.select([data != -1, data == -1], [data, means], )
    return result

In [28]:
# Test your code on this example
data = np.array([[4, 5, -1], [1, 2, 3], [7,-1,9], [-1, 2, -1], [2, 4, 6]])
print("Before:")
print(data)
print("After:")
print(default_missing(data))

Before:
[[ 4  5 -1]
 [ 1  2  3]
 [ 7 -1  9]
 [-1  2 -1]
 [ 2  4  6]]
After:
[[4.   5.   6.  ]
 [1.   2.   3.  ]
 [7.   3.25 9.  ]
 [3.5  2.   6.  ]
 [2.   4.   6.  ]]


In [29]:
_ = ok.grade('q05')  # check that your solution works. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



# When you're done, submit the notebook

1. **Run all the cells in order.**

2. Submit the notebook by saving it as PDF. 
    * In the cluster environment, it's File | Print (Save as PDF) and submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>, 
    * On other versions, it may be File | Download As (PDF) and then submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>.

<sup>&dagger;</sup>To submit to Gradescope, log into the website, add course 9W7PW3 (if not already added) and submit. The assignment name should match the name of this notebook.