##### <img src="../SDSS-Logo.png" style="display:inline; width:500px" />


## Learning Objectives
- use and manipulate 2-d NumPy Arrays
- Undertstand how to do row ans column summrization using numpy functions


# Multi-dimensional arrays

* The goal for today's class is to introduce N-dimensional arrays with an emphasis on two dimensional arrays and how to reference them.
* Two dimensional arrays are very common in data science.
* A spreadsheet can basically be conisdered a 2-d array.
* But any table  with rows and columns can be treated as a 2-d array.


# Motivation

* Arrays are a FUNDAMENTAL data type, and an n-D array can represent some complicated data sources. 
* For example, dividing up Jordan Lake into cubes, each of which is 1m X 1m X 1m. We can characterize the geometric location of each such cubic meter of water by its longitude, latitude, and depth. For each such cube we can measure the average temperature, turbidity, and salinity.  These 6 measurements can be put in 6-D array, or 6-dimensional "cube" or "matrix" or hyper-cube."

In [2]:
import numpy as np
import random
import comp116
import pickle

bp_reading1 =  np.array([149, 127, 128, 129, 125])
bp_reading2 = np.array([147, 129, 131, 127, 135])
bp_reading3 = np.array([143, 123, 125, 130, 133])
bp_reading4 = np.array([144, 129, 135, 131, 132])
bp_reading5 = np.array([151, 134, 130, 129, 127])
bp_reading6 = np.array([149, 127, 129, 131, 133])
bp_readings = np.concatenate(([bp_reading1], [bp_reading2], [bp_reading3],
                              [bp_reading4], [bp_reading5], [bp_reading6]), axis=0)
print(bp_readings)



[[149 127 128 129 125]
 [147 129 131 127 135]
 [143 123 125 130 133]
 [144 129 135 131 132]
 [151 134 130 129 127]
 [149 127 129 131 133]]


### Let's examine some blood pressure data

Let's look at variable `bp_reading1` which is five blood pressure readings of a patient on day 1.

`bp_reading2` is the five blood pressure readings of the same patient on day 2.
Repeating we have 6 days of data, upto `bp_reading6`.


**NOTE:** This data is totally fictitious. Also, a single blood pressure reading involves two numbers, the systolic and diastolic pressures. Here we are working with the systolic pressure only, so a single blood pressure reading is a single number.

In [4]:
# Let's assume we took five blood pressure readings 
# each time the patient came to the office
print('The five blood pressure readings of the '
      'patient from day 1 are:')
print(bp_reading1)

print('The five blood pressure readings of the '
      'patient from day 2 through 6 are:')
print(bp_reading2)
print(bp_reading3)
print(bp_reading4)
print(bp_reading5)
print(bp_reading6)

The five blood pressure readings of the patient from day 1 are:
[149 127 128 129 125]
The five blood pressure readings of the patient from day 2 through 6 are:
[147 129 131 127 135]
[143 123 125 130 133]
[144 129 135 131 132]
[151 134 130 129 127]
[149 127 129 131 133]


### Working with the blood pressure data

Let us say we want to compute the average blood pressure data for day 1. You can use the numpy function `mean` or the numpy function `average`. 

In [5]:
# Set average_bp_reading1 to the average of the bp readings in bp_reading1

average_bp_reading1 = np.average(bp_reading1)


print("The average blood pressure on the first day is", average_bp_reading1)

The average blood pressure on the first day is 131.6


## Average bp over multiple days

But what if we wanted to get the _average_ blood pressure reading over multiple days?

You can assume that each day has the same number of blood pressure readings

In [6]:
overall_average_bp_reading1 = ( np.mean( bp_reading1) +
                                np.mean( bp_reading2) +
                                np.mean( bp_reading3) +
                                np.mean( bp_reading4) +
                                np.mean( bp_reading5) +
                                np.mean( bp_reading6)) / 6



# Another way to calculate the average bp over all days is to add the sum of the readings for each day and divide
# by the number of measurements. Assign this to overall_average_bp_reading2


overall_average_bp_reading2 = ((np.sum( bp_reading1) + np.sum( bp_reading2) +
                                np.sum( bp_reading3) + np.sum( bp_reading4) +
                                np.sum( bp_reading6) + np.sum( bp_reading6)) / 
                                (len(bp_reading1) + len(bp_reading2) + len(bp_reading3) +
                                len(bp_reading4) + len(bp_reading5) + len(bp_reading6)))
                                                                                       

print("The averages over all six days calculated two ways are", overall_average_bp_reading1, overall_average_bp_reading2)




The averages over all six days calculated two ways are 133.06666666666663 133.0


## Multi-dimensional array (nd-array)

Variable `bp_readings` is a two dimensional array read in earlier that is the concatenation of all the previous readings.


<br />
<br />
<br />

**NOTE:** `bp_readings` data is totally fictitious.  Dr. Majekis made it all up.

In [7]:
# array_to_html helps to format the data in a table for display
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


## Referencing individual elements in a two dimensional arrays

- We used <font color='green'>*colons*</font> to separate `start`:`stop`:`step` in any given dimension (or axis).  
- We use a <font color='green'>*comma*</font> to separate references to the dimensions (or axes).
  - Use a *row* and *column* from each other in a two-dimensional array from the table above.

So if you use the index `[0, 4]` you get row with index 0 and column with index 4.
That element has the value of 125.

Print out the first day's fifth blood pressure reading.

In [9]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])


first_days_fifth_reading = bp_readings[0,4]

# First day's fifth reading is referenced
# using day offset zero and reading offset four
# separated by a comma
print("The first day's fifth reading is ", first_days_fifth_reading)

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The first day's fifth reading is  125


## Referencing a single cell

Using the array_to_html output above to *visually* look up where the value 144 is within `bp_readings`.
How would you print out that cell?

On an exam or quiz, you will never look things up *visually*.
But for now, we'll do it visually.

### Find the value 144

Set variable `row_144` and `column_144` to the row and column that has 144 in it.

Notice that the code references a single cell by using a comma to separate out the rows and columns.

In [12]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])


row_144 = np.argmax( np.any(bp_readings == 129, axis=1) )
column_144 = np.argmax( np.any(bp_readings == 129, axis=0) )

print('bp_readings[', row_144, ',', column_144, '] has the value', bp_readings[row_144, column_144])

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


bp_readings[ 0 , 1 ] has the value 127


### Visually find the value 123

Set variable `row_123` and `column_123` to the row and column that has 123 in it.

In [13]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])

row_123 = np.argmax( np.any(bp_readings == 123, axis=1) )
column_123 = np.argmax( np.any(bp_readings == 123, axis=0) )

print('bp_readings[', row_123, ',', column_123, '] has the value', bp_readings[row_123, column_123])

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


bp_readings[ 2 , 1 ] has the value 123


## What is the length of a two dimensional array?

Is the length of a NumPy two-dimensional array the number of rows or the number of columns?

Well, let's look at the NumPy property **shape**.

**Note** shape is not a function for array.
Instead it's a new thing called a property.
For a property you just reference it.
For example, `bp_readings.shape`.

The shape is a tuple and describes the number of elements in each dimension.
In our case the dimension is only two so we'll call it rows and columns

<br />
<br />
<br />

In [14]:
print('The shape of bp_readings is', bp_readings.shape)
print('There are', bp_readings.shape[0], 'rows and', bp_readings.shape[1], 'columns in bp_readings')
print('The len(bp_readings)=', len(bp_readings), 'which is the first element in the shape.')

The shape of bp_readings is (6, 5)
There are 6 rows and 5 columns in bp_readings
The len(bp_readings)= 6 which is the first element in the shape.


## What is the maximum row or column offset

Given that `bp_readings` has a shape of (6, 5) what is the maximum row offset and maximum column offset that can
be used with `bp_readings`?

Set variable `max_row_offset` and `max_col_offset` to the largest number that can be 
used for an offset in `bp_readings`

In [20]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])
print('The shape of bp_readings is', bp_readings.shape)

max_row_offset  = bp_readings.shape[0] - 1
max_col_offset  = bp_readings.shape[1] - 1

print('The maximum row and column offset that will not give'
      ' an IndexError is', max_row_offset, 'and', max_col_offset,
      ', respectively.')

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The shape of bp_readings is (6, 5)
The maximum row and column offset that will not give an IndexError is 5 and 4 , respectively.


## Index error

What happens if you reference a row with (index) offset 6 or column with index (offset) 5?

In [21]:

try:
    eval(''' 

print(bp_readings[6, 0])
print(bp_readings[0, 5])

    ''')
except:
    print("Readings do not exist")


Readings do not exist


## How to reference an entire row of a two dimensional array

Think of row zero as the first *group* of blood pressure readings.

- With one dimensional arrays,  you specify `start:stop:step` but if you don't there are defaults
  - `bp_readings1[0::] == bp_readings1[0:1:] == bp_readings1[0:1:1]`
- What happens if you specify the row offset of a two dimensional array but no commas and not column offfset?

<br />

For a two dimensional array, if you specify zero offset **without a column offset** you get the entire first _row_.  

If you specify just the row offset and no column offset, then you get that entire row.

In [12]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])

print("The first day of blood pressure readings\n",
      "are bp_readings[0]=", bp_readings[0])
print()

print("Another way to get an entire row is to\n"
      "write bp_readins[0,:]", bp_readings[0, :])

print("Another way to get an entire row is to\n"
      "write bp_readins[0,0:]", bp_readings[0, 0:])


print("Another way to get an entire row is to\n"
      "write bp_readins[0,0::]", bp_readings[0, 0::])

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The first day of blood pressure readings
 are bp_readings[0]= [149 127 128 129 125]

Another way to get an entire row is to
write bp_readins[0,:] [149 127 128 129 125]
Another way to get an entire row is to
write bp_readins[0,0:] [149 127 128 129 125]
Another way to get an entire row is to
write bp_readins[0,0::] [149 127 128 129 125]


## How do you specify the second row?

You want to specify a row offset of 1, **and**:
 - Don't specify the columns
 - Specify column `0:`
 - Specify column `:`
 - Specify column `::`

In [22]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])

print("The second day of blood pressure readings\n",
      "is bp_readings[1]=", bp_readings[1])
print()

print('This is the same as bp_readings[1,:]')
print(bp_readings[1])
print()

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The second day of blood pressure readings
 is bp_readings[1]= [147 129 131 127 135]

This is the same as bp_readings[1,:]
[147 129 131 127 135]



## More ways of specifying data in a two dimensional array

Of course, there are lots more powerful ways to specify data in a two-dimensional array.

The row specification could be a `start:stop:step` notation.
The column  specification could be a `start:stop:step` notation.


Later we'll find that booleans can be used for row or column specifications.

Knowing these notations will be key for understanding how to reference data.  


Knowing how to reference data will be key to doing well on the exam!

### Multiple ways to specify the first row of a two-dimensional array

In [23]:
comp116.array_to_html(bp_readings)

print('Specify the row and take the default for the column, a colon, to specify bp_readings[0]=', bp_readings[0])
print()

print('Specify the row and use a single colon for the column to specify starting column zero and ending column is the last column,'
      'bp_readings[0, :]=', bp_readings[0, :])
print()

print('bp_readings[0, 0:] =', bp_readings[0, 0:])
print('bp_readings[0, :5] =', bp_readings[0, :5])
print('bp_readings[0, 0:5]=', bp_readings[0, 0:5])

Unnamed: 0,0,1,2,3,4
0,149,127,128,129,125
1,147,129,131,127,135
2,143,123,125,130,133
3,144,129,135,131,132
4,151,134,130,129,127
5,149,127,129,131,133


Specify the row and take the default for the column, a colon, to specify bp_readings[0]= [149 127 128 129 125]

Specify the row and use a single colon for the column to specify starting column zero and ending column is the last column,bp_readings[0, :]= [149 127 128 129 125]

bp_readings[0, 0:] = [149 127 128 129 125]
bp_readings[0, :5] = [149 127 128 129 125]
bp_readings[0, 0:5]= [149 127 128 129 125]


### Specifying the last row of a two dimensional array

Set variable `last_row_bp_readings` to the last row of `bp_readings`

In [24]:
comp116.array_to_html(bp_readings)



last_row_bp_readings = bp_readings[-1,:]


print('The last row of bp_readings is', last_row_bp_readings)

Unnamed: 0,0,1,2,3,4
0,149,127,128,129,125
1,147,129,131,127,135
2,143,123,125,130,133
3,144,129,135,131,132
4,151,134,130,129,127
5,149,127,129,131,133


The last row of bp_readings is [149 127 129 131 133]


### Specifying the last column of a two dimensional array

If you specify the row as `:` then you're essentially specifying all rows.

Set variable `last_column_bp_readings` to the last row of `bp_readings` by using a row offset of `:`


In [25]:
comp116.array_to_html(bp_readings)


last_column_bp_readings = bp_readings[:, -1]


print('The last column of bp_readings is', last_column_bp_readings)

Unnamed: 0,0,1,2,3,4
0,149,127,128,129,125
1,147,129,131,127,135
2,143,123,125,130,133
3,144,129,135,131,132
4,151,134,130,129,127
5,149,127,129,131,133


The last column of bp_readings is [125 135 133 132 127 133]


## Specifying a column of a two-dimensional array

Remember, the comma separates the specification of the first dimension, the rows, from the 
second dimension, the columns.

But the columns or rows can also use the `start`:`stop`:`step`.
So specifying a `:` in either the row or column means *all*.

Set variable `second_column_bp_readings` to the second column of `bp_readings`

In [26]:
comp116.array_to_html(bp_readings)
second_column_bp_readings = bp_readings[:, 1]

print('The second column of bp_readings is', second_column_bp_readings)

Unnamed: 0,0,1,2,3,4
0,149,127,128,129,125
1,147,129,131,127,135
2,143,123,125,130,133
3,144,129,135,131,132
4,151,134,130,129,127
5,149,127,129,131,133


The second column of bp_readings is [127 129 123 129 134 127]


# Taking averages

What if we wanted the average of a two dimensional array?

What would `np.average` return?  Set `average_bp_readings` to be the average of all `bp_readings`

In [27]:
comp116.array_to_html(bp_readings)
average_bp_readings = np.average(bp_readings)

print('The average of all bp_readings is', average_bp_readings)

Unnamed: 0,0,1,2,3,4
0,149,127,128,129,125
1,147,129,131,127,135
2,143,123,125,130,133
3,144,129,135,131,132
4,151,134,130,129,127
5,149,127,129,131,133


The average of all bp_readings is 133.06666666666666


## Taking average of the first readings

What if you wanted to take the average of the first reading (columns)?  
How would you do it?

NumPy numerical computation has an `axis` parameter.


Since `bp_readings.shape` is `(6, 5)` and we want the average of the five readings
we want to *collapse* the axis that has 6.
We want to specify the average with axis=0 to get five averages.

Set `bp_readings_averages_of_five_readings` to an array such
that `bp_readings_averages_of_five_readings[0]` is the average of all the first 
readings.
`bp_readings_averages_of_five_readings[1]` is the average of all the second readings.


In [28]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])
bp_readings_averages_of_five_readings = np.average(bp_readings, axis=0)

print('The average of the first through five readings are', bp_readings_averages_of_five_readings)
print('The average of', bp_readings[:, 0], 'is', bp_readings_averages_of_five_readings[0])

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The average of the first through five readings are [147.16666667 128.16666667 129.66666667 129.5        130.83333333]
The average of [149 147 143 144 151 149] is 147.16666666666666


In [20]:
np.diff( [0, 1, 2, 3])

array([1, 1, 1])

## Does each reading always decrease

What if you wanted to know the difference going across a row?
Maybe each reading goes down as people relax in their office.

NumPy difference, [np.diff](https://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html) takes the difference 
of successive readings.

`np.diff( [0, 1, 2, 3] )` returns `np.array([1, 1, 1])`.  Notice that `np.diff` returns an array that has one less element in in the direction of the difference.

Set variable `bp_readings_diff_across_rows` to the difference of each successive reading in `arr`.

In [21]:
comp116.array_to_html(bp_readings, row_names=['Day1', 'Day2', 'Day3', 'Day4', 'Day5', 'Day6'],
                     col_names=['Reading1', 'Reading2', 'Reading3', 'Reading4', 'Reading5'])
bp_readings_diff_across_rows = np.diff(bp_readings, axis=1)
#print(bp_readings_diff_across_rows)
for offset in range(len(bp_readings_diff_across_rows)):
    print('The successive increases of day', offset+1, 'is', bp_readings_diff_across_rows[offset])
print('It looks like the second reading of each day consistently goes down.')

Unnamed: 0,Reading1,Reading2,Reading3,Reading4,Reading5
Day1,149,127,128,129,125
Day2,147,129,131,127,135
Day3,143,123,125,130,133
Day4,144,129,135,131,132
Day5,151,134,130,129,127
Day6,149,127,129,131,133


The successive increases of day 1 is [-22   1   1  -4]
The successive increases of day 2 is [-18   2  -4   8]
The successive increases of day 3 is [-20   2   5   3]
The successive increases of day 4 is [-15   6  -4   1]
The successive increases of day 5 is [-17  -4  -1  -2]
The successive increases of day 6 is [-22   2   2   2]
It looks like the second reading of each day consistently goes down.
