# Numpy

Another one of the most popular and important
packages available for Python is `numpy`
("numerical python").  Specifically, it provides
numerical arrays and a variety of methods, functions,
and operations for working with those arrays.
While Python lists are very flexible, `numpy` arrays
are fundamental when working with mathematical
operations or larger numerical data sets.  For those
of you coming from MATLAB or R, numpy will provide
an environment very similar to those mathematical
and statistical programming languages.  Before
we dive in, as with other packages, the first step
is to import it.  For numpy, the standard is:
```
import numpy as np
```
Recall that `np` is then the short name we can use when
accessing functions, values, etc. in numpy.

In [6]:
import numpy as np

## Lists vs Numpy Arrays

The listing below compares python lists and `numpy` arrays.
While lists are sometimes beneficial, the efficiency
and convenience of `numpy` arrays makes them the best choice
much of the time.

* lists:
    * flexible
    * size can change (add/remove items)
    * can store different types
* numpy arrays:
    * efficient - often far, far faster than lists
    * fixed size
    * fixed datatype
    * convenient operations/methods/functions

You are likely wondering how `numpy` arrays and their 
functions/methods/operations can be that much faster
than working with lists.  Numpy operations and data structures
are actually implemented underneath in a lower-level programming
language that is much more efficient -- combining the
benefits of python with the speed of lower level languages.

### Efficiency Comparison
Let's start by taking a look experimentally at this
efficiency difference.
Suppose we have two arrays/lists and we wish to multiply
corresponding elements.  We can do this with both
arrays and lists.  We'll use the `%%timeit` magic
command to measure how much time it takes
to perform this action easy way(recall:  this is a special command to the
jupyter notebook)

#### List Way

We'll start by using the random module to create 2 lists:

In [51]:
import random
n = 1000
lst1 = random.choices([1,2,3,4,5],k=n)
lst2 = random.choices([1,2,3,4,5],k=n)

Now, we'll go ahead and loop through, multiply corresponding elements,
and append the result to a new list:

In [52]:
%%timeit
res = []
for i in range(n):
    res.append(lst1[i]*lst2[i])

110 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Let's look at the same example as it would be implemented with numpy.


In [53]:
arr1 = np.random.randint(1,6,n)
arr2 = np.random.randint(1,6,n)

In [54]:
%%timeit
res = arr1*arr2

768 ns ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Even with this small example, we can see that numpy takes
approximately 1/100th of the time that lists do for this small
problem -- illustrating just how much more efficient numpy
is than simply using python lists.

## Creating Numpy Arrays

There are wide variety of ways to create arrays:

### From existing lists/tuples
We can use the `np.array(lst)`function to create
a numpy array from an existing list or tuple.


In [56]:
val_lst = [4,3,-1,8]
val_arr = np.array(val_lst)

print('val_list: ', val_lst)
print('val_arr: ', val_arr)
print(type(val_lst))
print(type(val_arr))

val_list:  [4, 3, -1, 8]
val_arr:  [ 4  3 -1  8]
<class 'list'>
<class 'numpy.ndarray'>


We could also do the same thing without explicitly 
creating the separate variable for the list:

In [12]:
val_arr = np.array([4,3,-1,8])
print(val_arr)

[ 4  3 -1  8]


It also works with lists of lists to create a multidimensional array.  Note, there's technically a thing called `matrix` in numpy, but you should **basically never** use it.  It doesn't play nicely with all parts of numpy and often causes unexpected results.  Instead, best practice is to create matrices as multidimensional arrays.

In [13]:
lst2d = [[1.1,4,8.3],[3,2.5,-1.8],[0,3.4,5]]
mat = np.array(lst2d)

print(mat)

[[ 1.1  4.   8.3]
 [ 3.   2.5 -1.8]
 [ 0.   3.4  5. ]]


### Filled with zeros/ones

Creating arrays from lists is common, but often one
wants to create an array filled with some default value.
Numpy has two very helpful functions to accomplish this:
* `np.ones(shape)`
* `np.zeros(shape)`

With 1D arrays, `shape` is just a single number.  For
multidimensional arrays, `shape` is a tuple representing
the size in each dimension.

#### Examples

In [14]:
print(np.zeros(5))
print(np.ones(7))
print('\n5x5 filled with 0s')
shape1 = (5,5)
print(np.zeros(shape1))
print('\n4x3x2 filled with 1s')
arr3d = np.ones((4,3,2))
print(arr3d)


[0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1.]

5x5 filled with 0s
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

4x3x2 filled with 1s
[[[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]]


#### Identity Matrix
There's also a fairly common special matrix that you may want to create, the *identity matrix*.
This matrix has 1s on the diagonal and 0s everywhere else.  To create this, we
can use the `np.identity(n)` function, where we want an $n \times n$ identity matrix.

In [15]:
matid = np.identity(3)
print(matid)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


#### Zeros/Ones Resembling Existing Arrays
In addition to creating arrays with specified sizes
initialized to 0 or 1, it's also possible to create
an array filled with 0s or 1s based on the form of
an existing array with:
* `np.zeros_like(arr)`
* `np.ones_like(arr)`

For instance:

In [16]:
zarr3d = np.zeros_like(arr3d)
print(zarr3d)

[[[0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]]]


### Evenly spread values

We saw with lists that we could use the result of the
`range(start, stop, step)`
function to create a lists with values that follow a
certain pattern -- integers that can be represented
by start, stop, and step.  Numpy has it's own version of this
function (designed to be more efficient for creating arrays):
```
np.arange(start,stop,step)
```
where `start` (inclusive), `stop` (exclusive), and `step` work exactly as they worked with `range`.

This function creates an array directly.  It should never
be used with a non-integer step size (since computers
must round numbers, it suffers from accuracy issues with
non-integer step sizes.

For non-integer step sizes, the better option is to use
```
np.linspace(start,stop,num=50)
```
This function doesn't worry about the step directly -- it instead
identifies how they should be spaced by the length of the interval
and the number of points specified.

### Examples

In [17]:
arr = np.arange(1,10,2)
print(arr)

arr2 = np.linspace(1,10,20)
print(arr2)

[1 3 5 7 9]
[ 1.          1.47368421  1.94736842  2.42105263  2.89473684  3.36842105
  3.84210526  4.31578947  4.78947368  5.26315789  5.73684211  6.21052632
  6.68421053  7.15789474  7.63157895  8.10526316  8.57894737  9.05263158
  9.52631579 10.        ]


## Data Types
As we have seen, Python is dynamically typed:  types
are changed automatically as needed. Lists can hold
various types and the types in a list can change.
Numpy arrays, however, are homogeneous and statically typed, meaning:
1. the data type stored in the array cannot change
2. all of the elements in the array have the same datatype

This means that all of the arrays created above had an implicit
datatype chosen automatically by numpy.  Let's start by looking
at a couple of those data types.  To look at the datatype,
we access the `dtype` attribute of the array, which just indicates
that it is a fundamental property of the array.

In [18]:
print(arr.dtype)
print(arr2.dtype)

int64
float64


We can also explicitly indicate what datatype we want
when calling most of the array-creation functions, such
as `np.zeros` and `np.ones`, by adding in the `dtype` argument.

In [19]:
val_lst = [1,4,6,8]
val_arr = np.array(val_lst)
val_arr2 = np.array(val_lst, dtype=float)
print(val_arr)
print(val_arr2)
print(val_arr.dtype)
print(val_arr2.dtype)

[1 4 6 8]
[1. 4. 6. 8.]
int64
float64


The datatype of an array resulting from some operation will
automatically adjust to whatever makes the most sense.

While in many cases the operation will have the desired behavior, there are times where it is necessary to change the datatype of an existing array.  To do so, we can simply call
the `astype()` method of the array. 

In [20]:
arr = np.arange(8.0)
print(arr.dtype)
conv = arr.astype(int)
print(conv.dtype)

float64
int64


## Accessing Array Elements

Accessing array elements is similar to accessing items inside of a list:
* indexed starting at 0 at the start
* negative indices from the end
* access with the `[]` notation

So far, all of these should seem very similar to what we've already seen with lists

In [21]:
val_arr = np.array([1,4,-13,-5,8,-9,24])

In [22]:
print(val_arr)
print(val_arr[0])
print(val_arr[3])
print(val_arr[-1])

[  1   4 -13  -5   8  -9  24]
1
-5
24


### Slicing

We've already seen the fundamental idea of slicing a few
times with lists, strings, and tuples.  We also
have slicing available for numpy arrays.  As with 
lists, these take the general form:
```
arr[start:stop:step]
```
where any of the three can optionally be empty and negative
values can be used to indicate negative indices or direction.
You can also leave of the `:step` part entirely if not needed.

In [23]:
val_arr = np.array([1,4,-13,-5,8,-9,24])
print(val_arr[1:-1:2])
print(val_arr[1:-1])

[ 4 -5 -9]
[  4 -13  -5   8  -9]


### Accessing Multi-dimensional Arrays

To access multi-dimensional arrays, we simply specify
the indices in a single `[]` separated by commas, e.g.,
```
arr[i,j]
```
For instance, consider the 2d array:

In [24]:
mat = np.array([[5,8,1,7],[3,2,4,2.5],[9,0,1,6]])
print(mat)

[[5.  8.  1.  7. ]
 [3.  2.  4.  2.5]
 [9.  0.  1.  6. ]]


To access the value 3 in the 2nd row, 1st column, we would do:

In [25]:
print(mat[1,0])

3.0


Note:  this is slightly different than how we accessed lists of
lists where we used a different set of `[]` for each level
of list.  It's a common mistake to mix these up, so watch out
for it when you get error messages.

One of the powerful things with python is the ability to take
slices of multi-dimensional arrays.  We can do this by just
replacing an index (like the 1 or 0 in `mat[1,0]` with the slicing
notation).  So, suppose I want to extract the vector `[2,4]` from
within this array.  That is located in the row with index 1, columns
with index 1 and 2.  Let's look at how to slice to extract that portion:

In [26]:
print(mat[1,1:3])

[2. 4.]


To include everything in a particular dimension, we can use the `:`
with no start or stop, which says give me all of this dimension.
For instance, suppose we wish to access just
the 2nd column of our 2d array:

In [27]:
print(mat[:,1])

[8. 2. 0.]


If you have had linear algebra, you probably noticed
that the result is a 1d array.  Numpy doesn't really
care about row vs column vector.  The operations typically
figure out the correct way to behave based on the arguments
so the standard is to use 1d arrays for both column and row
vectors.  You can force it to be one or the other by making it
a 2d array where one axis has size 1, but this is rarely the
correct choice (and often will make things behave unexpectedly).

This was shown here for a 2d array, but this same logic applies for
higher dimensional arrays -- you simply list the index
or slice for each dimension separated by commas.

## Array Attributes
Numpy arrays are technically *objects* that
have both *methods*, which we've seen before, and *attributes* associated
with them.  Attributes are just values associated
with that particular array object.

Numpy arrays have many attributes, but the most important
ones are:
* `dtype` - the data type of the array
* `shape` - the shape of the array as a tuple
* `ndim` - the number of dimensions of the array
* `size` - the total number of elements in the array
  (do not mix this up with `shape`
  
Note that since attributes are not methods, we don't "call"
them.  This means that when we access an attribute, we do so
with
```
arrname.attributename
```
Notice the lack of `()` at the end, which are used to call a
function.

### Examples

In [28]:
mat = np.array([[5,8,1,7],[3,2,4,2.5],[9,0,1,6]])
print(mat)
print(mat.dtype)
print(mat.shape)
print(mat.ndim)
print(mat.size)

[[5.  8.  1.  7. ]
 [3.  2.  4.  2.5]
 [9.  0.  1.  6. ]]
float64
(3, 4)
2
12


## Operators, Functions, Methods

One of the nicest things about numpy is the extremely
large number of methods, functions, and built-in operations
that are available for arrays.  There are far more than we
can talk about here.  In general,
if you want to try to do something with a numpy array,
there's a good chance there is a function or an easy way to
accomplish it.  The best way is to look at the
[documentation](https://numpy.org/doc/1.19/reference/arrays.ndarray.html)
or Google what you'd like to accomplish.

The most important thing about numpy arrays is that
the vast majority of these functions/methods/operations
work on the entire array at once.  Most loops can be avoided
entirely with proper use of numpy array operations.
As illustrated earlier, these operations are also far, far
faster than the loop counterparts, so you should avoid
using loops as much as possible.

### Elementwise Operations
Many elementwise operations happen automatically
with arrays when using the standard operators.  These
include:
* addition: `+` operator
* subtraction: `-` operator
* multiplication: `*` operator
* division: `/` operator
* exponentiation: `**` operator
* comparisons:  `==`, `>`, `>=`, `<`, `<=` operators
These work with both 1d and higher dimensional arrays, simply
performing the operation with corresponding elements.

#### Examples 1D

In [29]:
arr1 = np.arange(1,5)
arr2 = np.array([4,2,1,4])
print(arr1)
print(arr2)

[1 2 3 4]
[4 2 1 4]


In [30]:
add = arr1+arr2
mul = arr1*arr2
div = arr1/arr2
sub = arr1-arr2
squared = arr1**arr2
print('addition:')
print(add)
print('multiplication:')
print(mul)
print('division:')
print(div)
print('subtraction:')
print(sub)
print('power:')
print(squared)

addition:
[5 4 4 8]
multiplication:
[ 4  4  3 16]
division:
[0.25 1.   3.   1.  ]
subtraction:
[-3  0  2  0]
power:
[  1   4   3 256]


In [31]:
print(arr1 == arr2)
print(arr1 < arr2)

[False  True False  True]
[ True False False False]


Note that the comparison operators return an array indicating
whether each of the individual elementwise comparisons
evaluated to `True` or `False`.

Above we performed these operations with two 1d arrays of the same
shape.  These could have just as easily been 2d arrays,
3d arrays, etc.  Numpy can also perform operations between arrays
and individual values, elementwise performing the operation
with each element of the array.

In [32]:
mat = np.array([[5,8,1],[3,2,4],[9,0,1]])
print(mat/2) # divides each element by 2
print(mat+1) # adds 1 to each element
print(mat**2) # squares each element

[[2.5 4.  0.5]
 [1.5 1.  2. ]
 [4.5 0.  0.5]]
[[ 6  9  2]
 [ 4  3  5]
 [10  1  2]]
[[25 64  1]
 [ 9  4 16]
 [81  0  1]]


### Checking Containment

Like with lists we can use `in` to see if a value is in
an array with
```
val in arrname
```
Similarly to lists, we would often embed this as the condition
in an `if` statement to change what code is executed.

In [33]:
print(9 in mat)
print(-9 in mat)

True
False


### Functions
When working with numpy arrays, it's important to
use numpy functions (rather than math or built-in
functions).  The numpy functions are automatically
designed to work on the array as a whole, performing
the function on each element automatically, rather
than having to loop through the entire array.

#### Mathematical Functions
Pretty much all of the mathematical functions
exist in numpy as well including:

* $e^x$:  `exp(arr)`
* absolute value: `abs(arr)`
* logarithms:  `log(arr)`, `log2(arr)`, `log10(arr)` 
* trigonometric functions:  `sin(arr)`, `cos(arr)`, `tan(arr)`, etc.
* rounding:  `round(arr, ndecimals)`, `floor(arr)`, `ceil(arr)`

Because these are all functions in numpy, we must preface
them with `np.` when calling them.

##### Mathematical Function Examples

In [34]:
arr = np.arange(1,4)
arr2 = np.linspace(-2,2,10)
print(np.exp(arr))
print(np.log2(arr))
print(np.abs(arr2))
print(np.round(arr2,2))

[ 2.71828183  7.3890561  20.08553692]
[0.        1.        1.5849625]
[2.         1.55555556 1.11111111 0.66666667 0.22222222 0.22222222
 0.66666667 1.11111111 1.55555556 2.        ]
[-2.   -1.56 -1.11 -0.67 -0.22  0.22  0.67  1.11  1.56  2.  ]


### Methods

In addition to the functions in numpy, arrays have a variety of
useful methods.  You can check out the full list of methods in the
[numpy array documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).
Some of the most common ones are:
* `arrname.reshape(newshape)` - returns a "view" of the data with a new shape (does not actually
  copy the data so be careful modifying values).
* `arrname.sort()` - sorts the array in place.  If you want a copy use `np.sort(arr)` instead
* `arrname.argsort()` - returns the ordering of indices that would sort the array
* `arrname.cumsum()` - computes cumulative sums
* `arrname.cumprod()` - computes cumulative products
* `arrname.flatten()` - returns copy of array flattened to one dimension
* `arrname.astype(dtype)` - returns copy of the array with specified type (shown previously)
* `arrname.copy()` - returns a copy of the array

#### Examples - Changing Shape

In [35]:
arr = np.arange(12)
mat = arr.reshape((4,3))
print(mat)
mat[0,0] = 100
print(mat)
print(arr)

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[100   1   2]
 [  3   4   5]
 [  6   7   8]
 [  9  10  11]]
[100   1   2   3   4   5   6   7   8   9  10  11]


Notice that when we changed the element at 0,0 in `mat`
it changed the element in `arr` as well since `mat` was
not actually a copy of `arr`.

In [36]:
mat = np.array([[2,1],[3,4]])
print(mat)
arr = mat.flatten()
print(arr)

[[2 1]
 [3 4]]
[2 1 3 4]


#### Examples - Sorting

In [37]:
arr = np.array([4,1,2,8])
print(arr.argsort())
arr2 = arr.copy()
arr2.sort()
print(arr2)

[1 2 0 3]
[1 2 4 8]


#### Example - Cumulative Sum

In [38]:
print(arr.cumsum())

[ 4  5  7 15]


Many of these (`sort()`, `argsort()`, `cumsum()`) have the ability to specify the axis on
which you want to perform the operation.  This is helpful for doing things like sorting each
row or column independently.  For example, with `axis=1`, sort will sort each column separately.

In [39]:
mat = np.array([[3,6,1],[2,-1,9],[8,4,3]])
print('unsorted')
print(mat)
mat.sort(axis=0)
print('sorted down each column')
print(mat)

unsorted
[[ 3  6  1]
 [ 2 -1  9]
 [ 8  4  3]]
sorted down each column
[[ 2 -1  1]
 [ 3  4  3]
 [ 8  6  9]]


## Reduction Operations

The previous operations all returned arrays of the same shape as the input.  However, there are also methods for many *reduction* operations which go from a higher dimension solution to a lower dimensional one (reducing the dimension).  These can be used to compute things such as the maximum entry in an array, the minimum entry in an array, or the sum of all entries in the array, median of 
items in an array, etc.

There are many reduction operations, some of which are methods and others which are
functions.  

### Methods for Reduction
* `arrname.min()`
* `arrname.max()`
* `arrname.sum()`
* `arrname.mean()`
* `arrname.std()` - standard deviation
* `arrname.argmin()` - the index of the minimum value
* `arrname.argmax()` - the index of the maximum value

In [40]:
mat = np.array([5, 7, -8, 1, 3, -4, 12, 0, 1, 0, 2, 5]).reshape(4,3)
print(mat)
print(mat.min())
print(mat.max())
print(mat.sum())
print(mat.mean())
print(mat.flatten().argmin())

[[ 5  7 -8]
 [ 1  3 -4]
 [12  0  1]
 [ 0  2  5]]
-8
12
24
2.0
2


### Functions for Reduction

Those were all methods, meaning we called them by
```
arrname.methodname()
```
There are additional reduction operations that are
functions in numpy rather than methods for the particular
array.  Some of the most frequently used ones are:
* `count_nonzero(arr)` - counts number of nonzero
  elements.  If argument evaluates to a boolean array,
  counts the number of `True`.
* `any(arr)` - returns `True` if any element of `arr` is `True` (also available as an array method)
* `all(arr)` - returns `True` if all elements of `arr` are `True` (also available as an array method)
* `median(arr)` - returns median of all elements in `arr`

In [41]:
print(np.count_nonzero(mat))
print(np.median(mat))
tfarr = np.array([True, False, False])
print(np.any(tfarr))
print(np.all(tfarr))

10
1.5
True
False


It's very common to combine `any()`, `all()`, and `count_nonzero()` with a logical comparison.
For instance, I could use this to check if my array has at least one element greater than
some number, or count the number of elements greater than some number:

In [42]:
arr = np.array([5,2,1,8,4,9])
print(np.any(arr > 6))
print(np.count_nonzero(arr > 6))


True
2


### Reducing over Certain Axis

All of these reduction operations are actually
far more powerful than that simple example.  Each reduction operation
takes an `axis` argument that indicates which axis we want to
reduce over.  The dimension represented by this axis will be removed during
the course of the reduction operation. 

### Examples

In [43]:
print(mat.min(axis=1)) # one min for each row
print(mat.min(axis=0)) # one min for each col

[-8 -4  0  0]
[ 0  0 -8]


## Looping Through Arrays
While it's best to avoid using loops as much as possible with numpy arrays,
by using elementwise operations, reduction operations, and careful
use of the `axis` argument,
it is possible to loop through arrays.  Typically when looping through numpy
arrays we loop through the indices with something like:

```
for i in range(arr.shape[0]):
    # do something
```
where we can access whatever dimension in `shape` that is appropriate for
our loop, e.g. you may want `shape[1]` instead of `shape[0]`.

## Constants
Much like the math module had many mathematical constants,
numpy also contains mathematical constants.  Multiple of these
are the same as the ones in the math module, but are
included as part of numpy so you don't need to `import math`
just for these constants. 

The primary constants in numpy are:
* np.e
* np.pi
* np.inf
* np.nan

The last 2 (`inf` and `nan`) are unique to numpy and represente
infinity and not a number, respectively.

In [44]:
print(np.inf*3)

inf


## Files
We have already seen a couple options for reading
and writing to files.  Typically these end up with either
lists of lists or simply iterating through each row and breaking
that row down into a list.  Sometimes, however, if we have numerical data
we may know that we want it in a numpy array to perform some operations.
While we could go through the whole process of reading it in and creating
a list or filling a numpy array, this is tedious and time consuming.

If the data is numerical and in some well-formatted, delimited file (such
as tab separated or comma separated), then we can read the data directly into
a numpy array using numpy file input/output functions.  These
only work if the data that needs to go into the array are numerical values (as
numpy is not designed to work with strings).

There are two primary functions for reading in files with numpy:
* `np.loadtxt(fname, dtype='float', delimiter=None, skiprows=0, usecols=None)`
  this shows the default values for many of the keyword arguments. 
  * `delimiter`: what to separate columns with, default is whitespace
  * `skiprows`:  how many rows to skip, usually for a header row, default is not
    to skip any rows
  * `usecols`:  which columns to use, default is to use all columns.  To specify more
    than one column, give tuple of which column indices (0-based indexing) that you
    want to keep
* `np.genfromtxt(fname, dtype='float', delimiter=None, skiprows=0, usecols=None, missing_values=None, filling_values=None)`
  * very similar to `loadtxt`, but better equipped to handle missing values
  * `missing_values`:  what represents a missing value (depends on your dataset,
    sometimes it's -1, sometimes it's 9999, sometimes it's '?')
  * `filling_values`: what value to insert instead of any missing values

There's also one common function for writing out a numpy array to a file:
* `np.savetxt(fname, arr, fmt='%.18e', delimiter=' ', header='')`
   * `fname`: the filename in which to save the array
   * `arr`: the array (1d or multidimensional) to save
   * `fmt`: format specifier for how to output values
   * `delimiter`: what to separate columns with when outputting (default is a spa)

We'll illustrate one of these using the
[Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
which can be found in `'diabetes.csv'`.  This dataset is all numerical,
other than the first row which is a header row.  Missing data does not appear
to be a problem, so instead we'll use `loadtxt`.

In [45]:
data = np.loadtxt('data/diabetes.csv', skiprows=1, delimiter=',')
print(data.shape)
print(data[:10,:])

(768, 9)
[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01 1.000e+00]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01 0.000e+00]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01 1.000e+00]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01 0.000e+00]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01 1.000e+00]
 [5.000e+00 1.160e+02 7.400e+01 0.000e+00 0.000e+00 2.560e+01 2.010e-01
  3.000e+01 0.000e+00]
 [3.000e+00 7.800e+01 5.000e+01 3.200e+01 8.800e+01 3.100e+01 2.480e-01
  2.600e+01 1.000e+00]
 [1.000e+01 1.150e+02 0.000e+00 0.000e+00 0.000e+00 3.530e+01 1.340e-01
  2.900e+01 0.000e+00]
 [2.000e+00 1.970e+02 7.000e+01 4.500e+01 5.430e+02 3.050e+01 1.580e-01
  5.300e+01 1.000e+00]
 [8.000e+00 1.250e+02 9.600e+01 0.000e+00 0.000e+00 0.000e+00 2.320e-01
  5.400e+01 1.000e+00]]


## Linear Algebra

One of the areas where numpy really shines is linear algebra.
Numpy has a *subpackage* called `linalg`, that is automatically
loaded when numpy is imported, which contains a large number
of extremely efficient linear algebra operations.

Big picture, the array method `arr1.dot(arr2)` or alternatively
the numpy function `np.dot(arr1, arr2)` is an extremely helpful
function.  Depending on whether the arrays are 1D or 2D, determines
what it performs:
* if both `arr1` and `arr2` are 1D, then it performs an inner product   of the vectors (remember -- we don't care about and typically
  shouldn't explicitly use column and row vectors in numpy, instead
  it's best to use 1D arrays)
* if `arr1` is 2D and `arr1` is 1D, then it performs matrix-vector
  product
* if `arr1` and `arr2` are both 2D arrays, then it performs
  matrix-matrix multiplication.

#### Example - Inner Product

In [46]:
arr1 = np.arange(4)
print('----- arr1 -----')
print(arr1)
arr2 = np.arange(-2,2)
print('----- arr2 -----')
print(arr2)

print('----- arr1.dot(arr2) -----')
result = arr1.dot(arr2)
print(result)

----- arr1 -----
[0 1 2 3]
----- arr2 -----
[-2 -1  0  1]
----- arr1.dot(arr2) -----
2


#### Example - Matrix-vector product

In [47]:
A = np.arange(8).reshape(4,2)
print('----- A -----')
print(A)
x = np.array([1,2])
print('----- x -----')
print(x)

print('----- A.dot(x) -----')
result = A.dot(x)
print(result)

----- A -----
[[0 1]
 [2 3]
 [4 5]
 [6 7]]
----- x -----
[1 2]
----- A.dot(x) -----
[ 2  8 14 20]


#### Example - Matrix-matrix product

In [48]:
A = np.array([[3,4,2],[-1,2,2]])
print('----- A -----')
print(A)
B = np.array([[2,1],[1,1],[3,-1]])
print('----- B -----')
print(B)

print('----- A.dot(B) -----')
result = A.dot(B)
print(result)

----- A -----
[[ 3  4  2]
 [-1  2  2]]
----- B -----
[[ 2  1]
 [ 1  1]
 [ 3 -1]]
----- A.dot(B) -----
[[16  5]
 [ 6 -1]]


Since matrix-matrix multiplication is such a common task, it has it's own
operator, the `@` symbol.

In [49]:
print(A@B)

[[16  5]
 [ 6 -1]]


Beyond just the vector-vector, matrix-vector, and matrix-matrix operations, `numpy.linalg` has quite a few other functions for
tasks such as:
* outer products
* computing decompositions
* computing eigenvalues
* norms, determinant, trace, rank, etc.
* solving linear systems and inverting matrices

If you are
familiar with linear algebra, I strongly recommend glancing
through the full list of available functions in the 
[documentation for numpy.linalg](https://numpy.org/doc/stable/reference/routines.linalg.html#matrix-and-vector-products).

## Broadcasting

The final major aspect of numpy arrays, and in particular understanding how they behave,
is *broadcasting*, which describes how numpy arrays with different sizes are handled during operations.
Rather than having elementwise operations require both arrays be the same size, instead numpy will automatically extend the smaller array conceptually to be compatible.  For instance, we've already seen one example where we can add a scalar to a vector and it adds the scalar to each element of the vector.  Essentially, you can think of it as stretching the scalar value to match the shape of the vector, without ever having to explicitly form the array that comes from stretching the scalar value.

Numpy's broadcasting is significantly more powerful than just performing an operation with a scalar and a vector.  The same idea applies to multi-dimensional arrays.  Numpy works from the end dimension to the start dimension checking if the dimensions are compatible for each of the arrays involved in the operation.  Dimensions are compatible if either:
* they are equal
* one of them is 1

Let's take a brief look at an example:


In [50]:
mat = np.arange(12).reshape((4,3))
print('----- mat -----')
print(mat)
vec = np.array([3,0,1])
print('----- vec -----')
print(vec)

print('----- mat+vec -----')
print(mat+vec)

----- mat -----
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
----- vec -----
[3 0 1]
----- mat+vec -----
[[ 3  1  3]
 [ 6  4  6]
 [ 9  7  9]
 [12 10 12]]


This can take quite a while to wrap your head around, but it also has the ability to be extremely
powerful.  For now, just know that the idea exists and deal with each instance as it comes up.