# Numpy

Readings: 

[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) Chapter 2.

https://leewtai.github.io/courses/stat_computing/lectures/python_mds/numpy.html


## Introduction

### Important Python Libraries for Data Science


Libraries are important characteristic of how Python works: each application has its libraries.

| Package | Description | Logo |
|:---:|:---|:---:|
| Numpy | Numerical arrays | <img src="https://numpy.org/doc/stable/_static/numpylogo.svg" width="300"/> |
| Scipy | Scientific Python<br>User-friendly and efficient numerical routines:<br> numerical integration, interpolation, optimization, linear algebra, and statistics | <img src="https://scipy.org/images/logo.svg" width="200"/> |
| Matplotlib | Plotting | <img src="https://matplotlib.org/stable/_static/logo_light.svg" width="300"/>|
| Pandas | Data analytics <br> R-like data frames | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1024px-Pandas_logo.svg.png" width="300"/> |
| Scikit-learn | Machine learning | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/2880px-Scikit_learn_logo_small.svg.png" width="300"/> | 

All the packages above are included in Anaconda distribution.



Next, we will explore two key libraries used not only to data science, but also to any quantitative project written in Python. While Numpy is a library for optimized matrix operations, Matplotlib allows you to create virtually any visualization (2-D plots, scatter plots, bar plots, etc).




In standard python, the data collecitons are the `list`s. 

| Advantages | Disadvantages (for scientists) |
|:---|:---|
|- Can contain different types of objects<br>- Easy insertion<br>- Easy concatenation |- Sum of two lists is concatenation, not vector addition<br>- Slow for large lists<br>- No useful function (mean, variance, maximum, etc.)|

Scientists would like an object closer to the notion of vector or matrix. NumPy implements that.
Thus, nearly all other scientific libraries in Python are based on NumPy. (dubbed "NumPy ecosystem")


NumPy is a library for large multidimensional arrays and matrices, and math operations with them. 



![](https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png)


To use numpy in your code, simply use the import statement in the next cell:

In [None]:
import numpy

In [None]:
def add_lists(L1, L2):
    return [L1[i] + L2[i] for i in range(len(L1))]

In [None]:
L1 = list(range(100))
L2 = list(range(100))

%timeit add_lists(L1, L2)

In [None]:
import numpy as np

a1 = np.array(L1) # this how to create an array in Numpy.
a2 = np.array(L2)
%timeit a1 + a2

### "Should I use `numpy` arrays?"

If you will operate on one or more large sets of numbers and are considering writing `for`-loops, stop. Ask yourself whether you can achieve your task with `numpy` arrays instead. 90% of the time, `numpy`-based code is faster to write and faster to execute.



### "Should I use `for`-loops?"

When working with lists and other basic data structures, yes, absolutely! When working with `numpy` arrays, however, writing `for`-loops is almost always the wrong thing to do. Find a way using vectorized `numpy` code.

### Creating an array
To create a NumPy array, use `variable = numpy.array(Array_Like)`. 

In [None]:
a = [1,2,3,4,5]
print( type(a) )   # displays the type of the variable a

b = numpy.array( [1,2,3,4,5] )
print( type(b) )   # displays the type of the variable b

We are going to use `numpy` a lot, and five letter seems to be too much to type each time... It is extremely common to rename it as `np` (just a nickname).

In [None]:
import numpy as np  # allows user to use np instead of numpy

c = np.array( [5,4,3,2,1] )
print( type(c) )    # displays the type of the variable c

## Vectorized operations or  or UFuncs
Numpy provides a simple way to apply a function avoiding the use of the slow Python for loop -- the functions are precompiled!

In [None]:
print("Sum of lists: ", a + a)

In [None]:
print("Sum of NumPy arrays: ", b + c )

In [None]:
print("Original array: ", b)

In [None]:
print("Adding a constant: ", b + 7.1 )

In [None]:
print("Multiplying a constant: ", b*2.5 )

In [None]:
print("Exponentiation: ", b ** 3 )

In [None]:
print("Element-wise product: ", b * c )
print("Element-wise division: ", b / c )
print("Element-wise modulo: ", b % c )

In [None]:
print("Elementwise exponential: ", np.exp(b) )
print("Elementwise sine: ", np.sin(b) )
print("Elementwise cosine: ", np.cos(b) )

### Indexing
Selecting elements and sets of elements from your numpy array (same as in list):

In [None]:
print("First element in b: ", b[0] )

In [None]:
print("Second element in b: ", b[1] )

In [None]:
print("Last element in b: ", b[-1] )

In [None]:
print("All elements in b: ", b[:] ) 

In [None]:
print("Elements 1, 2, and 3: ", b[1:4] ) # last index is excluded!

In [None]:
print("From 0 to end in steps of 2, except for the last one: ", b[0:-1:2] )

In [None]:
print("From 0 to end in steps of 2: ", b[0::2] )
print("Reverse b: ", b[::-1] )


### Attributes and Methods

NumPy arrays are objects: they have __attributes__ and __methods__.
- __Attributes__: Some properties like shape, length, number of dimensions, etc. __syntax: `array.attribute`__
- __Methods__: "Actions" (or functions) applied to or performed on the object. __syntax: `array.method(arg1, arg2, etc.)`__




In [None]:
b.size # attribute: length, number of elements

In [None]:
b.ndim # attribute: number of dimensions 

In [None]:
b.shape # attribute: tuple for size of each dimension

In [None]:
b.dtype # attribute: data type

There are several dtypes predefined for numpy. `float64`, `float32`, `int64`, `int32`, `bool` will be most commonly used. 

In [None]:
(b > 5).dtype

In [None]:
np.sin(b).dtype

In [None]:
b.sum() # method: sum of elements

In [None]:
b.prod()

A list of NumPy Array (`numpy.ndarray`)'s attributes and methods? You can look them up online. 
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

### Other ways to create an array

- Function `np.arange(start, stop, step)` ("array range". argument can be floats.)

In [None]:
np.arange(1,5,0.5)

- Function `np.linspace(start, stop, subdivisions)`: linearly spacing between start and stop.

In [None]:
#np.linspace(0,10,4) 

np.linspace(2, 8, 7).astype('int')

_Hint: you can press tab key for autocompletion_

- Function `np.ones(size)` and `np.zeros(size)`: an array filled with ones or zeroes.

In [None]:
np.ones(5)

In [None]:
np.zeros(5)

### Two (or higher)-dimensional arrays

Two dimensional arrays can be created using nested lists (or list of lists). 

In [None]:
A = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
print("The array A:\n",A)
print("Shape of A:",A.shape)
print("Number of dimensions of A:",A.ndim)

One may also create a higher-dimensional array

In [None]:
B = A.reshape(3, 2, 2)
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)

In [None]:
B[0, :, :]

In [None]:
B = np.arange(10)
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)

In [None]:
B = B.reshape((10,1))
print("The array B:\n",B)
print("Shape:",B.shape)
print("Ndim:",B.ndim)

### Indexing multidimensional arrays

- To access an element at position (i, j): `array[i, j]`
- May select an entire column or row using a colon: `array[i, :]` for a row, `array[:, j]` for a column. 
- May also use a slice: e.g., `array[0:2, :]`. 

In [None]:
print("Array A:\n",A)

In [None]:
print("Element at pos (1,2):",A[1,2],'\n') 

`A[1][2]` is also possible, just like how nested list is indexed, but is less efficient. `A[1]` creates a copy of vector consisting row 1 of `A`, then it accesses index 2 of the vector. `A[1, 2]` directly accesses the value at the position desired without copyting a row.

In [None]:
print("Row #2 of A:\n",A[2,:],'\n')

In [None]:
print("Column #1 of A:\n",A[:,1],'\n')

__Exercise__: How would you reverse the order of columns in matrix `A`?

Basic syntax for "slicing" is `start:end:step`, and each of them can be omitted if start=0, end=len(l), step=1

### Accessing elements by logical tests

Let's recall how to perform _logical tests_ in Python.

In [None]:
A

In [None]:
print( A > 25 )

We can select elements based on the result of a logical test:

In [None]:
print( A[A > 25] )

__Exercise__:  Replace all odd numbers in `arr` with -1.

_Hint: you can assign values to a subsetted array._ 

<!--arr[arr % 2 == 0] = -2-->

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
arr

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Replace all even numbers in `arr` with -2.

In [None]:
arr

## Aggregation Functions



In [None]:
a = np.arange(10)
a

A quick tour of some of the most useful aggregation functions. Most of these functions are available either as functions imported by `numpy` (e.g. `np.sum()`) or as methods of the `array` class (e.g. `a.sum()`).

In [None]:
np.sum(a), a.sum()

In [None]:
a.mean()

In [None]:
a.min(), a.max()

In [None]:
a.prod() # for product

In [None]:
(a+1).prod()

### Example: Variance

The *variance* of a set of numbers $x_1, \ldots, x_n$ is 

$$\mathrm{var}(x_1,\ldots,x_n) = \frac{1}{n}\left[(x_1 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2\right]\;,$$

where 

$$\bar{x} = \frac{1}{n}\left(x_1 + \cdots + x_n\right)$$

is the *mean*. Data sets with large variance are more "spread out" or "variable." 

While the formula for variance looks somewhat complicated, we can compute it as a one-liner using `numpy`'s vectorization and aggregation tools. 

In [None]:
x = np.random.rand(100)
v = np.mean((x - np.mean(x))**2)
v

As it turns out, there's also an `np.var` function that would have given us the same answer. 

In [None]:
np.var(x), x.var()

## Boolean Aggregations

There are two especially useful aggregation functions for working with boolean aggregations. `np.any()` returns `True` if ANY of the array elements are `True`. `np.all()` returns `True` if ALL of the array elements are `True`. 

In [None]:
a = np.arange(10)
a

In [None]:
a > 8

In [None]:
np.any(a > 8)

In [None]:
np.all(a < 10)

## Aggregation Along Multiple Dimensions

Often we don't just want the sum of all the numbers in an array -- we want the sum of all numbers *per row*. For another example, maybe we want the largest entry *in each column*. `numpy` makes it easy to accomplish these tasks via the `axis` argument. 

In [None]:
A = np.reshape(np.arange(15), (3, 5))
A

In [None]:
# by default, A.sum() adds up all the numbers
A.sum()

In [None]:
# the zeroth axis corresponds to ROWS. 
# summing over this axis gives the totals in each COLUMN. 
A.sum(axis = 0)

In [None]:
# the first axis corresponds to COLUMNS. 
# summing over this axis gives the totals in each ROW. 

A.sum(axis = 1)

In [None]:
A.shape

Here's a helpful way to remember how to use the `axis` argument. In our case, `A.shape = (3,5)`. The sum `A.sum(axis = 0)` "eliminates" the zeroth dimension of size 3, leaving the dimension of size 5. So, `A.sum(axis = 0)` should be a 1d array of size 5, which is what we saw above. Similarly, `A.sum(axis = 1)` eliminates the first dimension of size 5, leaving the zeroth dimension of size 3. 

Other aggregation functions also accept the `axis` argument.

In [None]:
A.min(axis = 0)

In [None]:
A.max(axis = 1)

In [None]:
A.mean(axis = 1)

In [None]:
A.cumsum(axis = 1)

### Dealing with NaN values

What happens when there is a NaN value in your array? Since NaNs propagate, aggregations involving NaNs will generate more NaNs:

In [None]:
A = 1.0*A # to convert to float
A[0,:] = np.nan
A

In [None]:
A.mean(axis = 0)

In [None]:
A.mean(axis = 1)

In some cases, we might want to completely ignore NaN values. In this case, we can use the `nan*` versions of `numpy`'s aggregation functions, like `np.nansum()`, `np.nanmean()`, `np.nanmax()`, `np.nanmin()`, and so on:

In [None]:
np.nanmean(A, axis = 0)

In [None]:
np.nanmean(A, axis = 1)

##  Broadcasting 

In [None]:
import numpy as np

a=np.array([1,2,3])
b=np.array([4,5])

In [None]:
#This produces an error
a+b

Indeed, adding arrays of different sizes feels "unnatural". But note that the error message said 

"ValueError: operands could not be broadcast together with shapes (3,) (2,)"

It didn't just say "shapes don't match."

Sometimes you **can** add arrays of different shapes and this is called broadcasting


In [None]:
a=np.ones(3)
b=np.zeros(3)
b=b.reshape(3,1)
a,b
a+b

In [None]:
#It looks like we should be able to add a and b,but 
a.shape,b.shape

### First rule of broadcasting:
If one array has more dimensions, take the array with fewer dimensions  and 'pad' it with ones __ON THE LEFT__.

If you try to type a+b, numpy will automatically convert it to an array with shape (1,3) and then the addition works fine.


In [None]:
a+b

In [None]:
#Example 2
a=np.zeros((3,3))
b=np.arange(3)
print(a.shape,b.shape)
a,b

What happens if we try to add a and b?

By the first rule of broadcasting, b gets padded with a one so now a has shape(3,3) and b has shape (1,3)

### Second rule of broadcasting. 
If arrays have the same number of dimesions, you can stretch out ones

this means that b will get converted from array([0,1,2]) to array([[0,1,2],[0,1,2],[0,1,2]])
is the 3 by 3 array where every row is a copy of the original b

mathematically, we have the formula new_b[i,j]=old_b[j]

axis 0 is the axis we are "streching over" i isn't on the right hand side

In [None]:
a+b

### Third rule of broadcasting.

Sometimes you are out of luck. If the arrays have the same number of dimensions, the dimensions don't match, and there is not a one in either location, you can't add, i.e., you can  only stretch ones. 

In [None]:
#This gives you a value error
#A=np.zeros((3,4))
#B=np.zeros((4,3))
#A+B

### More Complicated Example

In [None]:
a=np.arange(3)
a=a.reshape(1,3)
b=2*np.arange(3)
b=b.reshape(3,1)
print(a.shape,b.shape)
a,b

Can we add? Yes! As long as there is a one in each dimension!

a gets stretched out along dimension 0 to form a 3 by 3 array. We are stretching along axis = so the rule is 
a_new[i,j]=a[j] so a becomes

array([[0,1,2],[0,1,2],[0,1,2]])

b on the other hand gets stretched along axis 1 so it gets obays the rule b_new[i,j] = b[i]

so be gets converted from 0,2,4 to 

array([[0,0,0],[2,2,2],[4,4,4]])

Now, since  a and b are both 3 by 3 we can add

In [None]:
a+b

Note that a_new and b_new are all "under the hood." You don't actually see them.

In [None]:
#Example
np.arange(3)+4

4 is stretched to array([4,4,4])

## Normalizing Data

In [None]:
#Here is some random data, n x p, number of samples x number of features.
X=np.random.rand(10,3)
X

Interpretation. Each row is a new person. Each column is a different 'assessment'. Can we manipulate our data so that each column has mean zero and variance 1?

In [None]:
#mean of each column
mu = X.mean(axis= )
mu

What happen if we type X-mu? 

`X` has shape `10,3` `mu` has shape `3`, so by the first rule of broadcasting, `mu` gets padded with a one to have shape `1,3`.

Then, by the second rule of broadcasting, `mu` gets stretched out to a 10x3 matrix where each row is a copy of the original `mu`

In [None]:
Centered = X-mu
Centered

In [None]:
Centered.mean(axis=0)

Okay, now lets make the variance of each column =1

In [None]:
sigma = X.std(axis=0)
sigma

In [None]:
Normalized=(X-mu)/sigma
Normalized

In [None]:
Normalized.mean(axis=0), Normalized.std(axis=0)

In [None]:
#Note that the rows do not have mean zero or variance 1

Normalized.mean(axis=1), Normalized.std(axis=1)


##  Linear Algebra 

In [None]:
A=np.arange(15).reshape(5,3)
A

In [None]:
x=np.ones(3)

What does $A*x$ do?

In [None]:
A*x

What does it NOT do?

## Matrix-vector multiplication? 

Recall that multiplication of an $M\times N$ matrix by an $N$ dimensional vector results in a new $M$ dimensional vector $y=Ax$ defined by 
\begin{equation*}
y_i=\sum_jA_{i,j}x_j
\end{equation*}

In [None]:
np.matmul(A,x)

## Matrix-Matrix multiplication 

If $A$ is an $M\times N$ matrix and $B$ is an $N\times R$ matrix, then $C=AB$ is an $M\times R$ matrix defined by 

\begin{equation*}
C_{i,j}=\sum_kA_{i,k}B_{k,j}
\end{equation*}

In [None]:
B=np.arange(33).reshape(3,11)
C=np.matmul(A,B)
C.shape

This doesn't work


In [None]:
BadB=np.arange(22).reshape(2,11)
BadC=np.matmul(A,BadB)


More linear algebra opeartions are available in the submodule `numpy.linalg`. 