# Numerical Python: NumPy II


[NumPy](http://www.numpy.org/) 

# [![Numpy logo](https://numfocus.org/wp-content/uploads/2016/07/numpy-logo-300.png)](https://matplotlib.org/gallery/mplot3d/voxels_numpy_logo.html)

In our last lecture we explored the basics of the numpy module.  
We talked about why we need numpy, how to build various numpy arrays and their fundamental attributes. We ended the class by looking at numpy's ufuncs, which is where numpy power lays.  
Today we are going to explore some more advanced capabilities of numpy :
1. __Aggregation__
1. __Broadcasting__
1. __fancy indexing and boolean indexing__

First thing first, let import numpy

In [None]:
import numpy as np
# This will make sure float print come out nicely
np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)

*** 
# Numpy aggregation

In most data science application we start exploring the data by querying different statistics.   
Numpy allows us to do that quickly by using aggregation functions (you aggregate information as you iterate over the array), which summarize the values in an array.
Some of the most common aggregation are : 
```py
sum, mean, std, var, min, max.   
```
To view the entire aggregation list visit : [Numpy aggregation](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html) (There is a table in the middle of the notebook)   
Here are some examples:

In [None]:
np.random.seed(1101)
a = np.random.randint(10, 20, size=10)
a

In [None]:
a.min(), a.max()

In [None]:
a.mean(), a.std()

In [None]:
a.argmin()

### Aggregation over multi dimensional arrays
As the title suggests we can aggregate over many dimensions, and summarize a statistics over an entire matrix for example:

In [None]:
M = np.random.rand(5, 5)
M.max(), M.min(), M.mean()

But, in many cases you are not interested in statistics of the entire matrix but one of the axis of it, say the columns for example. All you have to do for that is just specify the axis upon which you wish to aggregate.

<img src="http://www.elimhk.com/myblog/wp-content/uploads/2017/04/axis.png" width="">

To remember what would be the output shape of an aggregation over an axis, I like to think about collapsing that axis. So if you have an array with shape (10, 3) and you aggregate over axis 0, you'll end up with (1, 3), since you "collapsed" the 0 axis.

In [None]:
a = np.arange(10).reshape(2, 5)
a, a.sum()

In [None]:
a.sum(axis=0) # How many values are we expecting to get?

In [None]:
a.sum(axis=1) # How many values are we expecting to get?

In [None]:
np.random.seed(109)
a = np.random.randint(low=0, high=100, size=(10, 2))
a

In [None]:
a.mean(axis=0)

***
## Exercise
***

In [None]:
np.random.seed(1111)
X = np.random.randint(low=0, high=50, size=(30, 4))

__Get the mean values of each column in X__

In [None]:
# Your code here
X.mean(axis=0)

__Get the max value of each row in X__

In [None]:
# Your code here
X.max(axis=1)

__Get the median value of all the values of X__

In [None]:
# Your code here
np.median(X)

__Get the variance of each column of X plus the variance of each row of Y.__

In [None]:
np.random.seed(2222)
Y = np.random.randint(low=0, high=50, size=(4, 20))

In [None]:
# Your code goes here
X.var(axis=0) + Y.var(axis=1)

## Wine Example
<img src="https://www.ironstonevineyards.com/wp-content/uploads/2017/06/wine-club-cheers.jpg" width="300" height="">

We are going to use the wine dataset from sklearn (THE machine learning module in python).    
In this data, each row corresponds to a specific type of wine. The row values consist of different chemical compounds of the wine(alcohol, malic acid, magnesium etc...) along side the "score" of the wine. The "goal" in this data is to use the chemical properties in order to predict the wine score.  
In machine learning the "properties" are referred to as __features__ while the value we are trying to predict is referred to as the __label__ or __target__.

|  Type |Alcohol | Malic acid| ash | ... |
|---|---|---|---|---|
|Merlot Galil| 12.5 | 2.3| 4.5| ...|
|Merlot Arava| 13.2 | 3.1| 2.5| ...|
|Cabrniet Negev| 14.1 | 3.3| 4.1| ...|

In [None]:
from sklearn import datasets # We are going to use sklearn to load a sample dataset.

# Load the wine dataset.
wine_dataset = datasets.load_wine()

# Extract the feature names from the dataset.
features_names = wine_dataset['feature_names']

# Extract the features matrix from the dataset.
X = wine_dataset['data']

In [None]:
# Let's take a look at what kind of features we have to work with
features_names

In [None]:
# Let's see how much data we have, and a small sanity check
X.shape, len(features_names)

In [None]:
!conda install sklearn

***
## Exercise
***
__Extract the alcohol column__

In [None]:
# Your code starts here.

# Your code ends here.

__Find the mean, max and min values of the alcohol feature__

In [None]:
# Your code starts here.

# Your code ends here.

__find the mean value of the flavanoids column divided by the nonflavanoid_phenols column__

In [None]:
# Your code starts here.

# Your code ends here.

***
## Broadcasting

A very powerful mechanism of NumPy arrays is [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).
Broadcasting is used when an operation is used on two arrays of different shapes.
The rules are:

1. If arrays dimension differ, left-pad the smaller array's shape with 1s.
1. If the shapes differ, change any dimension of size 1 to match the dimension of the other array.
1. If shapes still differ, raise an error.

Some exmaples:
![broadcasting examples](http://www.astroml.org/_images/fig_broadcast_visual_1.png)

In [None]:
np.arange(3) + 5

In [None]:
np.ones((3,3)) + np.arange(3)

In [None]:
np.arange(3).reshape((3, 1)) + np.arange(3)

Let's see an example where this breaks

In [None]:
np.ones((3,3)) + np.ones((3, 2))

As an example, we can use broadcasting to quickly build a multiplication table.

In [None]:
# Shape (10,)      shape(10, 1) -> Broadcasting to (10, 10)
np.arange(1, 11) * np.arange(1, 11).reshape(10, 1)

***
# Exercise
***
Use `a` and `b` to produce the following output:
```py
array([[ 2,  4,  6,  8],
       [ 3,  6,  9, 12]])
```

In [None]:
# Your code starts here.

# Your code ends here.

__Given a 1D array `X`, calculate the differences between each two elements of `X` using broadcasting and save it to array `D`, Meaning `D[i,j] = X[i] - X[j]`__

In [None]:
X = np.linspace(1, 10, 10)

In [None]:
# Your code starts here.

# Your code ends here.

***
__In front of you is an array of prices of different products in shekels, let's call the sum of these products a basket. You would like to know the basket price in each of the following currencies:__
1. Dollar (1 shekel -> 0.28)
1. Euro   (1 shekel -> 0.26)
1. Yuan   (1 shekel -> 2.03)
1. Yen    (1 shekel -> 30.11)

__Use broadcasting and aggregation to quickly find out the price of the baskets.__

In [None]:
prices = np.array([50, 25, 80, 100, 150, 275])

# Your code starts here.

# Your code ends here.

***
__A very common procedure in machine learning is to use a normalization technique on the data prior to feeding it to an algorithm.


Use aggregation to center mean (having mean of 0) the columns of the following X matrix__.

In [None]:
np.random.seed(1111)
X = np.random.randint(low=0, high=50, size=(30, 4))

In [None]:
# Your code starts here.

# Your code ends here.

__Use `np.isclose` to validate that all the columns in the new matrix have mean 0.__

In [None]:
# Your code starts here.

# Your code ends here.

***
__Let X,Y be 2 random variables. In front of you is the joint distribution, J, of X and Y.  J[i. j] = $p(x=i, y=j)$  
Find out if X and Y are independent.__  
Reverse this string ([::-1]) for a hint: 
```py
J nevig eht ot erapmoc dna noitubirtsid tnioj eht etupmoc ,noitagergga gnisu Y dna X fo slanigram eht etupmoC
```

In [None]:
J = np.array(([
    [0.04 , 0.03 , 0.02 , 0.01 ],
    [0.075, 0.1  , 0.05 , 0.025],
    [0.075, 0.1  , 0.05 , 0.025],
    [0.12 , 0.16 , 0.08 , 0.04 ]
]))

In [None]:
# Your code starts here.

# Your code ends here.

***
## Boolean indexing and Fancy indexing

### Boolean operations
Before we talk about boolean indexing we'll talk about boolean ufuncs.  
We saw we can operate on numpy arrays in an element wise fashion using arithmetic functions, which will result in the computation of the operation on each element. We can also work element wise using boolean operations which will result in a boolean array indicating whether the boolean operator was True or False on each element.

In [None]:
np.random.seed(2611)
a = np.random.randint(low=0, high=50, size=20)
a

In [None]:
a == 18

In [None]:
a > 18

Given this boolean array we can now check different properties of our original array.  
For example we can check how many entries in our array are 18 - 

In [None]:
(a == 18).sum()

We can check if __all__ or __any__ of the elements possess a certain attribute:

In [None]:
(a == 18).any(), (a < 50).any(), (a < 0).any()

In [None]:
(a == 18).all(), (a < 50).all(), (a < 0).all()

And this obviously work on multi dimensional arrays as well

In [None]:
np.random.seed(23)
A = np.random.randint(low=0, high=30, size=(5, 4))
A

In [None]:
A > 9

In [None]:
A == 6

In [None]:
(A==6).sum(), (A>6).any(), (A>6).all()

We can use the axis parameter to aggregate over an axis and not the entire matrix.

In [None]:
(A == 6).sum(axis=0), (A<6).any(axis=1), (A>6).all(axis=0)

### Bitwise operation
A bitwise operation is a function which takes in 2 boolean values {0, 1} an outputs a boolean value{0, 1}. A bitwise operation is defined by a truth table which holds the output for each combination of values.  
Numpy supports 4 boolean bitwise operation :
1. & (AND)
1. | (OR)
1. ^ (XOR)
1. ~ (NOT)  
And here are there Truth tables:  
<img src="http://www.csc.villanova.edu/~mdamian/Past/csc2400fa13/assign/Figs/4basicgates.gif" width="300" height="">

This enables us to check multiple attributes quickly

In [None]:
np.random.seed(23)
A = np.random.randint(low=0, high=30, size=(5, 4))
A

In [None]:
(A > 8) & (A < 15) # Only values between 8 and 30 are evaluated to True

__Watch Out__ Bitwise operation precede comparison. For example, the following statements fail:

In [None]:
A > 8 & A < 15

> __and__ / **or** vs __&__ / **|** It could be confusing to see the difference between `and` and `&` (`or` and `|`). The `and` and `or` keyword works on "truthfulness" of an entire object. When you try to evaluate 
```py 
(A > 8) and (A < 15)
```
The interpreter will raise an exception since A>8 can not be evaluated as a boolean value. And `and` is not a ufunc in numpy. But 
```py
(A >) 8 | (A < 15)
```
Works since `|` calls a numpy ufunc which operates in an element by element fashion.

In [None]:
(A > 8) and (A < 15)

One of the coolest features of numpy is that you can use boolean array for indexing. This is usually referred to as __masking__.  
In the example below, we first build a boolean array, indicating whether a values is bigger then 7 or not. 
Then, we use this boolean array to extract all the values which are bigger than 7. This is a very powerful and convenient feature.

In [None]:
A[A > 7] # Get all values greater than 7

In [None]:
A[(A < 10) & (A != 6)] # Get all values smaller than 10 which are not 6

***
### Exercise
***

__Get all the values from A which are between 10 and 20 but not 11__

In [None]:
np.random.seed(99)
A = np.random.randint(low=0, high=20, size=(5, 5))
A

In [None]:
# Your code starts here

# Your code ends here

__Use np.where to find the indices of all the values between 10 and 20 or 30 to 40 in B__

In [None]:
np.random.seed(2019)
B = np.random.randint(low=0, high=41, size=50)
B

In [None]:
# Your code starts here

# Your code ends here

__Use the same technique to find the indices of all the wines which have `alcohol` level of above 12.5 or `malic acid` of below 2 but not both!__

In [None]:
from sklearn import datasets # We are going to use sklearn to load a sample dataset.

# Load the wine dataset.
wine_dataset = datasets.load_wine()

# Extract the feature names from the dataset.
features_names = wine_dataset['feature_names']

# Extract the features matrix from the dataset.
X = wine_dataset['data']

In [None]:
features_names

In [None]:
mask = None
# Your code starts here

# Your code ends here

### Fancy indexing
Once we have extracted the indices from the features we can use those indices to get all the rows that match our criterion. The idea of using an array of integers as an indexer is called fancy indexing.

In [None]:
ind = np.where(mask) # Assuming you got the mask right.
X[np.where(mask)]

The idea of Fancy indexing is pretty simple : we can use a scalar to pick a specific element from an array. So let's use an array to access multiple elements in an array.

<img src="https://media.giphy.com/media/dQkcf8GANR0ps57oBH/giphy.gif" width="200">

In [None]:
a = np.arange(20)

In [None]:
a[0], a[2], a[5], a[17] # Cumbersome way of accessing 4 elements in an array.

In [None]:
ind = [0, 2, 5, 17] # Fancy indexing!
a[ind]

We can build an array with any dimension we want using fancy indexing

In [None]:
mat_1 = [[0, 1, 2], [1, 2, 3]]
a[[0, 1, 3]]

We can also do fancy indexing on multiple axis

In [None]:
A = np.arange(20).reshape(4, 5)
A

In [None]:
row = [0, 1, 3]
col = [1, 2, 4]
A[row, col]

And keeping the same dimensions

If you want to take the [1, 2, 4] col values from [0, 1, 3] rows we can use __broadcasting!__

In [None]:
row = np.array([0, 1, 3])
A[row[:, np.newaxis], col]

and you can also mix between indexing types

In [None]:
np.random.seed(2020)
A = np.random.randint(low=0, high=100, size=(8, 5))
A

Using slicing and fancy indexing:

In [None]:
A[1:3, [1, 2, 4]] # Grab row 1 and2  take col values 1, 2, 4

Using boolean and fancy indexing.

In [None]:
# Take the 1st, 4th, 5th and 6th row, and keep only columns that have a mean greater than 50.
rows = np.array([1, 4, 5, 6])
col = A.mean(0) > 50
A[rows[:, np.newaxis], col]

In [None]:
rows[:, np.newaxis], col

In [None]:
np.tile(rows[:, np.newaxis], [5])

In [None]:
np.tile(col, (4, 1))

# "Losing Your Loops": Fast Numerical Computing with NumPy 

From the PyCon 2015 conferece, a [presentation](https://speakerdeck.com/jakevdp/losing-your-loops-fast-numerical-computing-with-numpy-pycon-2015) by [Jake VanderPlas](http://vanderplas.com).

Also available on [YouTube](https://www.youtube.com/watch?v=EEUXKG97YRw).

# References

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) A thorough tour into Numpy. 

- [Yoav Ram Numpy Notebook](https://github.com/yoavram/SciComPy/blob/master/notebooks/numpy.ipynb) If you want to skim through most topics in Numpy.