<a href="https://colab.research.google.com/github/rahiakela/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/blob/0-math-numpy-pandas-matplotlib-guide/visual_intro_to_numpy_and_data_representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Visual Intro to NumPy and Data Representation

Reference: [A Visual Intro to NumPy and Data Representation](https://jalammar.github.io/visual-numpy/)

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array.png?raw=1' width='800'/>

The [NumPy package](https://numpy.org/) is the workhorse of data analysis, machine learning, and scientific computing in the python ecosystem. It vastly simplifies manipulating and crunching vectors and matrices. Some of python’s leading package rely on NumPy as a fundamental piece of their infrastructure (examples include scikit-learn, SciPy, pandas, and tensorflow). Beyond the ability to slice and dice numeric data, mastering numpy will give you an edge when dealing and debugging with advanced usecases in these libraries.

In this post, we’ll look at some of the main ways to use NumPy and how it can represent different types of data (tables, images, text…etc) before we can serve them to machine learning models.


In [0]:
import numpy as np

## Creating Arrays

We can create a NumPy array (a.k.a. the mighty [ndarray](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)) by passing a python list to it and using ` np.array()`. In this case, python creates the array we can see on the right here:

<img src='https://github.com/rahiakela/img-repo/blob/master/create-numpy-array-1.png?raw=1' width='800'/>



In [2]:
np.array([1, 2, 3])

array([1, 2, 3])

There are often cases when we want NumPy to initialize the values of the array for us. NumPy provides methods like ones(), zeros(), and random.random() for these cases. We just pass them the number of elements we want it to generate:

<img src='https://github.com/rahiakela/img-repo/blob/master/create-numpy-array-ones-zeros-random.png?raw=1' width='800'/>

In [3]:
np.ones(3)

array([1., 1., 1.])

In [4]:
np.zeros(3)

array([0., 0., 0.])

In [5]:
np.random.random(3)

array([0.67695869, 0.82171071, 0.26322002])

Once we’ve created our arrays, we can start to manipulate them in interesting ways.

### Array Arithmetic

Let’s create two NumPy arrays to showcase their usefulness. We’ll call them data and ones:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-arrays-example-1.png?raw=1' width='800'/>


In [0]:
data = np.array([1, 2])
ones = np.ones(2)

Adding them up position-wise (i.e. adding the values of each row) is as simple as typing data + ones:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-arrays-adding-1.png?raw=1' width='800'/>

In [7]:
data + ones

array([2., 3.])

When I started learning such tools, I found it refreshing that an abstraction like this makes me not have to program such a calculation in loops. It’s a wonderful abstraction that allows you to think about problems at a higher level.

And it’s not only addition that we can do this way:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array-subtract-multiply-divide.png?raw=1' width='800'/>

In [8]:
data - ones

array([0., 1.])

In [9]:
data * data

array([1, 4])

In [10]:
data / data

array([1., 1.])

There are often cases when we want to carry out an operation between an array and a single number (we can also call this an operation between a vector and a scalar). Say, for example, our array represents distance in miles, and we want to convert it to kilometers. We simply say data * 1.6:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array-broadcast.png?raw=1' width='800'/>

In [11]:
data * 1.6

array([1.6, 3.2])

See how NumPy understood that operation to mean that the multiplication should happen with each cell? That concept is called broadcasting, and it’s very useful.

### Indexing

We can index and slice NumPy arrays in all the ways we can slice python lists:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array-slice.png?raw=1' width='800'/>

In [0]:
data = np.array([1, 2, 3])

In [13]:
data[0]

1

In [14]:
data[1]

2

In [15]:
data[0:2]

array([1, 2])

In [16]:
data[1:]

array([2, 3])

### Aggregation

Additional benefits NumPy gives us are aggregation functions:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array-aggregation.png?raw=1' width='800'/>

In [17]:
data.max()

3

In [18]:
data.min()

1

In [19]:
data.sum()

6

In addition to min, max, and sum, you get all the greats like mean to get the average, prod to get the result of multiplying all the elements together, std to get standard deviation, and [plenty of others](https://jakevdp.github.io/PythonDataScienceHandbook/02.04-computation-on-arrays-aggregates.html).

### In more dimensions

All the examples we’ve looked at deal with vectors in one dimension. A key part of the beauty of NumPy is its ability to apply everything we’ve looked at so far to any number of dimensions.

## Creating Matrices

We can pass python lists of lists in the following shape to have NumPy create a matrix to represent them:

In [20]:
np.array([
   [1, 2],
   [3, 4]       
])

array([[1, 2],
       [3, 4]])

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-array-create-2d.png?raw=1' width='800'/>

We can also use the same methods we mentioned above (ones(), zeros(), and random.random()) as long as we give them a tuple describing the dimensions of the matrix we are creating:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-ones-zeros-random.png?raw=1' width='800'/>


In [21]:
np.ones((3, 2))

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

In [22]:
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [23]:
np.random.random((3, 2))

array([[0.17286117, 0.13117469],
       [0.42092345, 0.41046649],
       [0.4653666 , 0.55296414]])

### Matrix Arithmetic

We can add and multiply matrices using arithmetic operators (+-*/) if the two matrices are the same size. NumPy handles those as position-wise operations:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-arithmetic.png?raw=1' width='800'/>

In [24]:
data = np.array([
   [1, 2],
   [3, 4]              
])

data + ones

array([[2., 3.],
       [4., 5.]])

We can get away with doing these arithmetic operations on matrices of different size only if the different dimension is one (e.g. the matrix has only one column or one row), in which case NumPy uses its broadcast rules for that operation:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-broadcast.png?raw=1' width='800'/>

In [25]:
data = np.array([
   [1, 2],
   [3, 4],
   [5, 6]              
])

ones_row = np.ones((1, 2))

data + ones_row

array([[2., 3.],
       [4., 5.],
       [6., 7.]])

### Dot Product

A key distinction to make with arithmetic is the case of [matrix multiplication](https://www.mathsisfun.com/algebra/matrix-multiplying.html) using the dot product. NumPy gives every matrix a dot() method we can use to carry-out dot product operations with other matrices:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-dot-product-1.png?raw=1' width='800'/>


In [61]:
data = np.array([1, 2, 3])

powers_of_ten = np.array([
   [1, 10],
   [100, 1000],
   [10000, 100000]                       
])

data.dot(powers_of_ten)

array([ 30201, 302010])

I’ve added matrix dimensions at the bottom of this figure to stress that the two matrices have to have the same dimension on the side they face each other with. You can visualize this operation as looking like this:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-dot-product-2.png?raw=1' width='800'/>

In [42]:
powers_of_ten[:, 1]

array([    10,   1000, 100000])

In [63]:
col1 = powers_of_ten[:, 0]
col2 = powers_of_ten[:, 1]

sum1 = np.dot(data, col1)
sum2 = np.dot(data, col2)

np.array([sum1, sum2])

array([ 30201, 302010])

### Matrix Indexing

Indexing and slicing operations become even more useful when we’re manipulating matrices:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-indexing.png?raw=1' width='800'/>

In [65]:
data = np.array([
   [1, 2],
   [3, 4],
   [5, 6]              
])

data[0, 1]

2

In [50]:
data[1:3]

array([[3, 4],
       [5, 6]])

In [51]:
data[0:2, 0]

array([1, 3])

In [54]:
data[:, 0]  # select first column

array([1, 3, 5])

In [56]:
data[:, 1] # select second column

array([2, 4, 6])

### Matrix Aggregation

We can aggregate matrices the same way we aggregated vectors:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-aggregation-1.png?raw=1' width='800'/>

In [74]:
data = np.array([
   [1, 2],
   [5, 3],
   [4, 6]              
])
data.max()

6

In [75]:
data.min()

1

In [76]:
data.sum()

21

Not only can we aggregate all the values in a matrix, but we can also aggregate across the rows or columns by using the axis parameter:

<img src='https://github.com/rahiakela/img-repo/blob/master/numpy-matrix-aggregation-4.png?raw=1' width='800'/>

In [77]:
data.max(axis=0)  # row by max

array([5, 6])

In [78]:
data.max(axis=1) # column by max

array([2, 5, 6])

In [79]:
data.sum(axis=0) # row by sum

array([10, 11])

In [80]:
data.sum(axis=1) # column by sum

array([ 3,  8, 10])

In [81]:
data.min(axis=0) # row by min

array([1, 2])

In [82]:
data.min(axis=1) # column by min

array([1, 3, 4])

### Transposing and Reshaping