# Advanced Numpy

In [1]:
# Import the library
import numpy as np

# Create an array
a = np.array(range(8))
a

array([0, 1, 2, 3, 4, 5, 6, 7])

In [2]:
# What is the default data type of this array
a.dtype

dtype('int64')

So each element is stored in the memory on 32 bits (=4 bytes). What other attributes are associated with an numpy array?

In [3]:
# Define a function to print useful info about the numpy array
def print_attributes(x):
    print('Array: ')
    print(x, '\n')
    print('Number of elements: ', x.size)
    print('Number of dimensions: ', x.ndim)
    print('Shape: ', x.shape)
    print('Data type:', x.dtype)
    print('Strides: ', x.strides)
    print('Flags:')
    print(x.flags)

In [4]:
print_attributes(a)

Array: 
[0 1 2 3 4 5 6 7] 

Number of elements:  8
Number of dimensions:  1
Shape:  (8,)
Data type: int64
Strides:  (8,)
Flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False


So, we see that our array contains 8 element, each of type int32, stored on 32 bits (4 bytes). The strides values represent the number of bytes that separates neighboring elements along the different dimensions. In our case, we have a one dimensional array, and the memory addresses of neighboring elements are separated by 4 bytes (stride).  

Among the flags, the most important ones are:  
C_CONTIGUOUS - says whether the data is in a single C-style (column-first) contiguous segment.  
F_CONTIGUOUS - says whether the data is in a single, Fortran-style (row-first) contiguous segment.  
OWNDATA - says whether the array owns the memory it uses or borrows it from another object.

In [5]:
# Let's reshape this array, such that it has 2 rows 
# and 4 colums 
b = a.reshape(2,4)
print_attributes(b)

Array: 
[[0 1 2 3]
 [4 5 6 7]] 

Number of elements:  8
Number of dimensions:  2
Shape:  (2, 4)
Data type: int64
Strides:  (32, 8)
Flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False


The new array is still represented in the memory as a contiguous memory location, each element occupying 4 bytes (dtype=int32). In order to jump from one element to the next along the first axis (e.g. going from element 0 to element 4), we need to jump in the memory 4 element (0 $\to$ 1 $\to$ 2 $\to$ 3 $\to$ 4) or 4x4 bytes. On the other hand, if we want to go to the next element along the second axis (e.g. 0 $\to$ 1) we only need to jump 4 bytes. This information is stored in the 'Strides' attribute.  

From the flags, we see that array `b` does not own its own memory space, so it just point to the same memory location as `a`. Let's see what hapens if we change `a`.

In [6]:
a[0] = 10
print(a)
print(b)

[10  1  2  3  4  5  6  7]
[[10  1  2  3]
 [ 4  5  6  7]]


As expected, both `a[0]` and `b[0]` point to the same memory location, which is changed when executing `a[0] = 10`. Therefore, `b[0]` now points to the changed value, 10.

As we see the data from `b` is stored in the memory in a C-style (column-first) contiguous segment. What about the transpose of `b`?

In [7]:
print_attributes(b.T)

Array: 
[[10  4]
 [ 1  5]
 [ 2  6]
 [ 3  7]] 

Number of elements:  8
Number of dimensions:  2
Shape:  (4, 2)
Data type: int64
Strides:  (8, 32)
Flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False


As expected, since `b` and `b.T` arrays refer to the same memory location and `b` is stored in a column-first fashion, `b.T` is stored in a row-first style (Fortran-style).

In [8]:
# Let's reshape again the array
c = a.reshape(2,2,2)
print_attributes(c)

Array: 
[[[10  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]] 

Number of elements:  8
Number of dimensions:  3
Shape:  (2, 2, 2)
Data type: int64
Strides:  (32, 16, 8)
Flags:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False


Now, to jump to the next element of `c` along axis 2 (e.g. 10 $\to$ 1) we need to jump in the memory 4 bytes; in order to jump to the next element along axis 1 (e.g. 10 $\to$ 2) we need to jump 2x4 bytes; and to jump to the next element along axis 0 (e.g. 10 $\to$ 4) we need to jump in the memory 4x4 bytes. Therefore the strides attribute for the new array is (16, 8, 4).

What hapens when we slice an object?

In [9]:
# Let's consider the alternative elements (a[0], a[2], a[4], etc.)
print_attributes(a[::2])

Array: 
[10  2  4  6] 

Number of elements:  4
Number of dimensions:  1
Shape:  (4,)
Data type: int64
Strides:  (16,)
Flags:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False


Although this points to the same memory location as a, now its elements are not contiguous in the memory, as we need to jump one element of `a` in order to get to the next element in this slice. Therefore, the stride changed from 4 bytes to 8 bytes. We observe that now these elements are neighter C_CONTIGUOUS nor F_CONTIGUOUS.

What if we want to view each byte of memory separately?

In [10]:
d = np.array([0, 1, 2], dtype=np.int32)
d_bytes = d.view(dtype=np.uint8)
print(d)
print(d_bytes)

[0 1 2]
[0 0 0 0 1 0 0 0 2 0 0 0]


The number 0 (int32) is stored in memory as the following 4 bytes: [0,0,0,0].  
The number 1 (int32) is stored in memory as the following 4 bytes: [1,0,0,0].  
The number 2 (int32) is stored in memory as the following 4 bytes: [2,0,0,0].  
As we can see from here, the least-significant byte is stored first (called little endian system).

In [11]:
# Let's modify the second byte
d_bytes[1] = 1
print(d)
print(d_bytes)

[256   1   2]
[0 1 0 0 1 0 0 0 2 0 0 0]


Since we changed the second byte to 1, the bit number 9 changed to 1, so the first element of `d` became the binary number `100000000` or $2^8=256$.

To get an independent array that we can change without modifying the original array, we need to use the `copy()` method:

In [12]:
d_copy_bytes = d.copy().view(dtype=np.uint8)
print(d)
print(d_copy_bytes)

[256   1   2]
[0 1 0 0 1 0 0 0 2 0 0 0]


In [13]:
# Now if we modify d_copy_bytes, this will not affect the original array d
# as these point to different locations in the memory
d_copy_bytes[1] = 0
print(d)
print(d_copy_bytes)

[256   1   2]
[0 0 0 0 1 0 0 0 2 0 0 0]


In [14]:
# Print the identify of these 2 objects
print(id(d))
print(id(d_copy_bytes))

4590098512
4590100272
