# NumPy vs. Python Collections

![Python Collections compared to NumPy](./../images/data_munging_01-Numpy-02.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/01-Numpy/02.01-Numpy-over-Python-Collections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
import time
import sys
import numpy as np

In [2]:
np.__version__

'1.26.2'

# Performance over Python Lists

In [6]:
# NumPy is faster
# 1. Contiguous storage
# 2. Leverage datatypes

# ten million
big_number = 10000000

# List
python_list = list(range(big_number))

start_time = time.time()
sum_list = sum(python_list)
list_time = time.time() - start_time

# NumPy Array
numpy_array = np.array(range(big_number), dtype=np.int64)

start_time_np = time.time()
sum_array = np.sum(numpy_array)
numpy_time = time.time() - start_time_np

print(f"Python List Time: {list_time}")
print(f"NumPy Array Time: {numpy_time}")
print(f"Numpy performing {list_time/numpy_time} times faster than Python Lists")

Python List Time: 0.20299983024597168
NumPy Array Time: 0.005993366241455078
Numpy performing 33.87075344100565 times faster than Python Lists


In [7]:
# one million, two million
lst1 = [i for i in range(1000000)]
lst2 = [i for i in range(1000000, 2000000)]

arr1 = np.array(lst1)
arr2 = np.array(lst2)

# Python List
start_time1 = time.time()
result_list = [a + b for a, b in zip(lst1, lst2)]
list_time1 = time.time() - start_time1

# NumPy Array
start_time_np1 = time.time()
result_array = arr1 + arr2
numpy_time1 = time.time() - start_time_np1

print(f"Python List Operation Time: {list_time1}")
print(f"NumPy Array Operation Time: {numpy_time1}")
print(f"Numpy performing {list_time1/numpy_time1} times faster than Python Lists")

Python List Operation Time: 0.06508183479309082
NumPy Array Operation Time: 0.0010025501251220703
Numpy performing 64.91629013079667 times faster than Python Lists


# How NumPy handles Data Types

NumPy promotes the types to the smallest size and smallest scalar kind that fits all the elements.
This type promotion can be counter intuitive sometimes.
See: 
* [Type Casting Rules](https://numpy.org/doc/stable/user/basics.ufuncs.html#type-casting-rules)
* [```numpy.result_type```](https://numpy.org/doc/stable/reference/generated/numpy.result_type.html#numpy-result-type)
* [```numpy.promote_types```](https://numpy.org/doc/stable/reference/generated/numpy.promote_types.html#numpy-promote-types)


In [8]:
# handling inconsistent data in NumPy
try:
	inconsistent_array1 = np.array([1, "two", 3, '!'])
	print("inconsistent_array1.dtype = ",inconsistent_array1.dtype)
except Exception as e:
	print(e)
# 
try:
	# throws exception for inconsistent_array2 as elements do not match the specified dtype
	inconsistent_array2 = np.array([1, "two", 3, '!'], dtype=np.int32)
	print(inconsistent_array2.dtype)
except Exception as e:
	print("inconsistent_array2 exception: ",e)
# 
try:
	# specify the dtype - makes things more reliable (and faster!)
	inconsistent_array3 = np.array([1, 2, 3, 4], dtype=np.int64)
	print("inconsistent_array3.dtype = ",inconsistent_array3.dtype)
except Exception as e:
	print(e)
# 

inconsistent_array1.dtype =  <U11
inconsistent_array2 exception:  invalid literal for int() with base 10: 'two'
inconsistent_array3.dtype =  int64


In [9]:
inconsistent_array3.strides

(8,)

# NumPy uses contiguous blocks of data in memory
  
![Row-Wise Representation of Data](./../images/PyDataGlobal2023-PythonvsNumpy-rowandcol.drawio.png)

In [13]:
# NumPy array
arr = np.array([1, 2, 3, 4], dtype=np.int32)    
# 
print(f"NumPy ctypes {arr.ctypes.data}\n{arr.ctypes.strides}\n{arr.nbytes}")
print(f"NumPy ctypes.data {arr.ctypes.data}")
# add another element to the array - see how size changes
arr = np.append(arr, [5])
print(f"NumPy ctypes {arr.ctypes.data}\n{arr.ctypes.strides}\n{arr.nbytes}")
print(f"NumPy ctypes.data {arr.ctypes.data}")
# 
for idx in range(len(arr)):
	# __array_interface__['data'] is a tuple (address for 0th element, mutable) 
	# Gives you same address each time, 
	# because it's expecting to traverse the array using offsets - 'strides' in NumPy
	# a stragety only applicable when you traverse a contiguous block of data
    print(f"NumPy array element {idx}: memory address = {id(arr[idx])}, {arr[idx].__array_interface__['data']}")
# 
# Python list
lst = [1, 2, 3, 4]
for idx, item in enumerate(lst):
    print(f"Python list element {idx}: memory address = {id(item)}")

NumPy ctypes 2545189117264
<numpy.core._internal.c_longlong_Array_1 object at 0x00000250A8465CD0>
16
NumPy ctypes.data 2545189117264
NumPy ctypes 2545189117456
<numpy.core._internal.c_longlong_Array_1 object at 0x00000250A8465CD0>
20
NumPy ctypes.data 2545189117456
NumPy array element 0: memory address = 2545919978832, (2545188803328, False)
NumPy array element 1: memory address = 2545919978832, (2545188803328, False)
NumPy array element 2: memory address = 2545919978832, (2545188803328, False)
NumPy array element 3: memory address = 2545919978832, (2545188803328, False)
NumPy array element 4: memory address = 2545919978832, (2545188803328, False)
Python list element 0: memory address = 140736721097512
Python list element 1: memory address = 140736721097544
Python list element 2: memory address = 140736721097576
Python list element 3: memory address = 140736721097608


# How strict data-types in NumPy reduce memory overhead

We saw that Python data structures come with a lot of functions that help with duck-typing and other general purpose data analysis tasks. 

Python lists have a significant memory overhead because they store more than just the data (like object type info, size, reference count, etc.).  
NumPy arrays, being homogeneous, cut down on this overhead.

In [26]:
# NumPy consuming less memory
lst = list(range(big_number))
lst_size = sys.getsizeof(lst)
print(f"Size of Python list: {lst_size} bytes or {round(lst_size/1024, 2)} Kb")

np_arr = np.array(lst)
np_arr_size = np_arr.nbytes
print(f"Size of NumPy array: {np_arr_size} bytes or {round(np_arr_size/1024, 2)} Kb")
print('\n')
print(f"Compared to Python lists, NumPy consumes approximately \
{round(((lst_size-np_arr_size)/lst_size)*100,2)}% less memory")

Size of Python list: 80000056 bytes or 78125.05 Kb
Size of NumPy array: 40000000 bytes or 39062.5 Kb


Compared to Python lists, NumPy consumes approximately 50.0% less memory


# Row Oriented vs Columnar

<font color='red'><em>This bit gets real confusing, real fast, so please ignore it in the first pass.</em></font>

In [9]:
# row order, column order

arr_c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], order='C')
arr_f = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], order='F')

print("Row-major order:")
print(arr_c.ravel(order='C'))
print(arr_c)
# 
print("\nColumn-major order:")
print(arr_f.ravel(order='F'))
print(arr_f)
# 
arr_c_to_f = arr_c.ravel(order='F').reshape(arr_c.shape, order = 'F')
arr_f_to_c = arr_f.ravel(order='C').reshape(arr_f.shape, order = 'C')
# this gets mondo confusing.. 
# add lots of comments
# even if they explain the code over and over
# over-communicate - confusion confuses
print("\nRow-major to Column-major:")
print(arr_c_to_f.ravel(order='F'))
print(arr_c_to_f)
print("\nColumn-major to Row-major:")
print(arr_f_to_c.ravel(order='C'))
print(arr_f_to_c)

Row-major order:
[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Column-major order:
[1 4 7 2 5 8 3 6 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Row-major to Column-major:
[1 4 7 2 5 8 3 6 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Column-major to Row-major:
[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [10]:
def print_memory(arr):
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            print(arr[i, j], arr[i, j].data)

print("Memory locations for C-order:")
print_memory(arr_c)
print("\nMemory locations for F-order:")
print_memory(arr_f)

Memory locations for C-order:
1 <memory at 0x000001BDA3EAA200>
2 <memory at 0x000001BDA3EAA200>
3 <memory at 0x000001BDA3EAA200>
4 <memory at 0x000001BDA3EAA200>
5 <memory at 0x000001BDA3EAA200>
6 <memory at 0x000001BDA3EAA200>
7 <memory at 0x000001BDA3EAA200>
8 <memory at 0x000001BDA3EAA200>
9 <memory at 0x000001BDA3EAA200>

Memory locations for F-order:
1 <memory at 0x000001BDA3EAA200>
2 <memory at 0x000001BDA3EAA200>
3 <memory at 0x000001BDA3EAA200>
4 <memory at 0x000001BDA3EAA200>
5 <memory at 0x000001BDA3EAA200>
6 <memory at 0x000001BDA3EAA200>
7 <memory at 0x000001BDA3EAA200>
8 <memory at 0x000001BDA3EAA200>
9 <memory at 0x000001BDA3EAA200>
