<a href="https://colab.research.google.com/github/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 1.6: Introduction to NumPy

## Introduction
**NumPy**, short for Numerical Python, is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many computational and data science packages use NumPy as the main building block. It is a fundamental library for scientific computing in Python.

### Key Features of NumPy:
* **ndarray**: An efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
* **Vectorization**: Mathematical functions for fast operations on entire arrays of data without having to write loops.
* **Linear Algebra**: Tools for random number generation, Fourier transforms, and matrix manipulation.
* **C API**: For connecting NumPy with libraries written in C, C++, or FORTRAN.

### Advantages over Python Lists:
1. **Contiguous Memory**: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. This allows for significantly faster access and manipulation.
2. **Vectorized Operations**: NumPy algorithms written in C can operate on this memory without type checking or other Python overhead, performing complex computations without slow `for` loops.

![numpy_vs_list](https://github.com/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/assets/numpy_vs_python_list.png?raw=1)

## Part 1: Performance Benchmark
To give you an idea of the performance difference, consider a NumPy array of one million integers and an equivalent Python list. We use the `%timeit` magic command to measure execution time.

In [2]:
import numpy as np
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

print("NumPy Vectorized Multiplication (my_arr * 2):")
%timeit my_arr2 = my_arr * 2

print("\nPython List Comprehension ([x * 2 for x in my_list]):")
%timeit my_list2 = [x * 2 for x in my_list]

NumPy Vectorized Multiplication (my_arr * 2):
2.6 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python List Comprehension ([x * 2 for x in my_list]):
57 ms ± 881 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Part 2: The ndarray (N-dimensional array)
The `ndarray` is a fast, flexible container for large datasets. It is a multidimensional array of fixed size with **homogeneous** elements (all elements must be of the same type).

Every array has:
* **shape**: A tuple indicating the size of each dimension.
* **dtype**: An object describing the data type of the array.
* **ndim**: The number of dimensions (axes).

### ndarray illustration
![ndarray](https://github.com/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/assets/numpy_ndarray.png?raw=1)

In [18]:
# [DEMO] Creating arrays from sequences
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

print(f"Array 2:\n{arr2}")
print(f"Shape: {arr2.shape}, Dtype: {arr2.dtype}, Dimensions: {arr2.ndim}")

print("\narr2[1, :]:")
row_slice_explicit = arr2[1, :]
print(row_slice_explicit)
print("Shape of arr2[1, :]:", row_slice_explicit.shape)
print("Dimensions of arr2[1, :]:", row_slice_explicit.ndim)

Array 2:
[[1 2 3 4]
 [5 6 7 8]]
Shape: (2, 4), Dtype: int64, Dimensions: 2

arr2[1, :]:
[5 6 7 8]
Shape of arr2[1, :]: (4,)
Dimensions of arr2[1, :]: 1


### Data Types and Casting
NumPy supports specific numerical types like `int32`, `float64`, etc. You can explicitly convert an array from one `dtype` to another using the `astype` method.

**Note:** If you cast floating-point numbers to an integer `dtype`, the decimal part will be truncated.

In [4]:
# [DEMO] Casting arrays
arr = np.array([3.7, -1.2, 0.5, 12.9])
print("Original:", arr)
print("Casted to int32:", arr.astype(np.int32))

Original: [ 3.7 -1.2  0.5 12.9]
Casted to int32: [ 3 -1  0 12]


### [EXERCISE 1: Creation & Casting]
1. Create a 3x4 array of all ones using `np.ones()`.
2. Cast this array to `float32`.
3. Create an array of strings representing numbers: `['1.25', '-9.6', '42']`. Cast it to `float`.

In [5]:
# Your code here
arr = np.ones((3, 4))
print(arr)

arr = arr.astype(np.float32)
print(arr)

str = np.array([1.25, -9.6, 42])
str = str.astype(float)
print(str)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
[ 1.25 -9.6  42.  ]


In [6]:
import numpy as np

print("Integer types:")
print(f"  np.int8: {np.iinfo(np.int8).min} to {np.iinfo(np.int8).max} (8-bit signed integer)")
print(f"  np.int16: {np.iinfo(np.int16).min} to {np.iinfo(np.int16).max} (16-bit signed integer)")
print(f"  np.int32: {np.iinfo(np.int32).min} to {np.iinfo(np.int32).max} (32-bit signed integer)")
print(f"  np.int64: {np.iinfo(np.int64).min} to {np.iinfo(np.int64).max} (64-bit signed integer)")
print(f"  np.uint8: {np.iinfo(np.uint8).min} to {np.iinfo(np.uint8).max} (8-bit unsigned integer)")

print("\nFloating-point types:")
print(f"  np.float16 (half precision)")
print(f"  np.float32 (single precision)")
print(f"  np.float64 (double precision, often default for Python float)")

print("\nBoolean type:")
print(f"  np.bool_")

print("\nString/Object types:")
print(f"  np.str_ (fixed-size string)")
print(f"  np.object_ (Python object type)")


Integer types:
  np.int8: -128 to 127 (8-bit signed integer)
  np.int16: -32768 to 32767 (16-bit signed integer)
  np.int32: -2147483648 to 2147483647 (32-bit signed integer)
  np.int64: -9223372036854775808 to 9223372036854775807 (64-bit signed integer)
  np.uint8: 0 to 255 (8-bit unsigned integer)

Floating-point types:
  np.float16 (half precision)
  np.float32 (single precision)
  np.float64 (double precision, often default for Python float)

Boolean type:
  np.bool_

String/Object types:
  np.str_ (fixed-size string)
  np.object_ (Python object type)


### Explanation of NumPy Data Types (dtypes)

NumPy's data types are crucial for efficient memory usage and computation. Here's a breakdown of the categories:

*   **Integer Types (`int`, `uint`):**
    *   `int` refers to signed integers, meaning they can represent both positive and negative values. The number (e.g., `8`, `16`, `32`, `64`) indicates the number of bits used to store the integer, which determines its range.
    *   `uint` refers to unsigned integers, meaning they can only represent non-negative values. This allows them to store larger positive numbers for the same bit-width compared to their signed counterparts.

*   **Floating-Point Types (`float`):**
    *   These types represent real numbers (numbers with decimal points).
    *   `float16` (half precision), `float32` (single precision), and `float64` (double precision) offer different levels of precision and range. `float64` is typically the default when you use Python's built-in `float` type in NumPy operations.

*   **Boolean Type (`bool_`):**
    *   Used to store `True` or `False` values. It's similar to Python's `bool` but is optimized for NumPy arrays.

*   **String/Object Types (`str_`, `object_`):**
    *   `str_`: For fixed-size strings. Not as flexible as Python strings, but can be memory-efficient for uniform string data.
    *   `object_`: Can hold arbitrary Python objects. This is generally used when elements in an array have mixed types or are complex Python objects, but it loses many of NumPy's performance benefits because it behaves more like a Python list of references.

## Part 3: Arithmetic & Broadcasting
Arithmetic operations are applied as batch operations without for loops. **Broadcasting** describes how arithmetic works between arrays of different shapes.

![vectorization](https://github.com/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/assets/vectorization.png?raw=1)

Example: A scalar value being replicated (broadcast) to match the shape of a larger array.

In [7]:
# [DEMO] Arithmetic & Broadcasting
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print("Element-wise multiplication (arr * arr):\n", arr * arr)
print("\nBroadcasting scalar (1 / arr):\n", 1 / arr)

Element-wise multiplication (arr * arr):
 [[ 1.  4.  9.]
 [16. 25. 36.]]

Broadcasting scalar (1 / arr):
 [[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]


## Part 4: Indexing and Slicing
One-dimensional arrays act similarly to Python lists. In 2D arrays, indexing can be done with `[row, column]` syntax.

### 2D Array Indexing Syntax
![2d_array_indexing](https://github.com/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/assets/ndarray_axis_index.png?raw=1)

**Important:** Array slices are **views** on the original array. This means data is not copied, and modifications to the slice will be reflected in the source array.

In [8]:
# [DEMO] Slicing views
arr = np.arange(10) #define the size of the array
arr_slice = arr[5:8] #left-close, right open, 8-5 = 3 elements
print(arr_slice)
arr_slice[1] = 12345
print("Original array modified via slice:", arr)

# [DEMO] 2D Slicing
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print (arr2d)
print("\nFirst two rows, columns 1 onwards:\n", arr2d[:2, 1:])
# meaning to print from row 0 to row 1, column 1 to last column

[5 6 7]
Original array modified via slice: [    0     1     2     3     4     5 12345     7     8     9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

First two rows, columns 1 onwards:
 [[2 3]
 [5 6]]


### [EXERCISE 2: The Logic of Slicing]
1. Select the first column of `arr2d` using a slice.
2. Set all values in the second row to 0.
3. **Socratic Prompt:** How does `arr2d[1]` differ from `arr2d[1, :]`? (Hint: check shapes)

In [9]:
import numpy as np

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print("Original arr2d:\n", arr2d)
print("Original arr2d shape:", arr2d.shape)
print("Original arr2d dimensions:", arr2d.ndim)

print("\narr2d[1]:")
row_single_index = arr2d[1]
print(row_single_index)
print("Shape of arr2d[1]:", row_single_index.shape)
print("Dimensions of arr2d[1]:", row_single_index.ndim)

print("\narr2d[1, :]:")
row_slice_explicit = arr2d[1, :]
print(row_slice_explicit)
print("Shape of arr2d[1, :]:", row_slice_explicit.shape)
print("Dimensions of arr2d[1, :]:", row_slice_explicit.ndim)

Original arr2d:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Original arr2d shape: (3, 3)
Original arr2d dimensions: 2

arr2d[1]:
[4 5 6]
Shape of arr2d[1]: (3,)
Dimensions of arr2d[1]: 1

arr2d[1, :]:
[4 5 6]
Shape of arr2d[1, :]: (3,)
Dimensions of arr2d[1, :]: 1


In [10]:
# Your code here
print("first col of arr2d using slice:", arr2d[:,0])

arr2d[1,:] = 0 # Perform the assignment first, cannot perform assignment in print statement
print("arr2d with second row set to zero:\n", arr2d)


if(arr2d[1].shape == arr2d[1,:].shape ):

    print("Both are same")

first col of arr2d using slice: [1 4 7]
arr2d with second row set to zero:
 [[1 2 3]
 [0 0 0]
 [7 8 9]]
Both are same


## Part 5: Boolean Indexing
Like arithmetic operations, comparisons (such as `==`) with arrays are vectorized. This yields a boolean array which can be used to filter data.

In [11]:
# [DEMO] Filtering scores
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

bob_mask = (names == 'Bob')
print("Mask:", bob_mask)
print("Bob's scores:\n", scores[bob_mask])

Mask: [ True False False  True False False False]
Bob's scores:
 [[ 75  80]
 [100  77]]


### [EXERCISE 3: Complex Filtering]
1. Select all scores where the name is NOT 'Bob'.
2. Select scores for 'Bob' or 'Will' using the `|` operator.
3. Find all scores less than 80 and set them to 0.

In [12]:
# Your code here
not_bob_mask = (names != 'Bob')
print("Mask", not_bob_mask)
print("Not Bob's score", scores[not_bob_mask])

bob_or_will_mask = (names == 'Bob')|(names =='Will')
print("Mask", bob_or_will_mask)
print("Bob or Will's score", scores[bob_or_will_mask])

less_than_80_mask = scores < 80
scores[less_than_80_mask] = 0


Mask [False  True  True False  True  True  True]
Not Bob's score [[ 85  90]
 [ 95 100]
 [ 85  92]
 [ 95  80]
 [ 72  80]]
Mask [ True False  True  True  True False False]
Bob or Will's score [[ 75  80]
 [ 95 100]
 [100  77]
 [ 85  92]]


## Part 6: Universal Functions (ufuncs) and Methods
A **ufunc** is a function that performs element-wise operations on data in ndarrays.

* **Unary ufuncs**: Take one array (e.g., `sqrt`, `exp`).
* **Binary ufuncs**: Take two arrays (e.g., `add`, `maximum`).
* **Statistical Methods**: `mean`, `sum`, `std` can be computed over the entire array or along an axis.

In [27]:
# [DEMO] Statistical Methods
arr = np.array(([1,2,3], [0,0,0], [0,0,0]))#np.random.randn(3, 4)
print("Random Array:\n", arr)
print("\nMean down rows (axis=0):", arr.mean(axis=0))
print("Sum across columns (axis=1):", arr.sum(axis=1))

Random Array:
 [[1 2 3]
 [0 0 0]
 [0 0 0]]

Mean down rows (axis=0): [0.33333333 0.66666667 1.        ]
Sum across columns (axis=1): [6 0 0]


## Part 7: Linear Algebra
Linear algebra operations, like matrix multiplication, are crucial for many data science algorithms. Multiplying two arrays with `*` is an element-wise product; for matrix multiplication, use `.dot()` or the `@` operator.

![matrix_multiplication](https://github.com/tengleemail-png/6m-data-1.6-intro-numpy/blob/main/assets/matrix_multiplication.png?raw=1)

In [14]:
# [DEMO] Matrix Multiplication
x = np.array([[1, 2, 3], [4, 5, 6]]) #2x3
y = np.array([[6, 23], [-1, 7], [8, 9]])#3x2
print (x)
print("/n")

print (y)
print("Matrix product (x @ y):\n", x @ y)

[[1 2 3]
 [4 5 6]]
/n
[[ 6 23]
 [-1  7]
 [ 8  9]]
Matrix product (x @ y):
 [[ 28  64]
 [ 67 181]]


### [EXERCISE 4: Reshaping & Statistics]
1. Create an array of 15 integers using `arange(15)` and reshape it to `(3, 5)`.
2. Calculate the average value of each row.
3. Use `np.unique()` to find distinct elements in an array of your choice.
4. Transpose the reshaped array using `.T` and check the new shape.

In [37]:
# Your code here
arr = np.arange(15).reshape(3,5)
print (arr)
ave_row = arr.mean(axis=0)
print(ave_row)

arr2d = np.random.randint(0,10, size = (3,5))
print('\n',arr2d)
unique_num = np.unique(arr2d)
print("\n Unique numbers are: \n", unique_num)

transpose_arr2d = arr2d.T
print("\nTranspose arr:\n",transpose_arr2d)
print("\ntranspose_shape", transpose_arr2d.shape)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[5. 6. 7. 8. 9.]

 [[0 6 6 9 6]
 [6 7 4 2 2]
 [5 0 2 4 2]]

 Unique numbers are: 
 [0 2 4 5 6 7 9]

Transpose arr:
 [[0 6 5]
 [6 7 0]
 [6 4 2]
 [9 2 4]
 [6 2 2]]

transpose_shape (5, 3)
