# Introduction to NumPy for Machine Learning

<div style="background-color:#1e293b;padding:15px;border-left:6px solid #38bdf8;color:#e2e8f0">
Welcome to the NumPy tutorial! NumPy is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. In machine learning, NumPy is essential for handling numerical data efficiently.
</div>

## What is NumPy?

NumPy (Numerical Python) is a library that provides:

- **N-dimensional arrays**: Efficient data structures for numerical data
- **Mathematical functions**: Operations on arrays (addition, multiplication, etc.)
- **Broadcasting**: Automatic handling of operations between arrays of different shapes
- **Integration**: Works seamlessly with other libraries like pandas, matplotlib, and scikit-learn

## Why NumPy for ML?

- **Performance**: NumPy operations are implemented in C, making them much faster than pure Python
- **Memory efficiency**: Arrays use less memory than Python lists
- **Foundation**: Many ML libraries (like TensorFlow, PyTorch) build on NumPy arrays
- **Convenience**: Vectorized operations eliminate the need for explicit loops

Let's get started with the basics!

In [1]:
# Import NumPy
import numpy as np

# Check version
print(f"NumPy version: {np.__version__}")

# Basic array creation
arr = np.array([1, 2, 3, 4, 5])
print(f"Basic array: {arr}")
print(f"Array type: {type(arr)}")
print(f"Array shape: {arr.shape}")
print(f"Array dtype: {arr.dtype}")

NumPy version: 2.4.1
Basic array: [1 2 3 4 5]
Array type: <class 'numpy.ndarray'>
Array shape: (5,)
Array dtype: int64


## Arrays vs Matrices

In NumPy, we primarily work with **arrays** (ndarrays), which can be multi-dimensional. A **matrix** is specifically a 2D array.

- **1D Array (Vector)**: `[1, 2, 3]` - shape (3,)
- **2D Array (Matrix)**: `[[1, 2], [3, 4]]` - shape (2, 2)
- **3D Array (Tensor)**: `[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]` - shape (2, 2, 2)

In [2]:
# Arrays vs Matrices examples

# 1D Array (Vector)
vector = np.array([1, 2, 3, 4])
print(f"1D Array (Vector): {vector}")
print(f"Shape: {vector.shape}")
print(f"Dimensions: {vector.ndim}")

# 2D Array (Matrix)
matrix_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(f"\n2D Array (Matrix):\n{matrix_2d}")
print(f"Shape: {matrix_2d.shape}")
print(f"Dimensions: {matrix_2d.ndim}")

# 3D Array (Tensor)
tensor = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"\n3D Array (Tensor):\n{tensor}")
print(f"Shape: {tensor.shape}")
print(f"Dimensions: {tensor.ndim}")

1D Array (Vector): [1 2 3 4]
Shape: (4,)
Dimensions: 1

2D Array (Matrix):
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)
Dimensions: 2

3D Array (Tensor):
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
Shape: (2, 2, 2)
Dimensions: 3


## Creating Arrays

NumPy arrays can be created in several ways:

- From Python lists: `np.array([1, 2, 3])`
- With zeros: `np.zeros((3, 3))`
- With ones: `np.ones((2, 4))`
- With a range: `np.arange(10)`
- Evenly spaced values: `np.linspace(0, 1, 5)`
- Identity matrix: `np.eye(3)`
- Random values: `np.random.rand(3, 3)`

Let's explore these:

In [3]:
# Different ways to create arrays

# From list
list_array = np.array([1, 2, 3, 4])
print(f"From list: {list_array}")

# Zeros
zeros_array = np.zeros((2, 3))
print(f"Zeros array:\n{zeros_array}")

# Ones
ones_array = np.ones((3, 2))
print(f"Ones array:\n{ones_array}")

# Range
range_array = np.arange(10)
print(f"Range array: {range_array}")

# Linspace - evenly spaced values (useful for plotting, generating test data)
linspace_array = np.linspace(0, 1, 5)  # 5 values from 0 to 1
print(f"Linspace array: {linspace_array}")

# Identity matrix (useful for linear algebra)
identity = np.eye(3)
print(f"Identity matrix:\n{identity}")

# Random
random_array = np.random.rand(2, 2)
print(f"Random array:\n{random_array}")

# 2D array
sample_matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(f"2D array:\n{sample_matrix}")
print(f"Shape: {sample_matrix.shape}")

From list: [1 2 3 4]
Zeros array:
[[0. 0. 0.]
 [0. 0. 0.]]
Ones array:
[[1. 1.]
 [1. 1.]
 [1. 1.]]
Range array: [0 1 2 3 4 5 6 7 8 9]
Linspace array: [0.   0.25 0.5  0.75 1.  ]
Identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Random array:
[[0.24835525 0.05312657]
 [0.280296   0.31112728]]
2D array:
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)


## Basic Operations

NumPy supports element-wise operations, which are much faster than loops in pure Python.

### Arithmetic Operations
- Addition: `arr1 + arr2`
- Subtraction: `arr1 - arr2`
- Multiplication: `arr1 * arr2`
- Division: `arr1 / arr2`

### Your Turn!
Try creating two arrays and performing some operations.

In [8]:
# Basic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(f"a: {a}")
print(f"b: {b}")
print(f"a + b: {a + b}")
print(f"a * b: {a * b}")
print(f"a ** 2: {a ** 2}")  # Element-wise square

# TODO: Create two arrays of your choice and perform subtraction and division operations
# Print the results
c = np.array([90, 160, 250])
d = np.array([30, 40, 50])
print(f"c: {c}")
print(f"d: {d}")
print(f"c - d: {c - d}")
print(f"c / d: {c / d}")

# TODO: Try multiplying an array by a scalar (single number)
# Print the result
scalar = 5
print(f"Scalar: {scalar}")
print(f"c * scalar: {c * scalar}")

a: [1 2 3]
b: [4 5 6]
a + b: [5 7 9]
a * b: [ 4 10 18]
a ** 2: [1 4 9]
c: [ 90 160 250]
d: [30 40 50]
c - d: [ 60 120 200]
c / d: [3. 4. 5.]
Scalar: 5
c * scalar: [ 450  800 1250]


<details>
<summary>Click to see solution</summary>

```python
# Subtraction and division
x = np.array([10, 20, 30])
y = np.array([2, 4, 5])
print(f"x - y: {x - y}")      # [8, 16, 25]
print(f"x / y: {x / y}")      # [5., 5., 6.]

# Scalar multiplication
arr = np.array([1, 2, 3])
print(f"arr * 5: {arr * 5}")  # [5, 10, 15]
```

</details>

## Array and Matrix Multiplication

NumPy supports different types of multiplication:

1. **Element-wise multiplication** (`*`): Multiplies corresponding elements
2. **Dot product** (`@` or `np.dot()` for 1D arrays): Sum of element-wise products, returns a scalar
3. **Matrix multiplication** (`@` or `np.matmul()` for 2D arrays): Row-by-column multiplication
4. **Broadcasting**: Automatic expansion for operations between different shapes

### Dot Product vs Matrix Multiplication

**Dot product** (for vectors): Multiply corresponding elements and sum them.
```
[1, 2, 3] · [4, 5, 6] = 1×4 + 2×5 + 3×6 = 32  (scalar result)
```

**Matrix multiplication** (for 2D arrays): Each element in the result is a dot product of a row from the first matrix and a column from the second.

```
[1, 2]   [5, 6]   [1×5+2×7, 1×6+2×8]   [19, 22]
[3, 4] @ [7, 8] = [3×5+4×7, 3×6+4×8] = [43, 50]
```

**Key rule**: For matrix multiplication `(m, n) @ (n, p) → (m, p)`, inner dimensions must match!

In [10]:
# Multiplication examples

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print("=== 1D Arrays (Vectors) ===")
print(f"a: {a}")
print(f"b: {b}")

# Element-wise multiplication
print(f"\nElement-wise (a * b): {a * b}")  # [4, 10, 18]

# Dot product - sum of element-wise products
print(f"Dot product (a @ b): {a @ b}")     # 1*4 + 2*5 + 3*6 = 32
print(f"Manual calculation: {1*4 + 2*5 + 3*6}")

# 2D Matrix multiplication
print("\n=== 2D Arrays (Matrices) ===")
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(f"A:\n{A}")
print(f"B:\n{B}")

# Matrix multiplication - row by column
print(f"\nMatrix multiplication (A @ B):\n{A @ B}")
# Let's verify the first element: row 0 of A · column 0 of B
print(f"A[0,:] @ B[:,0] = {A[0,0]}*{B[0,0]} + {A[0,1]}*{B[1,0]} = {A[0,0]*B[0,0] + A[0,1]*B[1,0]}")

print(f"\nElement-wise (A * B):\n{A * B}")

# Broadcasting example
print("\n=== Broadcasting ===")
c = np.array([1, 2, 3])  # shape (3,)
d = np.array([[10], [20], [30]])  # shape (3, 1)
print(f"c: {c} (shape {c.shape})")
print(f"d:\n{d} (shape {d.shape})")
print(f"c + d (broadcasted):\n{c + d}")

# TODO: Try matrix multiplication with different shaped arrays
# Create a 2x3 matrix and a 3x2 matrix, then multiply them
# What shape is the result? What happens if you try element-wise multiplication?
E = np.array([[1, 2, 3], [4, 5, 6]])  # shape (2, 3)
F = np.array([[7, 8], [9, 10], [11, 12]])  # shape (3, 2)
print(f"E:\n{E} (shape {E.shape})")
print(f"F:\n{F} (shape {F.shape})")
print(f"\nMatrix multiplication (E @ F):\n{E @ F} (shape {(E @ F).shape})")
print(f"\nElement-wise multiplication (E * F):\n{E * F} (This will raise an error due to shape mismatch)")


=== 1D Arrays (Vectors) ===
a: [1 2 3]
b: [4 5 6]

Element-wise (a * b): [ 4 10 18]
Dot product (a @ b): 32
Manual calculation: 32

=== 2D Arrays (Matrices) ===
A:
[[1 2]
 [3 4]]
B:
[[5 6]
 [7 8]]

Matrix multiplication (A @ B):
[[19 22]
 [43 50]]
A[0,:] @ B[:,0] = 1*5 + 2*7 = 19

Element-wise (A * B):
[[ 5 12]
 [21 32]]

=== Broadcasting ===
c: [1 2 3] (shape (3,))
d:
[[10]
 [20]
 [30]] (shape (3, 1))
c + d (broadcasted):
[[11 12 13]
 [21 22 23]
 [31 32 33]]
E:
[[1 2 3]
 [4 5 6]] (shape (2, 3))
F:
[[ 7  8]
 [ 9 10]
 [11 12]] (shape (3, 2))

Matrix multiplication (E @ F):
[[ 58  64]
 [139 154]] (shape (2, 2))


ValueError: operands could not be broadcast together with shapes (2,3) (3,2) 

<details>
<summary>Click to see solution</summary>

```python
# Create 2x3 and 3x2 matrices
M1 = np.array([[1, 2, 3], [4, 5, 6]])      # shape (2, 3)
M2 = np.array([[1, 2], [3, 4], [5, 6]])    # shape (3, 2)

print(f"M1 (2x3):\n{M1}")
print(f"M2 (3x2):\n{M2}")

# Matrix multiplication: (2,3) @ (3,2) -> (2,2)
result = M1 @ M2
print(f"M1 @ M2 (shape {result.shape}):\n{result}")

# Verify first element: M1 row 0 · M2 column 0
# [1,2,3] · [1,3,5] = 1*1 + 2*3 + 3*5 = 22
print(f"First element: 1*1 + 2*3 + 3*5 = {1*1 + 2*3 + 3*5}")

# Element-wise multiplication fails - shapes don't match!
# M1 * M2  # ValueError: shapes (2,3) and (3,2) not aligned
```

</details>

## Indexing and Slicing

NumPy arrays support powerful indexing and slicing operations, similar to Python lists but with more dimensions.

### Basic Indexing
- `arr[0]` - First element
- `arr[-1]` - Last element
- `matrix[0, 1]` - Row 0, column 1

### Slicing
- `arr[1:4]` - Elements 1 to 3
- `matrix[:, 1]` - All rows, column 1
- `matrix[0:2, 1:3]` - Rows 0-1, columns 1-2

In [12]:
# Indexing and slicing examples

arr = np.arange(10)  # [0, 1, 2, ..., 9]
print(f"1D array: {arr}")

# Basic indexing
print(f"arr[0]: {arr[0]}")
print(f"arr[5]: {arr[5]}")
print(f"arr[-1]: {arr[-1]}")

# Slicing
print(f"arr[2:7]: {arr[2:7]}")
print(f"arr[:5]: {arr[:5]}")
print(f"arr[5:]: {arr[5:]}")
print(f"arr[::2]: {arr[::2]}")  # Every other element

# 2D array indexing
grid = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(f"\n2D array:\n{grid}")

print(f"grid[0, 0]: {grid[0, 0]}")  # First row, first column
print(f"grid[1, 2]: {grid[1, 2]}")  # Second row, third column
print(f"grid[2, :]: {grid[2, :]}")  # Third row, all columns
print(f"grid[:, 1]: {grid[:, 1]}")  # All rows, second column
print(f"grid[0:2, 1:3]:\n{grid[0:2, 1:3]}")  # Submatrix

# TODO: Create a 3x3 matrix and practice indexing
# Extract the middle row, the last column, and the diagonal elements
K = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"\nMatrix K:\n{K}")
print(f"Middle row (K[1, :]): {K[1, :]}")
print(f"Last column (K[:, -1]): {K[:, -1]}")
print(f"Diagonal elements (K.diagonal()): {K.diagonal()}")

1D array: [0 1 2 3 4 5 6 7 8 9]
arr[0]: 0
arr[5]: 5
arr[-1]: 9
arr[2:7]: [2 3 4 5 6]
arr[:5]: [0 1 2 3 4]
arr[5:]: [5 6 7 8 9]
arr[::2]: [0 2 4 6 8]

2D array:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
grid[0, 0]: 1
grid[1, 2]: 7
grid[2, :]: [ 9 10 11 12]
grid[:, 1]: [ 2  6 10]
grid[0:2, 1:3]:
[[2 3]
 [6 7]]

Matrix K:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Middle row (K[1, :]): [4 5 6]
Last column (K[:, -1]): [3 6 9]
Diagonal elements (K.diagonal()): [1 5 9]


<details>
<summary>Click to see solution</summary>

```python
# Create a 3x3 matrix
m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"Matrix:\n{m}")

# Middle row (index 1)
print(f"Middle row: {m[1, :]}")           # [4, 5, 6]

# Last column (index -1 or 2)
print(f"Last column: {m[:, -1]}")         # [3, 6, 9]

# Diagonal elements
print(f"Diagonal: {np.diag(m)}")          # [1, 5, 9]
```

</details>

## Useful Functions and Reshaping

NumPy provides many built-in functions for array operations:

### Common Functions
- `np.sum(arr)` - Sum of all elements
- `np.mean(arr)` - Mean value
- `np.max(arr)` / `np.min(arr)` - Maximum/minimum
- `np.std(arr)` - Standard deviation
- `np.reshape(arr, new_shape)` - Change array shape

### Reshaping
- `arr.reshape(new_shape)` - Returns reshaped array
- `arr.flatten()` - Convert to 1D
- `arr.T` - Transpose (swap dimensions)

In [18]:
# Useful functions and reshaping

data = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Original array:\n{data}")
print(f"Shape: {data.shape}")

# Common functions
print(f"\nSum: {np.sum(data)}")
print(f"Mean: {np.mean(data)}")
print(f"Max: {np.max(data)}")
print(f"Min: {np.min(data)}")
print(f"Standard deviation: {np.std(data)}")

# Row-wise and column-wise operations
print(f"Sum by rows: {np.sum(data, axis=1)}")
print(f"Sum by columns: {np.sum(data, axis=0)}")

# Reshaping
flat = data.flatten()
print(f"\nFlattened: {flat}")

reshaped = data.reshape(6, 1)
print(f"Reshaped to (6, 1):\n{reshaped}")

reshaped2 = data.reshape(1, 6)
print(f"Reshaped to (1, 6): {reshaped2}")

# Transpose
transposed = data.T
print(f"\nTransposed:\n{transposed}")

# TODO: Create a 1D array of 12 elements and reshape it into different 2D shapes
# Try (3, 4), (4, 3), (2, 6), etc.
# Calculate the mean and standard deviation of your reshaped arrays
array_1d = np.arange(12)  # [0, 1, 2, ..., 11]
print(f"\nOriginal 1D array: {array_1d}")
reshaped_3_4 = array_1d.reshape(3, 4)
print(f"Reshaped to (3, 4):\n{reshaped_3_4}")
print(f"Mean: {np.mean(reshaped_3_4)}")
print(f"Standard deviation: {np.std(reshaped_3_4)}")


Original array:
[[1 2 3]
 [4 5 6]]
Shape: (2, 3)

Sum: 21
Mean: 3.5
Max: 6
Min: 1
Standard deviation: 1.707825127659933
Sum by rows: [ 6 15]
Sum by columns: [5 7 9]

Flattened: [1 2 3 4 5 6]
Reshaped to (6, 1):
[[1]
 [2]
 [3]
 [4]
 [5]
 [6]]
Reshaped to (1, 6): [[1 2 3 4 5 6]]

Transposed:
[[1 4]
 [2 5]
 [3 6]]

Original 1D array: [ 0  1  2  3  4  5  6  7  8  9 10 11]
Reshaped to (3, 4):
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Mean: 5.5
Standard deviation: 3.452052529534663


<details>
<summary>Click to see solution</summary>

```python
# Create 1D array of 12 elements
arr = np.arange(1, 13)
print(f"Original: {arr}")

# Reshape to different 2D shapes
shape_3x4 = arr.reshape(3, 4)
shape_4x3 = arr.reshape(4, 3)
shape_2x6 = arr.reshape(2, 6)

print(f"Shape (3, 4):\n{shape_3x4}")
print(f"Shape (4, 3):\n{shape_4x3}")
print(f"Shape (2, 6):\n{shape_2x6}")

# Stats are the same regardless of shape
print(f"Mean: {np.mean(shape_3x4)}")      # 6.5
print(f"Std: {np.std(shape_3x4):.4f}")    # 3.4520
```

</details>

## NumPy in Machine Learning

NumPy enables efficient numerical computations essential for ML:

### Vectorized Operations
Instead of loops, use array operations for speed and simplicity.

### Example: Data Normalization
A common ML preprocessing step - normalizing features to have zero mean and unit variance.

In [20]:
# ML Example: Data normalization

# Simulate some feature data (e.g., heights and weights)
features = np.array([[170, 70], [165, 65], [180, 80], [175, 75]])
print("Original features (height, weight):")
print(features)

# Standardize: (x - mean) / std
means = np.mean(features, axis=0)
stds = np.std(features, axis=0)

normalized = (features - means) / stds
print(f"\nNormalized features (zero mean, unit variance):")
print(normalized)
print(f"Means: {np.mean(normalized, axis=0)}")
print(f"Standard deviations: {np.std(normalized, axis=0)}")

# Vectorized distance calculation (useful for k-NN, etc.)
point = np.array([172, 72])
distances = np.sqrt(np.sum((features - point) ** 2, axis=1))
print(f"\nDistances from point {point} to each data point:")
print(distances)

# TODO: Try implementing min-max normalization: (x - min) / (max - min)
# Normalize the features to [0, 1] range
min_vals = np.min(features, axis=0)
max_vals = np.max(features, axis=0)
min_max_normalized = (features - min_vals) / (max_vals - min_vals)
print(f"\nMin-max normalized features (scaled to [0, 1]):")
print(min_max_normalized)

Original features (height, weight):
[[170  70]
 [165  65]
 [180  80]
 [175  75]]

Normalized features (zero mean, unit variance):
[[-0.4472136  -0.4472136 ]
 [-1.34164079 -1.34164079]
 [ 1.34164079  1.34164079]
 [ 0.4472136   0.4472136 ]]
Means: [1.38777878e-17 1.38777878e-17]
Standard deviations: [1. 1.]

Distances from point [172  72] to each data point:
[ 2.82842712  9.89949494 11.3137085   4.24264069]

Min-max normalized features (scaled to [0, 1]):
[[0.33333333 0.33333333]
 [0.         0.        ]
 [1.         1.        ]
 [0.66666667 0.66666667]]


<details>
<summary>Click to see solution</summary>

```python
# Min-max normalization: (x - min) / (max - min)
mins = np.min(features, axis=0)
maxs = np.max(features, axis=0)

normalized_minmax = (features - mins) / (maxs - mins)
print(f"Min-max normalized (0 to 1 range):\n{normalized_minmax}")
print(f"Min values: {np.min(normalized_minmax, axis=0)}")  # [0., 0.]
print(f"Max values: {np.max(normalized_minmax, axis=0)}")  # [1., 1.]
```

</details>

## Data Types in NumPy

NumPy arrays have a fixed data type (`dtype`) that determines memory usage and precision. Choosing the right dtype is important for performance and accuracy.

### Common Data Types
- `int32`, `int64` - Integer types
- `float32`, `float64` - Floating point (default is float64)
- `bool` - Boolean
- `object` - For mixed types or strings

### Why Data Types Matter
- **Memory efficiency**: Smaller types use less memory
- **Performance**: Operations on smaller types are faster
- **Precision**: Float32 is faster but less precise than float64

In [None]:
# Data types examples

# Default dtype (float64)
default_array = np.array([1.0, 2.0, 3.0])
print(f"Default array: {default_array}, dtype: {default_array.dtype}")

# Specify dtype
int_array = np.array([1, 2, 3], dtype=np.int32)
print(f"Int32 array: {int_array}, dtype: {int_array.dtype}")

float32_array = np.array([1.0, 2.0, 3.0], dtype=np.float32)
print(f"Float32 array: {float32_array}, dtype: {float32_array.dtype}")

# Converting dtypes
converted = int_array.astype(np.float64)
print(f"Converted to float64: {converted}, dtype: {converted.dtype}")

# Memory usage
print(f"\nMemory usage:")
print(f"Float64 array: {default_array.nbytes} bytes")
print(f"Float32 array: {float32_array.nbytes} bytes")
print(f"Int32 array: {int_array.nbytes} bytes")

# TODO: Create arrays with different dtypes and compare memory usage
# Try bool dtype for binary data
bool_array = np.array([True, False, True], dtype=bool)
print(f"Bool array: {bool_array}, dtype: {bool_array.dtype}, memory usage: {bool_array.nbytes} bytes")
num_array = np.array([1, 0, 1], dtype=np.int8)
print(f"Int8 array: {num_array}, dtype: {num_array.dtype}, memory usage: {num_array.nbytes} bytes")
string_array = np.array(['a', 'b', 'c'], dtype='<U1')  # Unicode string of length 1 (I've read it is rarely used)
print(f"String array: {string_array}, dtype: {string_array.dtype}, memory usage: {string_array.nbytes} bytes")
object_array = np.array([1, 'a', True], dtype=object)
print(f"Object array: {object_array}, dtype: {object_array.dtype}, memory usage: {object_array.nbytes} bytes")


Default array: [1. 2. 3.], dtype: float64
Int32 array: [1 2 3], dtype: int32
Float32 array: [1. 2. 3.], dtype: float32
Converted to float64: [1. 2. 3.], dtype: float64

Memory usage:
Float64 array: 24 bytes
Float32 array: 12 bytes
Int32 array: 12 bytes
Bool array: [ True False  True], dtype: bool, memory usage: 3 bytes
Int8 array: [1 0 1], dtype: int8, memory usage: 3 bytes
String array: ['a' 'b' 'c'], dtype: <U1, memory usage: 12 bytes
Object array: [1 'a' True], dtype: object, memory usage: 24 bytes


## Boolean Indexing and Masking

Boolean indexing allows you to select elements based on conditions. This is powerful for data filtering and manipulation.

### How it Works
- Create a boolean mask: `arr > 5`
- Use the mask to select: `arr[mask]`
- Combine conditions with `&` (and), `|` (or), `~` (not)

In [26]:
# Boolean indexing examples

data = np.array([1, 5, 8, 3, 9, 2, 7])
print(f"Original array: {data}")

# Create boolean mask
mask = data > 5
print(f"Mask (data > 5): {mask}")

# Use mask for selection
filtered = data[mask]
print(f"Filtered elements: {filtered}")

# Multiple conditions
mask2 = (data > 3) & (data < 8)
print(f"Mask (3 < data < 8): {mask2}")
print(f"Filtered: {data[mask2]}")

# Modifying with boolean indexing
data_copy = data.copy()
data_copy[mask] = 999
print(f"Modified array: {data_copy}")

# 2D example
grid_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"\n2D array:\n{grid_2d}")

mask_2d = grid_2d > 5
print(f"Mask (grid_2d > 5):\n{mask_2d}")
print(f"Filtered elements: {grid_2d[mask_2d]}")

# TODO: Create an array of exam scores and use boolean indexing to:
# - Find scores above 80
# - Find scores between 60 and 80
# - Replace failing scores (< 60) with 60
scores = np.array([55, 65, 75, 85, 95])
print(f"\nOriginal scores: {scores}")
print(f"Scores above 80: {scores[scores > 80]}")
print(f"Scores between 60 and 80: {scores[(scores > 60) & (scores <= 80)]}")
scores[scores < 60] = 60
print(f"Scores after replacing failing scores with 60: {scores}")

Original array: [1 5 8 3 9 2 7]
Mask (data > 5): [False False  True False  True False  True]
Filtered elements: [8 9 7]
Mask (3 < data < 8): [False  True False False False False  True]
Filtered: [5 7]
Modified array: [  1   5 999   3 999   2 999]

2D array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Mask (grid_2d > 5):
[[False False False]
 [False False  True]
 [ True  True  True]]
Filtered elements: [6 7 8 9]

Original scores: [55 65 75 85 95]
Scores above 80: [85 95]
Scores between 60 and 80: [65 75]
Scores after replacing failing scores with 60: [60 65 75 85 95]


<details>
<summary>Click to see solution</summary>

```python
# Exam scores
scores = np.array([85, 92, 45, 78, 55, 88, 62, 73, 95, 58])
print(f"Scores: {scores}")

# Scores above 80
above_80 = scores[scores > 80]
print(f"Above 80: {above_80}")            # [85, 92, 88, 95]

# Scores between 60 and 80
between = scores[(scores >= 60) & (scores <= 80)]
print(f"Between 60-80: {between}")        # [78, 62, 73]

# Replace failing scores with 60
scores_fixed = scores.copy()
scores_fixed[scores_fixed < 60] = 60
print(f"After replacing failures: {scores_fixed}")
```

</details>

## Performance: NumPy vs Python Loops

NumPy's vectorized operations are much faster than Python loops. Let's demonstrate this with a timing comparison.

### Why is NumPy Faster?
- **Vectorization**: Operations are performed in compiled C code
- **Memory layout**: Arrays are stored contiguously in memory
- **No type checking**: Fixed data types eliminate Python's dynamic typing overhead

In [30]:
# Performance comparison
import time
import math


# Create large arrays
size = 1000000
a = np.random.rand(size)
b = np.random.rand(size)

# Python loop approach
def python_add(a_list, b_list):
    return [x + y for x, y in zip(a_list, b_list)]

a_list = a.tolist()
b_list = b.tolist()

# Time Python loop
start = time.time()
result_python = python_add(a_list, b_list)
python_time = time.time() - start

# Time NumPy vectorized
start = time.time()
result_numpy = a + b
numpy_time = time.time() - start

print(f"Array size: {size}")
print(f"Python loop time: {python_time:.4f} seconds")
print(f"NumPy vectorized time: {numpy_time:.4f} seconds")
print(f"Speedup: {python_time / numpy_time:.1f}x faster")

# Verify results are the same
print(f"Results match: {np.allclose(result_python, result_numpy)}")

# TODO: Try timing other operations like multiplication or square root
# Experiment with different array sizes

print("\n--- EXTRA TESTS: MULTIPLICATION ---")

# Python loop multiplication
start = time.time()
result_python_mul = [x * y for x, y in zip(a_list, b_list)]
python_mul_time = time.time() - start

# NumPy multiplication
start = time.time()
result_numpy_mul = a * b
numpy_mul_time = time.time() - start

print(f"Python mul time: {python_mul_time:.4f} sec")
print(f"NumPy mul time:  {numpy_mul_time:.4f} sec")
print(f"Speedup: {python_mul_time / numpy_mul_time:.1f}x faster")
print(f"Results match: {np.allclose(result_python_mul, result_numpy_mul)}")


print("\n--- EXTRA TESTS: SQRT ---")

# Python loop sqrt
start = time.time()
result_python_sqrt = [math.sqrt(x) for x in a_list]
python_sqrt_time = time.time() - start

# NumPy sqrt
start = time.time()
result_numpy_sqrt = np.sqrt(a)
numpy_sqrt_time = time.time() - start

print(f"Python sqrt time: {python_sqrt_time:.4f} sec")
print(f"NumPy sqrt time:  {numpy_sqrt_time:.4f} sec")
print(f"Speedup: {python_sqrt_time / numpy_sqrt_time:.1f}x faster")
print(f"Results match: {np.allclose(result_python_sqrt, result_numpy_sqrt)}")

Array size: 1000000
Python loop time: 0.0736 seconds
NumPy vectorized time: 0.0033 seconds
Speedup: 22.5x faster
Results match: True

--- EXTRA TESTS: MULTIPLICATION ---
Python mul time: 0.0634 sec
NumPy mul time:  0.0026 sec
Speedup: 24.1x faster
Results match: True

--- EXTRA TESTS: SQRT ---
Python sqrt time: 0.0929 sec
NumPy sqrt time:  0.0032 sec
Speedup: 28.7x faster
Results match: True


## Final Summary

You've now learned the fundamentals of NumPy for machine learning:

- **Arrays vs Matrices**: Multi-dimensional arrays are NumPy's core data structure
- **Creating Arrays**: Multiple ways to initialize arrays (`zeros`, `ones`, `linspace`, `eye`, etc.)
- **Basic Operations**: Element-wise arithmetic operations
- **Dot Product & Matrix Multiplication**: Dot product for vectors (`@` returns scalar), matrix multiplication for 2D arrays (`@` returns matrix)
- **Indexing & Slicing**: Powerful ways to access and modify array elements
- **Useful Functions**: Built-in operations for statistics and transformations
- **Reshaping**: Changing array dimensions for different operations
- **Data Types**: Choosing appropriate dtypes for memory and performance
- **Boolean Indexing**: Selecting data based on conditions
- **Performance**: Why NumPy is faster than Python loops
- **ML Applications**: Vectorized operations for efficient data processing

### Key Takeaways
- NumPy arrays are faster and more memory-efficient than Python lists
- Vectorized operations eliminate slow Python loops
- Broadcasting allows operations between different-shaped arrays
- Choose appropriate data types for optimal performance
- Boolean indexing enables powerful data filtering
- NumPy is the foundation for pandas, scikit-learn, TensorFlow, and PyTorch