# Summer of Code - Artificial Intelligence
## Week 04: Exploratory Data Analysis
### Day 02: NumPy

In this lesson, we will learn more about NumPy Arrays.

In [1]:
import numpy as np

In [4]:
arr = np.arange(1, 21)
arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])

In [9]:
arr = arr.reshape((4, 5))
arr

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

# Indexing and Selection
How to select and modify subsets of data from NumPy arrays.


## Scalar indexing vs slicing
Scalar indexing is used to access a single element in an array, while slicing is used to access a range of elements.

In [None]:
arr[0][1]

array([1, 2])

In [None]:
arr[0, 1].item()

array([1, 2])

In [21]:
arr[2:, 3:]

array([[14, 15],
       [19, 20]])

In [25]:
arr[:, 1]

array([ 2,  7, 12, 17])

In [26]:
arr[2, :]

array([11, 12, 13, 14, 15])

## Multidimensional indexing: row/column order
In NumPy, when working with multidimensional arrays, it's important to understand the order in which elements are indexed. NumPy uses a row-major order (C-style) for indexing, meaning that elements are stored row by row.

In [27]:
arr[0] # returns first row

array([1, 2, 3, 4, 5])

## Fancy indexing (integer arrays)
Fancy indexing is a way to access specific elements of an array using integer arrays (or lists) as indices. This allows for more flexible and powerful data selection compared to standard slicing.

In [55]:
rng = np.random.default_rng(seed=42)

integers = rng.integers(0, 20, (4, 6))
integers

array([[ 1, 15, 13,  8,  8, 17],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [58]:
integers[[0, 3, 2], [1, 2, 0]]

array([15, 10, 14])

## Boolean masks
Boolean masks are arrays of boolean values (True/False) that can be used to select elements from another array based on certain conditions. This technique is useful for filtering data.


In [59]:
integers

array([[ 1, 15, 13,  8,  8, 17],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [61]:
mask = integers < 5

In [62]:
mask

array([[ True, False, False, False, False, False],
       [ True, False,  True,  True, False, False],
       [False, False, False, False, False,  True],
       [False, False, False, False,  True, False]])

In [64]:
integers[integers < 5]

array([1, 1, 4, 1, 2, 3])

In [66]:
integers[integers % 2 == 0]

array([ 8,  8,  4, 10, 14, 14, 10,  2, 16, 10, 18])

## In-place assignment with selection
In-place assignment allows you to modify specific elements of an array directly using indexing or boolean masks. This is useful for updating values without creating a new array.

In [67]:
integers

array([[ 1, 15, 13,  8,  8, 17],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [74]:
np.array([1, 2, 3]) + 1

array([2, 3, 4])

In [69]:
integers[0] = 10

In [71]:
integers

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [None]:
integers[0, :2] = -1

In [73]:
integers

array([[-1, -1, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

# Views vs Copies

- Slices are typically views: they reference the same memory as the original array.
- Fancy indexing and boolean masks produce copies: independent memory.
- Use `.copy()` to force a copy.
- Inspect the `.base` attribute to see if an array owns its memory (`.base is None`) or is a view of another array.


In [77]:
arr = integers

arr

array([[-1, -1, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [78]:
arr[0] = 30
arr

array([[30, 30, 30, 30, 30, 30],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [79]:
integers

array([[30, 30, 30, 30, 30, 30],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [109]:
arr = integers.copy()

arr

array([[30, 30, 30, 30, 30, 30],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [110]:
arr[0] = 0
arr

array([[ 0,  0,  0,  0,  0,  0],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [111]:
b = arr[-2:, -2:]

In [112]:
b

array([[10,  2],
       [ 3, 18]])

In [113]:
b[0] = 0

In [114]:
b

array([[ 0,  0],
       [ 3, 18]])

In [115]:
arr

array([[ 0,  0,  0,  0,  0,  0],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0],
       [16,  9, 10,  7,  3, 18]])

In [90]:
arr

array([[ 0,  0,  0,  0,  0,  0],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15, 10,  2],
       [16,  9, 10,  7,  3, 18]])

In [116]:
b.base is None

False

In [117]:
arr.base is None

True

In [118]:
arr2 = arr

arr2.base is None

True

In [120]:
arr

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0],
       [16,  9, 10,  7,  3, 18]])

In [121]:
arr2

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0],
       [16,  9, 10,  7,  3, 18]])

In [122]:
c = arr2[:3]
c

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0]])

In [127]:
c.base is None

False

# Broadcasting

Broadcasting lets arrays with different shapes interact by virtually "stretching" dimensions of size 1. Rules (simplified):
- Compare shapes from right to left.
- Dimensions are compatible if equal or either is 1.
- Result shape is the max along each dimension.
- If any pair is incompatible, error.



# Typecasting and dtypes

- Every `ndarray` has a `dtype` (e.g., `int64`, `float32`, `bool`).
- `astype(new_dtype)` returns a copy with converted type.
- Upcasting: mixing `int` and `float` often results in float.
- Beware of overflow when casting from float to int and of truncation.
- Common ML practice uses float32 to save memory and accelerate on GPUs.


In [131]:
np.array([1, 2, 3]) + [1]

array([2, 3, 4])

In [132]:
np.array([1, 2, 3]) + [2, 2]

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

In [138]:
a = np.array([
    [1],
    [2],
    [3]
])

b = np.array([1, 1, 1])

a + b

array([[2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

# Saving and Loading Data

- Native binary: `np.save('file.npy', arr)` and `np.load('file.npy')`
- Multiple arrays: `np.savez('file.npz', a=arr1, b=arr2)` → returns dict-like when loading.
- Control dtypes on load where possible for memory.



In [139]:
arr

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0],
       [16,  9, 10,  7,  3, 18]])

In [140]:
np.save('integers.npy', arr)

In [142]:
arr = np.load('integers.npy')
arr

array([[10, 10, 10, 10, 10, 10],
       [ 1, 13,  4,  1, 10, 19],
       [14, 15, 14, 15,  0,  0],
       [16,  9, 10,  7,  3, 18]])

In [143]:
arr1 = rng.normal(0, 1, (10, 2))
arr1

array([[ 0.0660307 ,  1.12724121],
       [ 0.46750934, -0.85929246],
       [ 0.36875078, -0.9588826 ],
       [ 0.8784503 , -0.04992591],
       [-0.18486236, -0.68092954],
       [ 1.22254134, -0.15452948],
       [-0.42832782, -0.35213355],
       [ 0.53230919,  0.36544406],
       [ 0.41273261,  0.430821  ],
       [ 2.1416476 , -0.40641502]])

In [144]:
arr2 = rng.uniform(0, 10, (5, 5))
arr2

array([[3.25825358, 3.70459706, 4.69555811, 1.89471359, 1.29921505],
       [4.75704926, 2.26909349, 6.69813995, 4.37151919, 8.32678196],
       [7.00265102, 3.12366641, 8.32259801, 8.04764357, 3.87478379],
       [2.88328104, 6.82495504, 1.39752484, 1.99908202, 0.0736227 ],
       [7.86924378, 6.64850857, 7.05165379, 7.80729031, 4.58915776]])

In [145]:
np.savez('arrays.npz', a=arr1, b=arr2)


In [147]:
data = np.load('arrays.npz')

In [148]:
arr1 = data['a']
arr1

array([[ 0.0660307 ,  1.12724121],
       [ 0.46750934, -0.85929246],
       [ 0.36875078, -0.9588826 ],
       [ 0.8784503 , -0.04992591],
       [-0.18486236, -0.68092954],
       [ 1.22254134, -0.15452948],
       [-0.42832782, -0.35213355],
       [ 0.53230919,  0.36544406],
       [ 0.41273261,  0.430821  ],
       [ 2.1416476 , -0.40641502]])

In [149]:
data['b']

array([[3.25825358, 3.70459706, 4.69555811, 1.89471359, 1.29921505],
       [4.75704926, 2.26909349, 6.69813995, 4.37151919, 8.32678196],
       [7.00265102, 3.12366641, 8.32259801, 8.04764357, 3.87478379],
       [2.88328104, 6.82495504, 1.39752484, 1.99908202, 0.0736227 ],
       [7.86924378, 6.64850857, 7.05165379, 7.80729031, 4.58915776]])

In [151]:
data.values()

ValuesView(NpzFile 'arrays.npz' with keys: a, b)

In [1]:
import numpy as np

# Practice: Short Exercises

1) Create a 6×6 array and:
   - Set the diagonal to 1 and anti-diagonal to 2 using slicing and fancy indexing.
   - Replace all values > mean with their z-scores (broadcasting) and leave others unchanged.


In [54]:
arr = np.arange(36).reshape((6,6))
arr.dtype

dtype('int64')

In [57]:
arr = arr.astype(dtype=np.float32)
arr

array([[ 0.,  1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10., 11.],
       [12., 13., 14., 15., 16., 17.],
       [18., 19., 20., 21., 22., 23.],
       [24., 25., 26., 27., 28., 29.],
       [30., 31., 32., 33., 34., 35.]], dtype=float32)

In [58]:
arr.dtype

dtype('float32')

In [59]:
# arr = np.linspace(0, 100, 10)
# arr

In [60]:
arr[[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]] = 1

arr

array([[ 1.,  1.,  2.,  3.,  4.,  5.],
       [ 6.,  1.,  8.,  9., 10., 11.],
       [12., 13.,  1., 15., 16., 17.],
       [18., 19., 20.,  1., 22., 23.],
       [24., 25., 26., 27.,  1., 29.],
       [30., 31., 32., 33., 34.,  1.]], dtype=float32)

In [61]:
arr.mean()

np.float32(14.75)

In [62]:
arr > arr.mean()

array([[False, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False, False, False,  True,  True,  True],
       [ True,  True,  True, False,  True,  True],
       [ True,  True,  True,  True, False,  True],
       [ True,  True,  True,  True,  True, False]])

In [63]:
greater_than_mean = arr[arr > arr.mean()].copy()
greater_than_mean

array([15., 16., 17., 18., 19., 20., 22., 23., 24., 25., 26., 27., 29.,
       30., 31., 32., 33., 34.], dtype=float32)

In [64]:
arr.std()

np.float32(11.041273)

In [65]:
greater_than_mean = (arr.mean() - greater_than_mean) / arr.std()
greater_than_mean

array([-0.02264232, -0.11321158, -0.20378084, -0.29435012, -0.38491938,
       -0.47548863, -0.6566272 , -0.74719644, -0.8377657 , -0.92833495,
       -1.0189042 , -1.1094735 , -1.290612  , -1.3811814 , -1.4717506 ,
       -1.5623199 , -1.6528891 , -1.7434584 ], dtype=float32)

In [66]:
arr[arr > arr.mean()] = greater_than_mean
arr

array([[ 1.        ,  1.        ,  2.        ,  3.        ,  4.        ,
         5.        ],
       [ 6.        ,  1.        ,  8.        ,  9.        , 10.        ,
        11.        ],
       [12.        , 13.        ,  1.        , -0.02264232, -0.11321158,
        -0.20378084],
       [-0.29435012, -0.38491938, -0.47548863,  1.        , -0.6566272 ,
        -0.74719644],
       [-0.8377657 , -0.92833495, -1.0189042 , -1.1094735 ,  1.        ,
        -1.290612  ],
       [-1.3811814 , -1.4717506 , -1.5623199 , -1.6528891 , -1.7434584 ,
         1.        ]], dtype=float32)

In [52]:
arr.dtype

dtype('int64')

2) Given a 2D array of shape (N, D), standardize per feature using `keepdims=True`. Verify each column mean is ~0 and std is ~1.


3) Save three arrays `A`, `B`, `C` to a single `.npz`, load and reconstruct `A @ B + C`. Use `float32` dtypes.



4) Create a boolean mask that selects rows where the sum is divisible by 3, then set those rows to 0 in-place.

