# NumPy Fundamentals

### 1. Introduction: What is NumPy and Why Use It?

At our previous class, we saw a problem: performing mathematical operations on large lists of numbers in Python is slow and requires writing `for` loops. NumPy (Numerical Python) is the solution.

**NumPy** is the fundamental package for scientific computing in Python. It provides a powerful object called an **N-dimensional array (`ndarray`)** and a huge collection of functions for working with these arrays.

**Why is it so essential?**
1.  **Speed:** NumPy operations are implemented in C, making them much faster than standard Python loops.
2.  **Convenience:** It allows you to perform mathematical operations on entire arrays at once (a concept called **vectorization**), which means less code and fewer bugs.
3.  **Foundation of the Ecosystem:** Libraries like Pandas, Matplotlib, and Scikit-learn are all built on top of NumPy.

By convention, we almost always import NumPy with the alias `np`.

In [2]:
# install numpy
# pip install numpy

In [3]:
# import library
import numpy as np

### 2. Creating NumPy Arrays

The core of NumPy is the `ndarray`. You can create one in several ways.

In [4]:
# 1. From a Python list
data = [1,2,3,4,5]
print(data, type(data))

[1, 2, 3, 4, 5] <class 'list'>


In [5]:
arr = np.array(data)
print(arr, type(arr))
print('data type:', arr.dtype)     # element of array
print('Dimension:', arr.ndim)
print('Shape:', arr.shape)
print('Size:', arr.size)      

[1 2 3 4 5] <class 'numpy.ndarray'>
data type: int64
Dimension: 1
Shape: (5,)
Size: 5


In [6]:
# 2D array
lst2 = [[1,2,3],[4,5,6]]
lst2

[[1, 2, 3], [4, 5, 6]]

In [7]:
arr2 = np.array(lst2)
print(arr2, type(arr2))
print('data type:', arr2.dtype)     # element of array
print('Dimension:', arr2.ndim)
print('Shape:', arr2.shape)
print('Size:', arr2.size)      

[[1 2 3]
 [4 5 6]] <class 'numpy.ndarray'>
data type: int64
Dimension: 2
Shape: (2, 3)
Size: 6


In [8]:
# 3d array
lst3 = [[[1,2],[3,4]],[[5,6],[7,8]]]
lst3

[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]

In [9]:
arr3 = np.array(lst3)
print(arr3, type(arr3))
print('data type:', arr3.dtype)     # element of array
print('Dimension:', arr3.ndim)
print('Shape:', arr3.shape)
print('Size:', arr3.size)      

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]] <class 'numpy.ndarray'>
data type: int64
Dimension: 3
Shape: (2, 2, 2)
Size: 8


In [10]:
# 2. Using built-in functions
# np.arange is like Python's range() but returns a NumPy array
array_range = np.arange(0,100,10)
print(array_range)
print(array_range.ndim)

[ 0 10 20 30 40 50 60 70 80 90]
1


In [11]:
# Create an array of all zeros
zero_array = np.zeros((2,2,3),dtype='int')
print(zero_array)
zero_array

[[[0 0 0]
  [0 0 0]]

 [[0 0 0]
  [0 0 0]]]


array([[[0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0]]])

In [12]:
# Create an array of all ones
ones_array = np.ones((2,2,3))
print(ones_array)

[[[1. 1. 1.]
  [1. 1. 1.]]

 [[1. 1. 1.]
  [1. 1. 1.]]]


In [13]:
# Create an array with a specific number of points between a start and end value
linespace_array = np.linspace(0,99,5)
print(linespace_array)

[ 0.   24.75 49.5  74.25 99.  ]


In [14]:
linespace_array = np.linspace(0,100,5)
print(linespace_array)

[  0.  25.  50.  75. 100.]


In [15]:
# Identity matrix : square matrix
np.identity(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [16]:
a = np.eye(5)
a

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [17]:
np.eye(5,6,3)

array([[0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [18]:
# change data type of a element
a.astype('int')

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

### 3. The Power of Vectorization

This is NumPy's killer feature. You can perform operations on entire arrays without writing explicit loops. This is called **vectorization**.

In [19]:
# The Old Way (Python lists)
lst_a = [1,2,3]
lst_b = [4,5,6]
lst_c = []

for i in range(len(lst_a)):
    lst_c.append(lst_a[i] + lst_b[i])
lst_c

[5, 7, 9]

In [20]:
# The NumPy Way (Vectorized)
arr_a = np.array([1,2,3])
arr_b = np.array([4,5,6])
print(arr_a, arr_b)

arr_c = arr_a + arr_b
print(arr_c)

[1 2 3] [4 5 6]
[5 7 9]


In [21]:
# This works for all standard math operations
arr_c ** 2

array([25, 49, 81])

In [22]:
# list of obtained marks
obtained_marks = np.array([34,37,50,40,80])

# full marks
full_marks = np.array([50,50,100,100,100])

# calculate percentage of each marks
per = (obtained_marks/full_marks)*100
per.astype('int')

array([68, 74, 50, 40, 80])

### 4. Array Attributes and Reshaping

You can easily inspect the properties of an array and change its shape.

In [23]:
arr = np.arange(12)
print(f'Original:{arr}')

Original:[ 0  1  2  3  4  5  6  7  8  9 10 11]


In [24]:
# Key Attributes
# The dimensions of the array

# The total number of elements
arr.size
# The number of axes (dimensions)
arr.shape
# The data type of the elements
# arr.dtype

# Reshaping
# .reshape() returns a new array with the same data but a new shape.
# The new shape must be compatible with the original size (e.g., 12 = 3 * 4)
arr.reshape((2,6))
arr.reshape((3,4))
arr.reshape((4,3))
arr.reshape((12,1))
arr.reshape((6,2))
arr.reshape((2,2,3))

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

### 5. Indexing, Slicing, and Broadcasting

Accessing elements in NumPy arrays is similar to Python lists but more powerful.

In [25]:
data = [[1,2,3],[4,5,6]]
data[0][0]
# data[0,0]

1

In [26]:
data = np.arange(20).reshape(4, 5) # 4 rows, 5 columns
print("Our 2D array:\n", data)

Our 2D array:
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]]


In [27]:
# --- Indexing ---
# Get a single element [row, column]
data[0][0]
data[0,0]
data[3,4]

np.int64(19)

In [28]:
# --- Slicing ---
# Get the first two rows [:,:]
data[0:2,0:5]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [29]:
# Get columns 1 and 2, 3 and 4
data[:,2:4]

array([[ 2,  3],
       [ 7,  8],
       [12, 13],
       [17, 18]])

In [30]:
data[1:3, 2:4]

array([[ 7,  8],
       [12, 13]])

In [31]:
# --- Broadcasting ---
# Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations.
# It 'stretches' the smaller array to match the shape of the larger one.

# Add 100 to every element in the array
data +100

array([[100, 101, 102, 103, 104],
       [105, 106, 107, 108, 109],
       [110, 111, 112, 113, 114],
       [115, 116, 117, 118, 119]])

In [32]:
array_a = np.array([[1,2],[3,4]])
print(array_a, array_a.shape)
array_b = np.array([[1],[4]])
print(array_b, array_b.shape)
print()
print(array_a + array_b)

[[1 2]
 [3 4]] (2, 2)
[[1]
 [4]] (2, 1)

[[2 3]
 [7 8]]


In [33]:
data = np.array([[1,2,3],[4,5,6]])
data

array([[1, 2, 3],
       [4, 5, 6]])

In [34]:
# Resize - update size of array
data.resize((2,2), refcheck = False)
data

array([[1, 2],
       [3, 4]])

In [35]:
data.resize((3,3), refcheck=False)
data

array([[1, 2, 3],
       [4, 0, 0],
       [0, 0, 0]])

In [36]:
# ravel() - convert every dimensional array to 1D.
data.ravel()

array([1, 2, 3, 4, 0, 0, 0, 0, 0])

### 6. Essential Statistical Functions

NumPy is packed with useful functions for performing statistical analysis.

The `axis` parameter is crucial here:
*   `axis=0`: Perform the operation **down the columns** (collapsing the rows).
*   `axis=1`: Perform the operation **across the rows** (collapsing the columns).

In [37]:
scores = np.array([
    [85, 92, 88], # Student 1 scores
    [76, 81, 79], # Student 2 scores
    [95, 91, 93]  # Student 3 scores
])

print(f"Scores array:\n{scores}\n")

Scores array:
[[85 92 88]
 [76 81 79]
 [95 91 93]]



In [38]:
# --- Operations on the entire array ---
print('Sum:', scores.sum())
print('Mean:', scores.mean())
print('Std:', scores.std())
print('Min:', scores.min())
print('Max:', scores.max())

Sum: 780
Mean: 86.66666666666667
Std: 6.377042156569663
Min: 76
Max: 95


In [39]:
# --- Operations along an axis ---
# Get the average score for each ASSIGNMENT (column-wise)
scores.sum(axis=0)
scores.mean(axis=0)

array([85.33333333, 88.        , 86.66666667])

In [40]:
# Get the average score for each STUDENT (row-wise)
scores.mean(axis=1)

array([88.33333333, 78.66666667, 93.        ])

In [41]:
scores.min(axis=1)

array([85, 76, 91])

### 7. Boolean Indexing (Filtering)

This is one of NumPy's most powerful features. You can use a boolean array to select elements from another array. This is the foundation of filtering in Pandas.

In [42]:
scores_1d = np.array([77, 95, 81, 68, 92, 88, 79, 99, 65, 85])
print(f"All scores: {scores_1d}")

All scores: [77 95 81 68 92 88 79 99 65 85]


In [43]:
# Step 1: Create a boolean mask
# This creates a new array of True/False values
boolean_mask = scores_1d > 90
print(boolean_mask)

[False  True False False  True False False  True False False]


In [44]:
# Step 2: Use the mask to filter the original array
# This selects only the elements where the mask is True
scores_1d[boolean_mask]

array([95, 92, 99])

In [45]:
# You can also combine conditions using & (and) and | (or)
# Note: you MUST use parentheses around each condition
scores_1d[(scores_1d>80) & (scores_1d<90)]

array([81, 88, 85])

In [46]:
arr = np.random.randint(1,100,20).reshape((4,5))
arr

array([[ 1, 13, 33, 15, 95],
       [75, 70, 82, 73, 73],
       [ 9, 23, 18,  1, 36],
       [32, 82, 13, 76, 74]], dtype=int32)

In [47]:
arr[arr<50]

array([ 1, 13, 33, 15,  9, 23, 18,  1, 36, 32, 13], dtype=int32)

# 8. Fancy Indexing
**Fancy indexing** is a term for using an *array of indices* to access multiple array elements at once. While slicing lets you select *contiguous blocks* of elements, fancy indexing lets you select **any combination** of elements, in any order. This is incredibly powerful.

In [48]:
# --- 1D Fancy Indexing ---
lst_1d = np.random.randint(10,20,10)
lst_1d

array([16, 19, 14, 17, 18, 15, 14, 13, 10, 16], dtype=int32)

In [49]:
lst = [1,2,3,4,5,6,7,8,9]
lst[6]
lst[2:8:2]

[3, 5, 7]

In [50]:
# Create an array of indices to select
ind_value = np.array([0,5,6,9])
ind_value

array([0, 5, 6, 9])

In [51]:
# Use the indices to 'pick' elements from the original array
lst_1d[ind_value]

array([16, 15, 14, 16], dtype=int32)

In [52]:
lst_1d[[0,5,6,9]]

array([16, 15, 14, 16], dtype=int32)

In [53]:
# --- 2D Fancy Indexing ---
arr_2nd = np.arange(12).reshape(4,3)
arr_2nd

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [54]:
# Select specific rows in a desired order
arr_2nd[[0,2,3]]
arr_2nd[[2,0,1]]

array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]])

In [55]:
# Select specific elements using (row, column) coordinates
row_ind = [0,2,3]
col_ind = [1,2,0]

In [56]:
# This will select elements at (0,1), (2,2), and (3,0)
arr_2nd[row_ind,col_ind]

array([1, 8, 9])

---

### Hands-On Lab: Matrix Calculations and Score Filtering

**Scenario:** You have the scores of 5 students on 4 different assignments. Your task is to perform several analyses using NumPy.

In [62]:
# Here is the starting data

student_scores = np.array([
    [85, 90, 88, 92],  # Student 1
    [78, 81, 84, 80],  # Student 2
    [92, 95, 91, 97],  # Student 3
    [65, 70, 68, 72],  # Student 4
    [88, 85, 89, 94]   # Student 5
])

# Task 1: Calculate the average score for each student (row-wise average).
avg= print("The average of each student is: ",student_scores.mean(axis=1))

# Task 2: Calculate the average score for each assignment (column-wise average).
avg= print("The average of each student column wise is: ",student_scores.mean(axis=0))

# Task 3: The professor decides to curve all scores by adding 3 points. Create a new array `curved_scores`.
curved_scores=student_scores+3
curved_scores

# Task 4: Find all the original scores that were 90 or above.
org=student_scores[student_scores>=90]
org



The average of each student is:  [88.75 80.75 93.75 68.75 89.  ]
The average of each student column wise is:  [81.6 84.2 84.  87. ]


array([90, 92, 92, 95, 91, 97, 94])

### Assignment: Analyze Numeric Data and Filter Top 10% Scores

**Your Task:**
You are given a larger dataset of 100 final exam scores. Your goal is to perform a statistical analysis and identify the students who are in the top 10%.

1.  **Create the NumPy array** provided below.
2.  **Calculate and print** the following statistics for the entire dataset:
    *   Mean score
    *   Median score (Hint: `np.median()`)
    *   Standard Deviation
    *   Minimum and Maximum scores
3.  **Determine the threshold** for the top 10% of scores. A score needs to be greater than the 90th percentile to be in the top 10%. (Hint: Use `np.percentile(array, 90)`).
4.  **Use boolean indexing** to create a new array called `top_performers` that contains only the scores which are in the top 10%.
5.  **Print** the `top_performers` array and the total number of students in this group.

In [None]:
# 1. Starting data: 100 final exam scores
np.random.seed(10)
final_exam_scores = np.random.normal(loc=78, scale=12, size=100).clip(0, 100).round(1)
print(f"--- Original Dataset (first 10 scores) ---\n{final_exam_scores[:10]}\n")

# 2. Calculate statistics
# YOUR CODE HERE

# 3. Determine the top 10% threshold (90th percentile)
# YOUR CODE HERE

# 4. Filter to get the top performing scores
# YOUR CODE HERE

# 5. Print the results
# YOUR CODE HERE

--- Original Dataset (first 10 scores) ---
[94.  86.6 59.5 77.9 85.5 69.4 81.2 79.3 78.1 75.9]

