# NumPy
---

### The accompanying worksheet for this notebook can be found [here](https://docs.google.com/document/d/17XaTuwDCjqjSJSY-PujIgH4YA-yiO0ayZjt0MjsuWnc/edit?usp=sharing).

### Concepts covered:
* Creating arrays
* Indexing, slicing, and changing data
* Creating data masks
* Methods for creating arrays
* Methods for analyzing arrays
* Practical with pollutant concentrations
---
Notebook and worksheet by Alice Hsu (Oct 2025)

# ![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/320px-NumPy_logo_2020.svg.png) 

NumPy is a Python package used for handling large amounts of numerical data. It is very commonly used for working with multi-dimensional data (e.g., 2D, 3D, and even 4D and up!), such as geospatial datasets. NumPy is typically imported using the np alias:

                                    import numpy as np

For example, a geospatial dataset might contain data for each <span style="color:blue">latitude</span>, <span style="color:red">longitude</span>, and <span style="color:darkorange">time step</span> - this would give you a 3D dataset, where the <span style="color:blue">rows</span> represent the <span style="color:blue">latitudes</span>, <span style="color:red">columns</span> represent the <span style="color:red">longitudes</span>, and the <span style="color:darkorange">layers</span> represent the <span style="color:darkornage">time steps</span>.

In [38]:
import numpy as np
import matplotlib.pyplot as plt

### Lists vs NumPy Arrays
Lists and NumPy arrays seem similar at first, but they function very differently.

* **Lists** can hold **different data types** at a time. However, **computations must be done on each individual element**.
    * If a list has 100 elements, to perform an operation on everything in the list, **you have to do 100 computations**.
* **NumPy arrays** can only hold one data type at a time. However, **computations are <u>vectorized</u>**:
    * You can perform a mathematical operation on every element at once.
    * If a NumPy array has 100 elements, to perform an operation on everything in the array, **you have to do only 1 computation**.

In [5]:
list_a = [1, 2, 3, 4]
a1d = np.array(list_a)
a1d

array([1, 2, 3, 4])

In [3]:
a2d = np.array([[10.,20,30,40], [9,8,5,3], [1,2,3,4]])
a2d

array([[10., 20., 30., 40.],
       [ 9.,  8.,  5.,  3.],
       [ 1.,  2.,  3.,  4.]])

<span style="color:blue">Check the shape of `a2d` using the `.shape` method.

In [None]:
#### YOUR CODE HERE ####


<span style="color:blue">The first element of a2d can be accessed using the code: `a2d[0,0]`. Check the data type of this value using the `type()` function.

In [None]:
#### YOUR CODE HERE ####


### <span style="color:darkorange"> Mathematical Operations on NumPy Arrays

We can perform most basic mathematical functions on whole NumPy Arrays.

**Perform the following operations on the NumPy array `a1d`** using the syntax written in the worksheet:
* Multiply by 2
* Divide by 2
* Exponentiate by 2
* Add 2 

In [None]:
#### YOUR CODE HERE ####
a1d

In [None]:
#### YOUR CODE HERE ####
a1d

In [None]:
#### YOUR CODE HERE ####
a1d

In [None]:
#### YOUR CODE HERE ####
a1d

You can also use **boolean operators** on NumPy arrays, such as:
* Greater than (`>`), Greater than or equal to (`>=`)
* Less than (`<`), Less than or equal to (`<=`)
* Equal to (`==`)
* Not equal to (`!=`)

In [104]:
a1d>2

array([False, False,  True,  True])

In [105]:
a1d==3

array([False, False,  True, False])

In [107]:
a1d!=1

array([False,  True,  True,  True])

<span style="color:blue"> Multiply `list_a` by 2. How does the output differ from the output of `a1d*2`?

In [7]:
#### YOUR CODE HERE ####
list_a

You can perform **element-wise operations** on different NumPy arrays (or subsets of NumPy arrays), such as adding, subtracting, or multiplying together two arrays, **provided they are the same shape**.

You can check the shape of a NumPy array using the .shape

For example:

In [19]:
b1d = np.array([2,4,8,12])

print(a1d+b1d)
print(a1d*b1d)
print(a1d**b1d)

But we couldn't do:

In [None]:
a1d*a2d

## Indexing and Slicing NumPy Arrays

### Indexing

#### 1D Arrays

Slicing works much like lists, in that **indexing starts at 0**, and you can also **access things from the end via a negative index**:

| Index |0|1|2|3|
|:-------|-------|----|-------|---------|
|`a1d`|`1`|`2`|`3`|`4`|
|**Negative Index**|**-4**|**-3**|**-2**|**-1**|

Thus, to access the last value in `a1d`, you could type `a1d[3]` or `a1d[-1]`, which would return `4`.

<span style="color:darkorange"><u>**Some Quick Indexing Exercises in 1D**</u></span>:

**Access the `2` from `a1d` using both its positive and negative index.**

In [None]:
#### YOUR CODE HERE ####
a1d[]

In [None]:
#### YOUR CODE HERE ####
a1d[]

#### 2D Arrays
However, to access a value in a 2D array, you must specify **the row _and_ the column** your value is in.

For example, let's access some values in `a2d`:

||**0**|**1**|**2**|**3**|
|-|:-------:|:-------:|:----:|:-------:|
|**0**|`10.`|`20.`|`30.`|`40.`|
|**1**|`9.`|`8.`|`5.`|`3.`|
|**2**|`1.`|`2.`|`3.`|`4.`|

To access the value `30`, in the first row and third column, you would type: `a2d[0,2]`.

<span style="color:darkorange"><u>**Some Quick Indexing Exercises in 2D**</u></span>:

**How would you access the value `5` in `a2d`?**

In [None]:
#### YOUR CODE HERE ####
a2d[]

**Come up with two different ways to access the value `40` in `a2d`.**

In [None]:
#### YOUR CODE HERE ####
a2d[]

In [None]:
#### YOUR CODE HERE ####
a2d[]

### Boolean Indexing/Masks

Another way to extract data you want out of a NumPy array is to use a **boolean mask**.

A **boolean mask** is just an array of `True` and `False`, where the locations of the `True` correspond to the data you want to extract, and the `False` correspond to the locations of the data you want to mask out.

A simple example:

Consider an array, `x`.

In [112]:
x = np.array([1,3,-3,2,8,4])

Suppose we wanted to extract all values **less than 3**. The boolean mask would then be:

In [113]:
x<3

array([ True, False,  True,  True, False, False])

Notice how the **boolean mask has the same shape as the data we are masking**.

**Applying the mask** would then **return all the True values**, or all the values of x that are less than 3:

In [114]:
x[x<3]

array([ 1, -3,  2])

<span style="color:darkorange"> Use a boolean mask to extract all the values in `a2d` greater than 4.

In [None]:
#### YOUR CODE HERE ####


### Slicing

You can also access parts, or slices, of your dataset by using the colon to specify the range you want. The colon is essentially the equivalent of saying “everything to and from” the row or column number you specify. 

Here, `a2D[0,1:3]` would extract the values in the 0th row, in the second (index 1) and third (index 2) columns, returning `[20,30]`.

Note that when you use the colon to specify a range, this range is not inclusive of the number specified on the right side - i.e., the `1:3` does not include the value with the index of 3.

In [146]:
a2d

array([[10., 20., 30., 40.],
       [ 9.,  8.,  5.,  3.],
       [ 1.,  2.,  3.,  4.]])

<span style="color:blue"><u>**WKSH EX 1**</u></span>: Check your answers here.

In [9]:
#### YOUR CODE HERE ####


<span style="color:blue"><u>**WKSH EX 2**</u></span>: Check your answers here.

In [None]:
#### YOUR CODE HERE ####


<span style="color:orange"><u>**iPYNB EX 1:**</u></span> **Consider the 2D array, `my_arr`.**

* Check the **shape of `my_arr`** using the `.shape` property.
* Check the **shape of the <u>first row</u> of `my_arr`** using the `.shape` property.
* **Subtract the second column** of `my_arr` from the **last column** of `my_arr`.
* **Change the number 21** in `my_arr` to the number 2.1.

In [22]:
my_arr = [[ 2. ,  3.2,  5.5, -6.4, -2.2,  2.4],
       [ 1. , 22. ,  4. ,  0.1,  5.3, -9. ],
       [ 3. ,  1. ,  2.1, 21. ,  1.1, -2. ]]

array([[ 2. ,  3.2,  5.5, -6.4, -2.2,  2.4],
       [ 1. , 22. ,  4. ,  0.1,  5.3, -9. ],
       [ 3. ,  1. ,  2.1, 21. ,  1.1, -2. ]])

In [31]:
#### YOUR CODE HERE ####

# Check the shape of my_arr


# Check the shape of the first column


# Subtract the second column of a from the last column of my_arr



<span style="color:blue"><u>**WKSH EX 3**</u></span>: Check your answers here.

In [16]:
T = np.array([[31, 37, 35, 34, 31, 29, 32],
              [44, 46, 47, 45, 39, 39, 42],
              [40, 42, 31, 44, 33, 38, 37]])

In [None]:
#### YOUR CODE HERE ####


## Methods and Properties of NumPy Arrays

NumPy arrays have a range of useful properties and methods that make basic analysis very easy.  The syntax for using a method or property is your variable + a period + the method or property, with any relevant inputs to the method in parentheses. Note that properties do not have any inputs, so you don’t need the parentheses to call them. In the table below, the following methods are being performed on an array called `my_array`.

|Method/Property|Description|
|--|:--|
|`my_array.shape`|Get the shape (i.e., the dimensions; # of rows, columns, layers, etc.) of the array|
|`my_array.min()`, `my_array.max()`|Find the minimum or maximum value in the array. You can specify the axis you want to take the minimum or maximum across. For example, my_array.min(axis=0) will find the minimum or maximum value across all the rows (i.e., one for each column).|
|`my_array.argmin()`, `my_array.argmax()`|Find the index of the minimum or maximum value in the array. You can specify the axis for which you want to find the minimum or maximum’s index. For example, `my_array.argmin(axis=0)` will find the index of the minimum value across all the rows in your array (i.e., one for each column).|
|`my_array.sum()`|Calculate the sum of all the values in your array. You can specify the axis for which you want to compute the sum. For example, `my_array.sum(axis=0)` will sum up the values across all the rows in your array (i.e., one for each column).|
|`my_array.mean()`, `my_array.std()`|Calculate the mean or standard deviation of all the values in your array. You can specify the axis for which you want to compute the mean or standard deviation. For example, `my_array.mean(axis=0)` will compute the mean of all the values across all the rows in your array (i.e., one for each column).|

### <span style="color:darkorange"><u>**Methods and Properties Examples:**</u></span>

For the examples below, recall the array we made above, `a2d`:

In [29]:
a2d

array([[10., 20., 30., 40.],
       [ 9.,  8.,  5.,  3.],
       [ 1.,  2.,  3.,  4.]])

In [22]:
a2d.shape

(3, 4)

In [30]:
a2d.max()

In [None]:
a2d.argmax()

In [None]:
a2d.sum()

In [None]:
a2d.mean()

What are the differences in each of these outputs using the `.mean()` method?

In [28]:
a2d.mean()

In [None]:
a2d.mean(axis=0)

In [None]:
a2d.mean(axis=1)

<span style="color:blue"><u>**WKSH EX 4**</u></span>: Check your answers here.

In [None]:
#### YOUR CODE HERE ####


### Other Useful NumPy Fuctions
There are hundreds of NumPy functions you can use to create or modify arrays, or analyze the data contained within them. You can find them as you need them in the official NumPy documentation. Here are a few commonly used ones to get you started:

<center><b><u>Functions for Creating Arrays</center></u></b>
    
|Method/Property|Parameters|Description|
|--|:--|:-|
|`np.linspace(x1,x2,N)`|**x1, x2**: start and end points<br>**n**: number of points you want between your start and end points; if you don’t specify N, it defaults to 50.|Get the shape (i.e., the dimensions; # of rows, columns, layers, etc.) of the array|Creates a 1D array containing N evenly spaced numbers between x1 and x2. When you want to specify the number of points between your domain limits.|
|`np.arange(x1,x2,dx)`|**x1, x2**: start and end points<br>**dx**: the spacing you want between the start and end points; if you don’t specify dx, it defaults to 1.|Find the minimum or maximum value in the array. You can specify the axis you want to take the minimum or maximum across. For example, my_array.min(axis=0) will find the minimum or maximum value across all the rows (i.e., one for each column).|Creates a 1D array containing numbers between x1 and x2 in intervals of dx. When you want to specify the size of the interval between your domain limits.|
|`np.zeros(n,m)`<br>`np.ones(n,m)`<br>`np.full((n,m),fill_val)`|**n**: number of rows<br>**m**: number of columns <br>**fill_val**: what to fill your array with|Find the index of the minimum or maximum value in the array. You can specify the axis for which you want to find the minimum or maximum’s index. For example, `my_array.argmin(axis=0)` will find the index of the minimum value across all the rows in your array (i.e., one for each column).|Creates an array with the dimensions n x m filled with zeros, ones, or a chosen fill value.|
|`np.random.randint(x1,x2,N)`|**x1, x2**: start and end points<br>**N**: number of points you want between your start and end points; if you don’t specify N, it defaults to 1.|Creates an array of N random integers between x1 and x2. You alternatively also create 2D or higher arrays by specifying a size using the size keyword argument.|

<center><b><u>Functions for Modifying the Shape of your Array</center></u></b>

|Method/Property|Description|
|--|:--|
`my_array.ravel()`|Flattens array - i.e., turns the array into a 1D array. For example, a 2x3 array would become a 1x6.
`my_array.reshape(n,m)`|Changes the shape of array into n x m, assuming that the number of data points within array is compatible with (“broadcastable to”) the new shape. The number of datapoints must be divisible by both n and m. For example, a 1D array with 20 values can be reshaped into a 4x5, a 5x4, a 2x10, a 10x2, but not a 3x7.

### <span style="color:darkorange"><u>**Creating and Reshaping Arrays Examples:**</u></span>

**Run the following cells** and see what outputs they produce. You can use the `.shape` property to check the shape of the outputs.

In [None]:
np.linspace(1,100,51)

In [None]:
np.arange(1,100,5)

In [None]:
np.zeros()

In [None]:
np.full((3,4),99)

<span style="color:blue"><u>**WKSH EX 5**</u></span>: Check your answers here.

In [None]:
#### YOUR CODE HERE ####


<span style="color:darkorange"><u>**iPYNB EX 6**</u></span>: **Temperature and Time**

<span style="color:darkorange">**(a)**</span>
Create a NumPy array called `t` representing a time domain from t = 0 to t = 30 days, with 2 day intervals.

In [None]:
#### YOUR CODE HERE ####
t = 

<span style="color:darkorange">**(b)**</span>
You have a dataset stored in a NumPy array called `temp123` containing **210 temperature measurements** taken across 210 days.

In [21]:
temp123 = np.random.randint(20,40,210)
temp123

array([38, 28, 24, 37, 20, 36, 20, 24, 32, 34, 33, 22, 30, 34, 23, 23, 33,
       20, 25, 38, 26, 36, 36, 32, 35, 32, 25, 26, 35, 30, 20, 34, 24, 28,
       39, 34, 21, 21, 32, 38, 35, 28, 35, 24, 39, 36, 31, 26, 38, 38, 26,
       22, 20, 39, 32, 34, 22, 33, 25, 34, 37, 21, 28, 38, 26, 35, 26, 25,
       31, 20, 34, 26, 24, 38, 21, 31, 22, 24, 39, 35, 33, 32, 20, 21, 32,
       28, 36, 34, 23, 39, 30, 31, 22, 27, 36, 33, 29, 39, 35, 25, 39, 26,
       29, 26, 26, 20, 38, 37, 23, 22, 20, 22, 23, 32, 23, 33, 32, 36, 33,
       31, 33, 33, 26, 27, 22, 34, 35, 36, 26, 34, 24, 33, 25, 25, 27, 23,
       33, 29, 37, 26, 34, 31, 38, 39, 22, 36, 31, 29, 22, 37, 33, 29, 37,
       23, 34, 32, 21, 23, 33, 38, 22, 32, 27, 34, 21, 24, 39, 20, 23, 23,
       35, 21, 37, 26, 21, 22, 25, 21, 24, 24, 23, 24, 30, 34, 25, 27, 20,
       36, 37, 21, 24, 26, 34, 33, 23, 26, 23, 24, 22, 29, 34, 37, 20, 34,
       33, 36, 32, 24, 37, 36])

You want to compute the weekly mean of these temperature measurements - i.e., the **mean temperature for every 7 days of data**. What code could you write to do this?

**Hint 1**: you want to transform your data from a 1x210 to an array with 2 dimensions. If one of the dimensions is 7, then what will the other dimension be?

**Hint 2**: you can use the `.reshape()` and `.mean()` methods.

In [None]:
### YOUR CODE HERE ####
temp123.reshape()

# <span style="color:darkorange">Practical: Pollutant Concentrations in a Lake </span>

Consider a **3x8 array `C0`** containing surface measurements of an initial contaminant concentration data (in μM) for a lake at t = 0. The array **`t` represents days after t = 0** at which you would like to calculate measurements.

||1km|2km|3km|4km|5km|6km|7km|
|-|-|-|-|-|-|-|-|
|**2km**|7|7|6|5.5|5|4|3|
|**4km**|9|8.5|8|8|7.5|6|5|
|**6km**|6|6|5.5|5|5|4|2|

In [135]:
C0 = np.array([[7,7,6,5.5,5,4,3],
                [9,8.5,8,8,7.5,6,5],
                [6,6,5.5,5,5,4,2]])

x = np.arange(1,8)
y = np.array([2,4,6])

This function will be used to visualize your data to see if it is behaving the way you expect it to. **Just run the cell below for now** - we'll use it later.

In [1]:
def plot_C(C):
    if C.ndim < 3:
        C = [C]
    for c in C:
        meshx,meshy = np.meshgrid(x,y)
        fig,ax = plt.subplots()
        ax.set_yticks([2,4,6])
        a = ax.contourf(meshx,meshy,c,levels=np.linspace(0,9,19))
        fig.colorbar(a)

Suppose the contaminant decays according to a first order reaction such that the concentration at a given time is given by the equation:

<center>$C(t) = C_0*e^{-k*t}$</center>

where _k_ is a rate constant equal to 0.01 $day^{-1}$, $C_0$ is the initial concentration such that $C(0) = $`C0`, represented at each point in the lake by the array C0 above, and t is the time that has passed, in days.

**Note**: exponentiation with e in Python can be done using the `exp()` function in the NumPy package. For example, to compute e$^2$ in Python, you would write: `np.exp(2)`

**1. Using the equation for *C(t)* and the value given for *k*, calculate the following arrays:**
* `C7`, a 3x8 array representing the contaminant concentration when t = 7 days
* `C14`, a 3x8 array representing the contaminant concentration when t = 14 days
* `C21`, a 3x8 array representing the contaminant concentration when t = 21 days
* `C28`, a 3x8 array representing the contaminant concentration when t = 28 days

In [None]:
#### YOUR CODE HERE ####
k = 
C7 =
C14 = 
C21 = 
C28 = 

**2. Create a 3D NumPy array `Ct` that contains `C0`, `C7`, `C14`, `C21`, and `C28`, in order of the concentration with increasing time.**

In [81]:
#### YOUR CODE HERE ####


You can **plot `Ct` below** to see if it is behaving as you would expect - **the concentration should decrease with time**.

In [None]:
plot_C(Ct)

**3. You want to know what the <u>average pollutant concentration of the lake over the whole time period</u>, from 0 to 28 days.**

Calculate this using `Ct` and the `np.mean()` function and save this to a variable called `C_avg`.

_Hint: Think about what shape `C_avg` should have if it must represent the concentration for the entire lake._

In [91]:
#### YOUR CODE HERE ####
C_avg = 

Suppose you want to know the **average pollutant concentrations at y = 3 km**, even though we don't have a data point for this stretch of the lake.

However, we can interpolate these concentrations by averaging the concentrations at y = 2 km and y = 4 km. This is the equivalent of taking the concentrations $c_{2,1-7}$ and $c_{4,1-7}$ and averaging them.

||1km|2km|3km|4km|5km|6km|7km|
|-|-|-|-|-|-|-|-|
|**2km**|$c_{2,1}$|$c_{2,2}$|$c_{2,3}$|$c_{2,4}$|$c_{2,5}$|$c_{2,6}$|$c_{2,7}$|
|**4km**|$c_{4,1}$|$c_{4,2}$|$c_{4,3}$|$c_{4,4}$|$c_{4,5}$|$c_{4,6}$|$c_{4,7}$|
|**6km**|$c_{6,1}$|$c_{6,2}$|$c_{6,3}$|$c_{6,4}$|$c_{6,5}$|$c_{6,6}$|$c_{6,7}$|

Using `C_avg`, write some code that returns the pollutant concentrations at x = 3 km. The steps should help break it down for you:
1. From `C_avg`, extract the row of data at 2 km.
    1. Think about the **index** that corresponds to the data at x = 2 km.
2. From `C_avg`, extract the row of data at 4 km.
    1. Think about the **index** that corresponds to the data at x = 4 km. 
3. Average the two rows of data.

In [None]:
C_avg[]

**4.** You want to know **which points** in the lake have a **concentration above 4 μM at each time step in `Ct`**, and **what those concentrations are**.

**(a)** First, create a boolean mask: your result should have five 3x8 arrays that are `True` where the concentration is >4 μM and `False` otherwise.

In [None]:
#### YOUR CODE HERE ####
C4_mask = 

**(b)** Now, apply the boolean mask `C4_mask` you created above to `Ct` to extract the actual concentrations >4 μM. 

In [None]:
### YOUR CODE HERE ###
Ct[]

You can plot your different concentration arrays here (`C0`, `C7`, `C21`, etc) to check if they are behaving as expected.

In [None]:
plot_C(Ct)

## Course Evaluation

Please let me know what you thought of this Data Analysis in Python Course! [This course evaluation](https://forms.office.com/Pages/ResponsePage.aspx?id=lYdfxj26UUOKBwhl5djwkFtIujJ9lCFMouysTWFV3rRUN09CVDJCSk9LTDZGVlk0VTFMWklWVlRSUC4u) will take just 2-3 minutes :).