# Python array manipulation

This script shows several ways to manipulate Python lists and Numpy arrays to make your code more efficient.

- Created by: Tomer Burg
- Last updated: 22 March 2022

### Let's start with the basics...

...by importing our necessary packages:

In [1]:
import numpy as np
import datetime as dt

### List comprehension

Let's say we want to create a list and populate it with the numbers 0 through 49. One possible way is to use a loop and iterate from 0 through 49, appending it to the list each time.

In [2]:
my_list = []

for i in range(50):
    my_list.append(i)

print(my_list)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]


Did you know there's a way to accomplish the above code in a single line? This is what we can do with list comprehension!

In [3]:
my_list = [i for i in range(50)]
print(my_list)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]


Let's move onto more complicated examples. Let's iterate over all numbers from 0 to 99, and we retain them only if they're even number and if they're in another list we'll compare against. Additionally, if we retain them, we multiply them by 2.

In [4]:
compare_list = [12,15,18,19,20,21,22,23,24,25,26,31,36,45,46,57,68,73,80,81,83,84,87,88,93]

my_list = []
for i in range(100):
    if i % 2 == 0 and i in compare_list:
        my_list.append(i*2)

print(my_list)

[24, 36, 40, 44, 48, 52, 72, 92, 136, 160, 168, 176]


Now we'll use list comprehension to write the above in a single line:

In [5]:
compare_list = [12,15,18,19,20,21,22,23,24,25,26,31,36,45,46,57,68,73,80,81,83,84,87,88,93]

my_list = [i*2 for i in range(100) if i % 2 == 0 and i in compare_list]

print(my_list)

[24, 36, 40, 44, 48, 52, 72, 92, 136, 160, 168, 176]


### Numpy array vectorized operations

Now that we've reviewed basic Python lists, we'll review more complex operations we can optimize with Numpy.

Let's declare a 2D numpy array of (1000x1000) dimension, all set to a value of one using `np.ones()`:

In [6]:
array = np.ones((1000,1000))

Just to illustrate how big this array is, let's count how many total entries it has - a total of 1 million entries!

In [7]:
print(len(array.flatten()))

1000000


Now let's create an equally sized array, but with random numbers from 0 to 1:

In [8]:
random_array = np.random.rand(1000,1000)

You may be wondering - what is the sum of this array?

There's two ways we can approach this. The first is to iterate over every row and column of this array, resulting in a nested for loop, and sum up every element.

We'll also use `datetime` to time how long this operation takes.

In [9]:
start_time = dt.datetime.now()

array_sum = 0
x,y = random_array.shape
for i in range(x):
    for j in range(y):
        array_sum += random_array[i][j]
print(f"Array sum: {np.round(array_sum,2)}")
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Array sum: 499858.36
Total seconds: 0.600386


Let's now do the same with Numpy's `np.sum()` function, and compare how long it takes:

In [10]:
start_time = dt.datetime.now()

array_sum = np.sum(random_array)
print(f"Array sum: {np.round(array_sum,2)}")
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Array sum: 499858.36
Total seconds: 0.002229


Notice how the Numpy version was **much** faster than the loop we used before! Why is this the case?

Numpy functions are vectorized, meaning the code is executed "under the hood" in C code, which is more efficient in computationally expensive calculations than Python. Consider how Numpy arrays require each array to only have a single data type (e.g., int, float, etc.), whereas variables in pure Python can be of any type (e.g., you can create a variable "var" and assign an integer to it (`var = 5`), then redefine it as a boolean (`var = True`), without it crashing the code).

This ambiguity about the data type within a pure Python loop and the lack of underlying C optimization significantly slow down pure Python for loops, which Numpy vectorization significantly optimizes.

## Example 1

Let's do an example where we iterate over every gridpoint, and where `random_array` is greater than 0.5 we take that value, assign it to `array`, and double it.

We'll also use `datetime` to time how long this operation takes.

In [11]:
start_time = dt.datetime.now()

x,y = array.shape
for i in range(x):
    for j in range(y):
        if random_array[i][j] > 0.5:
            array[i][j] = random_array[i][j] * 2
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Total seconds: 1.092894


Did you know you can do the above in just one line of code? Here's an example of how!

In [12]:
start_time = dt.datetime.now()

array[random_array > 0.5] = random_array[random_array > 0.5] * 2
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Total seconds: 0.019946


Notice how the above was much faster than the original loop where we iterated over every single element! We can perform operations within brackets; as the two arrays have equal dimensions, this performs the operation over every element within the arrays.

Essentially, the block of code `array[random_array > 0.5]` says that the statement after the equal sign will only be performed for elements within `array` where the equally positioned elements within `random_array` are greater than 0.5.

The following block of code `= random_array[random_array > 0.5] * 2` says that we take elements from `random_array` which are greater than 0.5, multiply then by 2, then assign them to the equally positioned elements within `array`.

Just for a sanity check, let's make sure these two arrays are the same by taking their difference:

In [13]:
array_slow = np.zeros((1000,1000))
array_fast = np.zeros((1000,1000))

#Using our slow method:
x,y = array.shape
for i in range(x):
    for j in range(y):
        if random_array[i][j] > 0.5:
            array_slow[i][j] = random_array[i][j] * 2

#Using our fast method:
array_fast[random_array > 0.5] = random_array[random_array > 0.5] * 2

#Compare the two arrays:
print(np.max(array_slow - array_fast))
print(np.min(array_slow - array_fast))

0.0
0.0


## Example 2

Let's say we have a 1-dimensional array with 500,000 elements, increasing incrementally from 0 to 500,000:

In [14]:
array = np.arange(500000)

Let's print the first 10 and last 10 elements of the array to see what they contain:

In [15]:
print(array[:10]) #first 10 elements of array
print(array[-10:]) #last 10 elements of array

[0 1 2 3 4 5 6 7 8 9]
[499990 499991 499992 499993 499994 499995 499996 499997 499998 499999]


Let's now calculate the difference between each element and the one preceding it.

One such way of doing this is to iterate over every element from the 2nd one to the end (since no element comes before the 1st one), and take their difference:

In [16]:
start_time = dt.datetime.now()

difference_array_slow = []
for i in range(1,len(array)):
    difference_array_slow.append(array[i] - array[i-1])
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Total seconds: 0.269843


Let's now perform the same operation using Numpy vectorization:

In [17]:
start_time = dt.datetime.now()

difference_array_fast = array[1:] - array[:-1]
            
end_time = dt.datetime.now()
print(f"Total seconds: {(end_time - start_time).total_seconds()}")

Total seconds: 0.00152


To convince us they're the same, let's compare the difference between these arrays:

In [18]:
comparison = difference_array_fast - np.array(difference_array_slow)
print(np.max(comparison))
print(np.min(comparison))

0
0
