# Code optimisation

The mass balance model for chloride was implemented as a for loop with an array, while the water balance model was programmed as a vector operation on a DataFrame. For scripts which take a long time to run, performance is a key consideration during programming so this notebook will investigate different options to complete the same task and compare their performance. In addition to performance there are other considerations when writing code. These will be touched upon as well in this notebook.

In [None]:
# import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We start by importing the data and observing the contents of the file by using the `head` function of the DataFrame, which returns the first five rows of data by default.

In [None]:
# read the data from excel using pandas
df = pd.read_excel(
    "data/water_balance_data.xlsx",
    index_col=0,
    parse_dates=True,
)
df.head()

The volume of rainwater that lands on the water surface per day was calculated using the following line of code

In [None]:
df['P'] = df['area'] * df['rain'] / 1000.

In the  next section we will investigate additional options, and determine which one is the fastest.

## Performance

The rainfall can be calculated in four different ways (probably even more, but these are the most obvious ones):
1. using a `for` loop
2. using a `for` loop with the iterrows function
3. using the `apply` method of a pandas DataFrame
4. using the vector operations for pandas DataFrames, this is the same method as used here above

We put the `%%timeit` code at the start of each cell to calculate the code execution time. This makes it easy to compare the performance of the methods. Note that the computation times can vary significantly between machines, python versions, package versions (NumPy, Pandas) and operating systems.

In [None]:
%%timeit
#1 using a simple for loop

P = np.zeros(len(df))
for i in range(len(df)):
    P[i] = df['area'].iloc[i] * df['rain'].iloc[i] / 1000.0 

df['P'] = P

In [None]:
%%timeit
#2 using iterrows
df['P'] = np.nan                            # create an empty column in the dataframe with the rain 
for i, row in df.iterrows():
    P_i = row['area'] * row['rain'] / 1000.0  # calculate the rain flux for one timestap
    df.loc[i, 'P'] = P_i               # assign the value to the column in the DataFrame

In [None]:
%%timeit
#3 using the apply method

# define a function to calculate the rain flux
def get_P(row):
    return row['area'] * row['rain'] / 1000.0

# apply this function to every row in a Dataframe
df['P'] = df.apply(get_P, axis=1)

In [None]:
%%timeit
#4 using a vector operation
df['P'] = df['area'] * df['rain'] / 1000

All the methods are pretty fast because our dataset is only 113 timesteps. However, you can see that there are large differences in computation times between the methods where the slowest method (2) is about 100 times slower than the fastest method (4).

We can also compare our methods using the four steps of Wes Dyer:

| method     | correct | clear | concise | fast |
|------------|---------|-------|---------|------|
| 1 for-loop | ✓       | x     | xx      | x   |
| 2 iterrows | ✓       | x     | x       | xx   |
| 3 apply    | ✓       | xx    | ✓       | x    |
| 4 vector   | ✓       | ✓✓    | ✓✓      | ✓✓    |

Note that especially 'clear' and 'concise' are very subjective criteria.

Method 4 is clearly the winner in this case. So we chose this method to calculate the water balance. Let's time the entire model

In [None]:
%%timeit

pan_factor = 1.2

df['P'] = df['area'] * df['rain'] / 1000.
df['E'] = df['area'] * df['evaporation'] / (1000. * pan_factor)
df['dV'] = -df['volume'].diff(periods=-1)
df['I'] = df['P'] - df['E'] - df['dV']

Can we make it even faster? By looking at the code we can see that both `df['rain']` and `df['evaporation']` are multiplied by `df['area']` and divided by 1000. We could define a variable that is equal to the area divided by 1000, and multiply the difference between rainfall and evaporation, reducing the number of arithmetic operations. Let's check if it makes a big difference. 

In [None]:
%%timeit

f = df["area"] / 1000.
df['dV'] = -df['volume'].diff(periods=-1)
df['infiltration'] = (df['rain'] - df['evaporation'] / pan_factor) * f - df['dV']

It seems to be somewhat faster, but the code becomes a little less readable. Whether one weights up versus the other is a matter for debate.

Note that when a `for` loop can not be avoided, a significant speed-up can be achieved by converting the data to NumPy arrays. Let's have a look at the rainfall calculation once more and check the performance of the for loop but now using NumPy arrays.

In [None]:
rain = df['rain'].to_numpy()                # convert from a pandas dataframe to a numpy array
area = df['area'].to_numpy()                # convert from a pandas dataframe to a numpy array

In [None]:
%%timeit

P = np.zeros_like(rain)           # create an array with zeros and the same size as 'rain'
for i in range(len(P)):
    P[i] = area[i] * rain[i] / 1000.0  # calculate the rain flux for one timestap

df['P'] = P               # create new column in the DataFrame with the rain flux


Much faster than looping over the DataFrame! That is good news, because our mass balance calculations are best done using a `for` loop.

***Exercise***

Write some code to calculate the following polynomial expression

$y = A x^4 + B x^3 + C x^2 + D x + E$

Try to think of different ways to perform the calculation (*hint: remember the `poly1d` function in NumPy?*) and determine the executation times. The values of the coefficients and $x$ are already defined in the code cell below.

In [None]:
x = np.linspace(0.01, 10, 1000)
A = 0.1
B = 0.2
C = 0.3
D = 0.4
E  = 0.5

In [None]:
%%timeit

y = # Type your code here

In [None]:
%%timeit

y = # Type your code here

In [None]:
%%timeit

y = # Type your code here