 # Applying the water balance equation



 ## Objectives

 - Apply the water balance equation to real-world data

 - Test the impact of land cover on the runoff ratio

 - Fit a simple statistical model to data



 ## Prerequisites:

 - Basic understanding of Python

 - Familiarity with Pandas, Matplotlib

 ## Dataset

 We will be using the CAMELS-GB dataset. This contains daily hydrometeorological data for around 670 catchments in Great Britain, as well as catchment attributes related to land use/land cover, geology, and climate. Download the data [here](https://catalogue.ceh.ac.uk/documents/8344e4f3-d2ea-44f5-8afa-86d2987543a9), and extract the data to a new folder in your working directory called `data`. Let's create a path variable so that we can easily navigate to the data files:

In [None]:
import os
DATADIR = os.path.join('data', '8344e4f3-d2ea-44f5-8afa-86d2987543a9', 'data')


 Now Load the data for a catchment chosen at random. The timeseries data are stored as csv files, so we use Pandas to load them into a Pandas DataFrame object:

In [None]:
import pandas as pd
id = '97002'
data = pd.read_csv(os.path.join(DATADIR, 'timeseries', f'CAMELS_GB_hydromet_timeseries_{id}_19701001-20150930.csv'), parse_dates=[0])
data.head()


 Later on it will be useful to have the catchment ID in the dataframe, so we add it here:

In [None]:
data['id'] = id


 Recall the water balance equation from lecture 1:

 \[

 \frac{dS}{dt} = P - E - Q

 \]

 where \( \frac{dS}{dt} \) is the change in storage over time, P is precipitation, E is evaporation and Q is streamflow. Also recall that over long time periods we can assume the storage term tends towards zero. Now we can write:

 \[

 0 = P - E - Q

 \]

 and hence:

 \[

 E = P - Q.

 \]

 This is convenient because evaporation is hard to measure accurately. Let's use the equation above to estimate the catchment-averaged evaporation. We will work at annual timescales so that we can reasonably assume that the storage term is negligible. First we need to compute the annual precipitation and discharge. To do this we typically use the "water year" instead of the calendar year. This avoids the potential for large errors in the water balance because catchment storage can vary significantly during the wet season. In the UK the water year is taken as 1st October to 30th September. Fortunately Pandas has some magic that allows us to easily aggregate by water year:

In [None]:
data['water_year'] = data['date'].dt.to_period('A-SEP')
data.head()


 Here, `A-SEP` is a period alias for "annual frequency, anchored end of September". Learn more about period aliases by consulting the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-period-aliases).



 We also need to convert the discharge from m3/s to m3/day:

 Now we group our dataframe by the new `water_year` column, and compute the sum of precipitation and discharge:

In [None]:
data['discharge_vol'] *= 60 * 60 * 24
data = data.groupby(['id', 'water_year'])[['precipitation', 'pet', 'discharge_spec', 'discharge_vol']].sum().reset_index()


 Aggregating data is an extremely useful skill in hydrology. Think about how you might use Pandas to aggregate by month or by season.



 When making comparisons between catchments, it is common to transform all variables to a *depth* so that the effect of catchment area is reduced. This allows us to compare the hydrological behaviour of a large catchment (e.g. Tweed) with a much smaller catchment. Let's load the catchment attributes and find the area of our catchment.

In [None]:
metadata = pd.read_csv(os.path.join(DATADIR, f'CAMELS_GB_topographic_attributes.csv'))
metadata['gauge_id'] = metadata['gauge_id'].astype(str)
area = metadata[metadata['gauge_id'] == id]['area'].values[0]
area *= 1e6 # km2 -> m2 


 Now we can transform our data to mm/day:

In [None]:
data['discharge_vol'] /= area # m3 -> m
data['discharge_vol'] *= 1000 # m -> mm


 If you look at the dataframe, you see that column `discharge_vol` is now the same as `discharge_spec`. In future, you can use `discharge_spec` directly, without the need for transformation. We now have everything we need to estimate evaporation using the water balance equation:

In [None]:
data['evaporation'] = data['precipitation'] - data['discharge_vol']


 Let's plot this data:

 ## Land cover impacts

 We will cover the drivers of evaporation in more detail later on the course. One question we may have is the role of different land cover types on the water balance. Let's investigate whether land use impacts evaporation by looking at some forested catchments:

In [None]:
metadata_lu = pd.read_csv(os.path.join(DATADIR, f'CAMELS_GB_landcover_attributes.csv'))


 Have a look at the columns in `metadata_lu` and consult Coxon et al. (2020). Which columns represent forest? Create a new column called `forest_perc` that combines the two types.

In [None]:
raise NotImplementedError()


 Now identify catchments with more than 20% forest:

In [None]:
raise NotImplementedError()


 To compare the impact of vegetation on runoff generation, it would be useful to compute a summary measure for each catchment. One such measure, or signature, is the runoff ratio, defined as the proportion of precipitation that becomes runoff. We can calculate this as follows:

In [None]:
data_sum = data.groupby('id')[['precipitation', 'discharge_vol']].sum()


 Now we can calculate the runoff ratio:

In [None]:
data_sum['runoff_ratio'] = data_sum['discharge_vol'] / data_sum['precipitation']


 ## Next steps

 Compute the runoff ratio for every catchment with greater than 20% forest (HINT: write a loop to perform the necessary steps). Using `statsmodels`, fit a linear regression model to the data to test the hypothesis that runoff ratio is related to the forest extent.