# Bias-correction

This notebook will cover a very useful and exciting topic in the context of climate data: bias correction!

Bias-correction methods attempt to correct the systematic errors present in modeled data. For example, a model may overestimate the number of rainy days and underestimate the number of extreme rainfall events. While summaries of rainfall over long periods of time could match in this case, it would obviously be ideal if this aspect of the rainfall pattern was correct, too!

Here are some more resources to read about bias-correction:
* [Copernicus Climate Change Service](https://climate.copernicus.eu/sites/default/files/2021-01/infosheet7.pdf)

## Pixel-to-point bias

There is another type of "bias" that we will address before we get to the systematic model bias which bias-correction efforts often seek to address: the bias inherent to our desire for gridded data at a particular point. This is the error between a pixel's spatial aggregate value (usually a mean value) and the observed values for that same variable at that point. 

Take for example the following goal: we want ERA5-Land 2m air temperature data for Eagle, Alaska, for 1950-1999. We want this data for the typical reason one might want gridded data instead of just the coincident observational data: greater temporal coverage! The ERA5-Land data extends further into the past (back to 1950) than the current observed record collected via the automated weather station, which begins in 1981. Additionally, it will have no gaps, which can be useful for goals such as tallying extreme events. 

So, we will get all of the available observed data for 1981-1999, all ERA5-Land data for 1950-1999 for the pixel which is intersected by Eagle, and then we will bias-correct the ERA5 data.

#### Download observed data

The [Automated Surface Observing Systems](https://www.weather.gov/asos/asostech) is responsible for the most comprehensive collection of observed meterological data in Alaska. There are multiple ways to access this data, but we will use the [Iowa Environmental Mesonet's archive](https://mesonet.agron.iastate.edu/request/download.phtml?network=AK_ASOS), which provides a nice API for getting data that has already been quality-controlled to some extent.

First, load the libraries we will need:

In [8]:
import xarray as xr
import pandas as pd
from pathlib import Path
import tqdm
from ardac_utils import unzip, cdsapi_timerange_params

Details on constructing this API call (the URL) can be found through the application in link above. For now, the reader only needs to know that we are requesting hourly observations of 2m air temperature (in Fahrenheit) from 1950-1999 for the observing station at the Eagle airport. We are using `pandas` directly to read this link into a `DataFrame`:

In [13]:
url = (
    "https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?station=PAEG"
    "&data=tmpf&year1=1950&month1=1&day1=1&year2=1999&month2=12&day2=31&tz=Etc%2FUTC"
    "&format=onlycomma&latlon=no&elev=no&missing=empty&trace=T&direct=yes&report_type=3"
)
df = pd.read_csv(url)

In the record produced by the IEM, it looks like less than 1% of the observations are missing, which might not impact subsequent analyses too much:

In [24]:
print(f"Missing observations: {round(df.isnull().tmpf.sum() / len(df) * 100, 2)}%")

Missing observations: 0.95%


Next, we need to download the ERA5-Land data. See the [era5_access.ipynb](./era5_access.ipynb) notebook for help on retrieving this data. 

Below is a code snippet which can be pasted into a script and run via `python script.py` to download all of the data. While it is not a ton of data, various factors can influence the speed at which the CDS API can package and return this data from the underlying dataset. 

Since we will be requesting hourly data for 1950-1999, we will be pulling 50 years * ~8760 hours = ~438000 records, which exceeds their current limit of 12000 records per request. So, we will iterate over each year to keep things simple, and request data for that year. 

Here is the script:

(not sure if this is the best way to share this script)
```python

import cdsapi
from pathlib import Path
from ardac_utils import unzip, cdsapi_timerange_params

# Eagle airport coordinates
lat = 64.7780833
lon = -141.1496111
ak_bbox = [lat, lon, lat, lon]

params = {
    "format": "netcdf.zip",
    "variable": "2m_temperature",
    "area": ak_bbox,
}

c = cdsapi.Client()

eagle_era5_dir = Path("eagle_era5land")
eagle_era5_dir.mkdir(exist_ok=True)
for year in range(1950, 1999):
    download_path = eagle_era5_dir.joinpath(f"era5land_eagle_{year}.netcdf.zip")
    time_params = cdsapi_timerange_params(
        start_time=f"{year}-01-01", end_time=f"{year}-11-24 23:00:00", freq="h"
    )
    params.update(time_params)
    c.retrieve("reanalysis-era5-land", params, download_path)

    unzip_path = eagle_era5_dir.joinpath(download_path.name.replace(".netcdf.zip", ".nc"))
    unzip(download_path, unzip_path)
```

Once that script completes, you should have a folder containing files named `era5land_eagle_<year>.nc` for all years in 1950-1999. We can now do some comparisons between the observed data and the ERA5-Land data.