# Demo example: filling gaps

This example is the continuation of the previous example: [Apply quality control](https://metobs-toolkit.readthedocs.io/en/latest/examples/qc_example.html). This example serves as a demonstration of how to fill gaps. 

In [None]:
%config InlineBackend.print_figure_kwargs = {'bbox_inches':None} #else the legend is cutoff in ipython inline plots

In [None]:
import metobs_toolkit
metobs_toolkit.add_StreamHandler(setlvl='WARNING')

your_dataset = metobs_toolkit.Dataset()
your_dataset.update_file_paths(
    input_data_file=metobs_toolkit.demo_datafile, # path to the data file
    input_metadata_file=metobs_toolkit.demo_metadatafile,
    template_file=metobs_toolkit.demo_template,
)

## Finding Gaps

When you import your datafile, the toolkit will look for gaps in your data. This is done by assuming that each station has a perfect frequency of observations (i.g. 5 minutes), a start timestamp, and an end timestamp. The toolkit can construct a perfect set of timestamps for which an observation is expected. The toolkit will try to find an observation for each (perfect) timestamp, using timestamp tolerances if specified. 

When it is not possible to assign an observation to a timestamp, we have a *gap*. If multiple consecutive timestamps could not be matched to observations, we also have a gap but the gap spans a longer period.  


In [None]:
your_dataset.import_data_from_file(
        freq_estimation_method="highest", #which method to use for estimating a frequency
        freq_estimation_simplify_tolerance="2min", #tolerance for simplifying a frequency
        origin_simplify_tolerance="5min", #tolerance for simplifying an origin (=start timestap per station)
        timestamp_tolerance="4min") #tolerance for mapping an observation to a (perfect) timestamp

your_dataset.coarsen_time_resolution(freq='15T')

These missing observations are indicated in time series plots as vertical lines:

In [None]:
your_dataset.get_station('vlinder02').make_plot(colorby='label')

## Inspect gaps

The gaps are stored in the form of at the `.gaps` attribute of a Dataset. 
Each gap is defined by 
* an observationtype
* a station name
* a start timestamp
* a end timestamp

In [None]:
your_dataset.gaps

To have a tabular representation of the present gaps, use the ´.get_gaps_fill_df()` method. This method returns a dataframe that contains all the records present in the gaps. When gap-fill methods are applied, the filled values appear in this dataframe.

In [None]:
your_dataset.get_gaps_fill_df()

## Outliers to gaps and missing observations

In practice, the observations that are labeled as outliers are often interpreted as gaps (because we assume that the observation value is erroneous) so that they can be filled. In the toolkit, it is possible to convert the outliers to gaps by using the ``convert_outliers_to_gaps()``.

In [None]:
#first apply (default) quality control
your_dataset.apply_quality_control(obstype='temp') #we use the default settings in this example

#Interpret the outliers as missing observations and gaps.
your_dataset.convert_outliers_to_gaps()
#Inspect your gaps
your_dataset.make_plot(colorby='label')

When plotting a single station, the figure becomes more clear

In [None]:
your_dataset.get_station('vlinder05').make_plot(colorby='label')

## Fill gaps

In the toolkit, two groups of methods are implemented: **interpolation methods** and by making **use of external modeldata**. 

**NOTE**: In this example, we use a single station (vlinder05) for demonstration. All methods can be directly applied on 
a Dataset, you do not need to apply this on all stations separately.

### Interpolation methods

The most straightforward method to fill a gap is by using interpolation. Linear interpolation is the best-known form of interpolation, but there are also more advanced forms of interpolation. In the toolkit, we can easily interpolate the gaps by making use of the ``Dataset.interpolate_gaps()`` method.

In [None]:
your_dataset.interpolate_gaps(obstype='temp', #Which gaps to fill
                              overwrite_fill = True, #Overwrite previous filled values if they are present
                              method='linear', #which interpolation method
                              max_consec_fill=10, #maximum number of consecutive missing records to fill.
                              )
your_dataset.get_station('vlinder05').make_plot(colorby='label')

As you can see, some gaps are (partially) filled others are not filled. This is because a filling criterea was not matched. By using the ``Dataset.get_gaps_fill_df()`` method, it becomes clear why some gaps could not be filled.

In [None]:
your_dataset.get_station('vlinder05').get_gaps_fill_df()

If you are interested in all details of a specific gap, you can find that gap using the ``Dataset.find_gap()`` method, and then inspect the specific ``Gap()``.

In [None]:
from datetime import datetime

gap_of_interest = your_dataset.get_station('vlinder05').find_gap(stationname='vlinder05',
                                                                 obstype='temp',
                                                                 in_gap_timestamp = datetime(2022,9,3,1)) #2022-09-03 01:00:00

gap_of_interest.get_info()

#or inspect the gapdf
gap_of_interest.gapdf

### Higher order interpolation demo

When using more advanced interpolation methods, often multiple **anchors** (= the good records that serve as anchor points for the interpolation). In the toolkit we will refer to a **leading period** and a **trailing period**, the anchor's observations before and after the gaps respectively. 

Here is an example on applying a polynomial interpolation on gaps.

In [None]:

your_dataset.get_station('vlinder04').interpolate_gaps(
                          method='polynomial',
                          overwrite_fill=True,
                          n_leading_anchors=3, #at least 3 leading anchors are needed for 3th order polynomial interpolation
                          n_trailing_anchors=4, #at least 3 trailing anchors are needed for 3th order polynomial interpolation
                          max_consec_fill=40, 
                          max_lead_to_gap_distance='60min', #the maximum distance (in time) beween the leading anchors and the start of the gap.
                          method_kwargs={'order':3}, #all extra arguments to pass to the pandas.Dataframe.interpolate method.
)

your_dataset.get_station('vlinder04').make_plot(colorby='label')

## Fill gaps using external modeldata

As an example, we will fill the gaps using bias-corrected ERA5 data. For more info on how to download ERA5 data in the toolkit, see [Demo example: Using a Google Earth engine](https://metobs-toolkit.readthedocs.io/en/latest/examples/gee_example.html#Extracting-ERA5-timeseries).


In [None]:

#as a demonstration, we use a single station to fill the gaps of. All methods can be directly applied on a full Dataset.
your_station = your_dataset.get_station('vlinder05') 


#ERA5-land is a (default) known dataset, AND 'temp' is a known ModelObstype (thus linked to a band)
era5_model = your_station.gee_datasets['ERA5-land']
era5_model.get_info()

In [None]:
#Extract time series at the location of the station
ERA5_data = your_station.get_modeldata(
                        Model=era5_model, 
                        obstypes=['temp'],
                        startdt=None, #if None, the start of the observations is used
                        enddt=None, #if None, the end of the observations is used
            )

#Use a debiased gap fill method to fill the gaps:
your_station.fill_gaps_with_debiased_modeldata(
            Model=ERA5_data,
            obstype="temp",
            overwrite_fill=True,
            leading_period_duration="24h", #leading period definition for bias calculation
            min_leading_records_total=4, # minimum size of leading period (in records)
            trailing_period_duration="24h", #trailing period definition for bias calculation
            min_trailing_records_total=4, # minimum size of trailing period (in records)
        )


your_station.make_plot(colorby='label')

The following modeldata-gapfill methods are implemented in the toolkit:

* ``Dataset.fill_gaps_with_raw_modeldata()``
* ``Dataset.fill_gaps_with_debiased_modeldata()``
* ``Dataset.fill_gaps_with_diurnal_debiased_modeldata()``
* ``Dataset.fill_gaps_with_weighted_diurnal_debias_modeldata()``

For more information, see the Documentation of the methods. 