# Introduction to the MetObs-toolkit

In this introduction, you will learn the principal components and methods in the MetObs-toolkit. Let's start by importing it.

Since this package is under development, it is often relevant to know the precise version of the toolkit.

In [None]:
import metobs_toolkit

#Print out the version of the toolkit
print(metobs_toolkit.__version__)

## The Dataset class

The ´´Dataset´´ class is for most applications the most important class. It holds all your stations and it's data. Thus a ´´Dataset´´ is in principal a collection of stations.

Since raw data files often include observations from multiple stations, we import our raw data always directly into a ´´Dataset´´. We use the ´´Dataset.import_data_from_file()´´ method, to import the raw data into a Dataset. 

A key component for importing raw data, is a description of what your data represents and how it is formatted. This is done by providing a **template file**, that describes how your raw data is structured. 



### Importing your raw data

As an example we will import a demo file of raw observations. In order to do that we need to :

* Create a template file for our raw data file. The ´build_template_prompt´ function will guide you in this process. It will ask questions, once you answerd them a template file is created. It will also propose some code that you use to import your data
* Create a ´Dataset` instance 
* Add the raw data into the ´Dataset´.

In [None]:
# Specify the path to your raw data file (we use the demo file as example)
path_to_datafile=metobs_toolkit.demo_datafile

# We will also use a metadata file
path_to_metadatafile=metobs_toolkit.demo_metadatafile

In [None]:
%%script true

#Create a template for these data files
metobs_toolkit.build_template_prompt()

In [None]:
#specify the path to the templatefile that was created
path_to_templatefile=metobs_toolkit.demo_template #demo file as example!!

Now that we have the datafiles and the templatefile, we create an empty ´Dataset´, and import the data into it.

In [None]:
dataset = metobs_toolkit.Dataset() #Create a new dataset object

#Load the data
dataset.import_data_from_file(
                    template_file=path_to_templatefile, #The template file
                    input_data_file=path_to_datafile, #The data file
                    input_metadata_file=path_to_metadatafile, #The metadata file
                    )

As can be seen in the printed logs, there is a lot going on when importing the data. That is because tests are applied on your data to check for gaps, and mismatches between data and metadata. 

We can now inspect the ´dataset´ further.

## The attributes

The attributes are holding the data of the dataset. Here we present some attributes that can be usefull to inspect.



<div class="alert alert-block alert-info">
All classes in the MetObs-toolkit have a ´get_info´ methods that prints out an overview of its content.
</div>

* ´Dataset.obstypes` : A collection of ´Obstypes´ that are known. These observationtypes describe a measurable quantity, and its corresponding units.

In [None]:
dataset.obstypes

In [None]:
#Note! The known obstypes are NOT the obstypes for which there are observations.
#To get the obstypes for which there are observations, use:
dataset.present_observations

* ´Dataset.template´: A template class, that is automatically set up by using the template file. This is only used when data is imported from a file. It has no further use.

In [None]:
template = dataset.template

template.get_info() # Prints out how the template maps raw data

* ´dataset.df´: A pandas DataFrame holding all the observation records.

In [None]:
dataset.df

* ´dataset.metadf´: A pandas DataFrame holding all the metadata of the stations.

In [None]:
dataset.metadf

## Station class

The stationclass is a representatio of a station. A station holds the following:

* *sensordata*: Timeseries of an observation type. A station can hold multiple sensordata, one for each sensor. 
* *site*: Each station has a ´Site´ attribute, that holds the information on the location of the station. Metadata related to the station is also stored here. 
* *modeldata*: In addition to the observations, modeldata timeseries representing the station can be stored. In pracktice, if one would download ERA5 data (using the MetObs-toolkit), the timeseries are stored as modeldata in the Station.


To select a station, one can use the *name* of the station, which is assumed to be unique for each station.


<div class="alert alert-block alert-info">
All the methods and attributes that are present in the ´Dataset´ are also applicable on the ´Station´! Thus if your script works on Dataset-level, it also works on station-level. 


Only the ´Dataset.sync_records()´, ´Dataset.buddy_check()´, and trivial Dataset-only methods (i.g. ´Dataset.get_station()´) are not defined for Stations.
</div>

In [None]:
#Select a station
your_station = dataset.get_station('vlinder02')

#Print out some details
your_station.get_info()

In [None]:
# Inspecting the attributes of the station

#Print out info on the Site of the station:
your_station.site.get_info()

In [None]:
# All observational data is stored as SensorData

print(your_station.sensordata)

# More convenient is to use the pandas dataframe representations,
# similar as with the Dataset

your_station.df

In [None]:
#Or the metadata for this singel station
your_station.metadf

## Plotting timeseries

Plotting the timeseries can be simply done by using the ´make_plot()´ method, on a ´Dataset´ or a ´Station´.

In [None]:
dataset.make_plot(obstype='temp', #Which observation type to plot. (See dataset.present_observations)
                  colorby='station', #if 'station', each station will be a different color
                  show_outliers=True,
                  show_gaps=True)

In [None]:
#We can also plot a single station
your_station.make_plot(obstype='humidity',
                       colorby='label') #If 'label', the colors are based on the status/label of an observation.

## Common usecases

Here a collection of common usecases.

### Resampling time resolution

It is common to change or alter the time resolution of your observations. This is often applied when:

* the data amount is to big, and the present time resolution is not required for the analysis.
* sensor do not have the same time resolution. (i.g. temperature is measured every 5 minutes, but precipitation is measured each hour.)
* Observations are not sychronized over multiple stations. This is a special case of resampling, since there is also a synchronization required.

It is recommendad to set the target time resolution, in the beginning of your pipeline! 

In the MetObs-toolkit you can resample by using the ´resample()´ method on a ´Dataset´ or ´Station´. By doing so, the toolkit will construct a set of target timestamps (in the new resolution), and will map the raw timestamps to the new target timestamps. There is no interpolation applied! 

In order to construct the mapping of the old timestamps to the target timestamps, a tollerance is used. The neirest timestamp is tested if it is within the tolerance of the target timestamp. If this test is not succecsfull, no record could be assigned to the target timestamp and thus a gap is created. Thus by increasing the *shift_tolerance*, the resampling method will have more mapped timestamps thus less gaps but at the cost of less accurate timestamps.

In [None]:
hourly_dataset = metobs_toolkit.Dataset()
#Load the data (raw data has 5 min resolution)
hourly_dataset.import_data_from_file(
                    template_file=path_to_templatefile, #The template file
                    input_data_file=path_to_datafile, #The data file
                    input_metadata_file=path_to_metadatafile, #The metadata file
                    )
#Resample to 1 hour resolution
hourly_dataset.resample(target_freq='1h', #Target frequency is set to 1 hour
                        target_obstype=None, #if None, all present observations are resampled
                        shift_tolerance='4min', #The maximum shift allowd for a timestamp
                        origin_simplify_tolerance='3min') # The maximum shift for the origin, to get a simplified origin

# You can verify that the resolution is hourl by inspecing the df attribute
hourly_dataset.df.index

### Dataframe of one observationtype

The ´Dataset.df` and ´Station.df´ returns a pandas dataframe with a so calld Multi-Index. That is because the combination of [´timestamp´, ´observationtype´, 'stationname´] defines an observation, thus the use of the Multi-Index. 

We are aware that working with Multi-Indexed dataframes can be challenging, thus an example on how to convert a multiindex dataframe to a regular-indexed dataframe. 

Be aware that removing (or reducing) the Multi-Index, is always a subsetting or approximation.

In [None]:
#Subset to only temperatures (=subsetting)

temperatures = dataset.df.xs(key='temp', 
                             level='obstype', #the level of the index ('datetime', 'name' or 'obstype')
                             drop_level=True)

#You can see that the index now only has 2-levels:
temperatures

In [None]:
#If we assume that all the temperature observations over all the stations have the same
#set of timestamps (typical after resampling! ), we can create a dataframe with all stations represented by columns.

temperatures_wide = (dataset.df
                    #first subset to temperatures
                    .xs(key='temp', 
                            level='obstype', #the level of the index ('datetime', 'name' or 'obstype')
                            drop_level=True)
                    #Convert a index level to columns (unstacking)
                    .unstack(level='name'))
temperatures_wide
                    

In [None]:
#if you are only interested in the value, you can select them:
temperatures_wide['value']

### Quality control

For an introduction to Quality Control, we refer to the **LINK** .

### Extracting data from Google Earth Engine

For an introduction to extracting data for GEE, we refer to the **LINK** .

### Filling gaps

For an introduction to filling gaps, we refer to the **LINK** .

### Analysis 

For an introduction to analysing your dataset, we refer to the **LINK** .